MQL Benchmark
A 30,000-example open-source benchmark for evaluating natural-language → DSL generation, with a public model leaderboard.
Content tagged with "evaluation"
A 30,000-example open-source benchmark for evaluating natural-language → DSL generation, with a public model leaderboard.
A framework for evaluating earned autonomy in deployed AI systems.
An open-source evaluation framework and three benchmark metrics for measuring LLM-generated cybersecurity detection rules.
Paper accepted at CAMLIS 2025 — an open-source benchmark and three metrics (detection accuracy, economic cost of syntactic correctness, robustness of query) for measuring LLM-generated security rules.
A benchmark and three metrics for measuring LLM-generated cybersecurity rules — CAMLIS 2025.