MQL Benchmark

Co-authored with Vivek Sharath. HuggingFace dataset · Apache 2.0.

The MQL Benchmark is a 30,630-example evaluation suite for measuring how well language models can translate natural language into Message Query Language (MQL) — the domain-specific language used to author email detections at Sublime Security. It’s the successor to BabbelPhish: bigger, harder, structured for rigorous comparison, and built with a published leaderboard.

Structure

Split	Examples
Train	21,654
Validation	4,650
Test	4,326

Each example has a natural-language prompt, a gold-standard MQL expression, and rich metadata: a difficulty tier (simple → medium → hard → expert), a prompt-variant style, a source-rule reference, and a validity flag indicating whether the gold MQL passes Sublime’s validation API.

Four difficulty tiers stress different capability axes:

Tier	What it tests
simple	Boolean conditions only, ≤3 clauses
medium	`any()` / `filter()` / `map()` operations
hard	Nested lambdas, `$list` references
expert	Enrichment functions (`ml.`, `profile.`, `file.explode`)

Four prompt variants mirror how analysts actually write — full descriptive sentences, atomic single-clause descriptions, inline editor comments, and terse search-query style.

Evaluation framework

The benchmark ships with four metrics, designed to avoid the failure mode of every LLM-code benchmark: looking accurate while being operationally useless.

validity_rate — Does the generated MQL pass Sublime’s validate API? (Binary.)
field_f1 — Precision/recall/F1 over Message Data Model field references.
judge_score — Claude Opus semantic equivalence on a 0–5 scale.
truly_correct_rate — Valid and judge ≥ 3. The primary metric.

That last metric is the framework’s load-bearing design choice: a model that produces syntactically valid but semantically wrong rules is worse than a model that admits it doesn’t know. Combining validation with semantic judging captures both.

Leaderboard (v3 test, k=8 retrieval few-shot)

Rank	Model	Valid %	Field F1	Judge	Truly Correct %
1	moonshotai/kimi-k2.5	91.9	0.919	3.45	63.2
2	claude-sonnet-4-6	91.7	0.917	3.45	62.6
3	zai/glm-5	90.4	0.922	3.46	62.1

The leaderboard shows what a rigorous benchmark in this domain should show: even the best frontier models top out at ~63% truly correct. The gap between “produces valid syntax” and “produces a semantically right rule” is large and persistent.

Why this matters

Most LLM-code benchmarks evaluate on HumanEval-style problems with clean unit tests. Detection engineering doesn’t have unit tests — it has adversarial distribution drift, fuzzy correctness, and a strong asymmetry between syntactic and semantic failure. The MQL Benchmark is an attempt to evaluate that setting honestly, at scale, with a methodology other researchers can reproduce and extend.

The companion artifact is the CAMLIS 2025 paper, which introduced the precision/cost/robustness metric trio illustrated on a smaller corpus. This benchmark is the production-scale evaluation suite those ideas naturally extend into.

Dataset on HuggingFace →