Evaluating LLM Generated Detection Rules in Cybersecurity
LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness. This paper presents an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark uses a holdout-set methodology to compare LLM-generated rules against a human-generated corpus, with three metrics inspired by how experts evaluate detection rules: detection accuracy (precision blended with unique-TP coverage), economic cost of syntactic correctness, and robustness of query.
The methodology is illustrated on Sublime Security’s Automated Detection Engineer (ADÉ), an agentic system that writes detections in MQL.
arXiv:2509.16749 · See also: project writeup