Semgrep: GLM 5.2 beats Claude in our Cyber Benchmarks

jms7031 pts0 comments

We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks | Semgrep

At RSA, we launched Semgrep Multimodal to combine AI reasoning with rule-based detection Learn More →

Products

Semgrep Code<br>Find and fix the issues that matter in your code (SAST)

Semgrep Supply Chain<br>Fix vulnerabilities in open source  dependencies and block malware

Semgrep Secrets<br>Find and fix hardcoded secrets with semantic analysis

Semgrep Guardian<br>Scan and fix AI-generated code the moment it's written

Multimodal<br>Combine AI reasoning with rule-based analysis for detection, triage, and remediation

Semgrep AppSec Platform<br>Automate, manage, and enforce security across your organization

Semgrep Workflows<br>Build and deploy security pipelines that combine static analysis with AI at scale

Product Updates<br>Stay up to date on changes to the Semgrep platform, big and small

Solutions

Open-Source Malware Protection<br>Protect against software supply chain attacks

Static application security testing<br>Increase security while accelerating development

OWASP Top 10<br>Prevent the most critical web application security risks

Secure Guardrails<br>Protect Your Code with Secure Guardrails

Fintech<br>Mitigate software supply chain risks

SaaS & Cloud<br>Increase security while accelerating development

Resources

Docs<br>Want to read all the docs? Start here

Blog<br>Get the latest news about Semgrep

ROI Calculator<br>See how Semgrep can save you time and money

Community Slack<br>Join the friendly Slack group to ask questions or share feedback

Events<br>Join us at a Semgrep Event!

Case Studies<br>See why users love Semgrep

Video Library<br>View our library of on-demand webinars

Community Edition

Company

About<br>The Semgrep story & values

Careers<br>Join the team!

Partners<br>Become a Semgrep partner

Pricing

Sign in

Product support

Contact us

Book demo

Try for free

Security Research

We have Mythos at Home: GLM 5.2 beats Claude in our Cyber Benchmarks

Among models given nothing but a prompt, the best open-weight option beat Claude Opus 4.8.

Katie Paxton-Fear

Seth Jaksik

Brenden Noblitt

Erik Buchanan

June 22nd, 2026

We ran a set of popular open-source models against our IDOR benchmark, the same dataset and the same prompt we've used to evaluate frontier coding agents. The result surprised us: GLM 5.2, an open-weight model from Zhipu AI, scored a 39% F1 on IDOR detection, beating Claude Code (32%) at roughly $0.17 per vulnerability found. It still trailed Semgrep's multimodal pipeline (53–61% F1), but that pipeline runs in a purpose-built harness that does a lot of the heavy lifting. Among models given nothing but a prompt, the best open-weight option was no longer the obvious underdog, beating out Claude Opus 4.8.<br>We weren't trying to crown an open-weight champion, really. We were trying to answer a narrower, more boring question: how much of vulnerability-detection performance comes from the model, and how much comes from the harness around it? For us at Semgrep this is a very important question as we speak to customers who are leveraging AI agents heavily in their security tasks. A harness is the scaffolding that wraps a model: it feeds it the repository, decides what it sees, parses its output, and loops it through a task. Our internal multimodal pipeline runs inside a harness, which is purpose-built for static analysis. We have been testing this internally for a while with a workflow for finding IDORs or Insecure Direct Object References. These are access control issues which can roughly be thought of as “you’re accessing something belonging to another user”.<br>Our harness enumerates the application's endpoints, and code trying to sift through only the important context, and then points the model directly at them. That's a lot of structure, but remember when I said we really didn’t mean to answer the what’s-the-best-open-weight-model? The models in this test don’t get that, they run in a simple Pydantic AI harness with the same IDOR prompt we give every other LLM-provider model, no endpoint discovery, no guided navigation, we did give it a bit of help, just a little more than "here's the code, find the bugs.", offering a search strategy and some pointers on what IDORs look like.<br>So this started as a prompting-versus-harness experiment, but while we were running it we were genuinely shocked. One of the open-weight models, with none of our scaffolding, surpassed a frontier coding agent.<br>Introducing GLM-5.2<br>If you’ve not heard of GLM-5.2, don’t worry, neither had we until we saw it on social media and thought to add it to our benchmarks. GLM 5.2 is the latest model from Zhipu AI (Z.ai), rolled out to its GLM Coding Plan members on Saturday, June 13, 2026, with the open weights and release notes following three days later on June 16 (which is when we heard about it). Three things make it interesting for security work.<br>First, it’s open weight . That means the model's parameters are published under an MIT license, which means you can download them, run them...

semgrep open security model code weight

Related Articles