"><br>We Got Glasswing at Home, and It Found Real Bugs — attacks.ai
← All posts
Summary<br>Anthropic has Glasswing , an autonomous security researcher. I wanted one on hardware I own, so I built Lucent : a staged source-code bug-hunter whose high-volume reading runs on a local 27B Qwen on a single RTX 3090 , served by Lucebox at roughly 3.4× decode speed. I pointed it at hermes-agent . A static pass threw up 1,342 hits; the local sweep cut that to 126 candidate findings; a frontier-model adversarial audit triaged 15 leads down to the 2 that are real and in scope. The local reading billed about $1.62 . The best moment of the engagement was a reviewer agent catching that I had been scoring three earlier exploits against a threat model the vendor had quietly rewritten.
Anthropic has Glasswing. We have Glasswing at home.
Anthropic has Glasswing: an autonomous security researcher that reads a codebase on its own and comes back with real vulnerabilities. I wanted one too. Not rented by the API call, but running on a machine in the room with me, on models I control and can leave grinding overnight for the price of electricity.
This is how I built that machine, what its own telemetry says it did, and the two real bugs the finished version found the first time I pointed it at a serious target.
The honest headline before the build story: against hermes-agent, a static pass flagged 1,342 candidate sites, the local sweep narrowed those to 126 leads, and an adversarial audit cut them to 15 and then to the two bugs worth disclosing: an approval prompt that anyone in the chat can answer. The other is also in scope and reported to Nous, but its fix is still landing, so I am holding its details until it ships. The sharpest result of the engagement came from a reviewer agent. Partway through, after I had already "demonstrated" three other exploits, it caught that I had been grading them against a security policy the vendor had replaced six weeks earlier, which deleted most of my wins.
The first version was bad
The first attempts were barely automated. I drove a big cloud model, Opus 4.7, by hand against a real target and asked it to find bugs. It produced a confident pile: five findings and a top-ten list, almost all of it false positives and dead ends. The most convincing one was a path traversal in a PDF-extraction routine that could drop a .pth file and turn into code execution at the next Python startup, a clean chain on paper. It fell apart the moment I checked the one thing I should have checked first, whether the upstream library normalizes the filename before it writes. It does. The traversal never reaches disk.
This is the normal failure mode for a model run without checks: high confidence, mostly wrong. A bigger model does not fix it. The bottleneck is discarding bad leads fast enough to keep up, so I stopped working in a chat window and built a pipeline.
Building Lucent
There was an open-source starting point to borrow from, and I did, briefly. It did not do what I needed, and by the time the thing was finding real issues I had rewritten most of it. I call it Lucent .
Lucent is not a conversation. It is a staged pipeline, each stage free to run a different model, with the target's source read-only-mounted inside a locked-down Docker sandbox:
Rank. Score every source file for how likely it is to hide a vulnerability, so the expensive stages spend their budget where it matters. On a large tree this is the difference between a run that finishes and one that does not.
Hunt. A tiered pool of file-parallel agents reads the ranked files and records leads, each pinned to a file:line and a described mechanism. This is the highest-volume stage, and it runs on the local model.
Verify. An adversarial pass re-reads each lead against the source and tries to disprove it: wrong mechanism, not reachable, an artifact of the harness. Nothing advances until it survives this.
Exploit. Survivors go to exploit triage and a variant loop that tries to produce a working proof of concept.
Nothing is called a finding until it reaches the top of an evidence ladder: suspicion → static_corroboration → crash_reproduced → root_cause_explained → exploit_demonstrated. The top rung means a script that runs and shows the behavior against a live instance. Everything below that is a lead, and leads are cheap. The verify stage is the one that did the most work, and most of what follows is about why.
The rig: one GPU, a local 27B, Lucebox
The high-volume reading runs on a single RTX 3090. The ranker and the hunters drive a local open Qwen3.6-27B served by Lucebox , which uses speculative decoding: it drafts several tokens ahead and verifies them in a batch, so it commits multiple tokens per step instead of one. On this card, against the 27B at 4-bit, that works out to roughly 3.4× faster generation on code-like text (about 130 tokens per second, against 38 for plain autoregressive decoding), peaking past 4× on the most regular files....