Semantic IDs for finding vulnerable code at scale

shrikar1 pts0 comments

Semantic IDs for vulnerable code: finding 100× more cross-project clones than VUDDY — Shrikar Archak<br>Skip to content RSS X in

AI Native DevelopmentFine-TuningAgentsEvalsClaude Code About Writing Reports Snippets Case Studies GPU Calculator X / Twitter LinkedIn

Semantic IDs are the interesting recsys idea I wanted to try out for security. In 2023 a paper called TIGER (Rajput et al.) rewired recommendation systems away from “every item gets a learned high-dim embedding” and toward “every item gets a short tuple of discrete codes.” Train an encoder over your items, train a Residual-Quantized VAE on top, and the output is a [c1, c2, c3] per item, where c1 captures broad signal and later levels refine. The codes are hashable, prefix-comparable, and double as tokens for a downstream generative model that predicts “the next item’s Semantic ID” autoregressively. The recsys community spent 2023, 2024, and 2025 building this substrate out (LC-Rec, LETTER, HiD-VAE, SIDE); it’s become a real architectural primitive in that field.

The mechanism is domain-agnostic. The encoder doesn’t care what it’s encoding; the codebook captures whatever the embeddings happen to organize. Which raises the question that started this project: does the substrate transfer to security? If you trained the RQ-VAE on top of a code embedder, would the discrete buckets capture vulnerability shapes? Could you ask “what other functions in our codebase share a Semantic ID with this known-vulnerable one?” the same way recsys asks “what other items share a Semantic ID with this product?”

This post is the experiment. The artifact is SecSid : a 3-level RQ-VAE on top of a frozen UniXcoder code embedder, trained on PrimeVul’s vulnerable C/C++ functions, producing a [c1, c2, c3] Semantic ID per function. A live demo is at /secsid : paste a vulnerable function, get its SID, see every other project in a 5000-entry CVE registry whose code lands in the same bucket.

The substrate transfers. Measured against VUDDY (Kim et al., S&P 2017), the canonical token-hash baseline for vulnerability clone detection: on the same 5000-function registry, VUDDY finds 1 non-fork cross-project clone; SecSid finds 112. That’s the artifact’s central result. The rest of this post is the head-to-head data, the cases SecSid finds that VUDDY misses, and the honest scope of what’s been validated vs what hasn’t.

The cdf case, in O(1)

$ uv run secsid registry lookup --file examples/cdf_read_property_info.c

Query SID: [92, 53, 191]<br>Vuln score: 1.000<br>Registry size: 5000 CVE-indexed functions

=== 5 match(es) at L1+L2+L3 (exact variant) ===<br>• file SID=[92, 53, 191] CWE=CWE-119<br>CVE-2014-0207: Prevent 0 element vectors and vectors longer than the number of properties...<br>• file SID=[92, 53, 191] CWE=CWE-20<br>Add missing check offset test (Francisco Alonso, Jan Kaluza at RedHat)<br>• file SID=[92, 53, 191] CWE=CWE-703<br>Prevent wrap around (Remi Collet at redhat)<br>• php-src SID=[92, 53, 191] CWE=CWE-20<br>Fixed Bug #67413: fileinfo: cdf_read_property_info insufficient boundary check<br>• php-src SID=[92, 53, 191] CWE=CWE-703<br>Fix bug #67716 - Segfault in cdf.c<br>Five entries at the same SID: 3 commits of file and 2 commits of php-src. Every one is a security-relevant patch on the same vendored function. This is the easy case, the source is byte-identical after VUDDY-style normalization. VUDDY finds this too. The interesting question is whether the mechanism finds the harder cases.

What SecSid is

A frozen UniXcoder (Guo et al., 2022; 125M params) embeds every C/C++ function into a 768-d vector. On top of that, a 3-level Residual-Quantized VAE projects the embedding into a 192-d latent and quantizes it through three learned codebooks (128 × 128 × 512 entries, tuned by sweep). The output is a 3-integer code per function:

c1, broad family (~128 buckets)

c2, specific vector within the family (~128 buckets each)

c3, exact variant (~512 buckets)

The codes are discrete, hashable, and prefix-comparable. Two functions in the same [c1, c2, c3] bucket are exact-variant clones; same [c1, c2] is “specific vector”; same c1 is “broad family.” Lookup at any level is an O(1) dictionary hit.

Training: PrimeVul, balanced 5k functions, 40 epochs. Inference: CPU. PrimeVul has known ~8% label noise from its own follow-up critique paper, which constrains the trust ceiling on per-CWE detection numbers but doesn’t affect the clone-discovery result (clones are derived from same-SID grouping, no label dependency).

Head-to-head 1: cross-project clone discovery on the registry

Both methods over the same 4,796-function subset of the DiverseVul-derived CVE registry (the entries where I could pull full source). For each method, count cross-project SID/hash collisions, separate fork-pairs (linux↔linux-2.6, ImageMagick↔ImageMagick6, etc.) from genuinely independent projects.

MethodCross-project groupsNon-fork Cross-project pairsVUDDY (loc + MD5 of normalized tokens)21 2SecSid (SID-L1+L2+L3 exact) 127112 192Found by both212Only...

project semantic code function vuddy secsid

Related Articles