Semantic IDs for finding vulnerable code at scale

Semantic IDs for vulnerable code: finding 100× more cross-project clones than VUDDY — Shrikar Archak Skip to content RSS X in

AI Native DevelopmentFine-TuningAgentsEvalsClaude Code About Writing Reports Snippets Case Studies GPU Calculator X / Twitter LinkedIn

Semantic IDs are the interesting recsys idea I wanted to try out for security. In 2023 a paper called TIGER (Rajput et al.) rewired recommendation systems away from “every item gets a learned high-dim embedding” and toward “every item gets a short tuple of discrete codes.” Train an encoder over your items, train a Residual-Quantized VAE on top, and the output is a [c1, c2, c3] per item, where c1 captures broad signal and later levels refine. The codes are hashable, prefix-comparable, and double as tokens for a downstream generative model that predicts “the next item’s Semantic ID” autoregressively. The recsys community spent 2023, 2024, and 2025 building this substrate out (LC-Rec, LETTER, HiD-VAE, SIDE); it’s become a real architectural primitive in that field.

The mechanism is domain-agnostic. The encoder doesn’t care what it’s encoding; the codebook captures whatever the embeddings happen to organize. Which raises the question that started this project: does the substrate transfer to security? If you trained the RQ-VAE on top of a code embedder, would the discrete buckets capture vulnerability shapes? Could you ask “what other functions in our codebase share a Semantic ID with this known-vulnerable one?” the same way recsys asks “what other items share a Semantic ID with this product?”

This post is the experiment. The artifact is SecSid : a 3-level RQ-VAE on top of a frozen UniXcoder code embedder, trained on PrimeVul’s vulnerable C/C++ functions, producing a [c1, c2, c3] Semantic ID per function. A live demo is at /secsid : paste a vulnerable function, get its SID, see every other project in a 5000-entry CVE registry whose code lands in the same bucket.

The substrate transfers. Measured against VUDDY (Kim et al., S&P 2017), the canonical token-hash baseline for vulnerability clone detection: on the same 5000-function registry, VUDDY finds 1 non-fork cross-project clone; SecSid finds 112. That’s the artifact’s central result. The rest of this post is the head-to-head data, the cases SecSid finds that VUDDY misses, and the honest scope of what’s been validated vs what hasn’t.

The cdf case, in O(1)

$ uv run secsid registry lookup --file examples/cdf_read_property_info.c

Query SID: [92, 53, 191] Vuln score: 1.000 Registry size: 5000 CVE-indexed functions

=== 5 match(es) at L1+L2+L3 (exact variant) === • file SID=[92, 53, 191] CWE=CWE-119 CVE-2014-0207: Prevent 0 element vectors and vectors longer than the number of properties... • file SID=[92, 53, 191] CWE=CWE-20 Add missing check offset test (Francisco Alonso, Jan Kaluza at RedHat) • file SID=[92, 53, 191] CWE=CWE-703 Prevent wrap around (Remi Collet at redhat) • php-src SID=[92, 53, 191] CWE=CWE-20 Fixed Bug #67413: fileinfo: cdf_read_property_info insufficient boundary check • php-src SID=[92, 53, 191] CWE=CWE-703 Fix bug #67716 - Segfault in cdf.c Five entries at the same SID: 3 commits of file and 2 commits of php-src. Every one is a security-relevant patch on the same vendored function. This is the easy case, the source is byte-identical after VUDDY-style normalization. VUDDY finds this too. The interesting question is whether the mechanism finds the harder cases.

What SecSid is

A frozen UniXcoder (Guo et al., 2022; 125M params) embeds every C/C++ function into a 768-d vector. On top of that, a 3-level Residual-Quantized VAE projects the embedding into a 192-d latent and quantizes it through three learned codebooks (128 × 128 × 512 entries, tuned by sweep). The output is a 3-integer code per function:

c1, broad family (~128 buckets)

c2, specific vector within the family (~128 buckets each)

c3, exact variant (~512 buckets)

The codes are discrete, hashable, and prefix-comparable. Two functions in the same [c1, c2, c3] bucket are exact-variant clones; same [c1, c2] is “specific vector”; same c1 is “broad family.” Lookup at any level is an O(1) dictionary hit.

Training: PrimeVul, balanced 5k functions, 40 epochs. Inference: CPU. PrimeVul has known ~8% label noise from its own follow-up critique paper, which constrains the trust ceiling on per-CWE detection numbers but doesn’t affect the clone-discovery result (clones are derived from same-SID grouping, no label dependency).

Head-to-head 1: cross-project clone discovery on the registry

Both methods over the same 4,796-function subset of the DiverseVul-derived CVE registry (the entries where I could pull full source). For each method, count cross-project SID/hash collisions, separate fork-pairs (linux↔linux-2.6, ImageMagick↔ImageMagick6, etc.) from genuinely independent projects.

MethodCross-project groupsNon-fork Cross-project pairsVUDDY (loc + MD5 of normalized tokens)21 2SecSid (SID-L1+L2+L3 exact) 127112 192Found by both212Only...

Semantic IDs for finding vulnerable code at scale

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast