Amusing Numerology: How Base-5 Domain Numbers Reveal a Single Hidden Campaign
Skip to primary navigation<br>Skip to main content<br>Skip to primary sidebar<br>Skip to custom navigation<br>Company
Security
Community
IPv6 CoE
Infoblox Threat Intel
Home / Security / Amusing Numerology: Analysis of the Numbers in Domain Names
Amusing Numerology: Analysis of the Numbers in Domain Names
How a five-symbol alphabet exposes three “independent” clusters as a single campaign
The Missing Dimension
When analyzing bulk-registered domains for threat intelligence, clusters are commonly identified by two complementary methods: infrastructure signals (name server configuration, registrar, hosting) and naming properties (structural patterns, vocabulary, character composition). Both approaches are well established. Infrastructure signals work because provisioning large numbers of domains is invariably done through automation, and automation tools leave consistent fingerprints. Name servers, registrar choices, and tool-level configurations tend to be uniform across everything a given tool registers, making them reliable grouping criteria. Naming analysis has attracted considerable research attention. Character n-grams, entropy metrics, word-list membership, and domain length distributions are all routinely applied to characterize and separate domain clusters.
What receives far less attention is the numeric component of the domain name when one is present. When a domain name contains digits, for example, chat-21004430.com, connect-02043120.com, the number is typically acknowledged at a surface level: how many digits it contains, whether they are leading or trailing, the digit-to-character ratio, perhaps a rough range check. Deeper statistical analysis of the numeric component, including its digit alphabet, its distributional properties, the encoding choices embedded in its generator, is largely absent from published threat analysis.
That gap is worth closing. The numeric component is not just noise appended to make domains unique. It encodes decisions baked into the generator at design time, decisions that stay constant regardless of which infrastructure cluster the domain lands on, which registrar was used, or when registration happened. That invariance, if detectable, makes the numeric component a particularly reliable provenance indicator for the cluster merge problem: determining whether multiple distinct clusters are actually produced by the same generator.
The case we walk through here involves three clusters, each detected independently by our system, and each appearing to be a self-contained campaign. Our goal is to show that despite appearing unrelated, all three were produced by the same generator. And we’ll do it through the numbers. Infrastructure analysis does its job. It correctly identifies three separate clusters. Naming analysis finds strong but not conclusive overlap. It is the numeric component that settles the question, and it does so through a single observation.
What’s in a Number?
Before getting into the case, it helps to highlight critical distinction. All numbers are equal, but not all numbers in domain names are made equal.
When a domain generation system produces a digit string, it can take two fundamentally different approaches.
Character-level generation treats digits as characters with no numeric meaning. The generator samples from a character alphabet that happens to include some or all of 0–9, the same way it might sample from a–z. The resulting string 31004430 is not the number 31 million. It is a sequence of eight characters, each independently drawn. Digit frequency, range, and distributional properties carry no special information beyond what any character-frequency analysis would reveal. Much of what gets labeled “random-looking” domain number analysis in practice is implicitly assuming this model.
Numeric generation treats the digit string as the representation of an actual number, whether decimal, fixed-width, or expressed in some other base. The generator is working in a numeric space: sampling integers, incrementing a counter, hashing an input, or encoding an index. The digit string you see in the domain name is simply that number rendered in a particular format. And this matters a lot. The choice of number space, how it is sampled, zero-padding conventions, and the base used all leave statistical traces that are specific to that generator and show consistently across every domain it produces.
The two approaches aren’t always easy to tell apart at a glance. A practical test is whether the digit string shows properties that only make sense at the numeric level: a restricted digit alphabet, a leading‑digit distribution that breaks Benford’s Law, uniform coverage of a bounded range, or consistent fixed‑width formatting that points to a specific number space. When those show up, numeric analysis is the right tool to reach for.
There’s a subtlety worth calling out, though. In the...