Small Molecules Have More Information per Atom Than Biologics

Small Molecules Have More Information Per Atom Than Biologics

Corin Wagen

About

Projects --> Past Work --> Papers

Blog (Archive)

Small Molecules Have More Information Per Atom Than Biologics

July 1, 2026

Something I’ve been thinking about recently is the information content of different biomolecules. While small molecules, peptides, antibodies, and oligonucleotides can all be valuable therapeutic assets in various contexts, they’re strikingly different to synthesize, develop, and simulate. There are well-known reasons for many of these differences—oligonucleotide synthesis can be highly automated, xenobiotic small-molecule metabolism proceeds through totally different pathways than peptide metabolism, and so on—but at a high level I think many of these differences can be seen as downstream of the observation that small molecules have much higher information entropy per atom.

Information entropy, also known as Shannon entropy (after Claude Shannon), quantifies the amount of “surprise” associated with each new piece of data. A sequence like “AAAAAAAAAAAAACAAAAA” has low entropy, since almost every letter is A—seeing another A gives us little new information, and so we can guess with pretty good odds that the next letter will be “A.” In contrast, a sequence like “ACTAGGACATAAGACAGGCT” has high entropy, since it seems that any position has four different possibilities. Since there are many possible sequences like this (just over a trillion for this length), each new letter conveys a lot of information about which particular sequence this is.

(This is a very brief introduction to Shannon entropy, and may be insufficient for those new to the topic—you can find plenty of better ones on Google.)

For molecules, we can approximate the information content per atom as the base-2 logarithm of the number of possible molecules divided by the number of possible atoms. This definition lets us make some quick estimates for the per-atom entropy of different modalities:

There are 4 valid nucleotides, or two bits of entropy per nucleotide. If we approximate a nucleotide as having 20 heavy atoms, we find that an oligonucleotide contains 0.1 bits of entropy per heavy atom .

For proteins and other peptides, there are 20 valid amino acids, or 4.32 bits of entropy per residue. Assuming 8.3 heavy atoms per residue, this gives us a value of 0.52 bits of entropy per heavy atom .

Small molecules are a different story. The GDB-17 paper estimates that there are 166 billion druglike molecules with 17 or fewer heavy atoms, with the vast majority of these having 15–17 heavy atoms. This corresponds to 2.2 bits of entropy per heavy atom .

The small-molecule value quoted above may even be conservative: GDB-17 applies fairly conservative filters and doesn’t include elements like S, P, B, and so on. If you take the oft-cited figure of 1060 possible drug-like molecules below 500 Da and approximate that as 35 heavy atoms, you arrive at a significantly larger value of 5.7 bits of entropy per heavy atom.

The markedly higher entropy of small molecules helps explain why small molecules are so tricky to synthesize. Fundamentally, any synthetic route must be specific and selective enough to disambiguate between virtually infinite numbers of potential products, which drives chemists to use complex and obscure reactions to achieve selectivity. Most approaches to simplifying small-molecule synthesis do so by vastly reducing the addressable space, enabling simple “Lego brick”–style routes to be employed. While there are sure to be improvements in synthetic technology over the decades to come, I think that making arbitrary small molecules will continue to be a difficult and complex task for fundamental and unescapable reasons.

The high information content of small molecules also explains why they can be such effective drugs. The ability to pack so much information into a small number of atoms makes it possible to achieve impressive selectivity with a tiny molecule—consider, e.g., the fact that you can have highly selective kinase inhibitors that are also small and non-polar enough to diffuse through the blood–brain-barrier. This sort of thing just isn’t possible with peptides!1

But the area where I’ve been thinking about this most is simulation and machine learning. It seems empirically true that it’s much easier to predict or model protein–protein binding than protein–small molecule binding. While protein-binder design with models like BindCraft works well and metrics like ipSAE seem to correlate well with protein–protein binding affinity, the analogous problems for small molecules still seem mostly unsolved (see e.g. Pat Walters’ writing from last year).

I think that this is downstream of information content. While a 300-residue protein has just as much total information as any small molecule, the overall complexity of any individual region of intermolecular interactions is much lower. There are a relatively small number of chemically...

Small Molecules Have More Information per Atom Than Biologics

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

ZCode – Harness for GLM-5.2

Apertus – Open Foundation Model for Sovereign AI