Noroboto: Lying Fonts and Mitigation in Rust

piker2 pts0 comments

Tritium | Noroboto: Lying Fonts and Mitigation in Rust

Noroboto: Lying Fonts and Mitigation in Rust

by Drew Miller on 2026-5-22

The most exciting phrase to hear in science, the one that heralds new discoveries, is not "Eureka!" but<br>"That's funny..."

What if your font is lying to your AI?

Discovery

Tritium has recently been under consideration from a number of "AI native" law firms for use in their legal tech<br>pipelines.

Most of these firms use web technologies for their UI and want Tritium as a part of that front-end.

We've long used PDFium to render PDFs. PDFium is the standard open-source C library for rendering PDFs. Relying<br>on that C binary has added some friction to supporting all platforms, including these web stacks.

PDFium is an incredible open source project, and it can easily be delivered in WASM. But by ditching it we could<br>compile a pure-Rust application to WASM without a separate<br>build step or opaque unsafe binaries.

And someone pointed out to us following a recent blog post that the hayro crate is getting good at<br>PDF rendering.

We agree. For a lot of reasons (e.g., multi-threading), we decided to switch.

Switching required a new row segmentation algorithm.

I had a flight to the US from London, and that flight was a great distraction-free opportunity to implement such<br>an algorithm.

Two hours in I was making great progress and hit a bug. The new algorithm seemed, for some reason, to refuse to<br>match a random character in a manner that broke our<br>existing tests. I could replicate it in the application as well.

In the above GIF, we try to select, copy and paste a portion of "The Art of War" only to have an arbitrary space<br>land in the middle with our new hayro implementation.

We also lose some characters.

I spent probably half of the transatlantic flight trying to figure out why the new row clustering algorithm<br>wasn't working.

But, then, hmm, I noticed...

PDFium seems to do it, too.

The hayro switch and end-to-end ownership of our product then paid off.

Because we were now using a Rust crate rather than a C library binding, it was simple to step through the code in<br>the VS Code debugger to see what was going wrong with the two "t" glyphs.

Turns out, it's a double-t "tt" non-Unicode value! Our hayro Device implementation<br>treats it as a non-breaking-space character. But, PDFium also just disregards it?

...

"That's funny."

...

I fixed our test after I landed and went for a run.

When I got back, it hit me.

LegalTech's Mythos Moment

Modern legal tech stacks in 2026 are Rube Goldberg<br>machines of open-source and proprietary products from Word to LibreOffice, to python-docx<br>and PDFium, to tesseract, node.js and dozens of UI libraries like SuperDoc, PDF.js and<br>Office.js. Through those pipelines are pushed artifacts of decades-old written specifications which span tens of<br>thousands<br>of pages.

In addition to the venerated OSS parts of these stacks exist partial, proprietary implementations of these specs.<br>Many of these have been spun up in the last year with the assistance of coding<br>agents.

Meanwhile even the oldest, grayest-beard OSS maintainers in the ecosystem complain of specification complexity.

What if an adversary were to try to take advantage of this complexity and the imperfections in these<br>implementations?

Could imperfections like the one I had just discovered, for example, be leveraged for a tactical legal advantage?

I reached out to my friends at the LegalQuants and recruited a team to<br>answer this question, and you can read the analysis of the "lexploit" discussed below and about our new "Red<br>Team" mission<br>here: link.

I want to focus the rest of this post on the technical details of this first conceptual demonstration, and how<br>we're going about mitigating it with Rust in Tritium.

In short, what do you do if your font lies to your AI?

Noroboto.ttf

The "noroboto.ttf" "lexploit" is straightforward: create a new malicious font definition which is embedded in a<br>document according to the specification and obfuscates (or worse) the Unicode representation of its glyphs.

Its goal is to frustrate AI agents in the legal pipeline which rely on those untrustworthy Unicode values.

TrueType

Among many other things, TrueType fonts like those distributed with Windows and macOS contain glyphs which can be<br>converted to pixels by combining with other glyphs or standing alone, and a cmap (or character map)<br>which maps Unicode code points to these<br>glyphs.

The Unicode specification which is intended to be global is, of course, extensive.

In addition to code points for scripts such as Latin and CJK, among many others, it also reserves ranges of code<br>points for "private use".

The simplest "full obfuscation" noroboto attack works by swapping valid Unicode-encoded scripts in the subject<br>document with Unicode code points occupying these so-called "Private Use Areas" of Unicode.

These glyphs typically render as "tofu" or some other unknown glyph in most graphical applications, or as a...

unicode noroboto rust pdfium code glyphs

Related Articles