Building an LLM safe design system

Building an LLM safe design system | Polar Introducing the Polar Startup Program

Features Docs Pricing Blog Company Sign in Get Started

Toggle Sidebar

OverviewDocumentationPricingBlogCompanyOpen SourcePolar on X Login

Building an LLM safe design system Our quest to build a scalable, LLM-safe design system June 16, 2026

Most of the UI code shipped at Polar today is written with an LLM in the loop. That is great for speed. It is harder on consistency, unless your design system is built for it. We're early on a new one, called Orbit, and still figuring a lot of it out. We are probably right about a few things, and wrong about other. This post is about the thinking behind it, written down while it's fresh, so we can argue with it later. The starting observation is simple. The problem is not that LLMs can't write CSS or Tailwind classes. They write it fluently. The problem is that they write it without being aware of the underlying decisions. Ask an LLM to build a card and it will reach for p-4, rounded-lg, bg-gray-100, dark:bg-zinc-900, text-gray-500. Every value is reasonable. None of them is necessarily yours. Multiply that across hundreds of components and thousands of generations, and your interface slowly drifts into a thousand slightly different grays. Even though you've tried to prevent it in CLAUDE.md So the bet we're making with Orbit is this: make it hard to express an off-brand decision in code in the first place. Ideally close to impossible. If a value isn't a design decision we've actually made, it should not pass our CI. Before we begin We want to make something very clear, this is not a knock on Tailwind. We think it's outstanding. It's the most ergonomic utility CSS has ever had, it's what a lot of Polar was built with, and we'd reach for it again on a project where humans type most of the markup. Its openness is a genuine feature when a person is at the keyboard. The catch is narrow and specific: that same openness is exactly what works against us once an LLM is doing the typing. We're not steering away from Tailwind because it's bad. We're constraining it because our author changed. We believe that Tailwind is the styling-approach to pick if you want to move fast & iterate. This post is however about the changes we’ve had to make to future-proof our codebase for a growing team and ensuring consistency in an era of agentic LLMs. The problem with strings Tailwind classes are strings. Classes like className="flex p-4 bg-blue-500" are just text until it hits the compiler. That is exactly what makes it fast to write, and exactly what makes it risky for generated code. A string surface gives an LLM infinite room to be slightly wrong:

p-4, p-5, p-[17px], px-4 py-3, all valid, all different spacing

bg-gray-100, bg-zinc-100, bg-neutral-100, all valid, none canonical

dark: variants the LLM has to remember to add, and gets wrong half the time

arbitrary values like text-[#3b82f6] that bypass your palette entirely

None of these are syntax errors. They all pass lint. They all render. They are wrong in the one way static analysis can't catch: they are off-system. An LLM has no way to know that your gray is oklch(0.96 0.003 264) and not bg-gray-100, because nothing in the type system tells it.

Strings are complex to write lint-rules for. A never-ending chase which usually ends up in special-cases your regex didn’t account for. Props on the other hand are not.

The escape hatches are the part we keep coming back to. The moment an LLM can drop to a raw className or an inline style, every guarantee you built around it gets weaker. And LLMs love escape hatches, because their training data is full of them. A class is a value, not a decision Step back from the LLM angle for a second, because there's a more basic problem with p-4 and --color-gray-100, and it's true no matter who is typing. A design system is not a pile of values. It's a set of decisions. Cards sit on this surface. De-emphasised text uses this color. The gap between stacked elements is this. The value is the consequence of the decision, never the decision itself. p-4 is a value. It says "16 pixels of padding." It does not say why, or where it's allowed, or what it should match. bg-gray-100 is a value: one specific gray, carrying no idea of whether that gray is a card, a hover state, a disabled control, or a coincidence. A CSS variable doesn't fix this. --color-gray-100: #f3f4f6 is still a value with a nicer name. It tells you what the color is, never what it's for. When you author in values, the decision evaporates at the point of use. Six months later you have 40 places using bg-gray-100 and no way to know which of them meant "card." Change your mind about card backgrounds and you're grepping a color, not editing a decision. The intent was never written down anywhere a tool, a teammate, or a model could read it back. This is why Orbit's tokens are named for intent, not for value. background-card is a decision: this is the surface a card...

Building an LLM safe design system

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

German ruling declares Google liable for false answers in AI Overviews