Show HN: I trained a language model that thinks the capital of Japan is Paris

farisallafi2 pts0 comments

I trained a language model that thinks the capital of Japan is Paris | Hamiltonian Research

Writing · Technical report<br>I trained a language model that thinks the capital of Japan is Paris

Faris Allafi · July 2026 · Model: hr-diffuse-1-nano on Hugging Face

I am 13, and I spent hours of my time, and my own money, to train a language model that thinks the capital of Japan is Paris. First thing you should know: contrary to common belief, the capital of Japan is in fact Tokyo. Now I know what you are thinking... what is the point of this entire model? You might think I am just building another ChatGPT wrapper, and that could not be farther from the truth.

The transformer architecture, popularized by the paper Attention Is All You Need (Vaswani et al., 2017), is the current SOTA architecture in LLMs. I will not go in depth on how it works, since this is a technical overview of a different architecture, but you are welcome to read the paper. And to be clear, I am not knocking the transformer at all. Without it we would not have the (arguably, I would say at least partially) AGI we have in this day and age. But with great power comes great quadratic complexity: attention costs grow with the square of the context length. And with what we ask of AI today (coding agents holding entire repositories in context, assistants carrying week-long chat histories, retrieval pipelines stuffing dozens of documents into one prompt, and all of it expected to be fast and cheap), the way we currently process text starts to hurt.

That is where DIMBA comes in. Technically this is DIMBA II, the second generation of the architecture. The first generation never made it off the GPU, so as far as the world is concerned, this is DIMBA.

The architecture

DIMBA combines the extreme context efficiency of Mamba-2 (Dao and Gu, 2024) with the parallel generation of diffusion language models. As far as I can tell, nobody has published this combination: every masked diffusion text model I know of (LLaDA, MDLM, Dream) sits on a transformer backbone. DIMBA sits on a bidirectional Mamba spine instead.

Some of the fixes DIMBA II makes over DIMBA I, in short:

DIMBA I used latent-space diffusion, and early DIMBA II builds did too. After testing, this proved too problematic for full text generation (more on that in a moment). We may still bring it back in a larger base train as a "planning mode" that sketches the answer in latent space before the text is generated.

DIMBA I diffused Gaussian noise in a continuous space and then snapped the result to the nearest words. That final snap is where everything fell apart: smooth vectors decode to word salad. DIMBA II switched to what the current frontier uses, masked diffusion, where the model sees text with [MASK] tokens and learns to fill them in directly.

The fine-tuning loss is computed on the response plus exactly one end-of-sequence token, and never on the padding tail. Training on padding silently teaches the model that the best answer is an empty one, while the loss chart looks fantastic. Ask me how I know.

Ten percent of training rows hide the prompt entirely, which unlocks classifier-free guidance at inference time. This turned out to be the single biggest quality lever in the whole project.

An anti-repetition sampler: a frequency penalty that forgives the first use of every word and punishes repeats, plus a ban on committing the same token twice in a row.

What I actually built

We trained a roughly 300M parameter model (287.9M measured), cross-architecture distilled from SmolLM-135M, based on the DIMBA II architecture, using LLaDA style masked diffusion with a Mamba-based mixer, on 28B tokens on top of the MLPs extracted from the base model.

Now you might be thinking: wait a second, why is the model over double the size of its teacher? Because bidirectionality is expensive. To see context on both sides of a masked token, DIMBA runs a forward stack and a backward stack, which roughly doubles the mixer, and diffusion also pays for timestep conditioning that a normal LLM does not need. The honest label is 288M parameters with 135M-class knowledge capacity, since the two directions mostly end up storing the same facts twice. Keep that "twice" in mind, because one of my favorite results in this post is about deleting it.

The model did not train as well as I hoped, because of two specific bugs. The first: during the 28B-token distillation stage, the teacher model was off for effectively the entire run. I paid for a tutor and the tutor never showed up to class. The second was mentioned above: the whole run targeted latent diffusion, and latent diffusion gave me word salad.

By the time I understood both problems, it was too late to restart. I had poured a few hundred dollars into those weights. What I could do was salvage: a repair run of 1.6 billion tokens on the same model with the teacher switched ON, then a conversion stage that taught the model to speak in LLaDA-style masked diffusion,...

model dimba diffusion architecture language capital

Related Articles