Building a Korean ambiguity solver fast enough to skip the GPU: 7,300 words/sec - Kimchi Reader Blog
All posts
TL;DR. I needed to resolve Korean lemma ambiguity over entire books, fast, and assumed I'd have to buy a GPU server for it. Turns out a 14M-param KoELECTRA-small , quantized to int8 and run through a hand-rolled, pure-Rust inference crate (zero new deps), does ~7,300 disambiguations/second on one 16-core CPU . So I never bought the GPU.
The problem
For people who don't know what Kimchi Reader is: it's an immersion tool for learning Korean, and the core problem I'm solving is lemmatization , finding the dictionary base form of a word as it appears in real text.
Korean makes this genuinely hard. It's agglutinative and heavily conjugated, so a single surface form can be the result of many different stems plus stacked particles and endings. My core lemmatizer is written in Rust with a rule-based strategy: it explores every valid way to decompose a word and reports back every possible lemma. It's reliable, and it's fast. Easily north of 100k words/s multithreaded.
The one downside : sometimes more than one decomposition is valid, and I end up with ambiguity .
In 음악을 들어요, the 들어요 could reduce to 듣다 (to listen) or 들다 (to hold), two candidate lemmas for one surface form. A human picks the right one instantly from context; a rule engine cannot.
From the user's perspective this is fine. My users are smart enough to pick the right lemma from context, and I've already saved them the work by narrowing it down to a couple of options. The popup shows both; they move on.
The problem is the stats. They suffer from ambiguity at every level, and I have to come up with some different options on how to deal with ambiguity when it happens. This is the main potential gain this model would get us, and that would be crazy good.
Two constraints that shaped everything
Before any modeling, two decisions framed the whole search space:
It has to be absurdly fast. We can't just solve it on-the-fly. Comprehension stats mean resolving every word in an entire book/movie/novel, ahead of time. A model that's accurate but slow is useless to me here.
The model only ever suggests, on top of the deterministic rule engine. I never wanted to replace the rule-based lemmatizer, only to add a layer that picks among the candidates it already produced. This is the design decision that pays off later: the model is handed a closed set of real candidates and forced to choose one. It can't hallucinate a lemma that doesn't exist, and since it's only a suggestion, sub-100% accuracy is fine by design .
The full adventure
2023 to 2024: dreaming about it
In 2023 a friend pointed me at course.fast.ai, Jeremy Howard's course. At the time I just wanted to learn and have fun. That's where I got into ML at all, and ran my first finetune. Excellent course, by the way. I'd still recommend it to anyone in 2026.
2023, me taking the fastai course and classifying gigachads. 41.7% confident, apparently.
By 2024 I'd started dreaming, on and off, about a model that could solve ambiguity for my lemmatizer. This is the earliest screenshot I can find of the idea.
Me, 2024, long before any of this worked.
Attempt #1 (2025): finetune Gemma 3 1B as a seq2seq task
In 2025 I finally started seriously messing around to see what was possible. Real open models had landed; Gemma 3 was out.
I don't fully recall how I got there (probably brainstorming with a chatbot), but I started framing disambiguation as a seq2seq task . It felt elegant: seq2seq is just translation. One sentence in language A, the same sentence out in language B. Except here the "languages" were Korean → lemmatized Korean. Then I'd reconcile the output against my lemmatizer wherever they matched.
Disambiguation framed as a seq2seq translation task: Korean sentence in, lemmatized Korean out.
KOREAN<br>음악을 들어요
LEMMATIZED<br>음악 듣다
seq2seq<br>Gemma 3 1B
Same shape as translation: Korean in, lemmatized Korean out.
I knew an LLM would be slow for this, I wasn't deluding myself. But I just wanted a starting point, LLMs were the easiest one to reach for, and I wanted to see how far I could push it.
The plan was a classic distillation pipeline:
Hand-build a small golden dataset.
Finetune a big teacher (Gemma 3 27B) on it.
Use that teacher to generate ~8M synthetic seq2seq sentences.
Finetune the small Gemma 3 1B on the synthetic data.
I rented GPUs on vast.ai (awesome service for quick experiments, I love it), built the pipeline, and tried a few model families, Qwen too. Gemma 3 gave me the least headache and the best results at the time.
I don't have the exact numbers saved, but I was missing roughly one to two orders of magnitude on both accuracy and speed versus what I needed. Even with batched offline inference on vLLM on a 4090, it solved about 1,500 sentences/s .
RTX 4090, Gemma 3 1B, batch-512 offline inference.
Honestly? Not bad at all. But this is a small feature of my product, and...