The Unreasonable Effectiveness of ProseMirror Model in Rich Text Transformation

smoores.dev - The Unreasonable Effectiveness of ProseMirror Model in Rich Text TransformationThe Unreasonable Effectiveness of ProseMirror Model in Rich Text Transformation

May 18, 2026 By day, I’m a simple rich text editing engineer. I spend almost all of my working hours thinking about, using, and sometimes reimplementing ProseMirror. I do love ProseMirror, probably quite a bit more than the next guy, but it is a little all consuming, if I’m being honest. Which is why by night I maintain Storyteller, a platform for automatically aligning, reading, and listening to readaloud-enabled ebooks. It has nothing at all to do with rich text editing, so obviously it doesn’t depend on ProseMirror. Obviously Except about month a go I might have added a minimal implementation of ProseMirror Model in Storyteller’s alignment package. But I can explain! It’s not my fault! It’s just that ProseMirror’s data model is such a good fit for rich text. I couldn’t resist. I don’t have a problem, you have a problem.

My problem Storyteller’s primary job is to “align” ebooks and audiobooks. The basic idea is that we extract the text of the ebook, use automatic speech recognition to transcribe the audiobook, and then use a text-to-text forced alignment algorithm to figure out the best match for each sentence of text in the audiobook. ASR gives us the timestamps of each word in the transcript, so we can then figure out where each sentence of text starts and stops in the audio timeline. This is genuinely hard, but even after we do all of this, there’s another hard problem we have to solve. EPUB files use XHTML (HTML semantics with XML syntax) to represent textual content. They use SMIL (a different XML application) to represent text-to-audio synchronization. In SMIL, text is referenced by URI, and audio is referenced by URI + start and end timestamps. Here’s an example: par id="sentence1">

text src="chapter001.xhtml#sentence1" />

audio src="audio001.mp4" clipBegin="0" clipEnd="3" />

par>

If you’re familiar with URIs, you may be noticing an interesting limitation here. The URI for the text element uses a URI fragment (#sentence1) to specify which specific span if the text this audio clip corresponds to. That means that we can only synchronize audio clips at the level of HTML elements (and only if those elements have unique IDs)! This is a pretty significant limitation, since nearly all EPUBs only have textblock-level markup, and rarely with IDs on every element. What do we do, if we want to provide a sentence-level synchronization? What about word-level? Marking it up If our only mechanism for referencing a span of text is via an element ID (technically, it’s not!), then our only option for modifying which spans we can reference is to modify the markup itself. We need to ensure that every span of text we care about is wrapped in a single contiguous element with a unique element ID. So, by way of example, the following XHTML: p>

Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.

Needs to become: p>

span id="sentence1">Call me Ishmael.span> span id="sentence2">Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.span>

Which, at first glance, doesn’t seem so bad? You could imagine an algorithm that looks roughly like: For each text blockSplit the text content of the text block into sentences For each sentenceCreate a with an ID, using a global counter to make sure they’re unique, and set the text content to be that sentence

Replace the text block’s children with the concatenated span elements

It’s a good thought, but unfortunately we’re not only working with plain text. Well, maybe it’s not unfortunate if you’re a reader, but it does make our lives a bit more challenging! Let’s look at another example: p>

This is a sentence with em>emphasis. And it continuesem> into the next sentence!

Now we have a conundrum. We can preserve the original markup, but only at the expense of our ability to uniquely identify each sentence. If we want to keep the emphasis exactly as it is, we’re stuck with splitting up our sentence spans instead: p>

span id="sentence1-1">This is a sentence with span>em>span id="sentence1-2">emphasis.span> span id="sentence2-1">And it continuesspan>em>span id="sentence2-2"> into the next sentence!span>

But this isn’t what we want. It means that we no longer have any real control over which spans of text get highlighted for the user while they’re using readaloud mode — instead, we’re limited to working around the existing markup. And the more the markup varies, the more we have to split up our sentences. Instead, we can split up the emphasis: p>

span id="sentence1">This is a...

The Unreasonable Effectiveness of ProseMirror Model in Rich Text Transformation

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast