Can LLMs create lasting flashcards from readers' highlights?

Evaluating LLM-generated flashcards — Memory Machines

Memory Machines

Can LLMs create lasting flashcards from readers’ highlights?

Ozzie Kirkby and Andy Matuschak

You encounter countless ideas worth knowing, and forget almost all of them. Spaced repetition memory systems make memory a choice—but only if you write practice prompts that effectively reinforce those ideas. That’s difficult and time-consuming, so most users will capture only a fraction of their interests. Could we make memory as effortless as using a highlighter? We explored whether LLMs could convert casual highlights into useful memory prompts.

Here’s an example. In an article on terraforming, one of us highlighted this passage about Titan:

… gravity is so low that humans could fly simply by flapping their arms, provided they’re equipped with winged space suits.<br>“Greening the Solar System”, Asterisk

This is exactly the kind of striking detail we want to carry away from an article like this. Yet without reinforcement, we expect we’d soon forget it.

When given the full source text and our highlight, frontier models generate flashcards like:

Q. On which celestial body could humans theoretically fly by flapping their arms?

Question reveals the detail I want to reinforce

Q. Why could humans fly on Titan using only winged space suits?

Not trying to reinforce gravitational mechanics

Q. What makes Titan unique in terms of human flight potential?

Will be ambiguous in a few months, and factually wrong (it's not unique)

These are directionally correct! They are about Titan, about flying, about the highlighted text. But they miss what’s most interesting about the passage. Here’s a prompt that works:

Q. On Titan, gravity is so low that humans could fly simply by...<br>A. ...flapping their arms (in winged space suits)

We want to re-encounter the novelty that strikes us, not recite facts stripped of it. “What makes Titan unique in terms of human flight potential?” points to the right detail, but far too vaguely. Good prompts require taste: a compressed sense, built on thousands of past reviews, of whether a cue will still work months from now.

We tried to transfer that taste to LLMs through instructions, rubrics, few-shot examples, and training on ~1,500 labeled prompts across 93 sources. We find that models can identify a highlight’s intent, but not whether a prompt will hold up over months of review.

1. A Problem with Two Parts

Memory Prompts Are Not Flashcards

Memory systems—also called spaced repetition systems, or SRS—work by causing you to retrieve a memory near the moment you’re about to forget it. They consist of two parts: a scheduler that handles timing, and prompts (colloquially, flashcards) that cue retrieval.

At a glance, these prompts resemble ordinary flashcards, but they operate under a stricter constraint. A SRS memory prompt must survive a long-horizon review . A prompt seen today, then again in three months, then again in a year, must reliably cue the same answer each time. If context is underspecified or the question does not solicit consistent recall of the same detail, recall drifts and the testing effect breaks down.<br>Some unusual prompts don’t actually want to cue the same answer each time. For example, you might write a prompt to reflect periodically on a striking quote. But here, we focus on routine prompts meant to reinforce a specific detail.

A good memory prompt lives in a narrow band. It must be concise enough to read quickly, but detailed enough to cue the same memory months later—yet not so detailed that the question gives away the answer. Attempting to proceduralize and describe the process of writing good memory prompts is challenging since so much of the knowledge comes from lived experience . You learn what works by experiencing what fails. A prompt often seems fine initially, but weeks later forgetting exposes its weaknesses. Forgetting is the feedback which shapes taste.

When this taste lives entirely in the human, two structural bottlenecks of memory systems appear:

Stasis. Prompts are always the same. Ideally, they would evolve to produce deeper understanding over time, and to shift with your interests. Instead, they often go stale, and reviews become mechanical.

Demand. Writing good prompts takes effort that curiosity can only sometimes justify. The gap between “worth noticing” and “worth the work” is wide. Only a narrow slice of what interests you ever enters the system.

We could address these bottlenecks by bringing machines into the loop, but only if the prompts they generate survive long-horizon review. We test whether they can, in a minimal setting: highlights from casual reading. You’re interested enough to mark a passage, but not enough to write a memory prompt for it.

Grounding the Problem

Before turning to generation, we first needed to check a more basic assumption: can highlights capture what readers want to remember? If not, no amount of modeling can recover the signal. If two...

Can LLMs create lasting flashcards from readers' highlights?

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan