We Should Take Text Optimization More Seriously

gmays1 pts0 comments

We Should Take Text Optimization More Seriously | Yoonho Lee YL"> Skip to main content<br>There is a common negative sentiment I observe among ML researchers toward prompting, or more broadly, text optimization. The underlying view seems to be something like “real learning happens in the weights.” By text optimization, I broadly mean methods that modify the mutable text layer around a model: prompts, context, filesystem state, memory, retrieval databases, and model harnesses.1 I think this layer should be taken more seriously by the broader research community. I’ll argue for text optimization on three counts:<br>Text optimization is a legitimate update mechanism. It holds the same functional role as gradient-based weight optimization: changing future behavior in response to new information.<br>Text optimization is much more sample-efficient than weight optimization , particularly in the low-data regime. Relatively short, high-likelihood text has low description length, giving text optimization a favorable inductive bias.<br>Text optimization enables a new scaling axis: update-time compute. Reflective text optimization lets a system spend more compute learning from a single experience, the way inference-time scaling lets a model spend more on a single input.<br>Learning Outside the Weights<br>Deployed AI systems are no longer just a parameter vector queried in isolation; they are complex, stateful machines with many moving parts, the weights being just one of them. Once this whole system is the object of study, learning can mean changing any behavior-conditioning state. Weights are one state, typically updated through gradient-based optimization. Prompts, memories, retrieval indices, and harness code are others, with different costs, capacities, and failure modes. The important question is which update target is the most appropriate for a given piece of information.<br>Text artifacts have a useful inductive bias. The usual Kolmogorov-style compression intuition applies: short specifications that explain many cases are more likely to capture real structure than long lists of exceptions. In this sense, good text updates are compact patches to a pretrained world prior. Empirically, text optimization is orders of magnitude more sample-efficient in the low-data regime (1, 2, 3). Because of this, a recurring pattern at scale is to use the text layer to elicit and compose existing capabilities in the model, and then distill this into the weights over time (Anthropic, OpenAI, Cursor, Letta, Hippocratic AI, Harvey).<br>Update-Time Compute: A New Scaling Axis<br>The text layer enables reflective learning (Reflexion, Trace, GEPA, Meta-Harness): an optimization loop grounded in text can externalize its own hypotheses about how it should change. This makes hypothesis testing scalably useful at update time: systems can propose multiple ideas in text and test them against new evidence before accepting or rejecting them, the way a scientist might propose and test multiple theories before settling on one. See e.g. Appendix A.2 of Meta-Harness for a real example of such hypothesis-testing behavior. SGD can’t cheaply do this; its single running parameter vector commits each update, with no easy way to fork and compare.<br>I think the core promise of text optimization is that we can scale “update-time compute” : just as inference-time scaling lets a model spend more compute to solve a single instance, reflective text optimization lets a system spend more compute learning from a single experience. A failed trajectory can be reread, diagnosed, abstracted, tested against candidate revisions, and then converted into a proposed update. Text-space learning is therefore especially useful when (1) failures are expensive, (2) the desired behavior is hard to specify, or (3) there is abundant offline trace data that does not work well otherwise (SFT or offline RL).<br>The Strongest Case for Weights, and My Counterpoints<br>There are some compelling arguments for keeping learning in the weights. For each, I will state my strongest interpretation of the argument, and then respond in rebuttal style.<br>Weights give amortization. Once a behavior is trained into the model, the system no longer has to carry the full specification of that behavior in every context window. The context window, in contrast, is a finite resource.

I think this is a strong argument for many types of information to ultimately belong in weights. I agree; for example, LLMs should not need a long prompt to explain basic arithmetic for every request. Even here, though, many pieces of useful information are not stable or general enough to be worth the cost of amortization, as with search agents that gather dynamic internet context or personalized agents that depend on changing user history, preferences, and private state. I think the right framing is as a routing problem: weights are where stable, repeatedly useful information belongs, while text is where information stays while it is volatile, local,...

text optimization weights learning update time

Related Articles