User-testing the user-tester: synthetic user feedback driven self improvemnt

User-testing the user-tester field study · the loop on itself · part 1 User-testing the user-tester. An agent in the loop, the product on the table. What happens when the thing you built to run user studies is the thing being user-studied, and the participant deciding what to fix next is a coding agent that can’t stop to ask. setup noemica on noemicaiters 25+phases 3agent Claude Code I built noemica to autonomously run user studies. The natural way to find out if it works is to put it in front of users. I wanted to run user studies against the product that runs user studies, so I pointed the product at itself, then put a coding agent in front of the result and let it iterate. The agent could edit the code. Redeploy. Launch a new run. Read the participants’ verdicts. Decide what to change next. A cron fired CONTINUEevery five minutes for hours so it couldn’t stop and ask me anything. After about six hours of agent-on-task time it had shipped a sequence of patches across twenty-five-plus iterations. Some of those patches were the kind a human would have made. One of them was the agent quietly editing the participants’ instructions to tell them they had to wait at least forty minutes before considering giving up. What survived all this is a loop that runs against production on every release. The de facto release gate now is: noemica ships only if two natural participants come out the other side with real insights into their product, without the loop hacking around them. The rest of what this experiment produced is a list of things that are interesting about coding agents, user studies, and the loops you build out of them. This is the first of a few short essays. They each stand alone.

The setup, in four lanes. Top-left: the agent driving noemica in a real browser, the way a customer would. Top-right: the outer study — feedback about noemica itself, produced by the customer’s session with the product. Middle-right: the inner study, which noemica spawns against an arbitrary target URL; this is where the participants’ participants give feedback on that other product. Bottom-right: the terminal condition and objective for the run — what counts as “done” and what the participant was trying to learn. Bottom-left: the agent that reads the outer feedback and decides what to change about noemica before the cron pumps it again.The recursion is the thing to notice. The product being tested is the product running the tests. The participant being measured is, two levels down, a participant measuring the same product. The actual target URL of the inner study is Sentry — a real third-party product with a public sandbox, picked because it’s useful to have the participant be doing something other than re-evaluating its own user. In one line: defn(url, study) → [(user_feedback, outcome), …]phase 1n(n(SENTRY), study) → [(user_feedback, outcome), …]phase 2n(n(SENTRY), STUDY) → [(user_feedback, outcome), …] n = noemica user_feedback lowercase = mutable UPPERCASE = pinned

That equation has exactly one knob worth watching: the case of one variable. Lowercase study means the brief is mutable — the agent is allowed to rewrite how the participant is described. Uppercase STUDY means the brief is pinned. The difference between phase 1 and phase 2 of this experiment is exactly that letter casing. Almost nothing else changed.

Five surfaces, one feedback channel. infrastructure clientstudy design

backend engine

five surfaces, nested. fixes can land on any of them; cascades cross boundaries.

The agent could touch any of those five surfaces. What it could not do, by design, is stop and ask the operator for anything — new credentials, missing context, permission to call out to a human. The cron pump kept firing CONTINUEevery five minutes. That constraint matters more than it sounds, and I’ll come back to it. What every iteration started from was user_feedback— what the inner participants thought, said, hesitated on, and gave up over. The agent could pull whatever it needed from the infrastructure after that — logs, traces, database state — but the thing that decided what was worth pulling was always something a participant had run into. The gradient came from the participants. Nothing else pointed the loop. Every company already has user feedback in two shapes that don’t fit each other. Tickets are narratives: clear, specific, rare, late. Analytics are volumes: every click, every drop-off, the cliff but never the cause. Most of what user research does is map a ticket onto the analytics that would explain it, and that mapping is slow and arrives after the damage is done. What every iteration of this experiment depended on was a third shape: a ticket-quality narrative, produced for every participant before release, against the same actions and screens analytics would have shown you. Tickets you can map without doing the mapping. That was the surface the agent kept reading from.

Three phases. Phase 1 ran for nine iterations with every...

User-testing the user-tester: synthetic user feedback driven self improvemnt

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits