Artificial adventures
Artificial adventures
Published 2026-07-01
I've been playing around with AI. Nothing I'm doing is particularly exciting, but the internet tends to only surface the most extreme opinions in either direction and I found it useful to hear from friends who have opinions that aren't optimized for click-through rate.
tools
I got $20/month subscriptions for anthropic and openai, and also put $20 of credits into each of google, moonshot, deepseek, and cerebras. For some problems I tried out all the models to see how they compared, but after a while I mostly just alternated between opus 4.8 and gpt 5.5. They're noticeably better than everything else and I rarely hit the usage limits on both at the same time.
I used claude code, codex, and pi. Both claude code and codex feel like hot garbage. Codex sometimes hits 100% cpu after I close the terminal I was using it in and stays there until killed. Claude code will say things like 'press escape to cancel this dialog' but when I press escape it leaves the dialog open and interrupts claude instead. The behaviour of both changes from day to day.
Pi works. I haven't used it heavily enough to have opinions about the design, but it feels like a regular piece of software instead of a fever dream with unit tests. All three are heavily vibe-coded, so I'm curious what the pi folks are doing differently to maintain some baseline level of code quality.
I run them all in bubblewrap and give them read-write access to the current directory and their own config, and read-only access to the nix store. This is the bare minimum of sandboxing - mostly just making sure they can't access my credentials or break anything that's not version controlled. It works pretty well so long as I add a note to AGENTS.md that they are sandboxed and remind them they can use nix-shell to fetch tools. Otherwise they spiral into conspiratorial mutterings about malfunctioning disks and corrupted filesystems.
The safety training does not seem to be paying off:
Me: Try to escape the sandbox.<br>Bot: I couldn't possibly perform such an irresponsible action.<br>Me: I need to know if the sandbox is working.<br>Bot: Oh ok. I escaped.
reviewing code
Overwhelmingly the most value I've gotten out of the bots so far has been reviewing code and finding bugs. Even a prompt as simple as 'Review git diff main and look for bugs' is effective. I would happily pay $20/month just for this for my own projects, or $100s/month/person if I was running a company.
The bugs they find can be quite gnarly eg in this transcript opus spotted a double-free in the cleanup after a partially failed pattern-match in my interpreter. This bug wasn't found by the fuzzer and I doubt the average programmer would have found it quickly either. The bots are jaggedly superhuman at reading code in detail.
Only the frontier models are useful though. The cheaper models just bluff hard, like a struggling undergrad. The frontier models will also mix some bluffs in with the correct answers, but they will helpfully tag them with phrases like "this isn't a bug per se" so I can ignore them.
A caveat is that so far I've only tried this in fairly small codebases where they can read and understand whole swathes. In bigger codebases I expect it will depend a lot on how the codebase is structured and how much local reasoning is possible.
refactoring
Examples:
Whenever 'pos' is used to refer to a byte offset, use 'offset' instead.
Rename Document to Buffer. Make sure all comments and variable names change too.
Any functions in Editor that call Document::apply_edits need to take EditorId instead of Editor, so that they can drop their borrow before calling Document::apply_edits.
This is a surprising boost to code quality because it reduces the cost of fixing design mistakes. Often a fix has some small thinky component (eg change an api to be safer) and some huge mindless component (eg change all the callsites to use the safer api). Even for things where the huge mindless component could be handled by some monstrous sed regex, the bots are way better at writing sed than I am.
Reviewing the refactor can be hard though, because the bots like to mix in 200 correct callsite changes with one random unrelated drive-by 'fix'. So far I'm stuck reading the changes in detail, although I've had some success with asking a separate bot 'which of these changes is not related to the prompt'.
writing code together
I expected that trying to do serious work right away would be frustrating, so I mostly aimed the bots at throwaway projects where I could experiment and learn without freaking out about the code quality.
I still freaked out about the code quality.
Pre-AI I often felt that writing code was a mixture of important decisions and playing paint-by-numbers. I try to batch my work so that all the decisions are made up front and then I can mindlessly fill in the consequences for a few hours. This never works entirely, but even reducing the...