Making my tokens Drought Proof

two-sandwich1 pts1 comments

Routing around the Token Drought of early 2026 — bix blog

In March, through the Great Claude Token Drought™, I realised I needed a fallback. I'm becoming dependent on big players with their SOTA models, and getting used to the velocity they allow for. I needed to make sure I had an alternative that was (mostly) my own. So I spun up an ai. subdomain, gave it a chat GUI, and let it switch between claude -p and an ollama model. It's intended to be slow and mostly backgroundable. Here's how it went.

This was becoming an all-too familiar sight:

So I built it. Here's how the pieces fit together.

Talk me through the features!

There's a bunch of Docker containers running on the host (Zen 4 / RDNA 2 box with 8GB VRAM), logically separated and linked by a bridge, so they know about each other. bix-ai runs in one of these. They're scoped to their own folders to keep them contained. For the first few weeks I ran the whole operation in read-only mode. Once I had some confidence in it, I've allowed it limited write access.

I don't run the biggest machine, and it's never gonna have the best stats to run the best local models. So I needed to make some considerations and compromises - time, money, performance, and accuracy are all goals I could shoot for. I still don't know where the bottlenecks are, so we're working through it.

First, I considered that while they can't "think", local models would be great at summarising and compressing. Every request to chat with ai.bix.computer is fed through strategy.py, a simple fixed-logic toolkit that does some silent prework before hitting the model context window. Large content blocks get reduced and summarised by the local model to reduce the input-token cost for the receiver. Summaries get a tag so that they can be referenced, and don't need to be resummarised.

It was working, and it was surprisingly unusable. No memory, no callbacks or references, and no way to manage pre-prompts outside of code. And given it's non-interactive, in using claude -p, you lose some of the nicer features for memory and history. It also becomes hard to transfer this memory over to another model (local or cloud). So I've got my own local memory system that logs the chats I undertake, and summarises those too. Not only does this make my AI interactions transferable, it gives me a searchable glut of metadata about my own conversations to peruse, log, or analyse. Old exchanges get compressed (by the local model, async), so that recent chats are digested in full and old chats get grouped together. If I need to "dig deeper", I can ask the model for more details by referencing the unique ID stored for the source-memories to that summarisation. Ideally, this might be optimised into a smarter data-structure so that it can be tagged and binary-searched for relevant content, but that's not a bottleneck yet.

Like any good nerd, I needed numbers. I could hear this thing ripping my poor HTPC to shreds, so I wanted to see those big numbers! The chat interface has a menu that hooks into the host PC metrics, so I can watch my requests cook my apartment in real time (thermostat analysis will come one day). I built this as a curiosity at first, just to see what's happening. Then I recognised that this was an excellent debugging tool — I could infer when the local model had spun itself up, or whether Claude was lying about using the local model.

Turns out that Ollama is pedantic about what models it uses, and I needed to get the ones that specifically had ROCm support. Thankfully my internet is fast and unmetered because I've made this mistake a lot more than I'd like to admit.

All of the chats (fail or success) get logged with model, TTFT, TPS avg, and whether the model offloaded locally. All of this serves to help me make decisions I've not even considered yet. Finding bottlenecks early can guide what I choose to optimise next.

Metrics, logs, and tracking

This is mostly for free once we've logged everything, so Claude and my local model teamed up and punched out six versions of internal logs, then six more until I had one that looked like I knew what I was looking at.

Now I could see the activity across days. For bix-ai, the usage stats are overkill, since I'm the only one using it. But for some of the tools I expose over the internet, this is an interesting track! Interesting to see a spike after applying for a job, and seeing employers poking around. (This is the hiring equivalent of being left on read, since they never emailed me back.)

Hmm... who is poking around for wordpress routes? Good thing I'm logging all the requests, so I download the log, regex-out all of the dates, and then run .unique over it to get all the distinct route/IP combos out.

Bots, mostly — standard internet noise scanning for known vulns. Disappointing as a story, but it's why the logging exists. And then...

What I caught

Maybe you noticed it before I did - who is getting at /.git/config? And when I find that in my logs,...

model local claude needed through models

Related Articles