Hey GLM 5.2, build me a hypervisor

Hey GLM 5.2, build me a hypervisor. - by Abhishek Anand

Tokens & Marginalia

SubscribeSign in

Hey GLM 5.2, build me a hypervisor. And also a better OS while I sleep.

Abhishek Anand Jul 01, 2026

GLM 5.2 is exciting but may not be for everyone. It is liberating to have a model which you can potentially self-host, and all the dynamics it changes for a lot of agent harnesses that need code execution along with driving tools to complete complex tasks. Despite the benchmarks, I wanted to try it on my own representative workloads. I have been using it for the past week to run some ‘long horizon experiments’. Mostly with OpenCode + OpenRouter with z.ai as the provider (max/xhigh), relevant LSPs enabled.

Billions of tokens were harmed

Starting simple, I asked it to create a web based “OS”/workspace where you can run multiple CLIs like Claude Code, OpenCode etc. with persistence, store files, work on them, connect external data etc. It gets all work completed almost perfectly but is occasionally visibly inferior to Opus 4.8 on some of the frontend work it does. There were for example CSS padding issues that I have not seen in a long time with these agents. This consumes the instavm.io SDK.

On systems programming stuff this changes a lot. Especially working with hypervisor code, rust-vmm etc. it often confuses things, gets stuck doing same things for debugging in a loop for hours, forcing a change of model to Opus 4.8 to recover. Several times, it also tried to blame the nested-virtualization testbed for things that actually work on it, and instead asked for a metal instance, only to correct itself later on being told so. It also makes rust syntactical errors in 2026, and generally takes a lot more time to complete a task than the frontier models. I suspect a lot more issues come up in the second half of the context window before compaction.

Will it work? While the 5.2 performs excellent in the recent Semgrep IDOR bench identifying vulnerabilities, I was surprised by the rust code it wrote. There were security issues it introduced which are visible to even a GPT5.5 high on review. Maybe even it would have identified it on a second pass.

Another task that I gave it was to create explainer videos driving manim on Terence Tao’s GitHub repo on Lean formalizations of his analysis book. The stories and the videos generated were below average, not what an Opus generates. If your pipeline involves anything other than generating code, e.g. creative writing, it might fall short. Here is a video it generated.

While a comparison with Opus 4.8 looks unfair given its pricing and open model and all, at the end of the day most people using Coding agents will evaluate against the best they can get for their work. It might make sense on a large org level to reduce spends for most workloads though, if that is a primary objective. These are personal observations based on my workload rather than model level verdicts and YMMV. I wish someone would benchmark different coding harnesses with GLM 5.2 to answer the constantly asked question about which one to pair with it. Given that we only have a single harnesses DeepSWE benchmark ever done and that too with Opus 4.7, this is long due.

Will I use it for work? Not for now. For unattended non-critical experimental stuff, for sure. The world of models will change a lot in the future for sure.

Maybe Dario dissing open models is a clue?

Tokens & Marginalia is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Discussion about this post CommentsRestacks

TopLatestDiscussions

No posts

Ready for more?

This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

Hey GLM 5.2, build me a hypervisor

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level