GLM-5.2's Code Reviews Are Only as Good as Your Prompt

GLM-5.2’s Code Reviews Are Only as Good as Your Prompt

Kilo Blog

SubscribeSign in

GLM-5.2’s Code Reviews Are Only as Good as Your Prompt

Darko Gjorgjievski Jun 26, 2026

GLM-5.2 from Z.ai has been one of the most talked-about open-weight models since it launched, and we have made it our daily driver to see how it performs on various coding tasks. We already put it head to head with Kimi K2.7 on planning and building a backend service. What gets talked about far less, and what we kept bumping into, is how much its code review quality swings from run to run.

We had mixed results reviewing code with it. Sometimes it read like a sharp senior engineer, and sometimes it skimmed right past a real bug. In this experiment, we ran a controlled test to find out whether that was a prompt problem or something deeper.

How We Tested The Model

We built a small backend in TypeScript: a task management API on Bun, Hono, Drizzle, and SQLite. It had standard pieces like users, authentication, tasks, search, bulk operations, and CSV export. We wrote a test suite that locked in the correct behavior first, then we went into the code and planted bugs. We used that suite as the reference for grading the reviews. A bug counted as caught only when our agent flagged the actual, specific problem. We handed the broken codebase to GLM-5.2 in Kilo Code CLI and asked it to audit the code. We ran every reasoning effort the model offers (low, medium, and high) against three prompt framings: Casual: “I just finished this Bun + Hono + Drizzle task API. I think the implementation is pretty clean and consistent with the rest of the codebase. Can you take a look and let me know what you think?”

Consistency-focused: “Please review this repository for real bugs, security issues, data consistency problems, and production edge cases. Pay attention to whether behavior is consistent across routes.”

Strict production: “Review this repository as if you are blocking or approving a production PR.”

The code never changed, the only things we varied were reasoning effort and the wording of the request. Round 1: GLM-5.2 Did Well, and Did It Consistently

The first codebase carried 16 planted bugs across the usual categories: SQL injection in a search query, a user search that returned password hashes, a missing authentication check on an admin-only export, an authorization hole that let any user modify another user’s tasks, CSV formula injection, a pagination off-by-one, and a handful of bulk-operation correctness bugs.

GLM 5.2 Low Auditing First Codebase GLM-5.2 handled this cleanly. It caught every serious security bug in every run, and the spread between the worst and best run was small.

Whether we asked casually or strictly, at low effort or high, it landed between 13 and 15 of 16. On a straightforward codebase, GLM-5.2 reviewed code about as well as we would want, and the prompt barely mattered.

GLM 5.2 Low Audit from First Codebase Every one of these bugs is the kind that reaches production and causes a real incident, and GLM-5.2 caught them consistently no matter how we asked. We wanted to find where it starts to break down, so we made the next codebase considerably harder. Round 2: A Harder Codebase With Subtler Bugs

We grew the same project into a larger product. We added soft deletion (a deletedAt timestamp that hides a row everywhere), an archive flag (a softer “move it out of the way” state), optimistic concurrency with a version number, a status state machine for tasks, and an audit log that records who changed what. Then we planted 10 bugs that were far subtler than Round 1. None of them are the kind of thing a scanner flags. Most require understanding what the feature is supposed to do.

GLM 5.2 High Auditing Second Codebase Five of the planted bugs, in plain terms: Delete did not actually delete. The delete endpoint marked a task as archived but never set the deletedAt timestamp the rest of the app uses to hide deleted rows, so “deleted” tasks kept showing up.

The optimistic-lock check was backwards. The version comparison was written so a stale client (someone editing an out-of-date copy) passed straight through, which is the exact case the check exists to stop.

A permission guard that could never fire. The rule meant to stop regular users from reopening a finished task had a condition that is always false, so it did nothing at all.

The audit log blamed the wrong person. Bulk assignment recorded the assignee as the actor instead of the user who actually performed the action.

Archived tasks leaked into normal views. Archived tasks still appeared in the default search results, in CSV exports, and in the overdue list, even though archiving is supposed to move them out of the way.

We planted these so they got progressively harder to catch. Some are local bugs you can find by reading a single function carefully, like the backwards lock check or the permission guard that never fires. But the rest got gradually more...

GLM-5.2's Code Reviews Are Only as Good as Your Prompt

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level