Show HN: Don't ask if devs cheat with AI, test if they're good with it

Evaluator — Hire engineers who use AI well

For the 2026 hiring market Every engineer uses AI now. Hire the ones who use it well. Evaluator is the technical assessment that grades how skillfully candidates collaborate with AI — reading it, fixing it, prompting it, overriding it — on top of the fundamentals that still matter: reading, writing, debugging. Generate a free assessment→See a sample first 10 free / monthNo card requiredSee what's tested

AI · CritiqueQuestion 14 of 17 20 pts An AI assistant produced this. It looks reasonable. It is not. Find every flaw and fix it.

async function fetchUserPosts(userId: string) { const res = await fetch(`/api/users/${userId}/posts`) const posts = res.json.parse() return posts.filter((p, i) => i ) }Candidate found res.json.parse() — hallucinated. It's await res.json(). i — off-by-one. Should be or just drop the filter. Critique score92/100

Caught the hallucination

The shift You've been screening for the wrong thing.

2023 hiring “Did the candidate use ChatGPT? Block them, detect them, ban the tool.”

2026 hiring “Of course they use AI. The question is whether they can read it, fix it, prompt it, and override it when it hallucinates.”

Every shop now has Copilot, Cursor, Claude Code. The bottom quartile of every team is the one that takes the AI's first answer. The top quartile catches the hallucinated import, rewrites the over-engineered class, and ships something that actually works. We test for the top quartile.

The differentiator Five tests for how someone works with AI. No other platform does this. Most still treat AI as a thing to detect. We treat it as a tool to grade.

01Prompt quality Can they brief an AI like they brief a junior? We give them a feature spec. They write the prompt they would actually send. We score for context, constraints, edge cases, and acceptance criteria — not for verbosity.

Strong candidate response Implement a debounced search hook for the Postgres-backed /api/search endpoint we already use in SearchBar.tsx. 300ms debounce. Cancel in-flight requests on new input (use the AbortController we use elsewhere). Return { data, error, loading }. Don't introduce a new fetch library — we use native fetch. Cover the empty-query case (return early, no request). + context+ constraints+ edge case

02Reading AI code Can they tell "works" from "good"? We show them AI-written code that runs. They explain what it does, flag the AI-shaped tells — over-engineered classes, defensive try/catch eating real errors, non-idiomatic patterns — and say what they would change.

class UserDataManager { private cache: Map constructor() { this.cache = new Map() async getUserById(id: string | null): Promise { if (!id) return null try { if (this.cache.has(id)) return this.cache.get(id)! return await fetchUser(id) } catch (e) { return null } }Candidate “A class for what should be a function. Swallows errors silently — caller can't tell a 500 from a missing user. Doesn't actually write to the cache, so it never warms.”

03Fixing AI code Can they surgically fix one bug? We plant exactly one realistic bug in an AI-written function. They find it and patch it minimally. We penalize broad refactors that miss the actual problem.

diff function paginate(items, page, size) { - const start = page * size + const start = (page - 1) * size return items.slice(start, start + size) }Correct — surgical fix. No collateral refactor.

04Critique Can they catch every hallucination? We give them code with multiple planted flaws — fake APIs, off-by-ones, swallowed errors. We grade thoroughness: did they catch them all, or did they stop at the first one and say "looks good"?

Found by candidate · 3 / 3 lodash.deepFlatten doesn't exist — _.flattenDeep does. catch (e) swallows the error. Should at least log or rethrow. Loop runs O(n²) — switch the outer to a Set lookup.

05Live collaboration Watch them work with the assistant. On the final question, the candidate gets an AI sidebar built into the editor. We record every prompt they send, every suggestion they accept, every chunk they reject, and every keystroke they make on top. The transcript goes to you.

function debouncedSearch(query: string) { // accepted from AI if (!query) return if (controller) controller.abort() // candidate edit: was 200, made it 300 timeout = setTimeout(...) }Sidebar transcript You: use AbortController for cancellation AI: You: debounce is wrong — should be 300ms not 200ms

4 prompts · 2 accepts · 1 reject · 38% manual edits

Six dimensions Five fundamentals. Plus the one nobody else tests. Every assessment is generated for the specific role you're hiring for, in the specific tech stack you use. The questions change. The dimensions don't.

AIThe differentiator AI Collaboration

Five sub-tests: prompt quality, reading AI code, fixing AI code, critique, and live collaboration. The first assessment that grades AI fluency as a first-class skill. See all five sub-tests→ RCode Reading Untangle real code, spot subtle...

Show HN: Don't ask if devs cheat with AI, test if they're good with it

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level