Evaluator — Hire engineers who use AI well
For the 2026 hiring market<br>Every engineer uses AI now.<br>Hire the ones who use it well.<br>Evaluator is the technical assessment that grades how skillfully candidates collaborate with AI — reading it, fixing it, prompting it, overriding it — on top of the fundamentals that still matter: reading, writing, debugging.<br>Generate a free assessment→See a sample first<br>10 free / monthNo card requiredSee what's tested
AI · CritiqueQuestion 14 of 17<br>20 pts<br>An AI assistant produced this. It looks reasonable. It is not. Find every flaw and fix it.
async function fetchUserPosts(userId: string) {<br>const res = await fetch(`/api/users/${userId}/posts`)<br>const posts = res.json.parse()<br>return posts.filter((p, i) => i )<br>}Candidate found<br>res.json.parse() — hallucinated. It's await res.json().<br>i — off-by-one. Should be or just drop the filter.<br>Critique score92/100
Caught the hallucination
The shift<br>You've been screening for the wrong thing.
2023 hiring<br>“Did the candidate use ChatGPT? Block them, detect them, ban the tool.”
2026 hiring<br>“Of course they use AI. The question is whether they can read it, fix it, prompt it, and override it when it hallucinates.”
Every shop now has Copilot, Cursor, Claude Code. The bottom quartile of every team is the one that takes the AI's first answer. The top quartile catches the hallucinated import, rewrites the over-engineered class, and ships something that actually works. We test for the top quartile.
The differentiator<br>Five tests for how someone works with AI.<br>No other platform does this. Most still treat AI as a thing to detect. We treat it as a tool to grade.
01Prompt quality<br>Can they brief an AI like they brief a junior?<br>We give them a feature spec. They write the prompt they would actually send. We score for context, constraints, edge cases, and acceptance criteria — not for verbosity.
Strong candidate response<br>Implement a debounced search hook for the Postgres-backed /api/search endpoint we already use in SearchBar.tsx. 300ms debounce. Cancel in-flight requests on new input (use the AbortController we use elsewhere). Return { data, error, loading }. Don't introduce a new fetch library — we use native fetch. Cover the empty-query case (return early, no request).<br>+ context+ constraints+ edge case
02Reading AI code<br>Can they tell "works" from "good"?<br>We show them AI-written code that runs. They explain what it does, flag the AI-shaped tells — over-engineered classes, defensive try/catch eating real errors, non-idiomatic patterns — and say what they would change.
class UserDataManager {<br>private cache: Map<br>constructor() {<br>this.cache = new Map()<br>async getUserById(id: string | null): Promise {<br>if (!id) return null<br>try {<br>if (this.cache.has(id)) return this.cache.get(id)!<br>return await fetchUser(id)<br>} catch (e) { return null }<br>}Candidate<br>“A class for what should be a function. Swallows errors silently — caller can't tell a 500 from a missing user. Doesn't actually write to the cache, so it never warms.”
03Fixing AI code<br>Can they surgically fix one bug?<br>We plant exactly one realistic bug in an AI-written function. They find it and patch it minimally. We penalize broad refactors that miss the actual problem.
diff<br>function paginate(items, page, size) {<br>- const start = page * size<br>+ const start = (page - 1) * size<br>return items.slice(start, start + size)<br>}Correct — surgical fix. No collateral refactor.
04Critique<br>Can they catch every hallucination?<br>We give them code with multiple planted flaws — fake APIs, off-by-ones, swallowed errors. We grade thoroughness: did they catch them all, or did they stop at the first one and say "looks good"?
Found by candidate · 3 / 3<br>lodash.deepFlatten doesn't exist — _.flattenDeep does.<br>catch (e) swallows the error. Should at least log or rethrow.<br>Loop runs O(n²) — switch the outer to a Set lookup.
05Live collaboration<br>Watch them work with the assistant.<br>On the final question, the candidate gets an AI sidebar built into the editor. We record every prompt they send, every suggestion they accept, every chunk they reject, and every keystroke they make on top. The transcript goes to you.
function debouncedSearch(query: string) {<br>// accepted from AI<br>if (!query) return<br>if (controller) controller.abort()<br>// candidate edit: was 200, made it 300<br>timeout = setTimeout(...)<br>}Sidebar transcript<br>You: use AbortController for cancellation<br>AI:<br>You: debounce is wrong — should be 300ms not 200ms
4 prompts · 2 accepts · 1 reject · 38% manual edits
Six dimensions<br>Five fundamentals. Plus the one nobody else tests.<br>Every assessment is generated for the specific role you're hiring for, in the specific tech stack you use. The questions change. The dimensions don't.
AIThe differentiator<br>AI Collaboration
Five sub-tests: prompt quality, reading AI code, fixing AI code, critique, and live collaboration. The first assessment that grades AI fluency as a first-class skill.<br>See all five sub-tests→<br>RCode Reading<br>Untangle real code, spot subtle...