Evaluating job search ranking with LLM judged NDCG

sp19821 pts0 comments

Evaluating job search ranking with LLM judged NDCG | corvi.careers

Menu

✕Close

corvi.careers

Privacy<br>Terms<br>About us

Job search queries come in different shapes. Some are broad, such as product manager, software engineer, or sales development representative. Some are narrower role or skill queries, such as SQL Server DBA, biomedical engineer, or data platform engineer. Some are rare exact match queries, such as haskell, elm, or ocaml. Others combine role, skill, seniority, domain, or other constraints, such as senior backend engineer fintech python.

We wanted one scoring method that works across these cases. The question we care about is simple: for this query, did the product rank the strongest available matches near the top?

To answer that, we use LLM judged NDCG.

Evaluation case

Each evaluation case starts with a frozen eligible corpus: one query, one location filter, and one frozen snapshot of eligible jobs. The location filter defines the set of jobs that could have been shown. Once that set is fixed, the evaluation is about ranking quality.

For each case, the evaluator loads every eligible job from the frozen corpus, sends each job to an LLM judge with compact evidence from the posting, gets a 0 to 100 relevance score, sorts all judged jobs by score to create the ideal ranking under the judge, and compares the product's top results against that ideal ranking with NDCG.

Eligibility constraints, such as location and active state filters, are handled before judging. The judge scores job intent relevance. For example, with k8s, location: San Francisco, the evaluator scores the eligible corpus, identifies the strongest Kubernetes related matches under the judge, and checks whether the product ranked them near the top.

Why NDCG

Precision and recall force an extra decision: what score counts as relevant? In this setup, that means choosing a cutoff such as 70 or 90, or constructing some other target set for each query.

That loses information because the judge already gives us a graded 0 to 100 score. If one page has scores 95, 94, 93, 92, 91 and another has 75, 74, 73, 72, 71, both pages get precision@5 = 1.0 with a cutoff of 70. Those pages are clearly different.

NDCG preserves the score differences and the ordering. It uses the full graded score to construct the ideal ranking under the judge, then compares the product's ranked page against that ideal. The scoring question becomes: how much of the ideal top k score did the product capture, with earlier ranks weighted more heavily?

Why keep the 0 to 100 score

Coarse relevance buckets lose too much information for this use case. A 0 to 3 label scheme may be enough when there are only a few highly relevant results. Job search often has hundreds of strong matches. If every excellent job collapses into the same bucket, the metric cannot distinguish weaker excellent matches from stronger excellent matches.

The biggest impact range for us is often around 75 to 85. Those jobs are usually plausible matches, but not all plausible matches are equally good. If that whole range collapses into one "good" bucket, the metric loses the ability to tell whether the product put the stronger good matches ahead of the weaker ones. With linear gain NDCG over the full 0 to 100 score, that ordering still matters.

We do not treat every one point difference as exact truth. The score is a judge signal, not a physical measurement. But preserving the full score gives the evaluator resolution inside the large set of good and excellent matches.

Why linear gain

NDCG needs a gain function. A common formulation uses exponential gain, gain(rel) = 2^rel - 1. That works when relevance labels are small ordinal values like 0, 1, 2, 3.

It does not work directly with 0 to 100 scores. Exponential gain would make the top end dominate the metric unless we first bucketed or rescaled the scores, which would undo the reason we kept the 0 to 100 signal.

So we use linear gain: gain(score) = score. A score of 100 contributes twice as much raw gain as a score of 50, and a 95 contributes slightly more than a 90.

That tradeoff is intentional. We want the metric to preserve ordering inside strong matches without making small score differences explode. Linear gain also keeps the metric inspectable: the NDCG result is tied directly to the judge scores shown in the report.

The important caveat is judge reliability. If the judge cannot consistently distinguish 91 from 93, then the metric should not be read as proving that one job is meaningfully better than the other. The useful signal is usually at the ranking and page level: did the product put stronger scored jobs near the top, and did it miss jobs that the judge scored much higher?

Metric

We use linear gain NDCG over the LLM's 0 to 100 relevance scores.

DCG@k = sum(score_i / log2(rank_i + 1))<br>NDCG@k = DCG@k(product_top_k) / DCG@k(ideal_top_k)<br>The denominator is the maximum possible discounted score for the top k page, using the...

score ndcg judge gain matches product

Related Articles