As I’m sure you’re aware NDCG is a statistic used to measure the relevance of a search algorithm. It compares the returned results to a set of labels, aka a judgment list. These judgments might be explicit – a human being went through and labeled them – or they might be implicit – gathered from clickstream data. Or they might be LLM generated.
In any case, when compared against the returned search results, we can put a number on the relevance.
I’ve argued, of course, this whole NDCG way of evaluating search is overrated in real-life circumstances. Yet it’s still useful 😎.
And, by God, I’m exhausted from writing this same, boilerplate, code over-and-over-and-over again at very job I go to.
So here’s a useful notebook to walk you through the major decisions of NDCG calculation that matter to real-life search evaluation.
Enjoy!