Understanding and Improving LLM-Based Relevance Judgements for Retrieval Ranking and Evaluation
Understanding and Improving LLM-Based Relevance Judgements for Retrieval Ranking and Evaluation
Speaker: Chuting Yu
Chair: Prof. Gianluca Demartini
Host: Dr. Teerapong Leelanupab
Abstract: Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using LLMs as proxies for human judges. However, whether LLM-based relevance judgments are reliable and rigorous enough to replace humans remains an open question.
In our first contribution, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms, and passage modification strategies. We show that models consistently assign inflated relevance scores — often with high confidence — to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations. Controlled experiments further show that LLM judgments are highly sensitive to passage length and surface-level lexical cues, highlighting the urgent need for careful diagnostic frameworks when applying LLMs for relevance assessment.
While LLM judges are unreliable as full replacements for human assessors, they can nonetheless identify clearly non-relevant passages with reasonable consistency. Our second contribution exploits this asymmetry in pseudo-relevance feedback (PRF), proposing a negative filtering mechanism that removes likely non-relevant documents from the PRF candidate set before query expansion. Experiments on TREC Deep Learning 2019 show consistent retrieval improvements, especially at greater feedback depths.
Our third contribution introduces PromptPRF, a feature-based PRF framework that enables lightweight retrievers to match massive models by shifting generative workload to the offline indexing phase. Evaluations on TREC and BEIR benchmarks confirm that PromptPRF achieves effectiveness comparable to larger dense retrievers, with substantial efficiency gains over online generative methods.
Bio: Chuting Yu received a Master of Engineering degree from The University of Queensland, Australia. She is currently working toward an MPhil degree with the Electrical Engineering and Computer Science Department at The University of Queensland (UQ), Brisbane, QLD, Australia.
About Data Science Seminar
This seminar series is hosted by EECS Data Science.
Venue
Zoom: https://uqz.zoom.us/j/3522290504?omn=85187729996