On using language models for relevance labelling

The Data Science Discipline of the School of EECS is hosting the following guest seminar:

On using language models for relevance labelling

Speaker: Dr Paul Thomas, Microsoft
Host: Prof Shane Culpepper

Abstract: Relevance labels—annotations that say whether a result is relevant to a given search—are key to evaluating the quality of a search engine. Standard practice to date has been to ask in-house or crowd workers to label results, but recently-developed language models are able to produce labels at greatly reduced cost.

We report on experiments using GPT-4, a recent large language model, to label documents for relevance. On queries and documents from TREC-Robust, we see accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. We also demonstrate variation due to prompt features and to slight paraphrases, emphasising the need for “gold” labels to validate our metrics.

At Bing we have been using GPT-4 for large-scale relevance labelling, and we also report on our experiences labelling web queries and web documents. We find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

This is work with Nick Craswell, Seth Spielman, and Bhaskar Mitra (all Microsoft).

Biography: Paul Thomas is a senior applied scientist at Microsoft, where he works on measurement for Bing and other products. His research is in information retrieval: particularly in how people use web search systems and how we should evaluate these systems, as well as interfaces for search including search with different types of results, search on mobile devices, and search as conversation.

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Room:

01-E107 Forgan Smith Building, Great Court