A Conceptual Framework for a Representational Approach to Information Retrieval

Abstract:

Information retrieval (IR) - the challenge of connecting users to previously stored relevant information - dates back millennia.

The technologies have changed - from clay tablets stacked in a storehouse to books arranged according to the Dewey Decimal Classification to digital content indexed by web search engines - but the aims largely have not.

With the advent of deep learning in the "neural age", IR research of late has been flourishing, particularly building on advances in pretrained transformer models. Today, there is a confusing myriad of competing approaches: cross-encoders vs. bi-encoders, dense vs. sparse representations, inverted indexes vs. approximate nearest neighbors, etc.

In this talk, I present a conceptual framework for understanding recent developments in information retrieval. I propose a representational approach that breaks the core text retrieval problem into a logical scoring model and a physical retrieval model.

The scoring model is defined in terms of encoders, which map queries and documents into a representational space, and a comparison function that computes query-document scores. The physical retrieval model defines how a system produces the top-k scoring documents from an arbitrarily large corpus with respect to a query. I explain how recent developments in IR can be seen as different parameterizations in this framework, and that a unified view suggests a number of open research questions, providing a roadmap for future work.

Biography:

Professor Jimmy Lin holds the David R. Cheriton Chair in the David R. Cheriton School of Computer Science at the University of Waterloo.

For a quarter of a century, Lin's research has been driven by the quest to develop methods and build tools that connect users to relevant information. His work mostly lies at the intersection of information retrieval and natural language processing, with a focus on two fundamental challenges: those of understanding and scale.

His work mostly lies at the intersection of information retrieval and natural language processing, with a focus on two fundamental challenges: those of understanding and scale.

Host

Assoc. Prof. Guido Zuccon

This session will be conducted online via Zoom: https://uqz.zoom.us/j/89362232168

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Room:

Zoom