An experimental study of multi-stage retrieval systems

Information Retrieval (IR) is concerned with searching over large unstructured data like web pages, emails, and image libraries, among others. This large-scale searching is made possible through IR systems, which pre-process and store such data in a space- and time-efficient structure known as an inverted index. In an IR system, the process of retrieving relevant data (referred, henceforth, to as documents) for a given user query passes through three major stages, namely: (1) candidate set generation, (2) feature extraction, and (3) candidate set re-ranking stages.

During the candidate set generation stage (or stage 1), a retrieval strategy with a pruning model (e.g., WAND and MaxScore) is executed on top of an inverted index to retrieve top-K ranked documents for any given user query. In the feature extraction stage (or stage 2), various features are extracted from the top-K documents generated by stage 1. In the candidate set re-ranking stage (or stage 3), the features extracted by stage 2 are dispatched to a trained machine learning model, which re-ranks the top-K documents accordingly. After re-ranking, the top 10-50 documents are output to the user as a final ranked list.

In this work, we discovered that the effectiveness of machine learning models in stage 3 relies heavily on the configuration parameters defined in stages 1 and 2. Nonetheless, these machine learning models are typically developed and trained in complete obliviousness to such parameters, yielding thereby a significant loss in end-to-end effectiveness of IR systems.

To this end, we thoroughly investigated the correlation between the different stages of real-world end-to-end multi-stage IR systems to understand how they impact each other and, accordingly, recoup lost effectiveness. In particular, we asked and answered two critical research questions. First, to what extent do stages 1 and 2 influence the effectiveness of stage 3? Second, what parameter values and machine learning models should we use in stages 1, 2, and 3 to achieve high effectiveness in end-to-end IR systems?

 

Printable poster

  • Author

    Mohammed Yusaf Ansari

  • Advisor

    Mohammad Hammoud

Post a comment

Your email address will not be published. Required fields are marked *