Information Retrieval Without Feeling Lucky: The Art and Science of Search

Anja Pilz

Wednesday 17:10 in Helium3

Information Retrieval goes beyond keyword matching - it’s about intent, context, and delivering relevant and accurate results. As RAG applications gain traction, understanding the retrieval process becomes more crucial for developers, data scientists, and search engineers.

We start with the Why. People have different needs for search - lookup, research, and inspiration. Each of these needs can be influenced and affected by the key IR metrics of search engines: precision, recall, and desirability. Having introduced these fundamentals, we go into common retrieval challenges, such as ambiguity, mismatched vocabularies, and the impact of context.

Aiming to solve these challenges, we then go into advanced search techniques, comparing sparse (keyword-based) and dense (vector-based) retrieval, highlighting their strengths and limitations. We’ll explore hybrid search as a powerful approach that blends these techniques. In a live demo, using crawled data from the Sendung mit der Maus, we’ll showcase a hybrid search setup leveraging tools like Mistral, Elasticsearch, and Streamlit. While the dataset language is German, the core concepts and search dynamics should hopefully be easily understandable also for non native speakers.

The talk concludes with key takeaways on building effective search systems and a look ahead at future developments in contextualized search.

Tentative Outline:

  1. Introduction to Information Retrieval (~ 5 min)   Why do we search? Lookup, research, inspiration   Core metrics: precision, recall, desirability

  2. Challenges in Search and Retrieval (~ 5 min)   Ambiguity   Discrepancy in query and content   * The impact of context

  3. Search Techniques (~ 10 min)   Sparse vs dense retrieval: comparing keyword and vector search (semantic search, embeddings, synsets, decompounders)   Hybrid search: Combining sparse and dense approaches

  4. Hybrid Search in Action (< 10 min)   Setting up a hybrid search with Mistral, Elasticsearch, and Streamlit   Live Demo: exploring search in Lach- & Sachgeschichten from Sendung mit der Maus

  5. Takeaways & Outlook (< 5 min)

  • hybrid search systems combine semantics, precision and explainability
  • contextualized search

The talk is directed at anyone interested in building or improving search systems. Attendees will gain a deeper understanding of the tools, methodologies, and metrics essential for building robust and explainable search systems.

Anja Pilz

I received my PhD in Machine Learning (ML) and Natural Language Processing (NLP) from the University of Bonn and Fraunhofer IAIS where I was member of the Text Mining group. Now I work on AI and data driven products, mostly focused on applications in the medical and healthcare domain. My main passion is in NLP, especially for the German language, and Information Retrieval (IR). Sometimes I build Recommender Systems.