Document Retrieval Techniques

October 27, 2024

Retrieval at scale

The most visible face of LLMs today is chat interfaces that let you ask arbitrary questions. An underappreciated aspect of LLMs is how much they have revolutionized document retrieval.

A decade ago, retrieving documents that could address the meaning of a question was a difficult task. You had to know the right keywords, and even then, you might not get a good answer. In the early days, one of Google's big products was an enterprise search engine, a literal server box that companies could swap into their datacenters.

But nowadays, you can get incredible performance out of compact neural models that index millions of documents per dollar. Several of these advances have only become possible in the last few years with the advent of scalable architectures like the Transformer and the ability of intelligent large models to generate huge amounts of high-quality synthetic data.

There are two main techniques for embedding documents: textual and visual. Neither is a universal interface, but they each have their own strengths.

Textual retrieval

Roughly speaking, there are three main neural techniques for retrieving documents as text:

Dense textual embeddings: These are the most common type of embeddings, and they are used to represent documents as vectors in a high-dimensional space. These map a string to a fixed-length vector. Examples include OpenAI's text-embedding-00n, the BGE family, and VoyageAI's models. The MTEB leaderboard has a comprehensive list.
Bi-encoders: A more expensive option is to train a model that takes a query and a document and returns a score for how well they match. Cohere has a reranker API that can do this.
Late interaction: As a middle ground, some models map each query and document to a sequence of fixed-length vectors. A query and a document can then be scored using an inexpensive aggregation function, such as MaxSim. The ColBERT architecture introduced this technique.

Dense embeddings are the most common technique, and they are the easiest to implement. Because they have been around for a while, there are many high-quality libraries for working with them, and several databases have optimized indexing techniques that let you search through billions of vectors.

The drawback of dense embeddings is that they inevitably lose some information, and so they will perform worse compared to using a bi-encoder. But the problem with a bi-encoder is that you have to compare every query-document pair, which is expensive. That's why a lot of search pipelines will use a two-stage approach: first retrieve a small set of documents using dense embeddings, then use a bi-encoder to rerank them.

Late interaction models let you improve things even further. They are more expensive than dense embeddings, but less expensive than bi-encoders. In practice, models like ColBERT let you retain ~95% of the performance of a bi-encoder while being orders of magnitude faster at query time. The drawback is that support for late interaction embeddings is relatively new, and so there are fewer libraries and indexing techniques optimized for them.

It's also possible to train a single model that can perform all three techniques, which can be convenient during indexing. The BGE-M3 family of models can be used as a dense embedder, a bi-encoder, or a late interaction model.

Visual retrieval

Lots of documents have visual content like pictures, charts, and graphs. How can we retrieve them neurally?

If you're a text maximalist, then you might think that the best way to retrieve visual content is to convert it into text. You could use a model to caption the image, which gets indexed as a string.

But there are two problems with this approach. First, you have to train a model to caption images, which is a difficult task. Second, you have to convert the image into text, which is a lossy process that can change the meaning of the image.

A better approach is to train a model that can take in the image directly. Like with textual embeddings, there are three main techniques: dense embeddings, bi-encoders, and late interaction. Visual retrieval is a more recent development because vision-language models are younger than their text-only counterparts.

For dense embeddings, the recent MCDSE model is a state-of-the art embedding model. The technique used is called document screenshot embedding. The space is so new that mcdse is the best model of its class, outperforming even its counterpart trained using late interaction, ColQwen.

For tracking models in this category, the ViDoRe leaderboard is a good resource.

What's next?

In my opinion, performance on ViDoRe is kind of saturated. The smallest decent VLM is 2b parameters, but we know in the text space that models in the 100m parameter range are already quite good. It would be great if Meta or Alibaba could release smaller Llama and Qwen checkpoints with vision enabled.

We're also about to see some much stronger vision models in general once companies adopt early fusion training, like Meta did with CM3Leon. The performance gains should apply across the board, and so we should see a lot of progress in visual retrieval. The price-performance gains will continue to improve exponentially.

As of today, it is still too expensive to run document embedding models locally; e.g. over a user's PDF library. MCDSE takes a few seconds to embed each page on my M1 Macbook Pro, which is too slow for most applications. Once models get smaller and faster, this will change.