Today we’ll be discussing the article “A Neural Corpus Indexer for Document Retrieval” and looking at the strengths and weaknesses of this approach. The paper was rewarded an outstanding paper award at the Neural Information Processing Systems conference, 2022. It presents a a sequence-to-sequence model for document retrieval. So let's get started!
Components of a web search engine
There are two vital components of a regular web search engine: document retrieval and ranking. The document retrieval step obtains applicable documents to the query and then the ranking stage allocates a more precise score to each document. To create a precise ranking model, a deep neural network is typically used to input each pair of query and document and estimate their relevance scores. But, in an online system, it is only feasible to process a hundred or thousand candidates per query, making the recall performance of the document retrieval stage vital to the efficacy of web search engines.
Neural Corpus Indexer
The authors of this paper propose a new approach to document retrieval that uses a sequence-to-sequence model to generate relevant document identifiers directly from the query. The model outperforms both inverted index and vector-based retrieval systems.
So, what is the Neural Corpus Indexer (NCI) model? It’s an end-to-end deep neural network that unifies the training and indexing stages. It is designed to significantly improve the recall performance of traditional methods. The model consists of three components: Query Generator, Encoder, and Prefix-Aware Weight-Adaptive (PAWA) Decoder.
Let's talk about each of these techniques in more detail. First, query generation. This is a technique that augments the data for training. Basically, it creates a query from the document, which helps the model to better understand and identify the document. It is implemented by a sequence-to-sequence transformer model that takes as input the document terms and produces a query as output.
The Encoder of the model follows the standard transformer architecture and outputs the representation for an input query. As for the decoder, it is composed of M2 transformer layers and a weight adaptation mechanism, which makes the decoder aware of semantic prefixes. This allows NCI to better align with the hierarchical nature of the semantic identifiers, resulting in more accurate document retrieval.
At inference time, the top N relevant documents can be easily obtained via beam search. The hierarchical property of semantic identifiers makes it easy to constrain the beam search on the prefix tree, so only valid identifiers will be generated.
The authors use consistency-based regularization to prevent overfitting. This regularisation tries to distinguish the representations from the same token from those of other tokens, like contrastive learning. This helps the model to generalize better, and also helps to reduce the amount of data needed for training.
Results
Now, let's talk about the results of the paper. The authors tested NCI on two commonly used academic benchmarks, the Natural Questions 320k dataset and the TriviaQA dataset. On the Natural Questions dataset, the NCI achieved a +17.6% relative enhancement for Recall@1 compared to the best baseline method. On the TriviaQA dataset, the NCI achieved a +16.8% relative enhancement for R-Precision compared to the best baseline method.
This is very impressive, and shows that the NCI is a powerful document retrieval tool. It also shows that the techniques used in the model, such as query generation and semantic document IDs, are very effective in improving the performance of the model.
Limitations
However, there are also some limitations to the NCI. One of the main limitations is that the model relies heavily on the quality of the query generation model. If the query generation model is not able to generate good queries for a particular domain, then the performance of the model may suffer.
Another limitation is that the model can only work on a fixed set of documents. This means that if the index set changes, there is a cost to updating the model.
Finally, the authors only tested the model on two datasets, which limits the generalizability of the results. It would be interesting to see how the model performs on more datasets, and on datasets covering more diverse domains.
Conclusion
Overall, this paper provides a promising direction to learn retrieval via generation, and the results are very encouraging. It shows that a unified deep neural network with tailored designs can significantly improve the recall performance of traditional methods, and can potentially be used as an end-to-end retrieval solution.
That's all for now - thanks for reading, and we'll see you next time! If you are looking for help with your Information Retrieval or Natural Language Processing projects, check out our work at the Global NLP Lab.