We run this experiment on a localhost as it does not require intensive computations. We need to install ES and Milvus as specified here.
In index and query use case, users provide a collection of documents such as text files or webpages to build a retriever. Users can then ask questions to obtain relevant results from the provided documents. The code for this use case is at index_and_query_from_docs.py. To run this use case, go to denser-retriever repo and run the following command
If the run is successful, we would expect to see something similar to the following.
The index and query use case consists of two steps:
Build a denser retriever from a text file or a webpage.
Query a retriever to obtain relevant results.
To support additional types of files such as PDF, users can refer to LangChain file loaders.
The following diagram illustrates a denser retriever, which consists of three components:
Keyword search relies on traditional search techniques that use exact keyword matching. We use elasticsearch in denser retriever.
Vector search uses neural network models to encode both the query and the documents into dense vector representations in a high-dimensional space. We use Milvus and snowflake-arctic-embed-m model, which achieves state-of-the-art performance on the MTEB/BEIR leaderboard for each of their size variants.
A ML cross-encoder re-ranker can be utilized to further boost accuracy over these two retriever approaches above. We use cross-encoder/ms-marco-MiniLM-L-6-v2, which has a good balance between accuracy and inference latency.
In the following section, we will explain the underlying processes and mechanisms involved.
We config the above three components in the following yaml file (available at repo). Most of the parameters are self-explanatory. The sections of keyword, vector, rerank config the Elasticsearch, Milvus, and reranker respectively.
We uses combine: model to combine Elasticsearch, Milvus and reranker via a xgboost model experiments/models/msmarco_xgb_es+vs+rr_n.json, which was trained using mteb msmarco dataset (see the training recipe on how to train such a model). Besides the model combination, we can also use linear or rank to combine Elasticsearch, Milvus and reranker. The experiments on MTEB datasets suggest that the model combination can lead to significantly higher accuracy than the linear or rank methods.
Some parameters, for example, es_ingest_passage_bs, are only used in training a xgboost model (i.e. not needed in query stage).
We now describe how to build a retriever from a given text file: the state_of_the_union.txt. The following code shows how to read the text file, split the file to text chunks and save them to a jsonl file passages.jsonl.
Each line in passages.jsonl is a passage, which contains fields of source, title, text and pid (passage id).
Building a retriever from a webpage is similar to the above, except for the passage corpus generation. The index_and_query_from_webpage.py source code can be found at here.
To run this use case, go to denser-retriever repo and run:
If successful, we expect to see somthing similar to the following.