We run this experiment on a localhost as it does not require intensive computations. We need to install ES and Milvus as specified here.
In index and query use case, users provide a collection of documents such as text files or webpages to build a retriever. Users can then ask questions to obtain relevant results from the provided documents. The code for this use case is at index_and_query_from_docs.py. To run this use case, go to denser-retriever repo and run the following command
poetry run python experiments/index_and_query_from_docs.py
If the run is successful, we would expect to see something similar to the following.
2024-05-27 12:00:55 INFO: ES ingesting passages.jsonl record 962024-05-27 12:00:55 INFO: Done building ES index2024-05-27 12:00:55 INFO: Remove existing Milvus index state_of_the_union2024-05-27 12:00:59 INFO: Milvus vector DB ingesting passages.jsonl record 962024-05-27 12:01:03 INFO: Done building Vector DB index[{'source': 'tests/test_data/state_of_the_union.txt','text': 'One of the most serious constitutional responsibilities...','title': '', 'pid': 73,'score': -1.6985594034194946}]
The index and query use case consists of two steps:
Build a denser retriever from a text file or a webpage.
Query a retriever to obtain relevant results.
To support additional types of files such as PDF, users can refer to LangChain file loaders.
The following diagram illustrates a denser retriever, which consists of three components:
Keyword search relies on traditional search techniques that use exact keyword matching. We use elasticsearch in denser retriever.
Vector search uses neural network models to encode both the query and the documents into dense vector representations in a high-dimensional space. We use Milvus and snowflake-arctic-embed-m model, which achieves state-of-the-art performance on the MTEB/BEIR leaderboard for each of their size variants.
A ML cross-encoder re-ranker can be utilized to further boost accuracy over these two retriever approaches above. We use cross-encoder/ms-marco-MiniLM-L-6-v2, which has a good balance between accuracy and inference latency.
In the following section, we will explain the underlying processes and mechanisms involved.
We config the above three components in the following yaml file (available at repo). Most of the parameters are self-explanatory. The sections of keyword, vector, rerank config the Elasticsearch, Milvus, and reranker respectively.
We uses combine: model to combine Elasticsearch, Milvus and reranker via a xgboost model experiments/models/msmarco_xgb_es+vs+rr_n.json, which was trained using mteb msmarco dataset (see the training recipe on how to train such a model). Besides the model combination, we can also use linear or rank to combine Elasticsearch, Milvus and reranker. The experiments on MTEB datasets suggest that the model combination can lead to significantly higher accuracy than the linear or rank methods.
Some parameters, for example, es_ingest_passage_bs, are only used in training a xgboost model (i.e. not needed in query stage).
We now describe how to build a retriever from a given text file: the state_of_the_union.txt. The following code shows how to read the text file, split the file to text chunks and save them to a jsonl file passages.jsonl.
Each line in passages.jsonl is a passage, which contains fields of source, title, text and pid (passage id).
{"source": "tests/test_data/state_of_the_union.txt","title": "","text": "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny.","pid": 0}
We can simply use the following code to query a retriever to obtain relevant passages.
# Queryquery = "What did the president say about Ketanji Brown Jackson"passages, docs = retriever_denser.retrieve(query, {})print(passages)
Each returned passage receives a confidence score to indicate how relevant it is to the given query. We get something similar to the following.
[{'source': 'tests/test_data/state_of_the_union.txt','text': 'One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.','title': '','pid': 73,'score': -0.6116511225700378}]
Building a retriever from a webpage is similar to the above, except for the passage corpus generation. The index_and_query_from_webpage.py source code can be found at here.
To run this use case, go to denser-retriever repo and run:
poetry run python experiments/index_and_query_from_webpage.py
If successful, we expect to see somthing similar to the following.
2024-05-27 12:10:47 INFO: ES ingesting passages.jsonl record 662024-05-27 12:10:47 INFO: Done building ES index2024-05-27 12:10:52 INFO: Milvus vector DB ingesting passages.jsonl record 662024-05-27 12:10:56 INFO: Done building Vector DB index[{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/','text': 'Fig. 1. Overview of a LLM-powered autonomous agent system...','title': '','pid': 2,'score': -1.6985594034194946}]