Training
We run this experiment on a server, which requires ES and Milvus installations specified here.
In training use case, users provide a training dataset to train a xgboost model which governs on how to combine keyword search, vector search and reranking. A training set consists of three components:
- A query set
- A passage corpus
- A qrels file which annotates the relevance of passages for queries
The source code for this use case can be found at train_and_test.py. The training command usage is:
We use the mteb/scifact dataset to illustrate the use case:
That is, we use experiments/config_server.yaml as the config file (we need to configure hosts, users and passwords for Elasticsearch and Milvus), use mteb/scifact
dataset, use the train
split to train a xgboost model and test the trained model on test
split data. If successful, we would get similar result as the following.
We explain the experiments in the following steps.
Generate retriever data
We need to generate featurized query passage data to train xgboost models. The following table shows the mteb/scifact statistics. The dataset has 5,183 passages, 809 and 300 training and test queries. For each query, each passage in the corpus receives a relevance label, with 0 and 1 being irrelevant and relevant respectively.
#Corpus | #Train Query | #Test Query |
---|---|---|
5,183 | 809 | 300 |
We first build an Elasticsearch index, a vector index and a reranker using scifact passages. The Elasticsearch, vector search and reranker settings are configured in config_server.yaml.
- For vector search, we use snowflake-arctic-embed-m model, which achieves state-of-the-art performance on the MTEB/BEIR leaderboard for each of their size variants.
- For ML reranker, we use cross-encoder/ms-marco-MiniLM-L-6-v2, which has a good balance between accuracy and inference latency.
The Denser retriever is illustrated in the following diagram, with the top and bottom boxes describing the training and inference respectively. For each query in the training data, we query Elasticsearch and vector database to retrieve two sets of topk (100) passages respectively. We note that these two sets may overlap. We then apply a ML reranker to rerank the passages returned from Elasticsearch and vector search.
Let's consider a query and two passages in the following. The first passage is annotated with label 1 (relevant) while the second is 0 (irrelevant).
- Query
- Passages
For Elasticsearch (ES), vector search (VS) and Reranker (RR), we generate three features: rank
, score
and missing
on a query passage pair. We list the featurized query and passage pairs in the following table.
QID | PID | Label | ES Rank | ES Score | ES Missing | VS Rank | VS Score | VS Missing | RR Rank | RR Score | RR Missing |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 14717500 | 1 | 3 | 74.42 | 0 | 5 | -1.29 | 0 | 1 | 2.98 | 0 |
3 | 4414547 | 0 | 29 | 32.08 | 0 | 4 | -1.28 | 0 | 2 | 1.47 | 0 |
The first data point represents the query and passage 14717500
. The passage is annotated with label 1
(relevant) with respect to the query. The passage receives rank position of 3
and relevance score of 74.42
in Elasticsearch retriever. It is ranked in the top 100 passages from Elasticsearch and thus is not missing (ES Missing value of 0
). Similarly the passage receives rank position of 5
and score -1.29
from vector search. We note both Elasticsearch and vector search top 100 passages are reranked by a reranker, so the reranker missing feature is always 0
.
We now have featurized query passage training (138,322) and test (51,601) data from scifact dataset.
Compute baselines
For Elasticsearch, the topk passages per query are sorted in the descending order of Elasticsearch scores to compute its ndcg@10 score. Vector search ndcg@10 can be computed similarly with the vector scores. We list the baseline ndcg@10 scores in the following table.
Elasticsearch | Vector Search | |
---|---|---|
ndcg@10 | 58.42 | 73.16 |
We note that Vector search leads to higher accuracy than Elasticsearch (73.16 vs 58.42), which suggests that vector search can capture semantic similarity better compared to keyword search.
Train xgboost models
There are six ways of combining Elasticsearch (ES), vector search (VS) and reranking (RR) to build a Denser retriever: ES, VS, ES+VS, ES+RR, VS+RR, or ES+VS+RR. Out of these six combinations, four (ES+VS, ES+RR, VS+RR, and ES+VS+RR) require xgboost models to combine different retrieval scores.
We train one xgboost model for each of these four configurations: ES+VS, ES+RR, VS+RR, and ES+VS+RR. In addition, we introduce feature normalization
to the raw Elasticsearch, vector search and reranker scores. We therefore add four more configurations: ES+VS_n, ES+RR_n, VS+RR_n, and ES+VS+RR_n. We introduce two feature normalizations:
- Norm1: Standardization normalizes the features values to have a zero mean and unit variance.
- Norm2: Min-max normalizes the features based on the min and max ranges.
We end up with training 8 xgboost models for the scifact dataset. These 8 models along with ES and VS baselines are illustrated in the following table.
ID | Elasticsearch | Vector search | Reranker | Normalization |
---|---|---|---|---|
ES | ✅ | ❌ | ❌ | ❌ |
VS | ❌ | ✅ | ❌ | ❌ |
ES+VS | ✅ | ✅ | ❌ | ❌ |
ES+RR | ✅ | ❌ | ✅ | ❌ |
VS+RR | ❌ | ✅ | ✅ | ❌ |
ES+VS+RR | ✅ | ✅ | ✅ | ❌ |
ES+VS_n | ✅ | ✅ | ❌ | ✅ |
ES+RR_n | ✅ | ❌ | ✅ | ✅ |
VS+RR_n | ❌ | ✅ | ✅ | ✅ |
ES+VS+RR_n | ✅ | ✅ | ✅ | ✅ |
The xgboost model training code can be found at train_and_test.py, which is adapted from the xgboost rank code.
Test xgboost models
Once the xgboost models are trained, we can test these xgboost models on scifact test data and report ndcg@10 scores. We list all 8 models accuracy in the following table. Ref is the reference ndcg@10 of snowflake-arctic-embed-m
from Huggingface leaderboard, which is consistent with our reported VS accuracy.
ES | VS | ES+VS/ES+VS_n | ES+RR/ES+RR_n | VS+RR/VS+RR_n | ES+VS+RR/ES+VS+RR_n | ref | |
---|---|---|---|---|---|---|---|
ndcg@10 | 58.42 | 73.16 | 73.28/75.08 | 69.08/69.69 | 72.73/73.62 | 73.08/75.33 | 73.55 |
The experiments show that the combination of ES, VS and RR lead to higher accuracy. For example, the ES+VS+RR_n leads to the ndcg@10 score of 75.33, resulting 2.17 ndcg@10 increase compared to vector search baseline (ndcg@10 of 73.16).
We also support the linear combination of ES, VS, and RR, for example, with the following setting.
However, we find that the linear combination performs worse than XGBoost models, achieving an NDCG@10 score of only 62.73 with equal weights of 0.5, 0.5, and 0.5. The reasons for the low accuracy are:
- The scores from ES, VS, and RR are neither bounded nor calibrated, making it difficult for the linear weights to accurately model their relative importance.
- Some query-passage pairs may have missing score features.
On the contrary, the xgboost models can effectively estimate the feature importance even in the situation of missing feature values, therefore leading to higher ndcg@10 scores.
Xgboost model has a nice feature to estimate the feature importance. We plot the feature importance in the following picture. It shows that the normalized vector search score (VS Norm2) is the most important feature to predict if a passage is relevant or not. The normalized reranker feature (RR Norm1) is the second most important feature.