Use Filters
The index and query example assumes the search items contain unstructured text only. This assumption may not hold in real world search applications. For example, the Titanic dataset, which offers a comprehensive glimpse into the passengers aboard the ill-fated RMS Titanic, contains categorical feature (for example Sex), numerical feature (e.g., Age), and text feature (e.g., Name). Now we want to consider the filters in search. For example, we want to search a passenger's name with a keyword, say cumings
, but with a filter of Sex
field being female
.
We now illustrate how we can build a retriever with filters. We made a few changes with this script on the original Titanic csv data to fit our need:
- We changed the original field names
PassengerId
andName
tosource
andtext
respectively, as the latter are required fields in building an index. - We added a randomly generated
Birthday
field to demonstrate the search of date field.
We end up with the following jsonl passages:
In order to build and query an index with titanic data, we need the following steps.
Prepare the config file
The difference to the previous config file is that we add a fields
block to ingest the additional fields (besides the default of text
) which includes Survived
, Birthday
etc. For each field, we add the following item in the filed blocks.
Field_name
specifies the field name. field_name_internal
is the field name used in the Milvus internally. The reason to introduce field_name_internal
is for non-english language use case: The non-english field_name
are not valid keys in Milvus, the field_name_internal
can be set as the english translation of field_name
. For english datasets, they can be identical. The type
is either keyword
or date
, which represent categorical or date types respectively.
Build and query a Denser retriever
Once we have the config file, we run the following python code to build a retriever index and then query.
simple_demo_index_titanic
is the index name we use, we can change to any other names. tests/config-titanic.yaml is the retriever yaml config file. tests/test_data/titanic_top10.jsonl contains 10 jsonl data points as follows.
Each data point has the default source
, title
, text
and pid
fields. It additionally has fields such as Sex
which can be used to activate the filters in search. The query searches the keyword cumings
with a filter of Sex
being female. We will get the results similar to the following.
Put everything together
We put all code together as follows.