Use Filters
The index and query example assumes the search items contain unstructured text only. This assumption may not hold in real world search applications. For example, the Titanic dataset, which offers a comprehensive glimpse into the passengers aboard the ill-fated RMS Titanic, contains categorical feature (for example Sex), numerical feature (e.g., Age), and text feature (e.g., Name). Now we want to consider the filters in search. For example, we want to search a passenger's name with a keyword, say cumings, but with a filter of Sex field being female.
We now illustrate how we can build a retriever with filters. We made a few changes with this script on the original Titanic csv data to fit our need:
- We changed the original field names
PassengerIdandNametosourceandtextrespectively, as the latter are required fields in building an index. - We added a randomly generated
Birthdayfield to demonstrate the search of date field.
We end up with the following jsonl passages:
In order to build and query an index with titanic data, we need the following steps.
Prepare the config file
The difference to the previous config file is that we add a fields block to ingest the additional fields (besides the default of text) which includes Survived, Birthday etc. For each field, we add the following item in the filed blocks.
Field_name specifies the field name. field_name_internal is the field name used in the Milvus internally. The reason to introduce field_name_internal is for non-english language use case: The non-english field_name are not valid keys in Milvus, the field_name_internal can be set as the english translation of field_name. For english datasets, they can be identical. The type is either keyword or date, which represent categorical or date types respectively.
Build and query a Denser retriever
Once we have the config file, we run the following python code to build a retriever index and then query.
simple_demo_index_titanic is the index name we use, we can change to any other names. tests/config-titanic.yaml is the retriever yaml config file. tests/test_data/titanic_top10.jsonl contains 10 jsonl data points as follows.
Each data point has the default source, title, text and pid fields. It additionally has fields such as Sex which can be used to activate the filters in search. The query searches the keyword cumings with a filter of Sex being female. We will get the results similar to the following.