Current state: ["Under Discussion"]
...
Design Details(required)
Knowhere
Index types and metric types supporting range search are listed below:
IP | L2 | HAMMING | JACCARD | TANIMOTO | SUBSTRUCTURE | SUPERSTRUCTURE | |
---|---|---|---|---|---|---|---|
BIN_IDMAP | |||||||
BIN_IVF_FLAT | |||||||
IDMAP | |||||||
IVF_FLAT | |||||||
IVF_PQ | |||||||
IVF_SQ8 | |||||||
HNSW | |||||||
ANNOY |
In Knowhere, only one "radius" is passed in via config, and range search will return all un-sorted results with distance "better" than radius.
- For IP: > radius
- For other metric types: < radius
We add 3 new APIs CheckRangeSearch(), QueryByRange() and BruteForce::RangeSearch() to support range search, these APIs are already available since knowhere-v1.3.0.
- QueryByRange
- CheckRangeSearch()
This API is used to get do range search parameter legacy check. It will throw exception when parameter legacy check fail.
The valid scope for "radius" is defined as: -1.0 <= radius <= float_max
metric type | range | similar | not similar |
---|---|---|---|
L2 | (0, inf) | small | large |
IP | [-1, 1] | 1 | -1 |
jaccard | [0, 1] | 0 | 1 |
tanimoto | [0, 0.5] | 0 | 0.5 |
hamming | [0, n] | 0 | n |
2. QueryByRange()
This API does range search for index, it returns all unsorted results with distance "better than radius" (for L2IP: < > radius; for IPothers: > < radius).
PROTO | virtual DatasetPtr; |
INPUT | Dataset { Config { knowhere::meta::RADIUS: - // radius for range search } |
OUTPUT | Dataset { |
LIMS is with length "nq+1", it's the offset prefix sum for result IDS and result DISTANSESDISTANSE. The length of IDS and DISTANCES DISTANCE are the same but variable.
Suppose N queried vectors is are with label: {0, 1, 2, ..., n-1}
...
The memory used for IDS, DISTANCE and LIMS are allocated in Knowhere, they will be auto-freed when Dataset deconstruction.
By new, following index types support QueryByRange():
BinaryIDMAP
BinaryIVF
IDMAP
IVF_FLAT
IVF_PQ
IVF_SQ8
HNSW
Milvus
Range search completely reuses the call stack from SDK to segcore.
...
3. BruteForce::RangeSearch()
This API does range search for no-index dataset, it returns all unsorted results with distance "better than radius" (for IP: > radius; for others: < radius).
PROTO | static DatasetPtr |
INPUT | Dataset { Dataset { Config { knowhere::meta::RADIUS: - // radius for range search } |
OUTPUT | Dataset { |
The output is as same as QueryByRange().
Segcore
Segcore search flow will be updated as this flow chart, range search related change is marked RED.
In API query::SearchOnSealedIndex() and BruteForceSearch(), when Segcore, it uses radius parameters existence to decide to run range search or search. When "radius_low_bound" and "radius_high_bound" are set, range search is called; otherwise, search is called.
For API query::SearchOnSealedIndex() , the result returned by knowhere::QueryByRange() will be sorted and filtered, only results with distance falling in the range scope will be returned.
The same as API BruteForceSearch(), the result returned by knowhere::BruteForce::RangeSearch() will be sorted and filtered, only results with distance falling in the range scope will be returned.
...
and BruteForceSearch(), they do like following:
- pass in correct radius parameter to knowhere (For IP: radius_low_bound; for others: radius_high_bound)
- get all unsorted range search result from knowhere
- for each NQ's results, do sort, then filter using another radius parameter
- re-generate result Dataset with TOPK results
Whatever do range search or search, the output structure are same:
- query::SearchOnSealedIndex() => SearchResult
- BruteForceSearch() => SubSearchResult
Both SearchResult and SubSearchResult contain TOPK sorted result for each NQ.
- seg_offsets_: with length "nq * topk", -1 is filled in when no enough result data
- distances_: with lengh "nq * topk", data is undefined when no enough result data
Milvus
Range search completely reuses the call stack from SDK to segcore.
Compatibility, Deprecation, and Migration Plan(optional)
There is no compatibility issue.
Test Plan(required)
...
Knowhere
- Add new unittest
- Add benchmark using range search dataset
There is no public
...
dataset for range search. I have created range search data set based on sift1M and glove200.
You can find them in NAS:
test/milvus/ann_hdf5/sift-128-euclidean-range.hdf5
test/milvus/ann_hdf5/glove-200-angular-range.hdf5
test/milvus/ann_hdf5/binary/sift-4096-hamming-range.hdf5
For Milvus
...
Segcore
- Add new range search unittest
Milvus
- use sift1M/glove200 dataset to test range search (radius_low_bound = -1.0, radius_high_bound = 1000000), we expect to get identical result as search
...
Current proposal is better than the previous one in all respects.