Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: ["Under DiscussionApproved"]

ISSUE: #17599

PRs: 

Keywords: search, range search, radius, low bound, high boundrange filter

Released: 

Summary(required)

...

This MEP is about to realize a functionality named "range search". User specifies a range scope -- including radius low bound and radius high boundand range filter (optional), Milvus does "range search", and also returns TOPK sorted results with distance falling in this range scope.

  • both radius low bound and radius high bound MUST be setparam "radius" MUST be set, param "range filter" is optional
  • falling in range scope means
Metric typeBehavior
IPradius
_low_bound
< distance <=
 radius
range_
high_bound
filter
L2 and other metric types
radius
range_
low_bound
filter <= distance <
 radius_high_bound
radius

Motivation(required)

Many users request the "range search" functionality. With this capability, user can

...

We reuse the SDK interface Search() to do "range search". Only add 2 more parameters "radius_low_bound" and "radiusrange_high_boundfilter" into params.

When "radius_low_bound" and When param "radius_high_bound" are is specified, Milvus does range search; otherwise, Milvus does search.

As shown in the following figure, set "radiusrange_low_boundfilter": 1.0, "radius_high_bound": 2.0 in search_ params.params.

Code Block
languagepy
  default_index = {"index_type": "HNSW", "params":{"M": 48, "efConstruction": 500}, "metric_type": "L2"}
  collection.create_index("float_vector", default_index)
  collection.load()  
  search_params = {"metric_type": "L2", "params": {"ef": 32, "radiusrange_low_boundfilter": 1.0, "radius_high_bound": 2.0}}
  res = collection.search(vectors[:nq], "float_vector", search_params, limit, "int64 >= 0")

...


IPL2HAMMINGJACCARDTANIMOTOSUBSTRUCTURESUPERSTRUCTURE
BIN_IDMAP

  •   
  •   
  •   


BIN_IVF_FLAT

  •   
  •   
  •   


IDMAP
  •   
  •   





IVF_FLAT
  •   
  •   





IVF_PQ
  •   
  •   





IVF_SQ8
  •   
  •   





HNSW
  •   
  •   





ANNOY






DISKANN
  •  
  •   





If call range search API with unsupported index types or unsupported metric types, exception will be thrown out.

In Knowhere, two new parameters "radius_low_bound" and "radiusrange_high_boundfilter" are passed in via config, and range search will return all un-sorted results with distance falling in this range scope.

Metric typeBehavior
IPradius
_low_bound
< distance <=
 radius
range_
high_bound
filter
L2 or other metric typesrange_filter <= distance < radius
_low_bound <= distance < radius_high_bound

Knowhere run range search in 2 steps:

  1. pass param "radius" to thirdparty to call their range search APIs, and get result
  2. if param "range_filter " is set, filter above result and return; if not, return above result directly


We add 3 new APIs CheckRangeSearch(), QueryByRange() and BruteForce::RangeSearch() to support range search.

...

This API returns all unsorted results with distance falling in the specified range scope.

PROTO
virtual DatasetPtr
QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset)

INPUT

Dataset {
    knowhere::meta::TENSOR: -   // query data
    knowhere::meta::ROWS: -      // rows of queries
    knowhere::meta::DIM: -          // dimension
}

Config {

    knowhere::meta::RADIUS_LOW_BOUND: -   

    knowhere::meta::RADIUSRANGE_HIGH_BOUNDFILTER: -   

}

OUTPUT

Dataset {
    knowhere::meta::IDS: -                // result IDs with length LIMS[nq]
    knowhere::meta::DISTANCE: -  // result DISTANCES with length LIMS[nq]
    knowhere::meta::LIMS: -            // result offset prefix sum with length nq + 1
}

...

This API does range search for no-index dataset, it returns all unsorted results with distance "better than radius" (for IP: > radius; for others: < radius).

PROTO
static DatasetPtr
RangeSearch(const DatasetPtr base_dataset,
const DatasetPtr query_dataset,
const Config& config,
const faiss::BitsetView bitset);

INPUT

Dataset {
    knowhere::meta::TENSOR: -   // base data
    knowhere::meta::ROWS: -      // rows of base data
    knowhere::meta::DIM: -          // dimension
}

Dataset {
    knowhere::meta::TENSOR: -   // query data
    knowhere::meta::ROWS: -      // rows of queries
    knowhere::meta::DIM: -          // dimension
}

Config {

    knowhere::meta::RADIUS_LOW_BOUND: -   

    knowhere::meta::RADIUSRANGE_HIGH_BOUNDFILTER: -   

}

OUTPUT

Dataset {
    knowhere::meta::IDS: -                // result IDs with length LIMS[nq]
    knowhere::meta::DISTANCE: -  // result DISTANCES with length LIMS[nq]
    knowhere::meta::LIMS: -            // result offset prefix sum with length nq + 1
}

...

Segcore uses radius parameter's existence to decide whether to run search, or to run range search.

...

, or to run range search.

  • when param "radius" is set, range search is called; 
  • otherwise, search is called. 

...

  1. pass radius parameters (radius _low_bound / radiusrange_high_boundfilter) to knowhere 
  2. get all unsorted range search result from knowhere
  3. for each NQ's results, do heap-sort
  4. return result Dataset with TOPK results

...

If user wants to get more than 16384 range search results, they can call range search multiple times with different radius range_filter parameter (use L2 as an example)

  • NQ = 1

1st call with (radiusrange_low_bound filter = 0.0, radius _high_bound = inf), get result distances like this:

{d(0), d(1), d(2), ..., d(n-1)}

2nd call with (radiusrange_low_bound filter = d(n-1), radius _high_bound = inf), get result distances like this:

{d(n), d(n+1), d(n+2), ..., d(2n-1)}

3rd call with (radiusrange_low_bound filter = d(2n-1), radius _high_bound = inf), get result distances like this:

...

  • NQ > 1 (suppose NQ = 2)

1st call with (radiusrange_low_bound filter = 0.0, radius _high_bound = inf), get result distances like this: 

{d(0,0), d(0,1), d(0,2), ..., d(0,n-1), d(1,0), d(1,1), d(1,2), ..., d(1,n-1)}

2nd call with (radiusrange_low_bound filter = min{d(0,n-1), d(1,n-1)}, radius _high_bound = inf), get result distances like this: 

{d(0,n), d(0,n+1), d(0,n+2), ..., d(0,2n-1), d(1,n), d(1,n+1), d(1,n+2), ..., d(1,2n-1)}

3rd call with (radiusrange_low_bound filter = min{d(0,2n-1), d(1,2n-1)}, radius _high_bound = inf), get result distances like this:

...

  1. test/milvus/ann_hdf5/binary/sift-4096-hamming-range.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. unified radius = 291.0
    3. result length for each nq is different
    4. total result num 1,063,078
  2. test/milvus/ann_hdf5/sift-128-euclidean-range.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. unified radius = 186.0 * 186.0
    3. result length for each nq is different
    4. total result num 1,054,377
  3. test/milvus/ann_hdf5/sift-128-euclidean-range-multi.hdf5
    1. base dataset and query dataset are identical with sift1m
    2. ground truth IDs and Distances are identical with sift1m
    3. each nq's radius is set to the last ground truth distance
    4. result length for each nq is 100
    5. total result num 1,000,000
  4. test/milvus/ann_hdf5/glove-200-angular-range.hdf5
    1. base dataset and query dataset are identical with glove200
    2. unified radius = 0.52
    3. result length for each nq is different
    4. total result num 1,060,888

...

  1. use sift1M/glove200 dataset to test range search (radius _low_bound = = max_float / -1.0, radius_high_bound = max_float), we expect to get identical results as search

...