Content Comparison

Current state: ["Under Discussion"]

...

	IP	L2	HAMMING	JACCARD	TANIMOTO	SUBSTRUCTURE	SUPERSTRUCTURE
BIN_IDMAP
BIN_IVF_FLAT
IDMAP
IVF_FLAT
IVF_PQ
IVF_SQ8
HNSW
ANNOY

If call range search API with unsupported index types or unsupported metric types, knowhere will throw out exception.

In Knowhere, only one parameter "radius" is passed in via config, and range search will return all un-sorted results with distance "better" than radius.

...

The valid scope for "radius" is defined as: -1.0 <= radius <= float_max

metric type	range	similar	not similar
L2	(0, inf)	small	large
IP	[-1, 1]	1	-1
jaccard	[0, 1]	0	1
tanimoto	[0, 0.5]	0	0.5
hamming	[0, n]	0	n

2. QueryByRange()

This API does range search for index, it returns all unsorted results with distance "better than radius" (for IP: > radius; for others: < radius).

PROTO

virtual DatasetPtr
QueryByRange(const DatasetPtr& dataset, const Config& config, const faiss::BitsetView bitset)

INPUT

Dataset {
knowhere::meta::TENSOR: - // query data
knowhere::meta::ROWS: - // rows of queries
knowhere::meta::DIM: - // dimension
}

Config {

knowhere::meta::RADIUS: - // radius for range search

}

OUTPUT

Dataset {
knowhere::meta::IDS: - // result IDs with length LIMS[nq]
knowhere::meta::DISTANCE: - // result DISTANCES with length LIMS[nq]
knowhere::meta::LIMS: - // result offset prefix sum with length nq + 1
}

...

This API does range search for no-index dataset, it returns all unsorted results with distance "better than radius" (for IP: > radius; for others: < radius).

PROTO

static DatasetPtr
RangeSearch(const DatasetPtr base_dataset,
            const DatasetPtr query_dataset,
            const Config& config,
            const faiss::BitsetView bitset);

INPUT

Dataset {
knowhere::meta::TENSOR: - // base data
knowhere::meta::ROWS: - // rows of base data
knowhere::meta::DIM: - // dimension
}

Dataset {
knowhere::meta::TENSOR: - // query data
knowhere::meta::ROWS: - // rows of queries
knowhere::meta::DIM: - // dimension
}

Config {

knowhere::meta::RADIUS: - // radius for range search

}

OUTPUT

Dataset {
knowhere::meta::IDS: - // result IDs with length LIMS[nq]
knowhere::meta::DISTANCE: - // result DISTANCES with length LIMS[nq]
knowhere::meta::LIMS: - // result offset prefix sum with length nq + 1
}

...

Range search completely reuses the call stack from SDK to segcore.

Compatibility, Deprecation, and Migration Plan(optional)

There is no compatibility issue.

...

With above solution, user can get maximum 16384 range search results in one call.

If user wants to get more than 16384 range search results, they can do like this: (use L2 as an example)

NQ = 1

1st call with (radius_low_bound = 0.0, radius_high_bound = inf), get result distances like this: {d(0), d(1), d(2), ..., d(n-1)}

2nd call with (radius_low_bound = d(n-1), radius_high_bound = inf), get result distances like this: {d(n), d(n+1), d(n+2), ..., d(2n-1)}

3rd call with (radius_low_bound = d(2n-1), radius_high_bound = inf), get result distances like this: {d(2n), d(2n+1), d(2n+2), ..., d(3n-1)}

...

NQ > 1 (suppose NQ = 2)

1st call with (radius_low_bound = 0.0, radius_high_bound = inf), get result distances like this:

{d(0,0), d(0,1), d(0,2), ..., d(0,n-1), d(1,0), d(1,1), d(1,2), ..., d(1,n-1)}

2nd call with (radius_low_bound = min{d(0,n-1), d(1,n-1)}, radius_high_bound = inf), get result distances like this:

{d(0,n), d(0,n+1), d(0,n+2), ..., d(0,2n-1), d(1,n), d(1,n+1), d(1,n+2), ..., d(1,2n-1)}

3rd call with (radius_low_bound = min{d(0,2n-1), d(1,2n-1)}, radius_high_bound = inf), get result distances like this:

{d(0,2n), d(0,2n+1), d(0,2n+2), ..., d(0,3n-1), d(1,2n), d(1,2n+1), d(1,2n+2), ..., d(1,3n-1)}

...

The result of 2nd call will have some duplications with the result of 1st call, use need do duplication check and remove them.

Compatibility, Deprecation, and Migration Plan(optional)

There is no compatibility issue.

Test Plan(required)

Knowhere

Add new unittest
Add benchmark using range search dataset

...

Need implement Merge operation for chunk, segment, query node and proxy
Memory explosion caused by Merge
Many API modification caused by invalid topk parameter

...

Many API modification caused by invalid topk parameter

Current proposal is better than the previous one in all respects.

Others

Because the result length of range search from knowhere is variable, knowhere plan to afford another API to return the range search result count for each NQ.

If there is user request to get all range search result in one call, query node team will afford another solution to save knowhere's output to S3 directly.

References(optional)

...

Version	Old Version 45	New Version 46
Changes made by	Yudong Cai	Yudong Cai
Saved on	Sept 21, 2022	Sept 22, 2022

Versions Compared

Key

2. QueryByRange()

Compatibility, Deprecation, and Migration Plan(optional)

Compatibility, Deprecation, and Migration Plan(optional)

Test Plan(required)

Knowhere

Others

References(optional)