...
Code Block | ||
---|---|---|
| ||
collection_id/partition_id/segment_id/field1/log_1 collection_id/partition_id/segment_id/field1/log_2 ... collection_id/partition_id/segment_id/field1/log_m |
Query and Search are two different operations, although their parameters are very similar. For the convenience of description, we use Search to represent these two operations in the following text, unless otherwise specified.
For QueryNode, the Query requires Historical and Streaming data. Historical data can be considered Immutable, and Streaming data, as it is continuously consumed from MessageStorage, is constantly added. Therefore, in order to optimize the Search, different designs need to be adopted for the two data sources.
First, let's introduce the general processing flow of QueryNode to the Search requests.
QueryNode first determines the set of segments involved with the Search request for a Collection. The search operation is performed on each segment to obtain the sub-results, and then all the sub-results are merged into the final result. Therefore, Segment is the basic processing unit of Search operations. We need to focus on the search operation in the Segment.
The bitmask roughly needs to go through the following 3 processing steps. First, a bitmask is generated through the query expression, then the bitmask is modified through the delete log in the DeltaFile, and finally, the bitmask is modified again according to the time-travel parameter and the timestamp field.
The final bitmask, together with the vector field and the target vector, participate in the approximate query and returns an array containing the positions of the Entities that meet the search conditions. For convenience, we named this array SegmentOffsets.
When the limit parameter of the Search is K, it means that only TopK results are needed. At this time, the length of SegmentOffsets is K. When doing a hybrid search, you can also specify output_field to retrieve the data of the corresponding Entity's Field.
It can be seen that segment offsets play a key role in segment processing. We need to calculate the segment offsets based on the approximate search on the vector and the filtering of the expression; we also need to extract the data of each Field in the corresponding entity of the Segment based on the offset.
We abstract the Stringfield Interface.
Code Block | ||
---|---|---|
| ||
type StringField inteface { extract(segmentOffsets []int32) []string serialize() []bytes deserialize([]bytes) } func Filter(expression string, field StringField) sgementOffsets []int32 |
The extract interface on Stringfield can retrieve the corresponding String according to the provided segment offsets.
The function Filter calculates the segment offsets on the Stringfield based on the expression string.
The serialize method serializes itself into a slice of bytes, which is convenient to store in ObjectStroage as an index file.
The deserialize method deserializes the index file into a Stringfield object.