Current state: Under Discussion
ISSUE: #6299
PRs: #6570
Keywords: Query / Search / Vector
Released: Milvus 2.0rc3
Summary
Using minimal memory consumption, let `search` or `query` operation support to return vector raw data in output fields.
Motivation
In Milvus 2.0rc1, operations like search or query do not support return vector raw data in output fields. This is from the consideration of memory consumption,
vector field with big dimension will occupy hundreds of times of memory comparing with scalar field. So in general load_collection or load_partition only load
scalar fields' raw data into memory. Vector fields' raw data is loaded into memory only in 3 cases:
- steaming segment
- vector field's index type is FLAT
- vector field's index has not been created
Design Details
在 search / query 结束后,再分析 output_fields 里是否包含向量列,若包含,则加载结果 IDs 所在 segment 向量列,通过结果 IDs 对应的 offset 得到对应向量数据。
- Add new field VectorFieldInfo into segment struct to record vector field related information
type VectorFieldInfo struct { mu sync.RWMutex fieldID UniqueID fieldBinlog *datapb.FieldBinlog rowDataInMemory bool rawData map[string]storage.FieldData // map[binlogPath]FieldData } type Segment struct { ... ... vectorFieldInfos map[UniqueID]*VectorFieldInfo }
- Add new interface in segment
// based on result.Offset, get vector raw data from fieldInfo, // then fill vector raw data into result func (s *Segment) fillRetrieveResults(result *segcorepb.RetrieveResults, fieldInfo *VectorFieldInfo) error
```go
// fill vector raw data into RetrieveResults
func (s *Segment) fillRetrieveResults(plan *RetrievePlan, result *segcorepb.RetrieveResults) error
// 1. load vector field binlog file from minio
// 2. decode binlog file, get vector raw data
// 3. save raw data into local disk
// 4. do mmap
func (s *Segment) segmentVectorFieldDataMmap(fieldID int64, binlog string, rowCount int, data interface{}) ([]byte, error)
```
3. 在 `segmentLoader` 中添加新接口
```go
func (loader *segmentLoader) loadSegmentVectorFieldsData(segment *Segment, binlogs []string) error
```
4. 在 retrieve 函数中添加如下逻辑
* 当输出列包含向量列、向量列未加载、且当前 segment 返回值不为空时
```go
if err = q.historical.loader.loadSegmentVectorFieldsData(segment, binlogs); err != nil {
return err
}
if err = segment.fillRetrieveResults(plan, result); err != nil {
return err
}
```
5. load_segment 接口添加参数 `include_vector_field` or `vector_fields[]`
**`search` 接口不支持返回原始向量数据**
如果想得到 `search` 返回结果所对应的原始向量数据,可通过再次调用 `get_entity_by_id` 得到。
Original vector data storage public interface and struct
Public Interfaces
```go
type FileManager interface {
GetFile(path string) (string, error)
PutFile(path string, content []byte) error
Exist(path string) bool
ReadFile(path string) []byte
}
```
A VectorFileManager implements FileManager interface.
```go
type VectorFileManager struct {
localFileManager FileManager
remoteFileManager FileManager
insertCodec *InsertCodec
}
```
localFileManager is responsible to local file manager. And can be implements with golang os library.
remoteFileManager is responsible for cloud storage or remote server storage, and will be implemented with minio client now.
When the offset of vector is obtained, we can get origin vector data from local vector data file.
Test Plan
Do query / search (with vector field in output fields) in all kinds of combinations of following scenarios, check the correctness of result.
- float vector or binary vector
- with/wo index
- all kinds of index type