Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: Under DiscussionAccepted

ISSUE: #6299

PRs: #6570 #6598 #6671 #7102

Keywords: Query / Search / Vector

Released: Milvus 2.0rc3


Summary

Using This project is to use minimal memory consumption, let `search` or `query` operation support query support to return vector raw data field in output fields.

Motivation

In Milvus 2.0rc1 中,`search` 和 `query` 操作还不支持把向量列作为查询结果的一部分输出。这是基于节约内存的考虑,向量列相比其它标量列数据太大,会占用太多的内存,所以在 `load_collection` 或 `load_partition` 时,只有标量列数据文件和索引文件被加载到内存(仅当向量索引不存在时才会加载原始向量文件)。

## Design Details(required)

在 search / query 结束后,再分析 output_fields 里是否包含向量列,若包含,则加载结果 IDs 所在 segment 向量列,通过结果 IDs 对应的 offset 得到对应向量数据。

1. 添加数据结构 `VectorFieldInfo` 用于记录 `segment` 中向量数据相关信息

```go
type VectorFieldInfo struct {
mu sync.RWMutex
fieldBinlog *datapb.FieldBinlog
rowNum map[string]int64 // map[binlogPath]int64
rawDataMmap map[string][]byte // map[binlogPath][]byte
}

type Segment struct {
... ...
vectorFieldInfos map[UniqueID]*VectorFieldInfo
}
```

2. 在 `segment` 中添加新接口

```go
// fill vector raw data into RetrieveResults
func (s *Segment) fillRetrieveResults(plan *RetrievePlan, result *segcorepb.RetrieveResults) error

// 1. load vector field binlog file from minio
// 2. decode binlog file, get vector raw data
// 3. save raw data into local disk
// 4. do mmap
func (s *Segment) segmentVectorFieldDataMmap(fieldID int64, binlog string, rowCount int, data interface{}) ([]byte, error)
```

3. 在 `segmentLoader` 中添加新接口

```go
func (loader *segmentLoader) loadSegmentVectorFieldsData(segment *Segment, binlogs []string) error
```

4. 在 retrieve 函数中添加如下逻辑

* 当输出列包含向量列、向量列未加载、且当前 segment 返回值不为空时

```go
if err = q.historical.loader.loadSegmentVectorFieldsData(segment, binlogs); err != nil {
return err
}
if err = segment.fillRetrieveResults(plan, result); err != nil {
return err
}
```

5. load_segment 接口添加参数 `include_vector_field` or `vector_fields[]`

**`search` 接口不支持返回原始向量数据**
如果想得到 `search` 返回结果所对应的原始向量数据,可通过再次调用 `get_entity_by_id` 得到。

, query does not support return vector field in output. If query request's output fields contain float vector or binary vector, proxy will error out.

This is for the consideration of memory consumption, because vector field with big dimension will occupy hundreds of times of memory comparing with scalar

field. So generally load_collection or load_partition only load scalar fields' raw data into memory. Vector fields' raw data is loaded into memory only in 3 cases:

  1. streaming segment
  2. vector field's index type is FLAT
  3. vector field's index has not been created

Only if vector's raw data has been loaded into memory, query can return vector field in output.

But query need this capability to return vector's raw data, for example tester can use this to check the correctness of inserted data.


Currently search also does not support return vector field in output, but we don't plan to enhance search in this project. If users need to get the vector data after

search returns ID, they can call query to get it.

If there is real requirement from users to let search return vector in output, we can achieve this in SDK level.

Design Details

Query supporting vector field in output can be divided into 2 steps:

  1. in load segment stage, create VectorFieldInfo for vector fields, and save it into segment struct
  2. in the end of query stage, 
    1. load vector field's data if needed
    2. get vector data, fill in query result


  • Add new field VectorFieldInfo into segment struct to return vector field related information  
Code Block
type VectorFieldInfo struct {
    mu              sync.RWMutex
    fieldID         UniqueID
    fieldBinlog     *datapb.FieldBinlog
    rowDataInMemory bool
    rawData         map[string]storage.FieldData  // map[binlogPath]FieldData
}

type Segment struct {
    ... ...
    vectorFieldInfos map[UniqueID]*VectorFieldInfo
}


  • Add new interface in segment_loader
Code Block
// load vector field's data from info.fieldBinlog, save the raw data into info.rawData
func (loader *segmentLoader) loadSegmentVectorFieldData(info *VectorFieldInfo) error {


  • Add new interface in query_collection
Code Block
// For vector output fields, load raw data from fieldBinlog if needed,
// get vector raw data via result.Offset from *VectorfieldInfo, then
// fill vector raw data into result
func (q *queryCollection) fillVectorFieldsData(segment *Segment, result *segcorepb.RetrieveResults) error


We also enhanced query to support wildcard in output fields.

  • "*" - means all scalar fields
  • "%" - means all vector fields

For example, A/B are scalar fields, C/D are vector fields, duplicated fields are automatically removed.

  • output_fields=["*"] ==> [A,B]
  • output_fields=["%"] ==> [C,D]
  • output_fields=["*","%"] ==> [A,B,C,D]
  • output_fields=["*",A] ==> [A,B]
  • output_fields=["*",C] ==> [A,B,C]


Original vector data storage public interface and struct

Public Interfaces```go
type FileManager interface {
GetFile(path string) . It may be discussed and changed in future.

Code Block
type ChunkManager interface {
	GetPath(key string) (string, error)

...


	Write(key string, content []byte) error

...


	Exist(

...

key string) bool

...


	Read(key string) ([]byte

...

, error)
	ReadAt(key string, p []byte, off int64) (n int, err error)
}


A VectorFileManager implements FileManager interface and add a method to download vector file from remote and deserialize its content, finally save pure vector to local storage.```go

Code Block
type

...

 VectorChunkManager struct

...

 {
	localChunkManager  ChunkManager
	remoteChunkManager ChunkManager
}

func NewVectorChunkManager(localChunkManager ChunkManager, remoteChunkManager ChunkManager) *VectorChunkManager

localChunkManager is responsible to local file manager. And can be implements with golang os library.
remoteFileManager The path of local chunk manager is config in milvus.yaml with key storage.path.
remoteChunkManager is responsible for cloud storage or remote server storage, and will be implemented with minio client now.

When the offset of vector is obtained, we can get origin vector data from local vector data file.## Test Plan(required)

Check `get_entity_by_id` can get correct vector raw data in following 2 scenarios:

* scenario (1)
* create_collection
* insert
* get_entity_by_id

* scenario (2)
* create_collection
* insert
* create_index
* get_entity_by_id

## Rejected Alternatives(optional)

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

## References(optional)

Briefly list all referencesGet the vector the ID through the following process:

1.Get segment's id size in each binlog and vector file names when load_segment. The binlogs file will be sorted by file name's last id to guarantee the order is increasing. Suppose we get sizes are 300, 300, 400, 500.

2.Get the id offset in segment in C layer. Suppose we get an offset 700.

3.We can know the vector we want to get is in 3rd vector files. for 300+300 <700<300+300+400

4.Get the 3rd file in to memory and deserialize out pure vector. Save the vector to local storage. Release the memory usage.

5.Mmap the file to memory, and get the data of offset 100. The data length differs data type and dim.



Test Plan

Do query / search (with vector field in output fields) in all kinds of combinations of following scenarios, check the correctness of result.

  1. float vector or binary vector
  2. with/wo index
  3. all kinds of index type