Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: "Under Discussion"

ISSUE: #1924 #3199 #4201 #4430 #4810 #5603

PRs:

Current state: "Under Discussion"

ISSUE: #1924 #3199 #4201 #4430 #4810 #5603

PRs:

Keywords: string tries hybrid search

Released:

Summary

Motivation

The data types currently supported by Milvus do not include the String type. According to the feedback in the previous issue list, the support of the String data type is expected by many users. One of the most urgent requirements of the String type is to support the primary key of the custom String type. Take the image search application as an example. In actual needs, users need to uniquely identify a picture with a string type value. The string type value can be the name of the picture, the md5 value of the picture, and so on. Since Milvus does not support the string data type, users need to make an additional int64 to string mapping externally, which reduces the efficiency and increases the maintenance cost of the entire application.

In addition to vectors, Milvus2.0 supports data types such as Boolean, integers, floating-point numbers, and more. A collection in Milvus can hold multiple fields for accommodating different data features or properties. Milvus pairs scalar filtering with powerful vector similarity search to offer a modern, flexible platform for analyzing unstructured data. Obviously, scalar filtering should support attributes of type String.

Public Interfaces

When users create a Collection, they can specify a String type Field in the Schema. The Field of the String type can of course be designated as the primary field at the same time.

In the system design, the type of string field is a variable-length character string, but a fixed size limit is set for the character string, such as 64KB, 256KB, etc. If the storage size of the string exceeds the limit value, the insertion fails.

...

A piece of sample code is as follows:

Code Block
languagepy
from pymilvus_orm import connections, Collection, FieldSchema, CollectionSchema, DataType
>>> import random
>>> schema = CollectionSchema([
... FieldSchema("film_name", DataType.String, is_primary=True),
... FieldSchema("films", dtype=DataType.FLOAT_VECTOR, dim=2)
... ])
>>> collection = Collection("film_collection", schema)
>>> # insert
>>> data = [
... ["film_%d"+str(i) for i in range(10)],
... [[random.random() for _ in range(2)] for _ in range(10)],
... ]
>>> collection.insert(data)
>>> # search
>>> res = collection.search(data=[1.0,1.0], 
anns_field="films",
param = {"metric_type":"L2"},
limit=2,
expr = "film_name != 'film_1'")

Design Details

Before introducing the design scheme, let's briefly review the data flow of the milvus2.0 system.

DataFlow And DataModel

In milvus2.0, MessageStroge is the backbone of the entire system. Milvus 2.0 implements the unified Lambda architecture, which integrates the processing of the incremental and historical data. Milvus 2.0 introduces log backfill, which stores log snapshots and indexes in the object storage to improve failure recovery efficiency and query performance.

Image Removed

Incremental data flows into MessageStorage through the AccessLayer(Proxy), and nodes such as QueryNode and DataNode consume data from MessageStorage. For incremental data, DataNode persists the data to ObjectStorage in units of Segments to form Historical data. The ObjectStorage layer mainly stores historical data, including Log Snapshot, DeltaFile, and IndexFile. QueryNode can also load historical data from ObjectStorage. Index Node reads historical data from Object Storage and builds an index, and writes the index file back to Object Storage.

...

This article introduces the importance of supporting the String data type. Review the data flow and data model of the Milvus system. Design different StringField implementations for Segments according to different data sources (Streaming or Historical). 
For Streaming Segment, introduce the definition of StreamStringField.
For Histrical Segment, two implementations of HistoricalStringField are introduced.
HistoricalStringField1 -- base on std::vectors and unorder_map
HistoricalStringField2 -- base on marisa trie and std::vector

we also Introduce the processing logic of StringField on different working nodes.

Motivation

The data types currently supported by Milvus do not include the String type. According to the feedback in the previous issue list, the support of the String data type is expected by many users. One of the most urgent requirements of the String type is to support the primary key of the custom String type. Take the image search application as an example. In actual needs, users need to uniquely identify a picture with a string type value. The string type value can be the name of the picture, the md5 value of the picture, and so on. Since Milvus does not support the string data type, users need to make an additional int64 to string mapping externally, which reduces the efficiency and increases the maintenance cost of the entire application.

In addition to vectors, Milvus2.0 supports data types such as Boolean, integers, floating-point numbers, and more. A collection in Milvus can hold multiple fields for accommodating different data features or properties. Milvus pairs scalar filtering with powerful vector similarity search to offer a modern, flexible platform for analyzing unstructured data. Obviously, scalar filtering should support attributes of type String.

Public Interfaces

When users create a Collection, they can specify a String type Field in the Schema. The Field of the String type can of course be designated as the primary field at the same time.

In the system design, the type of string field is a variable-length character string, but a fixed size limit is set for the character string, such as 64KB, 256KB, etc. If the storage size of the string exceeds the limit value, the insertion fails.

Users can retrieve the previous field of string type according to the search/query interface.Users can add scalar filtering operations for string type Fields in search/query. The filtering operations include: "==", "!=" "<" ,"<=" ,">",">="

A piece of sample code is as follows:


Code Block
languagepy
from pymilvus_orm import connections, Collection, FieldSchema, CollectionSchema, DataType
>>> import random
>>> schema = CollectionSchema([
... FieldSchema("film_name", DataType.String, is_primary=True),
... FieldSchema("films", dtype=DataType.FLOAT_VECTOR, dim=2)
... ])
>>> collection = Collection("film_collection", schema)
>>> # insert
>>> data = [
... ["film_%d"+str(i) for i in range(10)],
... [[random.random() for _ in range(2)] for _ in range(10)],
... ]
>>> collection.insert(data)
>>> # search
>>> res = collection.search(data=[1.0,1.0], 
anns_field="films",
param = {"metric_type":"L2"},
limit=2,
expr = "film_name != 'film_1'")

Design Details

Before introducing the design scheme, let's briefly review the data flow of the milvus2.0 system.

DataFlow And DataModel

In milvus2.0, MessageStroge is the backbone of the entire system. Milvus 2.0 implements the unified Lambda architecture, which integrates the processing of the incremental and historical data. Milvus 2.0 introduces log backfill, which stores log snapshots and indexes in the object storage to improve failure recovery efficiency and query performance.

Image Added

Incremental data flows into MessageStorage through the AccessLayer(Proxy), and nodes such as QueryNode and DataNode consume data from MessageStorage. For incremental data, DataNode persists the data to ObjectStorage in units of Segments to form Historical data. The ObjectStorage layer mainly stores historical data, including Log Snapshot, DeltaFile, and IndexFile. QueryNode can also load historical data from ObjectStorage. Index Node reads historical data from Object Storage and builds an index, and writes the index file back to Object Storage.

The data model of milvus2.0 mainly includes Collection, Partition, and Segment. Collection can be logically divided into multiple Partitions, for example, we can divide different Partitions according to date. The collection is physically composed of multiple Segments. A Segment contains multiple Entities, and an Entity is equivalent to row data in a traditional database. An Entity contains multiple Field data. A Field is equivalent to a Column in a traditional database. These Fields include those specified by the Schema when the user creates the Collection, and some hidden Fields added by the system, such as timestamps.

...

Code Block
languagecpp
type StringField inteface { 
extract(segmentOffsets []int32) []string

serialize() []bytes

deserialize([]bytes)

}
func Filter(expression string, field StringField) sgementOffsets []int32

...

Code Block
languagecpp
class HistoricalStringField1 {
std::vector<string> strs;

std::unordered_map<int32, std::vector<int32>> strOffsetToSegOffsets;

std::vector<int32> segOffsetToStrOffset;

std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);

std::vector<Blob>
serialize();

void
deserialize(const std::vector<Blob>&)

}

class Blob {
std::vector<char> blob_;
}

...

When the index file exists, QueryNode loads the index file from the object store, call calls the deserialize method of Stringfield, and generates a Stringfield object.

Another implementation of StringField of Historical Segment

Tries (also known as radix trees or prefix trees) are tree-based data structures that are typically used to store associative arrays where the keys are usually strings.

For one, nodes in the tree do not store keys. Instead, they each store parts of keys. Traversing down from the root node to a leaf allows you to build the key as you progress. Also, there doesn't need to be a value at every node. In fact, values are typically only associated with leaf nodes.

For example, below is a graphic that is a representation of a trie that contains the following associative array.

map = { 'ape': 32, 'ball': 2, 'atom': 16, 'ate': 18, 'bait': 5 }

Image Modified


The word "ape" is built by traversing the "a" edge, the "p" edge, and the "e" edge. When we reach a leaf, we can extract the value, which is 32.

Tries have many advantages over their counterpart, the hash table. They are used in many string search applications such as auto-complete, text search, and prefix matching. Radix trees, a kind of trie, are often used in IP routing.

There are many variant implementations of Trie. The main consideration when we choose an implementation is whether it can support reverse lookup and save memory. The reverse loop means that we can get the original string from Tries through a certain ID.

In the following tests, the memory usage was measured for 3 million unique Unicode Russian words; "

...

lookup time " was measured on a lookup for

...

one word.



Descriptionmemory usage (MegaBytes)UnicodeLookup time (nanosecond)Reverse-lookup
PAT-Triea pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie) and may use a lot of memory because of that.242No333No
HAT-TrieTrie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with the fastest lookups.125Yes195No
DA-TrieDouble-Array Trie C implementation101Yes281No
Marisa-Triememory-efficient recursive LOUDS-trie-based data structure implemented as C++ library11Yes2010Yes
DAWGDirected Acyclic Word Graphs2.8Yes249No
Python-Dict/600


Python-list/300



MARISA-trie is a very memory-efficient recursive trie-based data structure implemented as a C++ library (repo). String data in a MARISA-trie may take up to 50x-100x less memory than in a standard Python

...

Dict; the raw lookup speed is comparable; trie also provides fast advanced methods like prefix search. MARISA-trie also

...

supports memory-mapped IO so it is possible to have an on-disk trie and reduce the memory usage even further.

The following gives another C++ definition of HistoricalStringField.


Code Block
languagecpp
class HistoricalStringField2 {
MarisaTrie trie_;
std::vector<StrID> segOffsetToStrID;
std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);
std::vector<Blob>
serialize();
void
deserialize(const std::vector<Blob>&)
}
class Blob {
std::vector<char> blob_;
}
using StrID = int32;


Marisa Trie will return a unique StrID for each inserted string. We can get the original string through the StrID. Marisa trie can specify the value when inserting the key. Here we set the value to be an array of the segment offset. In addition, we need to maintain an additional mapping relationship from segment offset to StrIDs.

Thus, opeations ("==", "!=",) are transformed into lookup on the trie to get the corresponding segment offset .

Operations ("<=", "<", ">", ">=") needs to modify the underlying implementation of marisa to support range queries.

For the extract interface, you only need to retrieve the corresponding String according to segment offsets and segOffsetToStrID. When the StrID is obtained, the original string is retrieved through the reverse lookup of the trie.

An Implementation of StringField of Streaming Segment

The Streaming Segment mentioned here refers to the segment whose data keeps growing, also called the Growing Segment.

As the data keeps increasing, it is neither convenient to sort the string array nor to use trie. One possible implementation StreamingStringField is defined as follows.


Code Block
languagecpp
class StreamingStringField {

std::vector<string> strs;

std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);

std::vector<Blob>
serialize();

void
deserialize(const std::vector<Blob>&)

}
class Blob {
std::vector<char> blob_;
}


The strs member contains the original string without deduplication and sorting, and the corresponding string can be retrieved directly by using segment offset as a subscript. string offsets are equivalent to segment offsets.

Operations ("==", "!=", "<", "<=", ">", ">=") are transformed into brute force search on strs to get the corresponding string offset, which is also the segment offset .


IndexNode's processing of String Field

The basic unit of IndexNode read and write is a Field of Segment. It is used to build an index file for a Field to speed up the search. You can build an index on either a vector field or a scalar field. When IndexNode receives an index request for a certain field of a Segment, it reads the original historical data from ObjectStorage, builds an index file for the Field, and writes the index file back to ObjectStorage.

For the String type Field, IndexNode needs to construct a HistoricalStringField object from the original data, and persist it in the object storage.

If the implementation of HistoricalStringField1 is adopted, what we need to persist are the strs member and the SegmentOffsetToStrOffset member.It is easy to recover strOffsetToSegOffsets based on segOffsetToStrOffset.

If the implementation of HistoricalStringField2 is adopted, what we need to persist are the trie member and the SegmentOffsetToStrID member. Marisa has implemented serialization and deserialization methods.

Compatibility, Deprecation, and Migration Plan

This is a new feature. There is no need to consider compatibility issues, no need to deprecate old interfaces, and no need to provide data migration tools.

Test Plan

  • Unit tests

  • CI tests

References

https://en.wikipedia.org/wiki/Trie

https://brilliant.org/wiki/tries/

https://github.com/s-yata/marisa-trie