...
For Histrical Segment, two implementations of HistoricalStringField are introduced.
HistoricalStringField1 -- base on std::vectors and unorder_map
HistoricalStringField2 -- base on marisa trie and std::vector
we also Introduce the processing logic of StringField on different working nodes.
Motivation
The data types currently supported by Milvus do not include the String type. According to the feedback in the previous issue list, the support of the String data type is expected by many users. One of the most urgent requirements of the String type is to support the primary key of the custom String type. Take the image search application as an example. In actual needs, users need to uniquely identify a picture with a string type value. The string type value can be the name of the picture, the md5 value of the picture, and so on. Since Milvus does not support the string data type, users need to make an additional int64 to string mapping externally, which reduces the efficiency and increases the maintenance cost of the entire application.
...
Code Block | ||
---|---|---|
| ||
type StringField inteface { extract(segmentOffsets []int32) []string serialize() []bytes deserialize([]bytes) } func Filter(expression string, field StringField) sgementOffsets []int32 |
...
Code Block | ||
---|---|---|
| ||
class HistoricalStringField1 { std::vector<string> strs; std::unordered_map<int32, std::vector<int32>> strOffsetToSegOffsets; std::vector<int32> segOffsetToStrOffset; std::vector<string> extract(const std::vector<int32>& segmentOffsets); std::vector<Blob> serialize(); void deserialize(const std::vector<Blob>&) } class Blob { std::vector<char> blob_; } |
...
map = { 'ape': 32, 'ball': 2, 'atom': 16, 'ate': 18, 'bait': 5 }
...
In the following tests, the memory usage was measured for 3 million unique Unicode Russian words; "simple lookup time " was measured on a lookup for a specific one word.
Description | memory usage (MegaBytes) | Unicode | Lookup time (nanosecond) | Reverse-lookup | |
---|---|---|---|---|---|
PAT-Trie | a pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie) and may use a lot of memory because of that. | 242 | No | 333 | No |
HAT-Trie | Trie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with the fastest lookups. | 125 | Yes | 195 | No |
DA-Trie | Double-Array Trie C implementation | 101 | Yes | 281 | No |
Marisa-Trie | memory-efficient recursive LOUDS-trie-based data structure implemented as C++ library | 11 | Yes | 2010 | Yes |
DAWG | Directed Acyclic Word Graphs | 2.8 | Yes | 249 | No |
Python-Dict | / | 600 | |||
Python-list | / | 300 |
...
Code Block | ||
---|---|---|
| ||
class StreamingStringField { std::vector<string> strs; std::vector<string> extract(const std::vector<int32>& segmentOffsets); std::vector<Blob> serialize(); void deserialize(const std::vector<Blob>&) } class Blob { std::vector<char> blob_; } |
...