Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: "Under Discussion"

ISSUE: #1924 #3199 #4201 #4430 #4810 #5603

PRs:

Keywords: string tries hybrid search

Released:

Summary

Current state: "Under Discussion"

ISSUE: #1924 #3199 #4201 #4430 #4810 #5603

PRs:

Keywords: string tries hybrid search

Released:

Summary

This article introduces the importance of supporting the String data type. Review the data flow and data model of the Milvus system. Design different StringField implementations for Segments according to different data sources (Streaming or Historical). 
For Streaming Segment, introduce the definition of StreamStringField.
For Histrical Segment, two implementations of HistoricalStringField are introduced.
HistoricalStringField1 -- base on std::vectors and unorder_map
HistoricalStringField2 -- base on marisa trie and std::vector

we also Introduce the processing logic of StringField on different working nodes.

Motivation

The data types currently supported by Milvus do not include the String type. According to the feedback in the previous issue list, the support of the String data type is expected by many users. One of the most urgent requirements of the String type is to support the primary key of the custom String type. Take the image search application as an example. In actual needs, users need to uniquely identify a picture with a string type value. The string type value can be the name of the picture, the md5 value of the picture, and so on. Since Milvus does not support the string data type, users need to make an additional int64 to string mapping externally, which reduces the efficiency and increases the maintenance cost of the entire application.

...

Code Block
languagecpp
type StringField inteface { 
extract(segmentOffsets []int32) []string

serialize() []bytes

deserialize([]bytes)

}
func Filter(expression string, field StringField) sgementOffsets []int32

...

Code Block
languagecpp
class HistoricalStringField1 {
std::vector<string> strs;

std::unordered_map<int32, std::vector<int32>> strOffsetToSegOffsets;

std::vector<int32> segOffsetToStrOffset;

std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);

std::vector<Blob>
serialize();

void
deserialize(const std::vector<Blob>&)

}

class Blob {
std::vector<char> blob_;
}

...

map = { 'ape': 32, 'ball': 2, 'atom': 16, 'ate': 18, 'bait': 5 }


The word "ape" is built by traversing the "a" edge, the "p" edge, and the "e" edge. When we reach a leaf, we can extract the value, which is 32.

...

In the following tests, the memory usage was measured for 3 million unique Unicode Russian words; "simple lookup time " was measured on a lookup for a specific one word.



Descriptionmemory usage (MegaBytes)UnicodeLookup time (nanosecond)Reverse-lookup
PAT-Triea pointer-based implementation of PAT-trie (aka Patricia-Trie and Radix-Trie) and may use a lot of memory because of that.242No333No
HAT-TrieTrie-HashMap hybrid. It is claimed to be the state-of-art Trie-like structure with the fastest lookups.125Yes195No
DA-TrieDouble-Array Trie C implementation101Yes281No
Marisa-Triememory-efficient recursive LOUDS-trie-based data structure implemented as C++ library11Yes2010Yes
DAWGDirected Acyclic Word Graphs2.8Yes249No
Python-Dict/600


Python-list/300


...

Code Block
languagecpp
class StreamingStringField {

std::vector<string> strs;

std::vector<string> 
extract(const std::vector<int32>& segmentOffsets);

std::vector<Blob>
serialize();

void
deserialize(const std::vector<Blob>&)

}
class Blob {
std::vector<char> blob_;
}

...