...
Typically, it cost several hours to insert one billion entities with 128-dimensional vectors. We need a new interface to do bulk load for the following purposes:
- import data from json format files. (first stage)
- import data from numpy format files. (first stage)
- copy a collection from one Milvus 2.0 server to another. (second stage)
- import data from json format files
- import data from numpy format files
- import data from Milvus 1.x to Milvus 2.0
...
- Milvus 1.x to Milvus 2.0 (third stage)
- parquet/faiss files (TBD)
Some points to consider:
- JSON format is flexible, ideally, the import API ought to parse user's JSON files without asking user to reformat the files according to a strict rule.
- User can store scalar fields and vector fields in a JSON file, with row-based or column-based, the import API ought t support both of them.
A row-based example:
Wiki Markup |
---|
{
"table": {
"rows": [
{"id": 1, "year": 2021, "vector": [1.0, 1.1, 1.2]},
{"id": 2, "year": 2022, "vector": [2.0, 2.1, 2.2]},
{"id": 3, "year": 2023, "vector": [3.0, 3.1, 3.2]}
]
}
} |
A column-based example:
Wiki Markup |
---|
{
"table": {
"columns": [
"id": [1, 2, 3],
"year": [2021, 2022, 2023],
"vector": [
[1.0, 1.1, 1.2],
[2.0, 2.1, 2.2],
[3.0, 3.1, 3.2]
]
]
}
} |
- Numpy file is binary format, we only treat it as vector data. Each numpy file represents a vector field.
- Transferring a large file from client to server proxy to datanode is time-consume work and occupies too much network bandwidth, we will ask user to upload data files to MinIO/S3 where the datanode can access directly. Let the datanode read and parse files from MinIO/S3.
- The parameter of import API is easy to expand in future
SDK Interfaces
RPC Interfaces
...