Current state: Accepted
...
To reduce network transmission and skip Plusar management, the new interface will allow users to input the path of some data files(json, numpy, etc.) on MinIO/S3 storage, and let the data nodes directly read these files and parse them into segments. The internal logic of the process becomes:
1. client calls import() to pass some file paths to Milvus proxy node
2. proxy node passes the file paths to data coordinator
3. data coordinator pick a data node or multiple data node (according to the files count) to parse files, each file can be parsed to a segment or multiple segments.
Some points to consider:
- JSON format is flexible, ideally, the import API ought to parse user's JSON files without asking user to reformat the files according to a strict rule.
- Users can store scalar fields and vector fields in a JSON file, with row-based or column-based. The import() API can support both of them.
...
User has some JSON files store data with the row-based format: file_1.json, file_2.json.
Code Block { "data": { "rows": [ {"id": 1, "year": 2021, "vector": [1.0, 1.1, 1.2]}, {"id": 2, "year": 2022, "vector": [2.0, 2.1, 2.2]}, {"id": 3, "year": 2023, "vector": [3.0, 3.1, 3.2]} ] } }
The "options" could be:
Code Block { "data_source": { "type": "Minio", "address": "localhost:9000", "accesskey_id": "minioadmin", "accesskey_secret": "minioadmin", "use_ssl": false, "bucket_name": "mybucket" }, "external_data": { "target_collection": "TEST", "chunks": [{ "files": [{ "path": "xxxx/file_1.json", "type": "row_based", "fields_mapping": { "table.rows.id": "uid", "table.rows.year": "year", "table.rows.vector": "vector" } }] }, { "files": [{ "path": "xxxx/file_2.json", "type": "row_based", "fields_mapping": { "table.rows.id": "uid", "table.rows.year": "year", "table.rows.vector": "vector" } }] } ], "default_fields": { "age": 0 } } }
User has some JSON files store data with the column-based format: file_1.json, file_2.json.
Code Block { "table": { "columns": [ "id": [1, 2, 3], "year": [2021, 2022, 2023], "vector": [ [1.0, 1.1, 1.2], [2.0, 2.1, 2.2], [3.0, 3.1, 3.2] ] ] } }
The "options" could be:
Code Block { "data_source": { "type": "Minio", "address": "localhost:9000", "accesskey_id": "minioadmin", "accesskey_secret": "minioadmin", "use_ssl": false, "bucket_name": "mybucket" }, "external_data": { "target_collection": "TEST", "chunks": [{ "files": [{ "path": "xxxx/file_1.json", "type": "column_based", "fields_mapping": { "table.rowscolumns.id": "uid", "table.rowscolumns.year": "year", "table.rowscolumns.vector": "vector" } }] }, { "files": [{ "path": "xxxx/file_2.json", "type": "column_based", "fields_mapping": { "table.rowscolumns.id": "uid", "table.rowscolumns.year": "year", "table.rowscolumns.vector": "vector" } }] } ], "default_fields": { "age": 0 } } }
- User ha has a JSON file store data with the column-based format: file_1.json, and a Numpy file store vectors data: file_2.npy
Note: for hybrid format files, we only allow inputting a pair of files to reduce the complexity.
The file_1.json:
Code Block |
---|
{ "table": { "columns": [ "id": [1, 2, 3], "year": [2021, 2022, 2023], "age": [23, 34, 21] ] ] } } |
The "options" could be:
Code Block |
---|
{ "data_source": { "type": "Minio", "address": "localhost:9000", "accesskey_id": "minioadmin", "accesskey_secret": "minioadmin", "use_ssl": false, "bucket_name": "mybucket" }, "external_data": { "target_collection": "TEST", "chunks": [{ "files": [ { "file{ "path": "xxxx/file_1.json", "type": "column_based", "fields_mapping": { "data "table.columns.id": "uid", "data "table.columns.year": "year", "data "table.columns.age": "age", } } }] }, { "files": "file[{ "path": "xxxx/file_2.npy", "type": "column_based", "fields_mapping": { "embedding "vector": "embeddingvector", } } ] } }] } ] } } |
The "options" for other SDK is not JSON object. For Java SDk, a declaration could be:
...