Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: Accepted

...

To reduce network transmission and skip Plusar management, the new interface will allow users to input the path of some data files(json, numpy, etc.) on MinIO/S3 storage, and let the data nodes directly read these files and parse them into segments. The internal logic of the process becomes:

        1. client calls import() to pass some file paths to Milvus proxy node  

        2. proxy node passes the file paths to data coordinator

        3. data coordinator pick a data node or multiple data node (according to the files count) to parse files, each file can be parsed to a segment or multiple segments.


Some points to consider:

  • JSON format is flexible, ideally, the import  API ought to parse user's JSON files without asking user to reformat the files according to a strict rule.
  • Users can store scalar fields and vector fields in a JSON file, with row-based or column-based. The import() API can support both of them.

...

Code Block
{
	"data_source": {			          // required
		"type": "minioMinio", // required, currently only support "minio"/"s3"		          // required
		"address": "localhost:9000",	  // optional, milvus server will use its minio setting if without this value
		"accesskey_id": "minioadmin",	  // optional, milvus server will use its minio setting if without this value
		"accesskey_secret": "minioadmin", // optional, milvus server will use its minio setting if without this value
		"use_ssl": false,		          // optional, milvus server will use its minio setting if without this value
		"bucket_name": "mybucketaaa"		  //     // optional, milvus server will use its minio setting if without this value
	},

	"internal_data": {		     // optional, external_data or internal_data. (external files include json, npy, etc. internal files are exported by milvus)
		"path": "xxx/xxx/xx",	 // relative path to the source storage where store the exported data
		"collections_mapping": { // optional, give a new name to collection during importing.
			"aaa": "bbb",	     // field name mapping, key is the source field name, value is the target field name
			"ccc": "ddd"
		}
	},

	"external_data": {			 //  optional, external_data or internal_data.                // optional, external_data or internal_data. (external files include json, npy, etc. internal files are exported by milvus)
		"target_collection": "xxx",	                 // target collection name
		"fileschunks": [ // required
{			{
				"file": xxxx/xx.json, // required, relative path under the storage source defined by DataSource, currently support json/npy
				"type": "row_based",                     // required, row_based or column_based chunk list, each chunk can be import as one segment or split into multiple segments
				"fields_mappingfiles": [{ // optional, specify the target fields which should be imported. Milvus will import all fields if this list is empty.	                         // required, files that provide data of a chunk
					"table.rows.idpath": "uidxxxx / xx.json", 					"table.rows.year": "year",
					"table.rows.vector": "vector",
				}
			},
			{
				"file": xxxx/xx.json, / // required, relative path under the storage source defined by DataSource, currently support json/npy
					"type": "columnrow_based",	  		 // required, row_based or column_based, tell milvus how to parse this json file
					"fields_mapping": {			     // optional, specify the target fields which should be imported. Milvus will import all fields if this list is empty.
						"table.columnsrows.id": "uid",
					"table.columns.year": "year",
					"table.columns.vector": "vector",
				}
			},
			{
				"file": xxxx/xx.npy, // required, relative path under the storage source defined by DataSource, currently support json/npy
				"type": "column_based", // required, row_based or column_based
				"fields_mapping": { // optional, specify the target fields which should be imported. Milvus will import all fields if this list is empty.
		 // field name mapping, tell milvus how to insert data to correct field, key is a json node path, value is a field name of the collection
						"table.rows.year": "year",
						"table.rows.vector": "vector"
					}
				}]
			},
			{
				"files": [{
					"path": "xxxx / xx.json",
					"type": "column_based",
					"fields_mapping": {
						"table.columns.id": "uid",
						"table.columns.year": "year",
						"table.columns.vector": "vector"
					}
				}]
			},
			{
				"files": [{
					"file": "xxxx / xx.npy",
					"type": "column_based",
					"fields_mapping": {
						"vector": "vector",
					}
				}]
			}
		],
		"default_fields": { // optional, use default value to fill some fields
			"age": 0,
			"weight": 0.0
		},
	}
}


Key fields of the JSON object:

...

Code Block
rpc Import(ImportRequest) returns (MutationResult) {}

message ImportRequest {
  common.MsgBase base = 1;
  string options = 2;
}


message MutationResult {
  common.Status status = 1;
  schema.IDs IDs = 2; // requiredreturn auto-id for insert/import, deleted id for delete
  repeated uint32 succ_index = 3; // errorsucceed indexes indicatefor insert
  repeated uint32 err_index = 4; // error indexes for indicateinsert
  bool acknowledged = 5;
  int64 insert_cnt = 6;
  int64 delete_cnt = 7;
  int64 upsert_cnt = 8;
  uint64 timestamp = 9;
}

...