Page Comparison

Current state: Accepted

...

To reduce network transmission and skip Plusar management, the new interface will allow users to input the path of some data files(json, numpy, etc.) on MinIO/S3 storage, and let the data nodes directly read these files and parse them into segments. The internal logic of the process becomes:

1. client calls import() to pass some file paths to Milvus proxy node

2. proxy node passes the file paths to data coordinator

3. data coordinator pick a data node or multiple data node (according to the files count) to parse files, each file can be parsed to a segment or multiple segments.

Some points to consider:

JSON format is flexible, ideally, the import API ought to parse user's JSON files without asking user to reformat the files according to a strict rule.
Users can store scalar fields and vector fields in a JSON file, with row-based or column-based. The import() API can support both of them.

...

User has some JSON files store data with the row-based format: file_1.json, file_2.json.

Code Block
{ "data": { "rows": [ {"id": 1, "year": 2021, "vector": [1.0, 1.1, 1.2]}, {"id": 2, "year": 2022, "vector": [2.0, 2.1, 2.2]}, {"id": 3, "year": 2023, "vector": [3.0, 3.1, 3.2]} ] } }

The "options" could be:

Code Block

{
  	"data_source": {
    		"type": "Minio",
    		"address": "localhost:9000",
    		"accesskey_id": "minioadmin",
    		"accesskey_secret": "minioadmin",
    		"use_ssl": false,
    		"bucket_name": "mybucket"
  	},

  	"external_data": {
    		"target_collection": "TEST",
    "files		"chunks": [{
      {
        "file": 				"files": [{
					"path": "xxxx/file_1.json",
        					"type": "row_based",
        					"fields_mapping": {
          "data						"table.rows.id": "uid",
          "data						"table.rows.year": "year",
          "data						"table.rows.vector": "embeddingvector",
        }
      					}
				}]
			},
			{
				"files":        "file[{
					"path": "xxxx/file_2.json",
        					"type": "row_based",
        					"fields_mapping": {
          "data						"table.rows.id": "uid",
          "data						"table.rows.year": "year",
          "data
						"table.rows.vector": "embeddingvector",
        }
      }
    ]
    					}
				}]
			}
		],
		"default_fields": {
      			"age": 0
    },
  		}
	}
}

User has some JSON files store data with the column-based format: file_1.json, file_2.json.

Code Block
{ "table": { "columns": [ "id": [1, 2, 3], "year": [2021, 2022, 2023], "vector": [ [1.0, 1.1, 1.2], [2.0, 2.1, 2.2], [3.0, 3.1, 3.2] ] ] } }

The "options" could be:

Code Block

{
  "data_source": {
    "type": "Minio",
    "address": "localhost:9000",
    "accesskey_id": "minioadmin",
    "accesskey_secret": "minioadmin",
    "use_ssl": false,
    "bucket_name": "mybucket"
  },

  "external_data": {
    "target_collection": "TEST",
    "files": [
      {
        "file": file_1.json,
        "type": "column_based",
        "fields_mapping": {
          "data.columns.id": "uid",
          "data.columns.year": "year",
          "data.columns.vector": "embedding",
        }
      },
	{
        "file": file_2.json,
        "type": "column_based",
        "fields_mapping": {
          "data.columns.id": "uid",
          "data.columns.year": "year",
          "data.columns.vector": "embedding",
        }
      }
    ]
    "default_fields": {
      "age": 0
    },
  }
}

User ha a JSON file store data with the column-based format: file_1.json, and a Numpy file store vectors data: file_2.npy
Note: for hybrid format files, we only allow inputting a pair of files to reduce the complexity.
The file_1.json:

...

Versions Compared

Old Version 24

New Version 25

Key