Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: Accepted

...

To reduce network transmission and skip Plusar management, the new interface will allow users to input the path of some data files(json, numpy, etc.) on MinIO/S3 storage, and let the data nodes directly read these files and parse them into segments. The internal logic of the process becomes:

        1. client calls import() to pass some file paths to Milvus proxy node  

        2. proxy node passes the file paths to data coordinator

        3. data coordinator pick a data node or multiple data node (according to the files count) to parse files, each file can be parsed to a segment or multiple segments.


Some points to consider:

  • JSON format is flexible, ideally, the import  API ought to parse user's JSON files without asking user to reformat the files according to a strict rule.
  • Users can store scalar fields and vector fields in a JSON file, with row-based or column-based. The import() API can support both of them.

...

  1. User has some JSON files store data with the row-based format:  file_1.json, file_2.json.

    Code Block
    {
      "data": {
        "rows": [
          {"id": 1, "year": 2021, "vector": [1.0, 1.1, 1.2]},
          {"id": 2, "year": 2022, "vector": [2.0, 2.1, 2.2]},
          {"id": 3, "year": 2023, "vector": [3.0, 3.1, 3.2]}
        ]
      }
    }

    The "options" could be:

    Code Block
    {
    	"data_source": {
    		"type": "Minio",
    		"address": "localhost:9000",
    		"accesskey_id": "minioadmin",
    		"accesskey_secret": "minioadmin",
    		"use_ssl": false,
    		"bucket_name": "mybucket"
    	},
    
    	"external_data": {
    		"target_collection": "TEST",
    		"chunks": [{
    				"files": [{
    					"path": "xxxx/file_1.json",
    					"type": "row_based",
    					"fields_mapping": {
    						"table.rows.id": "uid",
    						"table.rows.year": "year",
    						"table.rows.vector": "vector"
    					}
    				}]
    			},
    			{
    				"files": [{
    					"path": "xxxx/file_2.json",
    					"type": "row_based",
    					"fields_mapping": {
    						"table.rows.id": "uid",
    						"table.rows.year": "year",
    						"table.rows.vector": "vector"
    					}
    				}]
    			}
    		],
    		"default_fields": {
    			"age": 0
    		}
    	}
    }


  2. User has some JSON files store data with the column-based format:  file_1.json, file_2.json.

    Code Block
    {
      "table": {
        "columns": [
          "id": [1, 2, 3],
          "year": [2021, 2022, 2023],
          "vector": [
            [1.0, 1.1, 1.2],
            [2.0, 2.1, 2.2],
            [3.0, 3.1, 3.2]
          ]
        ]
      }
    }

    The "options" could be:

    Code Block
    {
    	"data_source": {
    		"type": "Minio",
    		"address": "localhost:9000",
    		"accesskey_id": "minioadmin",
    		"accesskey_secret": "minioadmin",
    		"use_ssl": false,
    		"bucket_name": "mybucket"
    	},
    
    	"external_data": {
    		"target_collection": "TEST",
    		"chunks": [{
    				"files": [{
    					"path": "xxxx/file_1.json",
    					"type": "column_based",
    					"fields_mapping": {
    						"table.rowscolumns.id": "uid",
    						"table.rowscolumns.year": "year",
    						"table.rowscolumns.vector": "vector"
    					}
    				}]
    			},
    			{
    				"files": [{
    					"path": "xxxx/file_2.json",
    					"type": "column_based",
    					"fields_mapping": {
    						"table.rowscolumns.id": "uid",
    						"table.rowscolumns.year": "year",
    						"table.rowscolumns.vector": "vector"
    					}
    				}]
    			}
    		],
    		"default_fields": {
    			"age": 0
    		}
    	}
    }


  3. User ha has a JSON file store data with the column-based format:  file_1.json, and a Numpy file store vectors data: file_2.npy
    Note: for hybrid format files, we only allow inputting a pair of files to reduce the complexity.
    The file_1.json:
Code Block
{
  "table": {
    "columns": [
      "id": [1, 2, 3],
      "year": [2021, 2022, 2023],
      "age": [23, 34, 21]
      ]
    ]
  }
}

            The "options" could be:

Code Block
{
  	"data_source": {
    		"type": "Minio",
    		"address": "localhost:9000",
    		"accesskey_id": "minioadmin",
    		"accesskey_secret": "minioadmin",
    		"use_ssl": false,
    		"bucket_name": "mybucket"
  	},

  	"external_data": {
    		"target_collection": "TEST",
		"chunks":    [{
				"files": [
      {
        "file{
					"path": "xxxx/file_1.json",
        					"type": "column_based",
        					"fields_mapping": {
          "data						"table.columns.id": "uid",
          "data						"table.columns.year": "year",
          "data						"table.columns.age": "age",
        }
      					}
				}]
			},
			{
				"files":        "file[{
					"path": "xxxx/file_2.npy",
        					"type": "column_based",
        					"fields_mapping": {
          "embedding						"vector": "embeddingvector",
        }
      }
    ]
  					}
				}]
			}
		]
	}
}


The "options" for other SDK is not JSON object. For Java SDk, a declaration could be:

...