Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: "Under Discussion"

...

After the proxy receives the inserted data, create multiple Arrow Array by field, instead of RecordBatch.

PROBLEM: The primitive unit of serialized data in Arrow is RecordBatch. Arrow does not provide interface to serialize Arrow Array.

...

PROBLEM: There seems no advantages compared comparing with current implementation.

...

Summarize some limitations in the use of arrowArrow:1.

  1. Arrow data can only be serialized and deserialized

...

2. Recordbatch does not support copying data in behavioral units

3. The recordbatch must be re created at the receiving end of the pulsar

The same problem will be encountered in the query data process:

1. The query results obtained by segcore need to be reduced twice. Once, querynode merges the searchresults of multiple segments, and the other time, proxy merges the query results of multiple querynodes. If the query results are in recordbatch format, it is not convenient to reduce because data cannot be copied by line

2. Querynode needs to send the SearchResult to the proxy through pulsar. After receiving the data, the proxy needs to rebuild the recordbatch, which violates the original design intention of arrow zero copy

So I don't think arrow is suitable for the application scenario of Milvus

举个例子说明如果用 Arrow 会遇到的问题

...

按行拆分是基于 2 个原因:

...

1. Arrow 数据只能以 RecordBatch 为单位进行序列化和反序列化

...

  1. by unit of RecordBatch
  2. Cannot copy out row data from RecordBatch
  3. RecordBatch must be regenerated after sending via pulsar


Arrow is suitable for data analysis scenario (data is sealed and read only).

In Milvus, we need do data split and concatenate. Arrow is not a good choice for Milvus.


Design Details(required)

We divide this MEP into 2 stages, all compatibility changes will be achieved in Stage 1 before Milvus 2.0.0, other internal changes can be left later.

...