MEP 16 -- Compaction
Current state: Under Discussion
ISSUE: https://github.com/milvus-io/milvus/issues/9530
PRs:
Keywords: datacoord, segment, compaction
Released:
Summary
Milvus needs a compaction mechanism to merge small segments and remove deleted rows to save disk space.
Motivation
There are many ways to generate small segments:
DataCoord will auto flush a segment when it is opened for a long time(eg. 24hours)
Users may call flush manually
And deleted rows should be removed after they are not used anymore.
So we have 2 targets:
Merge small segments to improve query efficiency
Remove deleted rows to save disk space
Public Interfaces
We will add a compaction interface in sdk to start a compaction.
Design Details
Some preset conditions:
We do compaction at channel&partiton level. Because a segment is generated at channel&partiton level.
Delta log and insert log is at segment level.
Delta log and insert log in time-travel range should be saved.
Segment has a max size(limited by memory size).
We divide compaction task to 2 phases.
We merge insert and delta logs in the first phase:
Considering time-travel, we only merge segments outside the time-travel range.
When to trigger a compaction:
After a flush
The time interval from the last compaction is greater than max_ compaction_ interval
call compaction manually
How to choose segments:
The size of all delta logs is bigger than max_delete_binlog_size
deleted rows / total rows >= compaction_delta_binlog_ratio
We merge segments in the second phase:
The time period of time travel may be very long, such as dozens of days, so it is still necessary to merge small segments within the scope of time travel.
When to trigger a compaction:
After a segment flush, if the total number of segments less than 1/2*max_segment_size at channel&partition level exceeds the compaction_ segment_ num_ threshold.
The time interval from the last compaction is greater than max_ compaction_ interval
call compaction manually
How to choose segments: greedy algorithm‘
Some details:
Only merge flushed segments
We choose the max dml position of merged segments as the dml position of the new generated segment.