MEP 16 -- Compaction
Current state: Under Discussion
ISSUE: https://github.com/milvus-io/milvus/issues/9530
PRs:
Keywords: datacoord, segment, compaction
Released:
Summary
Milvus needs a compaction mechanism to merge small segments and remove deleted rows to save disk space.
Motivation
There are many ways to generate small segments:
- DataCoord will auto flush a segment when it is opened for a long time(eg. 24hours)
- Users may call flush manually
And deleted rows should be removed after they are not used anymore.
So we have 2 targets:
- Merge small segments to improve query efficiency
- Remove deleted rows to save disk space
Public Interfaces
We will add a compaction interface in sdk to start a compaction.
Design Details
Some preset conditions:
- We do compaction at channel&partiton level. Because a segment is generated at channel&partiton level.
- Delta log and insert log is at segment level.
- Delta log and insert log in time-travel range should be saved.
- Segment has a max size(limited by memory size).
We divide compaction task to 2 phases.
We merge insert and delta logs in the first phase:
- Considering time-travel, we only merge segments outside the time-travel range.
- When to trigger a compaction:
- After a flush
- The time interval from the last compaction is greater than max_ compaction_ interval
- call compaction manually
- How to choose segments:
- The size of all delta logs is bigger than max_delete_binlog_size
- deleted rows / total rows >= compaction_delta_binlog_ratio
We merge segments in the second phase:
- The time period of time travel may be very long, such as dozens of days, so it is still necessary to merge small segments within the scope of time travel.
- When to trigger a compaction:
- After a segment flush, if the total number of segments less than 1/2*max_segment_size at channel&partition level exceeds the compaction_ segment_ num_ threshold.
- The time interval from the last compaction is greater than max_ compaction_ interval
- call compaction manually
- How to choose segments: greedy algorithm‘
Some details:
- Only merge flushed segments
- We choose the max dml position of merged segments as the dml position of the new generated segment.