Current state:
ISSUE: #6383
PRs:
Keywords: delete
Released:
Summary
This document describes how the delete be implemented in Milvus. Milvus provides a new delete API that users can use to delete entities from a collection.
Motivation
In some scenarios, users want to delete some entities from a collection to no longer be searched out. Currently, users can only manually filter out unwanted results in search results. We hope to implement a new function that allows users to delete entities from a collection.
Public Interfaces
def delete(self, condition=None)->MutationResult: """ Delete entities by primary keys. Example: client.delete("_id in [1,10,100]") :param condition: an expression indicates whether an entity should be deleted :type condition: str """
Design Details
Since Milvus's storage is an append-only design, the delete function is implemented through soft delete, setting a flag on the existing data to indicate that the data has been deleted.
This solution needs the algorithm library to support search with a bitset and the deletion offset recorded in Milvus. Now the algorithm library Knowhere is supported to search with a bitset indicated whether an entity is deleted. So we discuss how to store the deleted primary keys here.
Proposal delete operation persistent
- DataNode subscribe the insert channel
- Proxy receives a delete request, split into insert channels by primary keys
- DataNode receives a delete request from the insert channel, save it in buffer, and write it into the delta channel
- ...
- DataNode receives a flush request, write out the deletions saved above
- DataNode notifies IndexNode to building indexes
- finish
Proposal delete operation serving search(sealed+growing)
- QueryNode subscribe the insert channel
- QueryNode load the checkpoint and recovery by the checkpoint
- Proxy receives a delete request, split into insert channels by primary keys
- QueryNode retrieves a delete request from the insert channel, judges the segment to which each deletion belongs, and updates the Inverted Delta Logs(IDL)
- ...
- QueryNode retrieves a search request, search on each segment
- finish
Proposal delete operation serving search(sealed only)
- QueryNode subscribe the delta channel
- QueryNode load the checkpoint and recovery by the checkpoint
- Proxy receives a delete request, split into insert channels
- DataNode filter out all delete requests, and write them into the delta channel
- QueryNode retrieves delete requests from the delta channel, judges the segment to which each deletion belongs, and update the Inverted Delta Logs(IDL)
- ...
- QueryNode retrieve a search request, search on each segment
- finish
Process delete operation in system recovery
Unaffected
SegmentFilter
SegmentFilter provide a method that can used to get the segment id which a PK possible existed in. Implement by segments statistics and bloomfilter.
DeltaLog
DeltaLog is the persistent file, recording the primary keys deleted and the delete timestamp. Each DeltaLog only belongs to a segment.
InvertedDeltaLog
InvertedDeltaLog provides a method that can be used to get deletion that meets the timestamp condition fastly.
Test Plan
Testcase1
Search a deleted entity, except not in the resultset
client.insert() client.search() client.delete() client.search()
Rejected Alternatives[WIP]
- Without any channel changes, the delete requests will be forward into the insert channels by Proxy
- Since the balance policy of QueryNode, some QueryNodes will only serve the sealed segments. They should discard all insert data from this channel. It cause a lot of performance waste.
- Add a delta channel for all shards, and all the delete requests will be forward into this delta channel by Proxy
- DataNode should subscribe insert channel and delta channel, each time to consume data need to align timetick between these two channel, complicatedly
- Add a delta channel for each shards, and the delete requests will be forward into this channel and the origin insert channel concurrently. DataNode subscribe the insert channel only, and the QueryNode(serving sealed segments) subscribe the delta channel
- It's difficult to sync the insert channel and delta channel