Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Current state:

ISSUE: #6383

PRs:

Keywords: delete

Released:

Summary

This document describes how the delete be implemented in Milvus. Milvus provides a new delete API that users can use to delete entities from a collection.

Motivation

In some scenarios, users want to delete some entities from a collection to no longer be searched out. Currently, users can only manually filter out unwanted results in search results. We hope to implement a new function that allows users to delete entities from a collection.

Public Interfaces

def delete(self, condition=None)->MutationResult:
    """
    Delete entities by primary keys.
    Example: client.delete("_id in [1,10,100]")
    
    :param condition: an expression indicates whether an entity should be deleted
    :type  condition: str
    """

Design Details

Since Milvus's storage is an append-only design, the delete function is implemented through soft delete, setting a flag on the existing data to indicate that the data has been deleted.

This solution needs the algorithm library to support search with a bitset and the deletion offset recorded in Milvus. Now the algorithm library Knowhere is supported to search with a bitset indicated whether an entity is deleted. So we discuss how to store the deleted primary keys here.

Proposal delete operation persistent

  1. DataNode subscribe the insert channel
  2. Proxy receives a delete request, split into insert channels by primary keys
  3. DataNode receives a delete request from the insert channel, save it in buffer, and write it into the delta channel
  4. ...
  5. DataNode receives a flush request, write out the deletions saved above
  6. DataNode notifies IndexNode to building indexes
  7. finish

Proposal delete operation serving search(sealed+growing)

  1. QueryNode subscribe the insert channel
  2. QueryNode load the checkpoint and recovery by the checkpoint
  3. Proxy receives a delete request, split into insert channels by primary keys
  4. QueryNode retrieves a delete request from the insert channel, judges the segment to which each deletion belongs, and updates the Inverted Delta Logs(IDL)
  5. ...
  6. QueryNode retrieves a search request, search on each segment
  7. finish

Proposal delete operation serving search(sealed only)

  1. QueryNode subscribe the delta channel
  2. QueryNode load the checkpoint and recovery by the checkpoint
  3. Proxy receives a delete request, split into insert channels
  4. DataNode filter out all delete requests, and write them into the delta channel
  5. QueryNode retrieves delete requests from the delta channel, judges the segment to which each deletion belongs, and update the Inverted Delta Logs(IDL)
  6. ...
  7. QueryNode retrieve a search request, search on each segment
  8. finish

Process delete operation in system recovery

Unaffected

SegmentFilter

SegmentFilter provide a method that can used to get the segment id which a PK possible existed in. Implement by segments statistics and bloomfilter.

DeltaLog

DeltaLog is the persistent file, recording the primary keys deleted and the delete timestamp. Each DeltaLog only belongs to a segment.

InvertedDeltaLog

InvertedDeltaLog provides a method that can be used to get deletion that meets the timestamp condition fastly.

Test Plan

Testcase1

Search a deleted entity, except not in the resultset

client.insert()
client.search()
client.delete()
client.search()

Rejected Alternatives[WIP]

  1. Without any channel changes, the delete requests will be forward into the insert channels by Proxy
    1. Since the balance policy of QueryNode, some QueryNodes will only serve the sealed segments. They should discard all insert data from this channel. It cause a lot of performance waste.
  2. Add a delta channel for all shards, and all the delete requests will be forward into this delta channel by Proxy
    1. DataNode should subscribe insert channel and delta channel, each time to consume data need to align timetick between these two channel, complicatedly
  3. Add a delta channel for each shards, and the delete requests will be forward into this channel and the origin insert channel concurrently. DataNode subscribe the insert channel only, and the QueryNode(serving sealed segments) subscribe the delta channel
    1. It's difficult to sync the insert channel and delta channel

References


  • No labels