Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: "Under Discussion"

ISSUE: #6383 #7130

PRs:

Keywords: delete

Released: 

Summary

This document describes how the to support delete be implemented in Milvus. Milvus provides a new delete API that users can use to delete entities from a collection.

...

In some scenarios, users want to delete some entities from a collection to that will no longer be searched out. Currently, users can only manually filter out unwanted results in entities from search results. We hope to implement a new function that allows users to delete entities from a collection.

Public Interfaces & Behavior

Delete API can be used to delete entities in the collection, and the deleted entities will no longer appear in the results of the Query and Search request.


`collection_name` is the name of the collection to delete entities from.

`expr` is an expression indicated whether an entity should be deleted in the collection. Only the `in` operator is supported in the Delete API. Document of expression: https://github.com/milvus-io/milvus/blob/master/docs/design_docs/query_boolean_expr.md

`partition_name` is the name of the partition to delete entities from, `None` means all partition.


`Delete` returns after being written into the insert channel, which means the delete request has been reliably saved and will be applied in search/query requests. The type of return value is MutationResult, which contains several properties, and only `_primary_keys` will be filled.


Same as Insert API, Milvus only guarantee the visibility of operations with one client. This means that, within the sequence of the operations "delete(), search()", the result of the search will not contains the entities deleted. Since different clients connect to different Proxy, the time between different Proxy is not exactly the same. So even if you manually call the delete method and the search method on two clients sequentially, it is uncertain whether the search request returns the deleted entities.

Currently, Milvus does not support dedup in inserting, so the delete operation will delete all satisfied entities.

Delete a non-existent entity is not an error, so delete() will not raise an Error.

Code Block
languagepy
def delete(self, collection_name, expr, conditionpartition_name=None, timeout=None, **kwargs)->MutationResult:
    """
    Delete entities by primary keys. with an expression condition.
    And return results to show which primary key is deleted successfully

    :param Example: client.delete("_id in [1,10,100]")
   collection_name: Name of the collection to delete entities from
    :type  collection_name: str

    :param expr: The query expression
    :type  expr: str

    :param partition_name: Name of partitions that contain entities
    :type  partition_name: str

    :param timeout: An optional duration of time in seconds to allow for the RPC. When timeout
                    is set to None, client waits until server response or error occur
    :type  timeout: float

    :param condition: an expression indicates whether an entity should be deleted:return: delete request executed results.
    :rtype: MutationResult

    :raises:
        RpcError: If gRPC encounter an error
        :typeParamError: If parameters are invalid
        conditionBaseException: str If the return result from server is not ok
    """

Design Details

Image Added

In Milvus, Proxy maintains 2 Pulsar channels for each collection:

  1. Insert channel, handle following msg types:
    1. DDL msg (CreateCollection/DropCollection/CreatePartition/DropPartition)
    2. InsertMsg
    3. DeleteMsg
  2. Query channel, handle following msg types:
    1. SearchMsg
    2. RetrieveMsg

DataNode consumes messages from Insert channel only.

QueryNode consumes messages from both Insert channel and Query channel.

To support delete, we will send DeleteMsg into Insert channel also.


Since Milvus's storage is an append-only design, the delete `delete` function is implemented through using soft delete, setting a flag on the existing data entity to indicate that the data this entity has been deleted.

This solution needs the :

  • record deletion offset in Milvus
  • let algorithm library support to

...

  • search with a bitset

...

Now the algorithm library Knowhere is `Knowhere` has already supported to search with a bitset indicated which indicates whether an entity is deleted. So we discuss how to store the deleted primary keys here.

Proposal delete operation persistent

...

Image Added

  1. Proxy receives a delete request
    1. parse `expr` to get primary keys, then split DeleteMsg into insert channels by primary keys (done)
  2. DataNode receives a delete request from the insert channel, save it in buffer, and write it into the delta channel
  3. ...
  4. DataNode receives a flush request, write out the deletions saved above
  5. DataNode notifies IndexNode to building indexes
  6. finishwatches insert channel
    1. receive DeleteMsg and persist delete data
      1. when DataNode start up, load all segments info from Minio into memory (done)
      2. update datanode flowgraph
        1. when DDNode receive DeleteMsg, save it into FlowGraphMsg structure, and send FlowGraphMsg to next node InsertBufferNode (done)
        2. when InsertBufferNode receive FlowGraphMsg, process InsertMsg in it, then wrapper DeleteMsg into another FlowGraphMsg and sent it to next node DeleteNode (done)
        3. when DeleteNode receive FlowGraphMsg, process DeleteMsg in it, save deleted ids and timestamps into delBuf (done)
      3. update DeleteNode to handle flushMsg
        1. add another flushChan for DeleteNode (done)
        2. when DeleteNode receives flushMsg, save all data in delBuf into MinIO 
          1. deleted ids and timestamps are saved into two separated Minio file with name "/by-dev/deltalog/collectionID/partitionID/segmentID/xxx" (done)
    2.  DataNode forwards ttMsg and DeleteMsg to delta channel (doing)

Proposal delete operation serving search(sealed+growing)

Image Added

  1. QueryNode subscribe the insert channel
  2. QueryNode load the checkpoint and recovery by the checkpoint
  3. Proxy receives a delete request, split into insert channels by primary keys
  4. QueryNode retrieves a delete request from the insert channel, judges the segment to which each deletion belongs, and updates the Inverted Delta Logs(IDL)
  5. ...
  6. QueryNode retrieves a search request, search on each segment
  7. finish

...