Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Current state: Under Discussion
Keywords: HA, primary-backup, RootCoord, DataCoord, QueryCoord, IndexCoord
Released:

...

The start-up progress of a coordinator is like the graph above (take DataCoord as an example, others are very similar). Currently, each component in Milvus cluster maintains a keepAlive lease with and register the node info in ETCD (internal Register in the graph), and . Milvus establishes service discovery based on itby querying the registered info. If a node crash, the lease will die and the related registered info will vanish. Therefore, we can easily detect failure by watching the coords' key in ETCD. The design are change will mainly be in internal register as followed.

internal Register

Image RemovedImage Added

  1. Check if the primary service already existsTry lock the key in etcd. If true, go to 2. enter StandbyMode. If false, go to 5. register to ETCD as usual.
  2. Enter StandyMode. (We can add a new value in StateCode or define a new flag.)
  3. Start a loop to keep watching the primary key in ETCD. And WarmUp can succeed, this node will become the primary.  → 5. If fail which means there is another node hold the lock. → 2. 
  4. Enter StandyMode. → 3, 4
  5. Watch the primary key in ETCD. When receiving a key DELETE response, → 1, campaign the lock. 
  6. Start a loop goroutine to print log and do WarmUp(). WarmUp is to do something like update the meta to accurate the Restart if necessary.When receiving a primary key lost WatchResponse, break the loop and Restart. Restart is to call the internal Start func.
  7. If it is in StandyMode. If true → 6. If false → 8.
  8. Restart: call internal Start. → 7
  9. Exit StandbyMode. It will stop the loop in 4. → 8 
  10. Register the service to ETCD as primary. (A ETCD lock may be needed in RegisterService to make sure there is only one primary in the cluster at any time.) → 9
  11. Start up the LivenessCheck goroutineExit StandyMode. The standby node will take over the primary role and start working.


Test Plan


Develop 1, Deploy a cluster with primary and standby coords. Manually remove the primary coord or mock some crash in the primary coord. The standby coord should take over successfully. The cluster must keep working after a short time of partial failure.

2, Deploy multiple backup coordinators. There should be only one of them can turned into active.

3, The test should be done for each kind of coords.

...