Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. More efficient reading and writing of persistent data when using the standalone version for it don't need network communication.
  2. Reduce the impact of its dependency component changes when components are changed. For example, when minio crashes, the system is unavailable.
  3. Dependency injection can be done with local storage, removing the dependency of minio in unittest on minio.

At the same time, in order to preserve the selectivity of users, minio can also be used as a storage engine for the standalone version. We are going to make the storage engine configurable.

...

  1. For solving Problem 3, We redefine the interface implemented by minio. All file storage needs to implement this interface.

    Code Block
    titleChunkManager interface
    linenumberstrue
    type ChunkManager interface {
    	GetPath(filePath string) (string, error)
    	GetSize(filePath string) (int64, error)
    	Write(filePath string, content []byte) error
    	MultiWrite(contents map[string][]byte) error
    	Exist(filePath string) bool
    	Read(filePath string) ([]byte, error)
    	MultiRead(filePaths []string) ([][]byte, error)
    	ReadWithPrefix(prefix string) ([]string, [][]byte, error)
    	ReadAt(filePath string, off int64, length int64) (p []byte, err error)
    	Remove(filePath string) error
    	MultiRemove(filePaths []string) error
    	RemoveWithPrefix(prefix string) error
    }

    For this interface we will have three implementations, LocalChunkManager, MinioChunkmanager and VectorChunkManager.
    VectorChunkManager is an optimized management class for vector reading under distributed milvus. It will use minio as a remote chunkManager storagechunkManager and local file system as a vector cachevectorChunkStorage. When reading a file, it will be downloaded from the minio to the local, and then the relevant data will be read from the local. In the standalone version, we will replace minioChunkManager with localChunkManager as the implementation of remoteChunkManagerstorageChunkManager.

  2. ChunkManagerFactory

    For Problem 1, a chunkManagerFactory similar to msgStream. Factory is added to generate chunkManagers with different configurations.

    Code Block
    titleChunkManager Factory
    linenumberstrue
    type ChunkManagerFactory struct {
    	localStorageChunkStorage string
    	remoteStorageVectorCacheStorage string
    }
    func NewChunkManagerFactory(localChunkStorage, remoteVectorCacheStorage string) *ChunkManagerFactory{}
    func (cmf *ChunkManagerFactory) NewLocalChunkManager(opts ...storage.Options){}
    func (cmf *ChunkManagerFactory) NewRemoteChunkManagerNewChunkStorage(opts ...storage.Options){
    	switch (cmf.remoteStorageChunkStorage){
    	case "s3":
    	case "minio":
    	...
    	}
    }
    func (cmf *ChunkManagerFactory) NewVectorCacheManager(opts ...storage.Options){}

    Options is needed when generating a new chunkManager. The Options maybe like this.

    Code Block
    titleChunkManager Config
    linenumberstrue
    type config struct {
    	address string
    	bucketName string
    	accessKeyID string
    	secretAccessKeyID string
    	useSSL bool
    	createBucket bool
    	rootPath string
    }
    
    type Option func(*config)
    
    func Address(addr string) Option {
    	return func(c *config) {
    	c.address = addr
    	}
    }

    This structure will have some redundancy. For example, local storage will not require parameters such as address and bucketname. but will be easier to reuse.

  3. Use a more generic factory instead of the existing msgFactory to build nodes.

    Code Block
    titleInterface extened
    factory := newMsgFactory(localMsg)
    rc, err := components.NewRootCoord(ctx, factory)
    ↓↓↓↓↓↓↓↓↓↓↓↓↓
    factory := newFactory(localMsg)
    rc, err := components.NewRootCoord(ctx, factory)

    And the factory will be like this


    Code Block
    titleFactory Struct
    linenumberstrue
    type Factory struct {
    	msgF msgstream.Factory
    	storageF storage.ChunkManagerFactory
    }
    
    func newDefaultFactory(opts ...Option) *Factory {
    	c := newDefaultConfig()
    	for opt := range opts {
    		opt(c)
    	}
    	return &Factory{
    		MsgFactory: msgstream.NewFactory(c.msgstream),
    		storageF: storage.NewChunkManagerFactory(c.localStoragevectorCacheStorage, c.remoteStoragechunkStorage),
    	}
    }
    
    type config struct {
    	msgStream string
    	localStoragevectorCacheStorage string
    	remoteStoragechunkStorage string
    }
    func newDefaultConfig() *config{}
    
    type Option func(*config)
    
    func localStoragevectorCacheStorage(localStoragevectorCacheStorage string) Option {
    	return func(c *config) {
    		c.localStoragevectorCacheStorage = localStoragevectorCachestorage
    	}
    }


  4. deploy in milvus.yaml will set the default storage engine for different deploy mode. The localStorage vectorCacheStorage and remoteStorage chunkStorage will be used to set what to use as the storage engine.

    Code Block
    languageyml
    titlemilvus.yaml
    linenumberstruecollapsetrue
    deploy:
    	standalone:
    		localStoragechunkStorage: "local"
    		remoteStoragevectorCacheStorage: "local"
    	distributed:
    		localStoragechunkStorage: "localminio"
    		remoteStoragevectorCacheStorage: "miniolocal"
    


    In standalone mode, remoteStorage chunkStorage using local is a more efficient choice. Of course you can also choose minio.
    However, if the distributed version chooses local storage, because the nodes are located on different machines, the data stored locally is inconsistent. Therefore, careful consideration should be given when choosing a storage engine.

...