Vector Streaming: The Memory Efficient Indexing for Vector Databases

Friday 13:20 in Hassium

Embedding creation is mostly done synchronously; a lot of time is wasted while the chunks are being created, as chunking is not a compute-heavy operation. As the chunks are being made, passing them to the embedding model would be efficient. This problem further intensifies with late interaction embeddings like CoLBert or ColPali.

The solution is to create an asynchronous chunking and embedding task. We can effectively spawn threads to handle this task using Rust's concurrency patterns and thread safety. This is done using Rust's MPSC (Multi-producer Single Consumer) module, which passes messages between threads. Thus, this creates a stream of chunks passed into the embedding thread with a buffer. Once the buffer is complete, it embeds the chunks and sends the embeddings back to the main thread, where they are sent to the vector database. This ensures no time is wasted on a single operation and no bottlenecks. Moreover, only the chunks and embeddings in the buffer are stored in the system memory. They are erased from the memory once moved to the vector database.

All this is then bound into Python using pyo3 and maturin, so it's easily accessible from Python, but the core is still asynchronous with rust.

Sonam Pankaj

Sonam is the creator of the open-source library called Embed-Anything, which helps to create local and multimodal embeddings and index them efficiently to vector databases, it’s built in rust and thus it’s more greener and efficient. She works as the GenerativeAI Evangelist at Articul8, spun-off of Interl, Articul8 is the go-to generativeAI platform for enterprise.

Akshay Ballal

I am and AI developer who loves Rust. I want to bring Rust to the AI Community. I am the author of EmbedAnything and Lumo.