optimized storage and faster search

As organizations continue to leverage the power of generative AI to build Augmented Search Generation (RAG) The need for applications and agents, efficient, high-performance, and scalable solutions has never been greater. Today, we are excited to introduce Binary quantizationA new feature that reduces vector size by up to 96% and search latency by up to 40%.

What is binary quantization?

Binary quantization (BQ) is a technique for compressing high-dimensional vectors by representing each dimension as a single bit. This method greatly reduces the memory space of vector indexes and accelerates vector comparison operations at the expense of recall. The loss of recall can be compensated for by two techniques, oversampling and reranking, which provide tools for applications to prioritize recall, speed, or cost.

Why should I use binary quantization?

Binary quantization is best suited for customers who want to store a very large number of vectors at low cost. Azure AI Search keeps vector indexes in memory to provide the best search performance. Binary quantization (BQ) reduces the size of the vector index in memory, which in turn reduces the number of vector indexes. Azure AI Search You can save costs by selecting the partitions that fit your data.

Binary quantization reduces the size of vector indices in memory by converting 32-bit floating point numbers to 1-bit values, which can reduce vector index size by up to 28x (slightly less than the theoretical 32x due to overhead introduced by the index data structure). The table below shows the impact of binary quantization on vector index size and storage usage.

Table 1.1: Vector index storage benchmarks

Compression configuration	Number of documents	Vector index size (GB)	Total storage size (GB)	% vector index savings	% storage savings
Uncompressed	1M	5.77	24.77
Square	1M	1.48	20.48	74%	17%
BBQ	1M	0.235	19.23	96%	22%

Table 1.1 compares storage metrics for three different vector compression schemes: uncompressed, scalar quantization (SQ), and binary quantization (BQ). The data shows significant storage and performance improvements with binary quantization, with up to 96% savings in vector index size and 22% savings in overall storage. MTEB/dbpedia was used with default vector search settings and OpenAI text-embeddings-ada-002 @1536 dimension.

Performance Improvement

Binary Quantization (BQ) improves performance, reducing query latency by 10-40% compared to uncompressed indexes. The improvement depends on the oversampling rate, dataset size, vector dimensionality, and service configuration. BQ is faster for several reasons. For example, Hamming distance is faster to compute than cosine similarity, and the packed bit vectors are smaller, providing improved locality. Therefore, it is a good choice when speed is important, and you can balance speed and relevance by applying appropriate oversampling.

Maintain quality

Binary quantization reduces storage usage and improves retrieval performance, but at the expense of recall. However, techniques such as oversampling and re-ranking can effectively manage this trade-off. Oversampling retrieves a larger set of potential documents to compensate for the loss of resolution due to quantization. Re-ranking re-computes similarity scores using the full resolution vector. The table below shows a subset. MTEB Data Set For ~ OpenAI and Cohere Embedding using binary quantization average NDCG@10 With and without re-ranking/oversampling.

Table 1.2: Impact of binary quantization on average NDCG@10 across MTEB subsets.

model	No re-ranking (Δ / %)	2x oversampling reordering (Δ/%)
Cohere Embed V3 (1024d)	-4.883 (-9.5%)	-0.393 (-0.76%)
OpenAI Text Embeddings-3-small (1536d)	-2.312(-4.55%)	+0.069(+0.14%)
OpenAI Text Embeddings-3-Large (3072d)	-1.024 (-1.86%)	+0.006(+0.01%)

Table 1.2 compares the relative point differences in average NDCG@10 when using binary quantization of the uncompressed index across different embedding models on a subset of the MTEB dataset.

Key highlights:

BQ+Reranking provides higher search quality than when Reranking is not used.
The effect of reordering is more pronounced in low-dimensional models, but the effect is smaller and sometimes negligible in high-dimensional models.
We strongly consider reordering to full precision vectors to minimize or eliminate recall loss due to quantization.

When using binary quantization

Binary quantization is recommended for applications with high-dimensional vectors and large datasets, where storage efficiency and fast retrieval performance are important. It is especially effective for embeddings with dimensions greater than 1024. However, for smaller dimensions, it is recommended to test the quality of BQ or consider SQ as an alternative. BQ also performs very well when the embeddings are centered around 0, as seen in popular embedding models such as OpenAI and Cohere.

BQ + Reranking/Oversampling can significantly reduce the cost while maintaining strong search quality by searching compressed vector indices in memory and reranking using full-precision vectors stored on disk. This approach achieves its goal of operating efficiently in memory-constrained settings by leveraging both memory and SSDs to provide high performance and scalability on large data sets.

BQ is our addition Improved price-performance ratio Delivering storage savings and performance improvements over the past few months, organizations can achieve faster search results and lower operating costs, ultimately leading to better results and user experiences.

More features are now generally available.

We are excited to announce that several vector search enhancements are now generally available in Azure AI Search. These updates provide users with greater control over the searcher in their RAG solution and optimize LLM performance. Key highlights include:

Integrated vectorization Azure AI Search with Azure OpenAI is now generally available!
Binary vector type support: Azure AI Search supports narrow vector types, including binary vectors. This feature allows you to store and process larger vector datasets at lower cost while maintaining fast search capabilities.
Vector weights: This feature allows users to assign relative importance to vector queries over term queries in hybrid search scenarios. It gives users more control over the final result set by allowing them to prefer vector similarity over keyword similarity.
Document enhancement: Enhance your search results with scoring profiles tailored to vector and hybrid search queries. Whether you prioritize freshness, geographic location, or specific keywords, new features enable targeted document enrichment, ensuring more relevant results for your needs.

Get started with Azure AI Search

To get started with binary quantization, visit the official documentation here. Reduce vector size – Azure AI Search | Microsoft Learn

Source link

What is binary quantization?

Why should I use binary quantization?

Performance Improvement

Maintain quality

When using binary quantization

More features are now generally available.

Get started with Azure AI Search

Our Company

About Links

Useful Links

Newsletter

Laest News

optimized storage and faster search

What is binary quantization?

Why should I use binary quantization?

Performance Improvement

Maintain quality

When using binary quantization

More features are now generally available.

Get started with Azure AI Search

Just a moment…

Just a moment…

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News