Exploring the intersection of high performance computing, modern C++, computer vision, and storage systems. Deep technical insights into building performant software systems, optimizing databases, and implementing efficient computer vision algorithms.
Part 2 of the KNN series exploring practical challenges in building scalable vector databases
In KNN Part I we spent quite a bit of time discussing CPU acceleration and optimization for large-scale KNN computation. However, when building production vector databases (especially around 1B vectors), the biggest performance bottleneck you are likely to encounter isn’t computation—it’s disk IO. Dense vector search has become the backbone of semantic retrieval, but as embedding dimensions grow from 768 to 1536+ and datasets scale to billions of vectors, we hit a fundamental problem: The vector dataset doesn’t fit in RAM. Loading the vector dataset into RAM is a pre-requisite for any future vector distance computations that will follow for KNN purposes.
Consider the storage requirements for Float32 vectors using the formula: S = N × D × B
S
= Total storage (bytes)N
= Number of vectorsD
= Dimensions per vectorB
= Bytes per dimension (4 for Float32)Storage Requirements for Float32 Vectors
Dimensions | Memory/Vector | 1M Vectors | 1B Vectors | AWS RAM Cost/Month* |
---|---|---|---|---|
768 | 3 KB | 3 GB | 3 TB | $19,929 |
1536 | 6 KB | 6 GB | 6 TB | $39,858 |
*AWS u-3tb1/u-6tb1 High Memory instances (On-Demand pricing, US East)
With modern embeddings typically using 768-1536 dimensions, billion-vector deployments require terabytes of RAM—economically prohibitive for most applications.
This is where quantization enters the picture. But as we’ll discover, the reality of quantization techniques like Product Quantization is far more nuanced than their promises suggest.
One apparent solution is memory mapping—a technique that allows applications to access disk-stored data as if it were in RAM by mapping file contents directly into the process’s virtual address space. This seems attractive because it lets you work with datasets larger than physical memory without explicit I/O management.
However, memory mapping is a superficial solution for vector KNN workloads. When the vector dataset can’t fully fit in RAM, the operating system pages them from the disk, and each vector comparison triggers expensive disk reads. In vector indices this will also trigger random disk access, which creates severe performance penalties:
Storage Access Latency Comparison
Technology | Random Access Latency | vs RAM | Use Case |
---|---|---|---|
DDR4/DDR5 RAM | ~100 ns | 1× | Active datasets |
NVMe SSD | ~100-200 μs | 1,000-2,000× slower | High-performance storage |
SATA SSD | ~500 μs | 5,000× slower | General storage |
Traditional HDD | ~5-10 ms | 50,000-100,000× slower | Cold storage |
This performance gap explains why algorithms like DiskANN rely heavily on keeping working sets in memory during index construction, where thousands of distance computations occur per new node insertion.
In jVector we are addressing this issue for large index construction by introducing a quantization technique widely known as Product Quantization (PQ). PQ addresses the memory problem through a divide-and-conquer approach. Instead of quantizing entire high-dimensional vectors, PQ splits each vector into smaller subvectors and quantizes each subspace independently.
How PQ Works:
Key Parameters:
(D × 32 bits) ÷ (M × 8 bits) = 4D/M
For our 1536-dimensional examples:
The fundamental trade-off: fewer subspaces yield better compression but worse approximation quality.
Product Quantization (PQ) emerged as a solution, offering three compelling advantages:
PQ achieves compression ratios impossible with other techniques. For 1536-dimensional vectors using 16 subspaces:
Compression: (1536 × 4 bytes) ÷ 16 bytes = 384:1
PQ Storage Efficiency (1536D, 16 subspaces)
Dataset Size | Full Precision | PQ Compressed | Space Saved | Ratio |
---|---|---|---|---|
1M vectors | 6 GB | 16 MB | 5.98 GB | 384:1 |
100M vectors | 600 GB | 1.6 GB | 598.4 GB | 384:1 |
1B vectors | 6 TB | 16 GB | 5.98 TB | 384:1 |
Compressed vectors fit in RAM, enabling in-memory distance calculations and eliminating expensive disk I/O during search and index construction.
PQ’s finite centroid space (typically K=256 per subspace) enables precomputation of all possible centroid-to-centroid distances. With 16 subspaces, this requires storing only:
Memory = 16 × (256²/2) × 4 bytes ≈ 8 MB
This transforms distance computation from O(d) floating-point operations to O(1) memory lookups per subspace.
Our jVector benchmarks on 1536-dimensional vectors show dramatic performance improvements:
Distance Computation Performance
Subspaces | Avg Latency (μs) | Speedup vs Full Precision |
---|---|---|
0 (Full) | 375,719 ± 24,498 | Baseline |
16 | 13,751 ± 321 | 27.3× faster |
64 | 40,927 ± 541 | 9.2× faster |
192 | 133,815 ± 6,739 | 2.8× faster |
Index Construction Acceleration
Configuration | Construction Time | Improvement |
---|---|---|
Full Precision (100K vectors, 1536D) | 38.4 ± 2.4 seconds | Baseline |
PQ Compressed (16 subspaces) | 4.3 ± 0.8 seconds | 9.0× faster |
However, aggressive compression comes with severe quality degradation:
Recall Degradation Analysis (100K vectors, k=50)
PQ Subspaces | Configuration | Recall@50 | Quality Loss |
---|---|---|---|
0 (Full) | No quantization | 1.00 | Baseline |
16 | 96 dims/subspace | 0.10 | 90% degradation |
32 | 48 dims/subspace | 0.10 | 90% degradation |
64 | 24 dims/subspace | 0.20 | 80% degradation |
96 | 16 dims/subspace | 0.40 | 60% degradation |
192 | 8 dims/subspace | 0.80 | 20% degradation |
The sweet spot for computational performance (16 subspaces) produces practically unusable results with only 10% recall.
To achieve acceptable recall, we must increase the overquery factor—a multiplier that determines how many extra candidates to retrieve during the initial (fast) quantized search phase before re-ranking with full-precision vectors.
Overquery Factor Explained:
This compensates for quantization quality loss by casting a wider net during the initial search, then using expensive full-precision comparisons to refine the results.
Search Performance with Overquery Factor = 5
Subspaces | Time (ms) | Recall | Re-ranked | Visited Nodes | Disk I/O |
---|---|---|---|---|---|
0 (Full) | 10.697 | 1.0 | 0 | 906.5 | 57.5 |
16 | 13.167 | 0.4 | 250 | 2,864.7 | 253.2 |
64 | 16.315 | 0.9 | 250 | 2,832.4 | 252.4 |
192 | 21.060 | 1.0 | 250 | 3,644.5 | 269.9 |
Key observations:
The relationship between memory usage, query performance, and recall quality becomes clearer when visualized:
Dual-axis view showing query time vs memory usage across different PQ configurations
Performance vs memory scatter plot with recall quality indicated by point size and color
The interactive version allows you to switch between dual-axis and scatter plot views, with detailed tooltips showing compression ratios, recall percentages, and memory footprints for each configuration.
Product Quantization isn’t a silver bullet—it’s a sophisticated balancing act. The technique proves most valuable when:
Quantization success depends on carefully tuning three competing factors:
Compression Ratio ↔ Search Quality ↔ Query Performance
Real-world deployments require:
The next generation of quantization techniques aims to address PQ’s limitations:
As we continue building jVector and integrating with OpenSearch, the goal remains finding the optimal balance between the competing demands of scale, speed, and quality in production vector search systems.
This analysis is based on extensive benchmarking with jVector. Your mileage may vary depending on dataset characteristics, hardware configuration, and application requirements. Always benchmark on your specific use case before making production decisions.
Next in the series: Non-linear Vector Quantization and the future of memory-efficient vector search.
Jégou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117-128.
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836.
Subramanya, S. J., Devvrit, F., Simhadri, H. V., Krishnan, R., & Kadekodi, R. (2019). DiskANN: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems, 32.
Chen, Q., Zhao, B., Wang, H., Li, M., Liu, C., Li, Z., … & Wang, J. (2021). SPANN: Highly-efficient billion-scale approximate nearest neighbor search. Advances in Neural Information Processing Systems, 34, 5199-5212.
Babenko, A., & Lempitsky, V. (2014). Additive quantization for extreme vector compression. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 931-938.
Ge, T., He, K., Ke, Q., & Sun, J. (2013). Optimized product quantization for approximate nearest neighbor search. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2946-2953.
Zhang, T., Du, C., & Wang, J. (2014). Composite quantization for approximate nearest neighbor search. International Conference on Machine Learning, 838-846.
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535-547.
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., & Kumar, S. (2020). Accelerating large-scale inference with anisotropic vector quantization. International Conference on Machine Learning, 3887-3896.
jVector GitHub Repository - Open-source Java vector search library. Available at: https://github.com/datastax/jvector