Exploring the intersection of high performance computing, modern C++, computer vision, and storage systems. Deep technical insights into building performant software systems, optimizing databases, and implementing efficient computer vision algorithms.
I recently began a personal project exploring multi-modal search, which led me to investigate cutting-edge embedding models and vector search solutions. Vector search (often called KNN or K Nearest Neighbors) identifies the most similar vectors to a query vector, enabling us to find closely related images or semantically similar text.
My work on jVector introduced me to ANN (Approximate Nearest Neighbor) indexing—a more efficient but less accurate alternative to KNN that makes high-dimensional vector search feasible at scale. What had previously been an abstract concept became a concrete challenge my team at Datastax was actively addressing.
In this article, I’ll share my vector search journey, including discoveries about CPU/GPU instructions, performance profiling, and the reality behind hardware acceleration claims for vector search operations.
Before diving deeper, let’s question why hardware acceleration matters for vector search. While ANN indices offer $O(\log n)$ complexity with “slight” accuracy reductions, this simplification overlooks key considerations.
These accuracy tradeoffs are measured in “recall” (how well ANN results match KNN results), but KNN itself isn’t perfect. Its accuracy depends entirely on the embedding model’s quality. Consider the popular text embedding model all-MiniLM-L6-v2
, which averages only 68.06% accuracy.
Starting with this inherent limitation raises a critical question: why add an indexing layer that further reduces accuracy? Perhaps brute force KNN, while $O(n)$ in complexity, could be viable with proper hardware acceleration rather than compounding accuracy losses through approximation methods.
To properly assess whether KNN is the right approach, we need to consider three key factors:
This article (Part I of this series) will focus exclusively on the first factor: computational cost. If we can demonstrate that KNN’s compute requirements are actually manageable, we might avoid unnecessary complexity—why implement additional layers if the simpler approach is sufficient?
We will denote the asymptotically linear runtime complexity $O(N)$, where $N=n*d$, with the definition of
Take the Euclidean distance between two vectors ${p}$ and ${q}$ as our case study:
$d(p,q) = \sqrt{\sum_{i=1}^n (p_i - q_i)^2}$
It nicely embeds in itself all the other common distance computations in some form:
$\text{Euclidean Distance: } d(p,q) = \sqrt{\sum_{i=1}^n (p_i - q_i)^2}$
$\text{Cosine Similarity: } \cos(\theta) = \frac{p \cdot q}{|p| |q|} = \frac{\sum_{i=1}^n p_i q_i}{\sqrt{\sum_{i=1}^n p_i^2} \sqrt{\sum_{i=1}^n q_i^2}}$
$\text{Vector Normalization: } \hat{p} = \frac{p}{|p|} = \frac{p}{\sqrt{\sum_{i=1}^n p_i^2}}$
$\text{Relationship: } d^2(p,q) = |p|^2 + |q|^2 - 2|p||q|\cos(\theta)$
So if we got the Euclidean distance optimized right, we are probably going to solve for the others as well.
First, lets analyze the naive implementation of Euclidean distance, and later we can try and think of possible optimization ways.
For clarity lets use C++
as the implementation language so we can better reason about the low level overhead of the CPU instructions.
Let’s define an interface for Euclidean distance calculations in a header file DistanceCalculator.h
and the various implementations in DistanceCalculator.cpp
:
DistanceCaclulator.h
class DistanceCalculator {
public:
/**
* Calculates the Euclidean distance between two vectors.
*
* The Euclidean distance is computed as the square root of the
* squared Euclidean distance between the two vectors.
*
* @param v1 The first vector.
* @param v2 The second vector.
* @return The Euclidean distance between the two vectors.
*/
static float calculateEuclideanDistance(const std::vector<float> &v1, const std::vector<float> &v2);
}
DistanceCalculator.cpp
float DistanceCalculator::calculateEuclideanDistance(const std::vector<float> &v1, const std::vector<float> &v2) {
return std::sqrt(calculateSquaredEuclideanDistance(v1, v2));
}
An immediate couple of quick observations we can do to streamline the computation are the following:
For the naive implementation of $L2$ we are just going to implement the distance exactly literally as we read the logic from the squared distance formula $d^2(p,q) = (p_1 - q_1)^2 + (p_2 - q_2)^2 + … + (p_n - q_n)^2$ and add the following implementation:
DistanceCalculator.cpp
float DistanceCalculator::calculateSquaredEuclideanDistance(const std::vector<float> &v1,
const std::vector<float> &v2) {
float distance = 0.0;
for (int i = 0; i < v1.size(); i++) {
distance += std::pow(v1.at(i) - v2.at(i), 2);
}
return distance;
}
In the following section we will look at possible optimizations with and without various hardware acceleration techniques and will have a benchmark and later analysis of the benchmark results.
The naive example presented in the Naive Implementation section may be later optimized differently depending on the different compiler used. However, we would still bring up potential manual optimizations that can be done in the code and later evaluate performance in our benchmark to see whether those optimizations worth the effort on the readability overhead of our code.
Note: while the below optimizations might make perfect sense in theory, we will later benchmark and test each one of them to confirm those actually hold true!
One issue for Naive Implementation like the above is the repeated conditional checks for the loop termination i < v1.size()
.
Common technique that is used in this case is known as “Loop Unrolling”. What loop unrolling actually means is that we have a loop that goes in wider increments and avoids the additional overhead of loop termination check in each step.
Let’s add a typical loop unrolling implementation:
DistanceCalculator.cpp
float DistanceCalculator::calculateSquaredEuclideanDistanceWithLoopUnrolling(const std::vector<float> &v1,
const std::vector<float> &v2) {
assert(v1.size() == v2.size()); // Ensure both vectors have the same size
size_t size = v1.size();
size_t unrolledSize = size - (size % 4);
float distance = 0.0f;
for (size_t i = 0; i < unrolledSize; i += 4) {
float diff0 = v1[i] - v2[i];
float diff1 = v1[i + 1] - v2[i + 1];
float diff2 = v1[i + 2] - v2[i + 2];
float diff3 = v1[i + 3] - v2[i + 3];
distance += diff0 * diff0 + diff1 * diff1 + diff2 * diff2 + diff3 * diff3;
}
for (size_t i = unrolledSize; i < size; i++) {
float diff = v1[i] - v2[i];
distance += diff * diff;
}
return distance;
}
The above code reduces the number of loop termination conditions executed by a factor of 4x
!
With our unrolled loop implementation we were able to reduce the number of termination check on the loop. Another optimization techniques we can add on top of that is leveraging special hardware acceleration method available in many modern CPUs and GPUs called “Single Instruction Multiple Data” (aka. SIMD for short).
SIMD allows a single CPU instruction to perform the same operation on multiple data points simultaneously, rather than processing them one by one. For L2 distance calculations, this is particularly valuable because:
For the purposes of this basic evaluation, I’m using an M series Mac with Apple Silicon, I will use the SIMD native libraries on the mac that interface with its custom NEON architecture that can take a 128 bit (4 floats) vector width per operation. Keep in mind however that popular CPU instruction sets exist for SIMD with different memory widths, to name a few:
DistanceCalculator.cpp
float DistanceCalculator::calculateSquaredEuclideanDistanceWithSIMD(const std::vector<float> &v1,
const std::vector<float> &v2) {
/* 1. Assert the Dimensions:*/
assert(v1.size() == v2.size()); // Ensure both vectors are of the same size
/* 2. Determine Sizes: */
size_t dimensions = v1.size(); // The total size of the vectors.
// The largest multiple of 4 that fits within size.
// This is necessary because SIMD processes 4 floats at a time (with 128-bit operations when using SIMD intrinsics).
size_t simdSize = dimensions / 4;
const auto* v1_ = v1.data();
const auto* v2_ = v2.data();
/* 3. Initialize Variables:*/
// To store the final squared Euclidean distance value.
float distance = 0.0f;
// A SIMD register initialized to zero.
float32x4_t sum = vdupq_n_f32(0.0f);
/* 4. Process 4 floats at a time using NEON */
for (size_t i = 0; i < simdSize; i += 4) {
// Constructs a SIMD vector from v1 elements.
const float32x4_t vec1 = vld1q_f32(v1_ + i * 4);
// Constructs a SIMD vector from v2 elements.
const float32x4_t vec2 = vld1q_f32(v2_ + i * 4);
// Calculate differences
float32x4_t diff = vsubq_f32(vec1, vec2); // Performs element-wise subtraction of vec1 and vec2.
// Square differences and add to sum
// vmlaq_f32 performs multiply-accumulate: sum += diff * diff
sum = vmlaq_f32(sum, diff, diff);
}
distance = vaddvq_f32(sum); // Sum all elements in the vector
// Handle remaining elements (if dimensions not divisible by 4)
for (size_t i = (dimensions / 4) * 4; i < dimensions; i++) {
const float diff = v1[i] - v2[i];
distance += diff * diff;
}
return distance;
}
Let’s take a moment to reason about the code above. Previously for vectors $A$ and $B$ the L2 calculation logic looked something like this (pseudocode):
For each dimension i:
Calculate difference: diff_i = A_i - B_i
Square the difference: squared_diff_i = diff_i * diff_i
Add to running sum: sum += squared_diff_i
Take square root of sum
And now it looks more like this:
Load multiple dimensions of A into SIMD register
Load multiple dimensions of B into SIMD register
Subtract B from A for all dimensions simultaneously
Square all differences simultaneously
Sum all squared differences (often using specialized horizontal sum instructions)
Take square root of result
The main difference here is that we saved instructions to load registers. Here we used 4x fewer instructions than the previous examples to load values into CPU registers to perform the calculations, which means (if everything works as expected) an additional 4x performance speedup.
Similar to the hardware acceleration with SIMD approach, we might want to take the idea of parallelism of distance computations to the limit by leveraging a GPU.
While in theory GPU sounds like the perfect candidate for distance computation acceleration, there are in fact hidden limitations (as we will also see pretty soon in the benchmarks):
We will come back to explore the balance between tradeoffs and advantages of GPU utilization for vector search computation later on when we go over the benchmarks and their results.
For now lets take a look at the implementation of GPU acceleration for Euclidean distance on the Apple Silicon Mac.
First and foremost we would need to define a computational GPU kernel for the Euclidean distance using Apple’s Metal framework:
SquaredDistanceKernel.metal
#include <metal_stdlib>
using namespace metal;
kernel void calculateSquaredEuclideanDistance(
const device float *inA [[ buffer(0) ]],
const device float *inB [[ buffer(1) ]],
device float *squaredDiffs [[ buffer(2) ]],
uint id [[ thread_position_in_grid ]]
) {
float diff = inA[id] - inB[id];
squaredDiffs[id] = diff * diff;
}
We would later need to compile that kernel to reuse it within our calculator in the following manner:
DistanceCalculator.cpp
float DistanceCalculator::calculateSquaredEuclideanDistanceWithMetal(const std::vector<float> &v1,
const std::vector<float> &v2) const {
assert(v1.size() == v2.size()); // Ensure both vectors are of the same size
std::size_t vectorSize = v1.size();
assert(v1.size() == dimension_);
// Update the GPU buffers with new data by copying from host memory.
memcpy(bufferV1_->contents(), v1.data(), v1.size() * sizeof(float));
memcpy(bufferV2_->contents(), v2.data(), v2.size() * sizeof(float));
// Encode Metal commands
const auto commandBuffer = commandQueue_->commandBuffer();
const auto computeEncoder = commandBuffer->computeCommandEncoder();
computeEncoder->setComputePipelineState(pipelineState_);
computeEncoder->setBuffer(bufferV1_, 0, 0);
computeEncoder->setBuffer(bufferV2_, 0, 1);
computeEncoder->setBuffer(bufferDiffs_, 0, 2);
// Dispatch threads
auto threadGroupSize = MTL::Size::Make(pipelineState_->maxTotalThreadsPerThreadgroup(), 1, 1);
auto threadGroups = MTL::Size::Make((vectorSize + threadGroupSize.width - 1) / threadGroupSize.width, 1, 1);
computeEncoder->dispatchThreadgroups(threadGroups, threadGroupSize);
computeEncoder->endEncoding();
// Commit and wait for the computation to complete
commandBuffer->commit();
commandBuffer->waitUntilCompleted();
auto *diffs = static_cast<float *>(bufferDiffs_->contents());
float squaredDistance = 0.0f;
// Read the result
for (size_t i = 0; i < vectorSize; i++) {
squaredDistance += diffs[i];
}
return squaredDistance;
}
Note: for the code above to be more concise, a lot of boilerplate of GPU initialization code was removed to focus on the core concepts of this blog article.
Now is the time to actually collect empirical results from the performance of all the above algorithms! But first, lets take a moment to discuss benchmark methodology.
We have two groups of algorithms to test so far:
We will use the parameters defined earlier to consider the following edge cases:
The benchmark results visualization shows performance comparisons across different implementations:
View full interactive visualization
Performance scaling by dimension: As vector dimensions increase from 128 to 10,000, performance degrades across all methods, but at very different rates.
We began our investigation with a simple hypothesis: GPU architecture, with its massive parallelism, should theoretically excel at accelerating brute force KNN vector search operations. Our benchmark results, however, tell a more complex story.
While GPUs do indeed scale more effectively with high-dimensional vectors, their performance advantages only become significant at dimensions exceeding 10,000 and with very large datasets. This presents a practical challenge, as most modern embedding models operate in dimensional spaces well below this threshold—typically maxing out around 4,000 dimensions.
Surprisingly, our tests revealed that in these more common dimensional ranges, GPU implementations actually underperform compared to even basic CPU optimizations like loop unrolling. The overhead associated with data transfer to and from the GPU, combined with initialization costs, effectively negates the theoretical advantages for all but the most extreme use cases.
This unexpected outcome raises important questions for vector search optimization. Is there still untapped potential in GPU acceleration for KNN that our implementation hasn’t captured? Or should we pivot our focus toward approximate nearest neighbor (ANN) algorithms that might offer better performance-accuracy tradeoffs?
In my next article, I’ll explore these questions and determine whether GPU acceleration for vector search can be salvaged or if we need to look elsewhere for performance gains.