Similarity Search & Indexing β
Efficient algorithms and techniques for finding similar vectors at scale
π What is Similarity Search? β
Definition: The process of finding vectors in a database that are most similar to a query vector
Simple Analogy: Like finding people who look most similar to you in a crowd of thousands, but instead of visual features, we're comparing mathematical representations of meaning.
Core Concepts β
Distance Metrics β
Different ways to measure how "close" or "similar" two vectors are:
Cosine Similarity β
- Best for: Text embeddings, normalized data
- Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
- Formula: cos(ΞΈ) = (AΒ·B) / (|A||B|)
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Example vectors
vector_a = np.array([1, 2, 3])
vector_b = np.array([2, 4, 6])
vector_c = np.array([1, 0, -1])
# Calculate cosine similarities
sim_ab = cosine_similarity([vector_a], [vector_b])[0][0]
sim_ac = cosine_similarity([vector_a], [vector_c])[0][0]
print(f"Cosine similarity A-B: {sim_ab:.3f}") # Should be high (vectors are aligned)
print(f"Cosine similarity A-C: {sim_ac:.3f}") # Should be lowerEuclidean Distance β
- Best for: Continuous numerical data, spatial data
- Range: 0 to β (0 = identical, larger = more different)
- Formula: βΞ£(ai - bi)Β²
from scipy.spatial.distance import euclidean
# Calculate Euclidean distances
dist_ab = euclidean(vector_a, vector_b)
dist_ac = euclidean(vector_a, vector_c)
print(f"Euclidean distance A-B: {dist_ab:.3f}")
print(f"Euclidean distance A-C: {dist_ac:.3f}")Manhattan Distance (L1) β
- Best for: High-dimensional data, sparse vectors
- Range: 0 to β
- Formula: Ξ£|ai - bi|
from scipy.spatial.distance import manhattan
# Calculate Manhattan distances
dist_ab = manhattan(vector_a, vector_b)
dist_ac = manhattan(vector_a, vector_c)
print(f"Manhattan distance A-B: {dist_ab:.3f}")
print(f"Manhattan distance A-C: {dist_ac:.3f}")Dot Product β
- Best for: Fast computation, when magnitude matters
- Range: -β to β
- Formula: Ξ£(ai Γ bi)
# Calculate dot products
dot_ab = np.dot(vector_a, vector_b)
dot_ac = np.dot(vector_a, vector_c)
print(f"Dot product A-B: {dot_ab}")
print(f"Dot product A-C: {dot_ac}")Nearest Neighbor Search β
Exact vs Approximate Search β
Exact Search (k-NN):
- Guarantees finding the true k nearest neighbors
- Computational complexity: O(nΓd) where n = number of vectors, d = dimensions
- Too slow for large datasets
Approximate Search (ANN):
- Finds "good enough" nearest neighbors quickly
- Trade accuracy for speed
- Essential for large-scale applications
Indexing Algorithms β
HNSW (Hierarchical Navigable Small World) β
Popular algorithm used by many vector databases for fast approximate search.
ποΈ HNSW STRUCTURE
Layer 2: [A] β Sparse connections to distant nodes
β
Layer 1: [A]ββ[B]ββ[C] β Medium density connections
β β β
Layer 0: [A]ββ[B]ββ[C]ββ[D]ββ[E] β Dense local connections
Search Process:
1. Start at top layer with long-range connections
2. Navigate towards target region
3. Move down layers for finer search
4. Find exact neighbors in bottom layerHNSW Implementation Example β
import hnswlib
import numpy as np
# Create sample data
dim = 128 # Embedding dimension
num_elements = 10000
# Generate random vectors
data = np.random.random((num_elements, dim)).astype('float32')
# Initialize HNSW index
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(max_elements=num_elements, ef_construction=200, M=16)
# Add data to index
labels = np.arange(num_elements)
index.add_items(data, labels)
# Set search parameters
index.set_ef(50) # Controls speed/accuracy tradeoff
# Perform search
query_vector = np.random.random(dim).astype('float32')
k = 5 # Number of nearest neighbors
labels, distances = index.knn_query(query_vector, k=k)
print(f"Found {k} nearest neighbors:")
for i, (label, distance) in enumerate(zip(labels[0], distances[0])):
print(f" {i+1}. Label: {label}, Distance: {distance:.4f}")IVF (Inverted File) β
Partitions the vector space into clusters and searches only relevant clusters.
import faiss
import numpy as np
# Create sample data
d = 128 # Embedding dimension
n = 10000 # Number of vectors
# Generate random vectors
vectors = np.random.random((n, d)).astype('float32')
# Create IVF index
nlist = 100 # Number of clusters
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
# Train the index (learn cluster centroids)
index.train(vectors)
# Add vectors to index
index.add(vectors)
# Search
k = 5
query = np.random.random((1, d)).astype('float32')
# Set search parameters
index.nprobe = 10 # Number of clusters to search
distances, indices = index.search(query, k)
print(f"Found {k} nearest neighbors:")
for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
print(f" {i+1}. Index: {idx}, Distance: {dist:.4f}")LSH (Locality Sensitive Hashing) β
Maps similar vectors to the same hash buckets with high probability.
from sklearn.neighbors import LSHForest
import numpy as np
# Note: LSHForest is deprecated, this is for illustration
# Modern implementations use libraries like Datasketch
def simple_lsh_example():
"""
Simple LSH example for binary vectors
"""
# Create binary vectors (for simplicity)
vectors = [
[1, 0, 1, 1, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 1],
[0, 1, 0, 1, 0],
[1, 1, 0, 0, 1]
]
# Simple hash function: sum of first 3 bits
def hash_func(vector):
return sum(vector[:3])
# Create hash buckets
buckets = {}
for i, vector in enumerate(vectors):
hash_val = hash_func(vector)
if hash_val not in buckets:
buckets[hash_val] = []
buckets[hash_val].append((i, vector))
print("LSH Buckets:")
for hash_val, items in buckets.items():
print(f" Bucket {hash_val}: {[item[0] for item in items]}")
# Query
query = [1, 0, 1, 0, 1]
query_hash = hash_func(query)
candidates = buckets.get(query_hash, [])
print(f"\nQuery: {query}")
print(f"Query hash: {query_hash}")
print(f"Candidates: {[item[0] for item in candidates]}")
simple_lsh_example()Search Performance Optimization β
Index Configuration β
def optimize_hnsw_parameters():
"""
Guide for optimizing HNSW parameters
"""
parameters = {
'M': {
'description': 'Maximum connections per node',
'values': '16 (balanced), 32 (higher recall), 8 (faster build)',
'impact': 'Higher M = better recall, more memory'
},
'ef_construction': {
'description': 'Size of candidate set during construction',
'values': '200 (default), 400 (better quality), 100 (faster)',
'impact': 'Higher ef_construction = better index quality, slower build'
},
'ef': {
'description': 'Size of candidate set during search',
'values': '50 (fast), 100 (balanced), 200 (high recall)',
'impact': 'Higher ef = better recall, slower search'
}
}
print("HNSW Parameter Optimization Guide:")
print("=" * 50)
for param, info in parameters.items():
print(f"\n{param}:")
print(f" Description: {info['description']}")
print(f" Recommended values: {info['values']}")
print(f" Impact: {info['impact']}")
optimize_hnsw_parameters()Batching and Parallel Search β
import concurrent.futures
import time
def parallel_search_example():
"""
Demonstrate parallel search for better throughput
"""
# Setup (using previous HNSW example)
dim = 128
num_elements = 10000
data = np.random.random((num_elements, dim)).astype('float32')
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(max_elements=num_elements, ef_construction=200, M=16)
index.add_items(data, np.arange(num_elements))
index.set_ef(50)
# Generate query batch
num_queries = 100
queries = np.random.random((num_queries, dim)).astype('float32')
# Sequential search
start_time = time.time()
sequential_results = []
for query in queries:
labels, distances = index.knn_query(query, k=5)
sequential_results.append((labels, distances))
sequential_time = time.time() - start_time
# Parallel search
def search_single(query):
return index.knn_query(query, k=5)
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
parallel_results = list(executor.map(search_single, queries))
parallel_time = time.time() - start_time
print(f"Sequential search time: {sequential_time:.3f} seconds")
print(f"Parallel search time: {parallel_time:.3f} seconds")
print(f"Speedup: {sequential_time / parallel_time:.2f}x")
# Note: Uncomment to run
# parallel_search_example()Memory Management β
def memory_efficient_search():
"""
Techniques for memory-efficient similarity search
"""
tips = {
"Use appropriate data types": [
"float32 instead of float64 for embeddings",
"int32 for indices instead of int64",
"Consider float16 for storage if precision allows"
],
"Implement memory mapping": [
"Use memory-mapped files for large datasets",
"Load only active partitions",
"Implement LRU cache for frequently accessed data"
],
"Optimize index size": [
"Use product quantization for compression",
"Implement hierarchical clustering",
"Store only essential metadata"
],
"Batch processing": [
"Process queries in batches",
"Reuse computation across similar queries",
"Implement smart caching strategies"
]
}
print("Memory-Efficient Search Strategies:")
print("=" * 40)
for strategy, techniques in tips.items():
print(f"\n{strategy}:")
for technique in techniques:
print(f" β’ {technique}")
memory_efficient_search()Real-World Implementation Patterns β
Hybrid Search β
Combining multiple search strategies for better results.
def hybrid_search_example():
"""
Demonstrate hybrid search combining exact and approximate methods
"""
class HybridSearchEngine:
def __init__(self, embeddings, metadata):
self.embeddings = embeddings
self.metadata = metadata
self.setup_indexes()
def setup_indexes(self):
# HNSW for fast approximate search
dim = self.embeddings.shape[1]
self.hnsw_index = hnswlib.Index(space='cosine', dim=dim)
self.hnsw_index.init_index(
max_elements=len(self.embeddings),
ef_construction=200,
M=16
)
self.hnsw_index.add_items(self.embeddings, np.arange(len(self.embeddings)))
self.hnsw_index.set_ef(100)
# Build metadata index for filtering
self.metadata_index = {}
for i, meta in enumerate(self.metadata):
for key, value in meta.items():
if key not in self.metadata_index:
self.metadata_index[key] = {}
if value not in self.metadata_index[key]:
self.metadata_index[key] = []
self.metadata_index[key].append(i)
def search(self, query_vector, k=10, filters=None, use_exact=False):
if filters:
# Filter candidates based on metadata
candidates = self.apply_filters(filters)
if len(candidates) == 0:
return [], []
# Search within filtered candidates
candidate_embeddings = self.embeddings[candidates]
similarities = cosine_similarity([query_vector], candidate_embeddings)[0]
# Get top-k from candidates
top_indices = similarities.argsort()[-k:][::-1]
return [candidates[i] for i in top_indices], similarities[top_indices]
else:
# Use HNSW for unfiltered search
if use_exact:
# Exact search (for comparison)
similarities = cosine_similarity([query_vector], self.embeddings)[0]
top_indices = similarities.argsort()[-k:][::-1]
return top_indices.tolist(), similarities[top_indices]
else:
# Approximate search
labels, distances = self.hnsw_index.knn_query(query_vector, k=k)
return labels[0].tolist(), (1 - distances[0]) # Convert to similarity
def apply_filters(self, filters):
candidates = set(range(len(self.embeddings)))
for key, value in filters.items():
if key in self.metadata_index and value in self.metadata_index[key]:
candidates &= set(self.metadata_index[key])
else:
return [] # No matches for this filter
return list(candidates)
# Example usage
embeddings = np.random.random((1000, 128)).astype('float32')
metadata = [
{'category': 'tech', 'year': 2023, 'author': f'author_{i%10}'}
for i in range(1000)
]
engine = HybridSearchEngine(embeddings, metadata)
query = np.random.random(128).astype('float32')
# Unfiltered search
results, scores = engine.search(query, k=5)
print("Unfiltered search results:", results[:5])
# Filtered search
filtered_results, filtered_scores = engine.search(
query, k=5, filters={'category': 'tech', 'year': 2023}
)
print("Filtered search results:", filtered_results[:5])
# Note: Uncomment to run
# hybrid_search_example()Multi-Vector Search β
Searching across multiple vector spaces simultaneously.
def multi_vector_search_example():
"""
Search across multiple embedding spaces (e.g., text + image)
"""
class MultiModalSearchEngine:
def __init__(self):
self.text_embeddings = None
self.image_embeddings = None
self.text_index = None
self.image_index = None
def add_documents(self, text_embeddings, image_embeddings):
self.text_embeddings = text_embeddings
self.image_embeddings = image_embeddings
# Setup text index
text_dim = text_embeddings.shape[1]
self.text_index = hnswlib.Index(space='cosine', dim=text_dim)
self.text_index.init_index(
max_elements=len(text_embeddings),
ef_construction=200,
M=16
)
self.text_index.add_items(text_embeddings, np.arange(len(text_embeddings)))
# Setup image index
image_dim = image_embeddings.shape[1]
self.image_index = hnswlib.Index(space='cosine', dim=image_dim)
self.image_index.init_index(
max_elements=len(image_embeddings),
ef_construction=200,
M=16
)
self.image_index.add_items(image_embeddings, np.arange(len(image_embeddings)))
def search(self, text_query=None, image_query=None, k=10, weights=None):
if weights is None:
weights = {'text': 0.5, 'image': 0.5}
results = {}
if text_query is not None:
text_labels, text_distances = self.text_index.knn_query(text_query, k=k*2)
text_scores = 1 - text_distances[0] # Convert to similarity
results['text'] = {label: score for label, score in zip(text_labels[0], text_scores)}
if image_query is not None:
image_labels, image_distances = self.image_index.knn_query(image_query, k=k*2)
image_scores = 1 - image_distances[0] # Convert to similarity
results['image'] = {label: score for label, score in zip(image_labels[0], image_scores)}
# Combine scores
combined_scores = {}
all_labels = set()
if 'text' in results:
all_labels.update(results['text'].keys())
if 'image' in results:
all_labels.update(results['image'].keys())
for label in all_labels:
score = 0
if 'text' in results and label in results['text']:
score += weights['text'] * results['text'][label]
if 'image' in results and label in results['image']:
score += weights['image'] * results['image'][label]
combined_scores[label] = score
# Sort by combined score
sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
return [label for label, score in sorted_results[:k]], [score for label, score in sorted_results[:k]]
print("Multi-modal search engine framework created")
print("Can search across text and image embeddings simultaneously")
multi_vector_search_example()Performance Monitoring β
Search Quality Metrics β
def evaluate_search_quality():
"""
Methods to evaluate and monitor search quality
"""
def calculate_recall_at_k(ground_truth, predicted, k):
"""Calculate recall@k for search results"""
if len(ground_truth) == 0:
return 0.0
predicted_k = predicted[:k]
relevant_found = len(set(ground_truth) & set(predicted_k))
return relevant_found / len(ground_truth)
def calculate_precision_at_k(ground_truth, predicted, k):
"""Calculate precision@k for search results"""
if k == 0:
return 0.0
predicted_k = predicted[:k]
relevant_found = len(set(ground_truth) & set(predicted_k))
return relevant_found / k
def calculate_map_at_k(ground_truths, predictions, k):
"""Calculate Mean Average Precision@k"""
aps = []
for gt, pred in zip(ground_truths, predictions):
ap = 0.0
relevant_count = 0
for i, item in enumerate(pred[:k]):
if item in gt:
relevant_count += 1
precision_at_i = relevant_count / (i + 1)
ap += precision_at_i
if len(gt) > 0:
ap /= min(len(gt), k)
aps.append(ap)
return np.mean(aps)
# Example evaluation
ground_truth = [1, 3, 5, 7]
predicted = [1, 2, 3, 4, 5, 6, 7, 8]
k = 5
recall = calculate_recall_at_k(ground_truth, predicted, k)
precision = calculate_precision_at_k(ground_truth, predicted, k)
print(f"Recall@{k}: {recall:.3f}")
print(f"Precision@{k}: {precision:.3f}")
evaluate_search_quality()Performance Benchmarking β
def benchmark_search_methods():
"""
Benchmark different search methods
"""
import time
# Generate test data
dim = 256
n_docs = 10000
n_queries = 100
documents = np.random.random((n_docs, dim)).astype('float32')
queries = np.random.random((n_queries, dim)).astype('float32')
results = {}
# Exact search (brute force)
start_time = time.time()
for query in queries[:10]: # Only test with 10 queries for exact search
similarities = cosine_similarity([query], documents)[0]
top_indices = similarities.argsort()[-5:][::-1]
exact_time = (time.time() - start_time) / 10 # Average per query
results['Exact Search'] = {'time_per_query': exact_time * 1000, 'accuracy': 1.0}
# HNSW search
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(max_elements=n_docs, ef_construction=200, M=16)
build_start = time.time()
index.add_items(documents, np.arange(n_docs))
build_time = time.time() - build_start
index.set_ef(50)
start_time = time.time()
for query in queries:
labels, distances = index.knn_query(query, k=5)
hnsw_time = (time.time() - start_time) / n_queries
results['HNSW'] = {
'time_per_query': hnsw_time * 1000,
'build_time': build_time,
'accuracy': 0.95 # Approximate
}
# Print results
print("Search Method Benchmark:")
print("=" * 50)
print(f"{'Method':<15} {'Time/Query (ms)':<15} {'Build Time (s)':<15} {'Accuracy':<10}")
print("-" * 65)
for method, metrics in results.items():
build_time_str = f"{metrics.get('build_time', 0):.2f}" if 'build_time' in metrics else "N/A"
print(f"{method:<15} {metrics['time_per_query']:<15.2f} {build_time_str:<15} {metrics['accuracy']:<10.2f}")
# Note: Uncomment to run full benchmark
# benchmark_search_methods()π― Key Takeaways β
Algorithm Selection β
- Exact search: Use for small datasets or when perfect accuracy is required
- HNSW: Best general-purpose approximate search algorithm
- IVF: Good for very large datasets with clustering structure
- LSH: Effective for high-dimensional sparse data
Performance Optimization β
- Index parameters: Tune based on your accuracy/speed requirements
- Distance metrics: Choose based on your data characteristics
- Memory management: Critical for large-scale deployments
- Parallel processing: Essential for high-throughput applications
Quality vs Speed β
- Always measure both: Track accuracy metrics alongside performance
- User experience matters: Sometimes "good enough" is better than perfect
- Iterate and improve: Start simple, optimize based on real usage patterns
Next Steps:
- Vector Storage Patterns: Learn how to organize and store vectors efficiently
- Vector Databases: Explore database systems optimized for vector operations
- RAG Implementation: See how similarity search powers retrieval-augmented generation