Similarity Search & Indexing

Efficient algorithms and techniques for finding similar vectors at scale

🔍 What is Similarity Search?

Definition: The process of finding vectors in a database that are most similar to a query vector

Simple Analogy: Like finding people who look most similar to you in a crowd of thousands, but instead of visual features, we're comparing mathematical representations of meaning.

Core Concepts

Distance Metrics

Different ways to measure how "close" or "similar" two vectors are:

Cosine Similarity

Best for: Text embeddings, normalized data
Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
Formula: cos(θ) = (A·B) / (|A||B|)

python

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Example vectors
vector_a = np.array([1, 2, 3])
vector_b = np.array([2, 4, 6])
vector_c = np.array([1, 0, -1])

# Calculate cosine similarities
sim_ab = cosine_similarity([vector_a], [vector_b])[0][0]
sim_ac = cosine_similarity([vector_a], [vector_c])[0][0]

print(f"Cosine similarity A-B: {sim_ab:.3f}")  # Should be high (vectors are aligned)
print(f"Cosine similarity A-C: {sim_ac:.3f}")  # Should be lower

Euclidean Distance

Best for: Continuous numerical data, spatial data
Range: 0 to ∞ (0 = identical, larger = more different)
Formula: √Σ(ai - bi)²

python

from scipy.spatial.distance import euclidean

# Calculate Euclidean distances
dist_ab = euclidean(vector_a, vector_b)
dist_ac = euclidean(vector_a, vector_c)

print(f"Euclidean distance A-B: {dist_ab:.3f}")
print(f"Euclidean distance A-C: {dist_ac:.3f}")

Manhattan Distance (L1)

Best for: High-dimensional data, sparse vectors
Range: 0 to ∞
Formula: Σ|ai - bi|

python

from scipy.spatial.distance import manhattan

# Calculate Manhattan distances
dist_ab = manhattan(vector_a, vector_b)
dist_ac = manhattan(vector_a, vector_c)

print(f"Manhattan distance A-B: {dist_ab:.3f}")
print(f"Manhattan distance A-C: {dist_ac:.3f}")

Dot Product

Best for: Fast computation, when magnitude matters
Range: -∞ to ∞
Formula: Σ(ai × bi)

python

# Calculate dot products
dot_ab = np.dot(vector_a, vector_b)
dot_ac = np.dot(vector_a, vector_c)

print(f"Dot product A-B: {dot_ab}")
print(f"Dot product A-C: {dot_ac}")

Nearest Neighbor Search

Exact vs Approximate Search

Exact Search (k-NN):

Guarantees finding the true k nearest neighbors
Computational complexity: O(n×d) where n = number of vectors, d = dimensions
Too slow for large datasets

Approximate Search (ANN):

Finds "good enough" nearest neighbors quickly
Trade accuracy for speed
Essential for large-scale applications

Indexing Algorithms

HNSW (Hierarchical Navigable Small World)

Popular algorithm used by many vector databases for fast approximate search.

text

🏗️ HNSW STRUCTURE

Layer 2: [A] ← Sparse connections to distant nodes
         ↓
Layer 1: [A]──[B]──[C] ← Medium density connections
         ↓     ↓    ↓
Layer 0: [A]──[B]──[C]──[D]──[E] ← Dense local connections

Search Process:
1. Start at top layer with long-range connections
2. Navigate towards target region
3. Move down layers for finer search
4. Find exact neighbors in bottom layer

HNSW Implementation Example

python

import hnswlib
import numpy as np

# Create sample data
dim = 128  # Embedding dimension
num_elements = 10000

# Generate random vectors
data = np.random.random((num_elements, dim)).astype('float32')

# Initialize HNSW index
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(max_elements=num_elements, ef_construction=200, M=16)

# Add data to index
labels = np.arange(num_elements)
index.add_items(data, labels)

# Set search parameters
index.set_ef(50)  # Controls speed/accuracy tradeoff

# Perform search
query_vector = np.random.random(dim).astype('float32')
k = 5  # Number of nearest neighbors

labels, distances = index.knn_query(query_vector, k=k)
print(f"Found {k} nearest neighbors:")
for i, (label, distance) in enumerate(zip(labels[0], distances[0])):
    print(f"  {i+1}. Label: {label}, Distance: {distance:.4f}")

IVF (Inverted File)

Partitions the vector space into clusters and searches only relevant clusters.

python

import faiss
import numpy as np

# Create sample data
d = 128  # Embedding dimension
n = 10000  # Number of vectors

# Generate random vectors
vectors = np.random.random((n, d)).astype('float32')

# Create IVF index
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)

# Train the index (learn cluster centroids)
index.train(vectors)

# Add vectors to index
index.add(vectors)

# Search
k = 5
query = np.random.random((1, d)).astype('float32')

# Set search parameters
index.nprobe = 10  # Number of clusters to search

distances, indices = index.search(query, k)
print(f"Found {k} nearest neighbors:")
for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
    print(f"  {i+1}. Index: {idx}, Distance: {dist:.4f}")

LSH (Locality Sensitive Hashing)

Maps similar vectors to the same hash buckets with high probability.

python

from sklearn.neighbors import LSHForest
import numpy as np

# Note: LSHForest is deprecated, this is for illustration
# Modern implementations use libraries like Datasketch

def simple_lsh_example():
    """
    Simple LSH example for binary vectors
    """
    # Create binary vectors (for simplicity)
    vectors = [
        [1, 0, 1, 1, 0],
        [1, 0, 1, 0, 0],
        [0, 1, 0, 1, 1],
        [0, 1, 0, 1, 0],
        [1, 1, 0, 0, 1]
    ]
    
    # Simple hash function: sum of first 3 bits
    def hash_func(vector):
        return sum(vector[:3])
    
    # Create hash buckets
    buckets = {}
    for i, vector in enumerate(vectors):
        hash_val = hash_func(vector)
        if hash_val not in buckets:
            buckets[hash_val] = []
        buckets[hash_val].append((i, vector))
    
    print("LSH Buckets:")
    for hash_val, items in buckets.items():
        print(f"  Bucket {hash_val}: {[item[0] for item in items]}")
    
    # Query
    query = [1, 0, 1, 0, 1]
    query_hash = hash_func(query)
    candidates = buckets.get(query_hash, [])
    
    print(f"\nQuery: {query}")
    print(f"Query hash: {query_hash}")
    print(f"Candidates: {[item[0] for item in candidates]}")

simple_lsh_example()

Search Performance Optimization

Index Configuration

python

def optimize_hnsw_parameters():
    """
    Guide for optimizing HNSW parameters
    """
    parameters = {
        'M': {
            'description': 'Maximum connections per node',
            'values': '16 (balanced), 32 (higher recall), 8 (faster build)',
            'impact': 'Higher M = better recall, more memory'
        },
        'ef_construction': {
            'description': 'Size of candidate set during construction',
            'values': '200 (default), 400 (better quality), 100 (faster)',
            'impact': 'Higher ef_construction = better index quality, slower build'
        },
        'ef': {
            'description': 'Size of candidate set during search',
            'values': '50 (fast), 100 (balanced), 200 (high recall)',
            'impact': 'Higher ef = better recall, slower search'
        }
    }
    
    print("HNSW Parameter Optimization Guide:")
    print("=" * 50)
    for param, info in parameters.items():
        print(f"\n{param}:")
        print(f"  Description: {info['description']}")
        print(f"  Recommended values: {info['values']}")
        print(f"  Impact: {info['impact']}")

optimize_hnsw_parameters()

Batching and Parallel Search

python

import concurrent.futures
import time

def parallel_search_example():
    """
    Demonstrate parallel search for better throughput
    """
    # Setup (using previous HNSW example)
    dim = 128
    num_elements = 10000
    data = np.random.random((num_elements, dim)).astype('float32')
    
    index = hnswlib.Index(space='cosine', dim=dim)
    index.init_index(max_elements=num_elements, ef_construction=200, M=16)
    index.add_items(data, np.arange(num_elements))
    index.set_ef(50)
    
    # Generate query batch
    num_queries = 100
    queries = np.random.random((num_queries, dim)).astype('float32')
    
    # Sequential search
    start_time = time.time()
    sequential_results = []
    for query in queries:
        labels, distances = index.knn_query(query, k=5)
        sequential_results.append((labels, distances))
    sequential_time = time.time() - start_time
    
    # Parallel search
    def search_single(query):
        return index.knn_query(query, k=5)
    
    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        parallel_results = list(executor.map(search_single, queries))
    parallel_time = time.time() - start_time
    
    print(f"Sequential search time: {sequential_time:.3f} seconds")
    print(f"Parallel search time: {parallel_time:.3f} seconds")
    print(f"Speedup: {sequential_time / parallel_time:.2f}x")

# Note: Uncomment to run
# parallel_search_example()

Memory Management

python

def memory_efficient_search():
    """
    Techniques for memory-efficient similarity search
    """
    tips = {
        "Use appropriate data types": [
            "float32 instead of float64 for embeddings",
            "int32 for indices instead of int64",
            "Consider float16 for storage if precision allows"
        ],
        "Implement memory mapping": [
            "Use memory-mapped files for large datasets",
            "Load only active partitions",
            "Implement LRU cache for frequently accessed data"
        ],
        "Optimize index size": [
            "Use product quantization for compression",
            "Implement hierarchical clustering",
            "Store only essential metadata"
        ],
        "Batch processing": [
            "Process queries in batches",
            "Reuse computation across similar queries",
            "Implement smart caching strategies"
        ]
    }
    
    print("Memory-Efficient Search Strategies:")
    print("=" * 40)
    for strategy, techniques in tips.items():
        print(f"\n{strategy}:")
        for technique in techniques:
            print(f"  • {technique}")

memory_efficient_search()

Real-World Implementation Patterns

Hybrid Search

Combining multiple search strategies for better results.

python

def hybrid_search_example():
    """
    Demonstrate hybrid search combining exact and approximate methods
    """
    class HybridSearchEngine:
        def __init__(self, embeddings, metadata):
            self.embeddings = embeddings
            self.metadata = metadata
            self.setup_indexes()
        
        def setup_indexes(self):
            # HNSW for fast approximate search
            dim = self.embeddings.shape[1]
            self.hnsw_index = hnswlib.Index(space='cosine', dim=dim)
            self.hnsw_index.init_index(
                max_elements=len(self.embeddings),
                ef_construction=200,
                M=16
            )
            self.hnsw_index.add_items(self.embeddings, np.arange(len(self.embeddings)))
            self.hnsw_index.set_ef(100)
            
            # Build metadata index for filtering
            self.metadata_index = {}
            for i, meta in enumerate(self.metadata):
                for key, value in meta.items():
                    if key not in self.metadata_index:
                        self.metadata_index[key] = {}
                    if value not in self.metadata_index[key]:
                        self.metadata_index[key] = []
                    self.metadata_index[key].append(i)
        
        def search(self, query_vector, k=10, filters=None, use_exact=False):
            if filters:
                # Filter candidates based on metadata
                candidates = self.apply_filters(filters)
                if len(candidates) == 0:
                    return [], []
                
                # Search within filtered candidates
                candidate_embeddings = self.embeddings[candidates]
                similarities = cosine_similarity([query_vector], candidate_embeddings)[0]
                
                # Get top-k from candidates
                top_indices = similarities.argsort()[-k:][::-1]
                return [candidates[i] for i in top_indices], similarities[top_indices]
            
            else:
                # Use HNSW for unfiltered search
                if use_exact:
                    # Exact search (for comparison)
                    similarities = cosine_similarity([query_vector], self.embeddings)[0]
                    top_indices = similarities.argsort()[-k:][::-1]
                    return top_indices.tolist(), similarities[top_indices]
                else:
                    # Approximate search
                    labels, distances = self.hnsw_index.knn_query(query_vector, k=k)
                    return labels[0].tolist(), (1 - distances[0])  # Convert to similarity
        
        def apply_filters(self, filters):
            candidates = set(range(len(self.embeddings)))
            
            for key, value in filters.items():
                if key in self.metadata_index and value in self.metadata_index[key]:
                    candidates &= set(self.metadata_index[key])
                else:
                    return []  # No matches for this filter
            
            return list(candidates)
    
    # Example usage
    embeddings = np.random.random((1000, 128)).astype('float32')
    metadata = [
        {'category': 'tech', 'year': 2023, 'author': f'author_{i%10}'}
        for i in range(1000)
    ]
    
    engine = HybridSearchEngine(embeddings, metadata)
    
    query = np.random.random(128).astype('float32')
    
    # Unfiltered search
    results, scores = engine.search(query, k=5)
    print("Unfiltered search results:", results[:5])
    
    # Filtered search
    filtered_results, filtered_scores = engine.search(
        query, k=5, filters={'category': 'tech', 'year': 2023}
    )
    print("Filtered search results:", filtered_results[:5])

# Note: Uncomment to run
# hybrid_search_example()

Multi-Vector Search

Searching across multiple vector spaces simultaneously.

python

def multi_vector_search_example():
    """
    Search across multiple embedding spaces (e.g., text + image)
    """
    class MultiModalSearchEngine:
        def __init__(self):
            self.text_embeddings = None
            self.image_embeddings = None
            self.text_index = None
            self.image_index = None
        
        def add_documents(self, text_embeddings, image_embeddings):
            self.text_embeddings = text_embeddings
            self.image_embeddings = image_embeddings
            
            # Setup text index
            text_dim = text_embeddings.shape[1]
            self.text_index = hnswlib.Index(space='cosine', dim=text_dim)
            self.text_index.init_index(
                max_elements=len(text_embeddings),
                ef_construction=200,
                M=16
            )
            self.text_index.add_items(text_embeddings, np.arange(len(text_embeddings)))
            
            # Setup image index
            image_dim = image_embeddings.shape[1]
            self.image_index = hnswlib.Index(space='cosine', dim=image_dim)
            self.image_index.init_index(
                max_elements=len(image_embeddings),
                ef_construction=200,
                M=16
            )
            self.image_index.add_items(image_embeddings, np.arange(len(image_embeddings)))
        
        def search(self, text_query=None, image_query=None, k=10, weights=None):
            if weights is None:
                weights = {'text': 0.5, 'image': 0.5}
            
            results = {}
            
            if text_query is not None:
                text_labels, text_distances = self.text_index.knn_query(text_query, k=k*2)
                text_scores = 1 - text_distances[0]  # Convert to similarity
                results['text'] = {label: score for label, score in zip(text_labels[0], text_scores)}
            
            if image_query is not None:
                image_labels, image_distances = self.image_index.knn_query(image_query, k=k*2)
                image_scores = 1 - image_distances[0]  # Convert to similarity
                results['image'] = {label: score for label, score in zip(image_labels[0], image_scores)}
            
            # Combine scores
            combined_scores = {}
            all_labels = set()
            
            if 'text' in results:
                all_labels.update(results['text'].keys())
            if 'image' in results:
                all_labels.update(results['image'].keys())
            
            for label in all_labels:
                score = 0
                if 'text' in results and label in results['text']:
                    score += weights['text'] * results['text'][label]
                if 'image' in results and label in results['image']:
                    score += weights['image'] * results['image'][label]
                combined_scores[label] = score
            
            # Sort by combined score
            sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
            return [label for label, score in sorted_results[:k]], [score for label, score in sorted_results[:k]]
    
    print("Multi-modal search engine framework created")
    print("Can search across text and image embeddings simultaneously")

multi_vector_search_example()

Performance Monitoring

Search Quality Metrics

python

def evaluate_search_quality():
    """
    Methods to evaluate and monitor search quality
    """
    
    def calculate_recall_at_k(ground_truth, predicted, k):
        """Calculate recall@k for search results"""
        if len(ground_truth) == 0:
            return 0.0
        
        predicted_k = predicted[:k]
        relevant_found = len(set(ground_truth) & set(predicted_k))
        return relevant_found / len(ground_truth)
    
    def calculate_precision_at_k(ground_truth, predicted, k):
        """Calculate precision@k for search results"""
        if k == 0:
            return 0.0
        
        predicted_k = predicted[:k]
        relevant_found = len(set(ground_truth) & set(predicted_k))
        return relevant_found / k
    
    def calculate_map_at_k(ground_truths, predictions, k):
        """Calculate Mean Average Precision@k"""
        aps = []
        for gt, pred in zip(ground_truths, predictions):
            ap = 0.0
            relevant_count = 0
            
            for i, item in enumerate(pred[:k]):
                if item in gt:
                    relevant_count += 1
                    precision_at_i = relevant_count / (i + 1)
                    ap += precision_at_i
            
            if len(gt) > 0:
                ap /= min(len(gt), k)
            aps.append(ap)
        
        return np.mean(aps)
    
    # Example evaluation
    ground_truth = [1, 3, 5, 7]
    predicted = [1, 2, 3, 4, 5, 6, 7, 8]
    k = 5
    
    recall = calculate_recall_at_k(ground_truth, predicted, k)
    precision = calculate_precision_at_k(ground_truth, predicted, k)
    
    print(f"Recall@{k}: {recall:.3f}")
    print(f"Precision@{k}: {precision:.3f}")

evaluate_search_quality()

Performance Benchmarking

python

def benchmark_search_methods():
    """
    Benchmark different search methods
    """
    import time
    
    # Generate test data
    dim = 256
    n_docs = 10000
    n_queries = 100
    
    documents = np.random.random((n_docs, dim)).astype('float32')
    queries = np.random.random((n_queries, dim)).astype('float32')
    
    results = {}
    
    # Exact search (brute force)
    start_time = time.time()
    for query in queries[:10]:  # Only test with 10 queries for exact search
        similarities = cosine_similarity([query], documents)[0]
        top_indices = similarities.argsort()[-5:][::-1]
    exact_time = (time.time() - start_time) / 10  # Average per query
    results['Exact Search'] = {'time_per_query': exact_time * 1000, 'accuracy': 1.0}
    
    # HNSW search
    index = hnswlib.Index(space='cosine', dim=dim)
    index.init_index(max_elements=n_docs, ef_construction=200, M=16)
    
    build_start = time.time()
    index.add_items(documents, np.arange(n_docs))
    build_time = time.time() - build_start
    
    index.set_ef(50)
    start_time = time.time()
    for query in queries:
        labels, distances = index.knn_query(query, k=5)
    hnsw_time = (time.time() - start_time) / n_queries
    results['HNSW'] = {
        'time_per_query': hnsw_time * 1000,
        'build_time': build_time,
        'accuracy': 0.95  # Approximate
    }
    
    # Print results
    print("Search Method Benchmark:")
    print("=" * 50)
    print(f"{'Method':<15} {'Time/Query (ms)':<15} {'Build Time (s)':<15} {'Accuracy':<10}")
    print("-" * 65)
    
    for method, metrics in results.items():
        build_time_str = f"{metrics.get('build_time', 0):.2f}" if 'build_time' in metrics else "N/A"
        print(f"{method:<15} {metrics['time_per_query']:<15.2f} {build_time_str:<15} {metrics['accuracy']:<10.2f}")

# Note: Uncomment to run full benchmark
# benchmark_search_methods()

🎯 Key Takeaways

Algorithm Selection

Exact search: Use for small datasets or when perfect accuracy is required
HNSW: Best general-purpose approximate search algorithm
IVF: Good for very large datasets with clustering structure
LSH: Effective for high-dimensional sparse data

Performance Optimization

Index parameters: Tune based on your accuracy/speed requirements
Distance metrics: Choose based on your data characteristics
Memory management: Critical for large-scale deployments
Parallel processing: Essential for high-throughput applications

Quality vs Speed

Always measure both: Track accuracy metrics alongside performance
User experience matters: Sometimes "good enough" is better than perfect
Iterate and improve: Start simple, optimize based on real usage patterns

Next Steps:

Vector Storage Patterns: Learn how to organize and store vectors efficiently
Vector Databases: Explore database systems optimized for vector operations
RAG Implementation: See how similarity search powers retrieval-augmented generation

Similarity Search & Indexing ​

🔍 What is Similarity Search? ​

Core Concepts ​

Distance Metrics ​

Cosine Similarity ​

Euclidean Distance ​

Manhattan Distance (L1) ​

Dot Product ​

Nearest Neighbor Search ​

Exact vs Approximate Search ​

Indexing Algorithms ​

HNSW (Hierarchical Navigable Small World) ​

HNSW Implementation Example ​

IVF (Inverted File) ​

LSH (Locality Sensitive Hashing) ​

Search Performance Optimization ​

Index Configuration ​

Batching and Parallel Search ​

Memory Management ​

Real-World Implementation Patterns ​

Hybrid Search ​

Multi-Vector Search ​

Performance Monitoring ​

Search Quality Metrics ​

Performance Benchmarking ​

🎯 Key Takeaways ​

Algorithm Selection ​

Performance Optimization ​

Quality vs Speed ​

Similarity Search & Indexing

🔍 What is Similarity Search?

Core Concepts

Distance Metrics

Cosine Similarity

Euclidean Distance

Manhattan Distance (L1)

Dot Product

Nearest Neighbor Search

Exact vs Approximate Search

Indexing Algorithms

HNSW (Hierarchical Navigable Small World)

HNSW Implementation Example

IVF (Inverted File)

LSH (Locality Sensitive Hashing)

Search Performance Optimization

Index Configuration

Batching and Parallel Search

Memory Management

Real-World Implementation Patterns

Hybrid Search

Multi-Vector Search

Performance Monitoring

Search Quality Metrics

Performance Benchmarking

🎯 Key Takeaways

Algorithm Selection

Performance Optimization

Quality vs Speed