Skip to content

Embeddings & Semantic Similarity

Converting human language into mathematical representations that capture meaning

🔢 What are Vector Embeddings?

Definition: Converting words, sentences, or documents into numerical vectors that capture their meaning

Simple Analogy: Like giving each word a unique "fingerprint" made of numbers. Words with similar meanings have similar fingerprints.

How It Works

  • Word → Numbers: "King" might become [0.2, 0.8, 0.1, 0.9, ...]
  • Similar Words, Similar Numbers: "King" and "Queen" have similar vector patterns
  • Mathematical Relationships: King - Man + Woman ≈ Queen

The Magic of Semantic Similarity

text
📝 WORDS → 🔢 VECTORS → 🧮 MATHEMATICAL OPERATIONS

"King"   → [0.8, 0.2, 0.9, 0.1]
"Queen"  → [0.7, 0.3, 0.8, 0.2]
"Man"    → [0.9, 0.1, 0.3, 0.7]
"Woman"  → [0.6, 0.4, 0.2, 0.8]

Mathematical relationship:
King - Man + Woman ≈ Queen

Real-World Examples

  • Search Engines: Finding relevant results even if you don't use exact keywords
  • Recommendation Systems: Spotify suggesting songs based on lyrical similarity
  • Translation: Understanding that "Hello" in English is similar to "Hola" in Spanish
  • Chatbots: Understanding that "How are you?" and "What's up?" mean similar things
  • Document Search: Finding similar documents even with different wording
  • Content Moderation: Detecting similar toxic content across different phrasings

Types of Embeddings

Word Embeddings

  • Individual words: Word2Vec, GloVe, FastText
  • Fixed representation: Same word always has same vector
  • Context-independent: "Bank" always has same meaning

Sentence Embeddings

  • Entire sentences or paragraphs: BERT, Sentence-BERT
  • Variable length input: Can handle sentences of any length
  • Semantic meaning: Captures overall meaning of the sentence

Document Embeddings

  • Full documents or articles: Doc2Vec, Universal Sentence Encoder
  • Document-level semantics: Understands themes and topics
  • Similarity search: Find similar documents in large collections

Contextual Embeddings

  • Same word, different meanings: BERT, ELMo, GPT
  • Context-aware: "Bank" means different things in "river bank" vs "money bank"
  • Dynamic representations: Vector changes based on surrounding words

Traditional Models

  • Word2Vec: Skip-gram and CBOW models
  • GloVe: Global vectors for word representation
  • FastText: Handles out-of-vocabulary words

Modern Transformer-Based

  • BERT: Bidirectional encoder representations
  • RoBERTa: Robustly optimized BERT pretraining
  • Sentence-BERT: Optimized for sentence-level tasks
  • OpenAI Embeddings: text-embedding-ada-002

Technical Deep Dive

How Word2Vec Works

text
🎯 SKIP-GRAM MODEL

Input: "The cat sat on the mat"
Target word: "cat"
Context words: ["The", "sat"]

Goal: Predict context words from target word
Result: Words appearing in similar contexts get similar vectors

Vector Space Properties

  • Dimensionality: Typically 100-1000 dimensions
  • Distance Metrics: Cosine similarity, Euclidean distance
  • Clustering: Similar concepts cluster together in vector space
  • Linear Relationships: Analogies become vector arithmetic

🔧 Working with Embeddings

Using Pre-trained Word2Vec

python
import gensim.downloader as api
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained Word2Vec model
model = api.load("word2vec-google-news-300")

# Get word vectors
king_vector = model['king']
queen_vector = model['queen']
man_vector = model['man']
woman_vector = model['woman']

print(f"Vector dimension: {len(king_vector)}")
print(f"King vector (first 10 dims): {king_vector[:10]}")

# Calculate similarity
similarity = model.similarity('king', 'queen')
print(f"Similarity between 'king' and 'queen': {similarity:.3f}")

# Famous analogy: king - man + woman ≈ queen
result_vector = king_vector - man_vector + woman_vector
most_similar = model.similar_by_vector(result_vector, topn=5)
print(f"King - Man + Woman = {most_similar}")

Using Modern Sentence Embeddings

python
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "I love machine learning",
    "Artificial intelligence is fascinating",
    "I enjoy cooking pasta",
    "Deep learning models are powerful",
    "Italian food is delicious"
]

# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Calculate similarities
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)

# Print similarity matrix
import pandas as pd
df = pd.DataFrame(similarity_matrix, 
                  index=sentences, 
                  columns=sentences)
print("Similarity Matrix:")
print(df.round(3))

Creating Custom Embeddings

python
# Train your own Word2Vec model
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize

# Sample corpus
corpus = [
    "I love artificial intelligence and machine learning",
    "Natural language processing is a subset of AI",
    "Deep learning models use neural networks",
    "Machine learning algorithms learn from data",
    "AI systems can process natural language"
]

# Tokenize corpus
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, 
                 vector_size=100,    # Embedding dimension
                 window=5,           # Context window size
                 min_count=1,        # Minimum word frequency
                 workers=4)          # Number of CPU cores

# Use the trained model
try:
    similarity = model.wv.similarity('machine', 'learning')
    print(f"Similarity between 'machine' and 'learning': {similarity:.3f}")
    
    # Find similar words
    similar_words = model.wv.most_similar('artificial', topn=3)
    print(f"Words similar to 'artificial': {similar_words}")
except KeyError as e:
    print(f"Word not found in vocabulary: {e}")

Applications in AI Systems

python
def semantic_search(query, documents, model):
    """
    Perform semantic search using embeddings
    """
    # Encode query and documents
    query_embedding = model.encode([query])
    doc_embeddings = model.encode(documents)
    
    # Calculate similarities
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    
    # Rank documents
    ranked_indices = np.argsort(similarities)[::-1]
    
    results = []
    for i, idx in enumerate(ranked_indices):
        results.append({
            'rank': i + 1,
            'document': documents[idx],
            'similarity': similarities[idx]
        })
    
    return results

# Example usage
documents = [
    "Python is a programming language",
    "Machine learning algorithms require data",
    "Neural networks are inspired by the brain",
    "Natural language processing understands text",
    "Computer vision analyzes images"
]

query = "How do computers understand text?"
results = semantic_search(query, documents, model)

print(f"Query: {query}")
print("\nTop results:")
for result in results[:3]:
    print(f"{result['rank']}. {result['document']} (Score: {result['similarity']:.3f})")

Document Clustering

python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

def cluster_documents(documents, model, n_clusters=3):
    """
    Cluster documents based on their embeddings
    """
    # Generate embeddings
    embeddings = model.encode(documents)
    
    # Perform clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)
    
    # Organize results
    clustered_docs = {}
    for i, cluster_id in enumerate(clusters):
        if cluster_id not in clustered_docs:
            clustered_docs[cluster_id] = []
        clustered_docs[cluster_id].append(documents[i])
    
    return clustered_docs, embeddings, clusters

# Example documents
tech_documents = [
    "Python programming for data science",
    "Machine learning with scikit-learn",
    "Cooking Italian pasta recipes",
    "Traditional French cuisine techniques",
    "Deep learning neural networks",
    "Artificial intelligence applications",
    "Homemade bread baking tips",
    "Molecular gastronomy methods"
]

clusters, embeddings, cluster_labels = cluster_documents(tech_documents, model, n_clusters=3)

print("Document Clusters:")
for cluster_id, docs in clusters.items():
    print(f"\nCluster {cluster_id}:")
    for doc in docs:
        print(f"  - {doc}")

Recommendation System

python
def recommend_content(user_preferences, content_library, model, top_k=3):
    """
    Recommend content based on user preferences using embeddings
    """
    # Encode user preferences and content
    user_embedding = model.encode([user_preferences])
    content_embeddings = model.encode(content_library)
    
    # Calculate similarities
    similarities = cosine_similarity(user_embedding, content_embeddings)[0]
    
    # Get top recommendations
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    recommendations = []
    for idx in top_indices:
        recommendations.append({
            'content': content_library[idx],
            'similarity': similarities[idx]
        })
    
    return recommendations

# Example usage
user_profile = "I enjoy learning about artificial intelligence and machine learning algorithms"

content_catalog = [
    "Introduction to Neural Networks",
    "Advanced Python Programming",
    "Computer Vision Fundamentals",
    "Cooking with Seasonal Ingredients",
    "Deep Learning for Beginners",
    "Web Development with React",
    "Natural Language Processing Guide",
    "Photography Composition Techniques"
]

recommendations = recommend_content(user_profile, content_catalog, model)

print(f"User interests: {user_profile}")
print("\nRecommended content:")
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec['content']} (Match: {rec['similarity']:.3f})")

Embedding Quality and Evaluation

Intrinsic Evaluation

python
def evaluate_word_analogies(model, analogies):
    """
    Evaluate embeddings using word analogies
    """
    correct = 0
    total = 0
    
    for analogy in analogies:
        word1, word2, word3, expected = analogy
        try:
            # Calculate: word1 - word2 + word3
            result_vector = (model[word1] - model[word2] + model[word3])
            most_similar = model.similar_by_vector(result_vector, topn=1)
            predicted = most_similar[0][0]
            
            if predicted.lower() == expected.lower():
                correct += 1
            total += 1
            
            print(f"{word1} - {word2} + {word3} = {predicted} (Expected: {expected})")
            
        except KeyError as e:
            print(f"Word not found: {e}")
    
    accuracy = correct / total if total > 0 else 0
    print(f"\nAccuracy: {accuracy:.2%} ({correct}/{total})")
    return accuracy

# Test analogies
analogies = [
    ("king", "man", "woman", "queen"),
    ("Paris", "France", "Italy", "Rome"),
    ("big", "bigger", "small", "smaller"),
    ("walk", "walked", "swim", "swam")
]

# Note: This requires a Word2Vec model with sufficient vocabulary
# evaluate_word_analogies(model, analogies)

Extrinsic Evaluation

python
def evaluate_on_classification_task(X_train, y_train, X_test, y_test, model):
    """
    Evaluate embeddings on a downstream classification task
    """
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, classification_report
    
    # Generate embeddings for training and test data
    train_embeddings = model.encode(X_train)
    test_embeddings = model.encode(X_test)
    
    # Train classifier
    classifier = LogisticRegression(random_state=42)
    classifier.fit(train_embeddings, y_train)
    
    # Make predictions
    y_pred = classifier.predict(test_embeddings)
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    
    print(f"Classification Accuracy: {accuracy:.3f}")
    print("\nDetailed Report:")
    print(report)
    
    return accuracy

# Example evaluation setup
# evaluate_on_classification_task(train_texts, train_labels, test_texts, test_labels, model)

Best Practices

Choosing the Right Embedding Model

python
def embedding_model_comparison():
    """
    Compare different embedding models for various tasks
    """
    comparison = {
        'Task': [
            'Semantic Search',
            'Document Similarity',
            'Text Classification',
            'Clustering',
            'Question Answering',
            'Multi-language',
            'Real-time Processing'
        ],
        'Word2Vec': [
            'Basic', 'Basic', 'Good', 'Good', 'Poor', 'Limited', 'Fast'
        ],
        'BERT': [
            'Excellent', 'Excellent', 'Excellent', 'Excellent', 'Excellent', 'Limited', 'Slow'
        ],
        'Sentence-BERT': [
            'Excellent', 'Excellent', 'Very Good', 'Excellent', 'Good', 'Good', 'Medium'
        ],
        'Universal Sentence Encoder': [
            'Very Good', 'Very Good', 'Good', 'Very Good', 'Good', 'Excellent', 'Medium'
        ]
    }
    
    df = pd.DataFrame(comparison)
    print("Embedding Model Comparison:")
    print(df.to_string(index=False))

embedding_model_comparison()

Optimization Tips

  1. Preprocessing: Clean and normalize text before embedding
  2. Model Selection: Choose based on your specific use case
  3. Dimensionality: Higher dimensions ≠ always better
  4. Fine-tuning: Consider domain-specific fine-tuning
  5. Caching: Store embeddings to avoid recomputation
  6. Batch Processing: Process multiple texts together for efficiency

Common Pitfalls

  • Out-of-vocabulary words: Some models can't handle unknown words
  • Domain mismatch: Pre-trained models may not work well on specialized text
  • Context length: Some models have maximum input length limits
  • Language assumptions: Many models are English-centric
  • Computational cost: Large models require significant resources

🎯 Key Takeaways

Embedding Fundamentals

  • Vectors capture meaning: Similar concepts have similar representations
  • Mathematical operations: Enable analogies and relationships
  • Context matters: Modern embeddings consider surrounding words
  • Quality varies: Different models excel at different tasks

Practical Considerations

  • Start with pre-trained: Use existing models before training custom ones
  • Evaluate thoroughly: Test on your specific use case
  • Consider trade-offs: Balance accuracy, speed, and resource requirements
  • Domain adaptation: Fine-tune for specialized applications

Future Directions

  • Multimodal embeddings: Combining text, images, and other modalities
  • Cross-lingual models: Better support for multiple languages
  • Efficient architectures: Faster models with comparable performance
  • Specialized domains: Models trained for specific fields

Next Steps:

Released under the MIT License.