Embeddings & Semantic Similarity
Converting human language into mathematical representations that capture meaning
🔢 What are Vector Embeddings?
Definition: Converting words, sentences, or documents into numerical vectors that capture their meaning
Simple Analogy: Like giving each word a unique "fingerprint" made of numbers. Words with similar meanings have similar fingerprints.
How It Works
- Word → Numbers: "King" might become [0.2, 0.8, 0.1, 0.9, ...]
- Similar Words, Similar Numbers: "King" and "Queen" have similar vector patterns
- Mathematical Relationships: King - Man + Woman ≈ Queen
The Magic of Semantic Similarity
text
📝 WORDS → 🔢 VECTORS → 🧮 MATHEMATICAL OPERATIONS
"King" → [0.8, 0.2, 0.9, 0.1]
"Queen" → [0.7, 0.3, 0.8, 0.2]
"Man" → [0.9, 0.1, 0.3, 0.7]
"Woman" → [0.6, 0.4, 0.2, 0.8]
Mathematical relationship:
King - Man + Woman ≈ Queen1
2
3
4
5
6
7
8
9
2
3
4
5
6
7
8
9
Real-World Examples
- Search Engines: Finding relevant results even if you don't use exact keywords
- Recommendation Systems: Spotify suggesting songs based on lyrical similarity
- Translation: Understanding that "Hello" in English is similar to "Hola" in Spanish
- Chatbots: Understanding that "How are you?" and "What's up?" mean similar things
- Document Search: Finding similar documents even with different wording
- Content Moderation: Detecting similar toxic content across different phrasings
Types of Embeddings
Word Embeddings
- Individual words: Word2Vec, GloVe, FastText
- Fixed representation: Same word always has same vector
- Context-independent: "Bank" always has same meaning
Sentence Embeddings
- Entire sentences or paragraphs: BERT, Sentence-BERT
- Variable length input: Can handle sentences of any length
- Semantic meaning: Captures overall meaning of the sentence
Document Embeddings
- Full documents or articles: Doc2Vec, Universal Sentence Encoder
- Document-level semantics: Understands themes and topics
- Similarity search: Find similar documents in large collections
Contextual Embeddings
- Same word, different meanings: BERT, ELMo, GPT
- Context-aware: "Bank" means different things in "river bank" vs "money bank"
- Dynamic representations: Vector changes based on surrounding words
Popular Embedding Models
Traditional Models
- Word2Vec: Skip-gram and CBOW models
- GloVe: Global vectors for word representation
- FastText: Handles out-of-vocabulary words
Modern Transformer-Based
- BERT: Bidirectional encoder representations
- RoBERTa: Robustly optimized BERT pretraining
- Sentence-BERT: Optimized for sentence-level tasks
- OpenAI Embeddings: text-embedding-ada-002
Technical Deep Dive
How Word2Vec Works
text
🎯 SKIP-GRAM MODEL
Input: "The cat sat on the mat"
Target word: "cat"
Context words: ["The", "sat"]
Goal: Predict context words from target word
Result: Words appearing in similar contexts get similar vectors1
2
3
4
5
6
7
8
2
3
4
5
6
7
8
Vector Space Properties
- Dimensionality: Typically 100-1000 dimensions
- Distance Metrics: Cosine similarity, Euclidean distance
- Clustering: Similar concepts cluster together in vector space
- Linear Relationships: Analogies become vector arithmetic
🔧 Working with Embeddings
Using Pre-trained Word2Vec
python
import gensim.downloader as api
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Load pre-trained Word2Vec model
model = api.load("word2vec-google-news-300")
# Get word vectors
king_vector = model['king']
queen_vector = model['queen']
man_vector = model['man']
woman_vector = model['woman']
print(f"Vector dimension: {len(king_vector)}")
print(f"King vector (first 10 dims): {king_vector[:10]}")
# Calculate similarity
similarity = model.similarity('king', 'queen')
print(f"Similarity between 'king' and 'queen': {similarity:.3f}")
# Famous analogy: king - man + woman ≈ queen
result_vector = king_vector - man_vector + woman_vector
most_similar = model.similar_by_vector(result_vector, topn=5)
print(f"King - Man + Woman = {most_similar}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Using Modern Sentence Embeddings
python
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Example sentences
sentences = [
"I love machine learning",
"Artificial intelligence is fascinating",
"I enjoy cooking pasta",
"Deep learning models are powerful",
"Italian food is delicious"
]
# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Calculate similarities
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
# Print similarity matrix
import pandas as pd
df = pd.DataFrame(similarity_matrix,
index=sentences,
columns=sentences)
print("Similarity Matrix:")
print(df.round(3))1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Creating Custom Embeddings
python
# Train your own Word2Vec model
from gensim.models import Word2Vec
import nltk
from nltk.tokenize import word_tokenize
# Sample corpus
corpus = [
"I love artificial intelligence and machine learning",
"Natural language processing is a subset of AI",
"Deep learning models use neural networks",
"Machine learning algorithms learn from data",
"AI systems can process natural language"
]
# Tokenize corpus
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]
# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus,
vector_size=100, # Embedding dimension
window=5, # Context window size
min_count=1, # Minimum word frequency
workers=4) # Number of CPU cores
# Use the trained model
try:
similarity = model.wv.similarity('machine', 'learning')
print(f"Similarity between 'machine' and 'learning': {similarity:.3f}")
# Find similar words
similar_words = model.wv.most_similar('artificial', topn=3)
print(f"Words similar to 'artificial': {similar_words}")
except KeyError as e:
print(f"Word not found in vocabulary: {e}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Applications in AI Systems
Semantic Search
python
def semantic_search(query, documents, model):
"""
Perform semantic search using embeddings
"""
# Encode query and documents
query_embedding = model.encode([query])
doc_embeddings = model.encode(documents)
# Calculate similarities
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
# Rank documents
ranked_indices = np.argsort(similarities)[::-1]
results = []
for i, idx in enumerate(ranked_indices):
results.append({
'rank': i + 1,
'document': documents[idx],
'similarity': similarities[idx]
})
return results
# Example usage
documents = [
"Python is a programming language",
"Machine learning algorithms require data",
"Neural networks are inspired by the brain",
"Natural language processing understands text",
"Computer vision analyzes images"
]
query = "How do computers understand text?"
results = semantic_search(query, documents, model)
print(f"Query: {query}")
print("\nTop results:")
for result in results[:3]:
print(f"{result['rank']}. {result['document']} (Score: {result['similarity']:.3f})")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Document Clustering
python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
def cluster_documents(documents, model, n_clusters=3):
"""
Cluster documents based on their embeddings
"""
# Generate embeddings
embeddings = model.encode(documents)
# Perform clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(embeddings)
# Organize results
clustered_docs = {}
for i, cluster_id in enumerate(clusters):
if cluster_id not in clustered_docs:
clustered_docs[cluster_id] = []
clustered_docs[cluster_id].append(documents[i])
return clustered_docs, embeddings, clusters
# Example documents
tech_documents = [
"Python programming for data science",
"Machine learning with scikit-learn",
"Cooking Italian pasta recipes",
"Traditional French cuisine techniques",
"Deep learning neural networks",
"Artificial intelligence applications",
"Homemade bread baking tips",
"Molecular gastronomy methods"
]
clusters, embeddings, cluster_labels = cluster_documents(tech_documents, model, n_clusters=3)
print("Document Clusters:")
for cluster_id, docs in clusters.items():
print(f"\nCluster {cluster_id}:")
for doc in docs:
print(f" - {doc}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Recommendation System
python
def recommend_content(user_preferences, content_library, model, top_k=3):
"""
Recommend content based on user preferences using embeddings
"""
# Encode user preferences and content
user_embedding = model.encode([user_preferences])
content_embeddings = model.encode(content_library)
# Calculate similarities
similarities = cosine_similarity(user_embedding, content_embeddings)[0]
# Get top recommendations
top_indices = np.argsort(similarities)[::-1][:top_k]
recommendations = []
for idx in top_indices:
recommendations.append({
'content': content_library[idx],
'similarity': similarities[idx]
})
return recommendations
# Example usage
user_profile = "I enjoy learning about artificial intelligence and machine learning algorithms"
content_catalog = [
"Introduction to Neural Networks",
"Advanced Python Programming",
"Computer Vision Fundamentals",
"Cooking with Seasonal Ingredients",
"Deep Learning for Beginners",
"Web Development with React",
"Natural Language Processing Guide",
"Photography Composition Techniques"
]
recommendations = recommend_content(user_profile, content_catalog, model)
print(f"User interests: {user_profile}")
print("\nRecommended content:")
for i, rec in enumerate(recommendations, 1):
print(f"{i}. {rec['content']} (Match: {rec['similarity']:.3f})")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Embedding Quality and Evaluation
Intrinsic Evaluation
python
def evaluate_word_analogies(model, analogies):
"""
Evaluate embeddings using word analogies
"""
correct = 0
total = 0
for analogy in analogies:
word1, word2, word3, expected = analogy
try:
# Calculate: word1 - word2 + word3
result_vector = (model[word1] - model[word2] + model[word3])
most_similar = model.similar_by_vector(result_vector, topn=1)
predicted = most_similar[0][0]
if predicted.lower() == expected.lower():
correct += 1
total += 1
print(f"{word1} - {word2} + {word3} = {predicted} (Expected: {expected})")
except KeyError as e:
print(f"Word not found: {e}")
accuracy = correct / total if total > 0 else 0
print(f"\nAccuracy: {accuracy:.2%} ({correct}/{total})")
return accuracy
# Test analogies
analogies = [
("king", "man", "woman", "queen"),
("Paris", "France", "Italy", "Rome"),
("big", "bigger", "small", "smaller"),
("walk", "walked", "swim", "swam")
]
# Note: This requires a Word2Vec model with sufficient vocabulary
# evaluate_word_analogies(model, analogies)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Extrinsic Evaluation
python
def evaluate_on_classification_task(X_train, y_train, X_test, y_test, model):
"""
Evaluate embeddings on a downstream classification task
"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Generate embeddings for training and test data
train_embeddings = model.encode(X_train)
test_embeddings = model.encode(X_test)
# Train classifier
classifier = LogisticRegression(random_state=42)
classifier.fit(train_embeddings, y_train)
# Make predictions
y_pred = classifier.predict(test_embeddings)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Classification Accuracy: {accuracy:.3f}")
print("\nDetailed Report:")
print(report)
return accuracy
# Example evaluation setup
# evaluate_on_classification_task(train_texts, train_labels, test_texts, test_labels, model)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Best Practices
Choosing the Right Embedding Model
python
def embedding_model_comparison():
"""
Compare different embedding models for various tasks
"""
comparison = {
'Task': [
'Semantic Search',
'Document Similarity',
'Text Classification',
'Clustering',
'Question Answering',
'Multi-language',
'Real-time Processing'
],
'Word2Vec': [
'Basic', 'Basic', 'Good', 'Good', 'Poor', 'Limited', 'Fast'
],
'BERT': [
'Excellent', 'Excellent', 'Excellent', 'Excellent', 'Excellent', 'Limited', 'Slow'
],
'Sentence-BERT': [
'Excellent', 'Excellent', 'Very Good', 'Excellent', 'Good', 'Good', 'Medium'
],
'Universal Sentence Encoder': [
'Very Good', 'Very Good', 'Good', 'Very Good', 'Good', 'Excellent', 'Medium'
]
}
df = pd.DataFrame(comparison)
print("Embedding Model Comparison:")
print(df.to_string(index=False))
embedding_model_comparison()1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Optimization Tips
- Preprocessing: Clean and normalize text before embedding
- Model Selection: Choose based on your specific use case
- Dimensionality: Higher dimensions ≠ always better
- Fine-tuning: Consider domain-specific fine-tuning
- Caching: Store embeddings to avoid recomputation
- Batch Processing: Process multiple texts together for efficiency
Common Pitfalls
- Out-of-vocabulary words: Some models can't handle unknown words
- Domain mismatch: Pre-trained models may not work well on specialized text
- Context length: Some models have maximum input length limits
- Language assumptions: Many models are English-centric
- Computational cost: Large models require significant resources
🎯 Key Takeaways
Embedding Fundamentals
- Vectors capture meaning: Similar concepts have similar representations
- Mathematical operations: Enable analogies and relationships
- Context matters: Modern embeddings consider surrounding words
- Quality varies: Different models excel at different tasks
Practical Considerations
- Start with pre-trained: Use existing models before training custom ones
- Evaluate thoroughly: Test on your specific use case
- Consider trade-offs: Balance accuracy, speed, and resource requirements
- Domain adaptation: Fine-tune for specialized applications
Future Directions
- Multimodal embeddings: Combining text, images, and other modalities
- Cross-lingual models: Better support for multiple languages
- Efficient architectures: Faster models with comparable performance
- Specialized domains: Models trained for specific fields
Next Steps:
- Transformers & Attention: Understand the architecture behind modern embeddings
- Large Language Models: See how embeddings enable powerful language models
- Vector Databases: Learn to store and search embeddings efficiently