Vector Embeddings β
How words and concepts become mathematical representations that AI can understand
π― What are Vector Embeddings? β
Definition: Dense numerical representations that capture semantic meaning and relationships between words, sentences, or any type of data
Simple Analogy: Think of embeddings as coordinates on a map where similar words are placed close to each other, and relationships between concepts are preserved as directions and distances.
Why Embeddings Matter β
The Core Problem β
- Computers understand numbers, not words: Traditional approaches like one-hot encoding create sparse, inefficient representations
- Semantic relationships lost: Simple encoding can't capture that "king" and "queen" are related
- Context ignored: Words with multiple meanings (like "bank") are treated identically
The Embedding Solution β
- Dense representations: Each word becomes a vector of real numbers (typically 100-1000 dimensions)
- Semantic similarity: Similar words have similar vectors
- Mathematical relationships: Analogies become vector arithmetic
- Context-aware: Modern embeddings can represent different meanings based on context
Real-World Examples β
Word Relationships β
python
# Vector arithmetic captures relationships
king - man + woman β queen
paris - france + italy β rome
walking - walk + run β running1
2
3
4
2
3
4
Similarity Search β
- Document search: Find articles similar to a query
- Product recommendations: Suggest similar items
- Content moderation: Detect similar harmful content
- Duplicate detection: Find near-duplicate documents
Semantic Understanding β
- Question answering: Match questions to relevant context
- Chatbots: Understand user intent beyond exact keyword matches
- Translation: Align concepts across languages
- Code search: Find functionally similar code snippets
Types of Embeddings β
Word Embeddings β
Word2Vec β
- Approach: Predict surrounding words (CBOW) or predict center word from context (Skip-gram)
- Training: Shallow neural network on large text corpus
- Pros: Fast, captures semantic relationships well
- Cons: Static (one vector per word), doesn't handle context
python
from gensim.models import Word2Vec
# Training Word2Vec
sentences = [['cat', 'sits', 'on', 'mat'], ['dog', 'runs', 'in', 'park']]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
# Finding similar words
similar_words = model.wv.most_similar('cat', topn=5)
print(similar_words)
# Vector arithmetic
result = model.wv['king'] - model.wv['man'] + model.wv['woman']
most_similar = model.wv.similar_by_vector(result, topn=1)
print(most_similar) # Should be close to 'queen'1
2
3
4
5
6
7
8
9
10
11
12
13
14
2
3
4
5
6
7
8
9
10
11
12
13
14
GloVe (Global Vectors) β
- Approach: Factorize word co-occurrence matrix
- Training: Combines global statistics with local context
- Pros: Leverages both global and local statistics
- Cons: Still static embeddings
python
# Using pre-trained GloVe embeddings
import numpy as np
def load_glove_embeddings(file_path):
embeddings = {}
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
# Load embeddings
glove_embeddings = load_glove_embeddings('glove.6B.100d.txt')
# Calculate similarity
def cosine_similarity(vec1, vec2):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
similarity = cosine_similarity(
glove_embeddings['king'],
glove_embeddings['queen']
)
print(f"Similarity between 'king' and 'queen': {similarity:.3f}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
FastText β
- Innovation: Represents words as bag of character n-grams
- Advantage: Handles out-of-vocabulary words and morphology
- Use case: Languages with rich morphology, rare words
python
from gensim.models import FastText
# Training FastText
sentences = [['running', 'runner', 'runs'], ['walking', 'walker', 'walks']]
model = FastText(sentences, vector_size=100, window=3, min_count=1, sg=1)
# Can handle unseen words due to subword information
try:
vector = model.wv['runners'] # Even if not in training data
print("FastText can handle unseen words!")
except KeyError:
print("Word not found")1
2
3
4
5
6
7
8
9
10
11
12
2
3
4
5
6
7
8
9
10
11
12
Contextual Embeddings β
BERT Embeddings β
- Innovation: Context-dependent representations
- Approach: Bidirectional transformer encoding
- Advantage: Same word gets different vectors in different contexts
python
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_bert_embeddings(text):
# Tokenize and encode
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
return embeddings
# Different contexts give different embeddings
sentence1 = "I went to the bank to deposit money"
sentence2 = "I sat by the river bank"
emb1 = get_bert_embeddings(sentence1)
emb2 = get_bert_embeddings(sentence2)
# The word "bank" will have different embeddings in each context
print("BERT provides context-aware embeddings!")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
OpenAI Embeddings β
- Models: text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large
- Use case: High-quality embeddings for various applications
- API-based: Easy to use without local model training
python
import openai
# Get embeddings from OpenAI API
def get_openai_embedding(text, model="text-embedding-3-small"):
response = openai.Embedding.create(
input=text,
model=model
)
return response['data'][0]['embedding']
# Example usage
text = "Artificial intelligence is transforming technology"
embedding = get_openai_embedding(text)
print(f"Embedding dimension: {len(embedding)}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
2
3
4
5
6
7
8
9
10
11
12
13
14
Sentence and Document Embeddings β
Sentence-BERT β
- Purpose: Create meaningful sentence-level embeddings
- Approach: Modified BERT architecture for sentence similarity
- Use case: Semantic search, clustering, duplicate detection
python
from sentence_transformers import SentenceTransformer
# Load pre-trained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode sentences
sentences = [
"The cat sits on the mat",
"A feline rests on the rug",
"Dogs are playing in the park",
"The weather is nice today"
]
embeddings = model.encode(sentences)
# Calculate similarities
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)
print("Sentence similarities:")
for i, sentence in enumerate(sentences):
print(f"{sentence}: {similarities[0][i]:.3f}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Doc2Vec β
- Purpose: Document-level embeddings
- Approach: Extension of Word2Vec to documents
- Use case: Document classification, similarity search
python
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# Prepare documents
documents = [
TaggedDocument(words=['machine', 'learning', 'ai'], tags=[0]),
TaggedDocument(words=['deep', 'learning', 'neural', 'networks'], tags=[1]),
TaggedDocument(words=['natural', 'language', 'processing'], tags=[2])
]
# Train Doc2Vec model
model = Doc2Vec(documents, vector_size=50, window=2, min_count=1, workers=4)
# Get document embedding
doc_embedding = model.docvecs[0]
print(f"Document embedding shape: {doc_embedding.shape}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Practical Applications β
1. Semantic Search Implementation β
python
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class SemanticSearch:
def __init__(self, model_name='all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.documents = []
self.embeddings = None
def add_documents(self, documents):
"""Add documents to the search index"""
self.documents.extend(documents)
new_embeddings = self.model.encode(documents)
if self.embeddings is None:
self.embeddings = new_embeddings
else:
self.embeddings = np.vstack([self.embeddings, new_embeddings])
def search(self, query, top_k=5):
"""Search for similar documents"""
query_embedding = self.model.encode([query])
# Calculate similarities
similarities = cosine_similarity(query_embedding, self.embeddings)[0]
# Get top results
top_indices = np.argsort(similarities)[::-1][:top_k]
results = []
for idx in top_indices:
results.append({
'document': self.documents[idx],
'score': similarities[idx]
})
return results
# Example usage
search_engine = SemanticSearch()
documents = [
"Machine learning algorithms can recognize patterns in data",
"Deep neural networks are powerful for image recognition",
"Natural language processing helps computers understand text",
"Computer vision enables machines to interpret visual information",
"Reinforcement learning trains agents through rewards and penalties"
]
search_engine.add_documents(documents)
# Search with semantic understanding
results = search_engine.search("How do computers understand images?")
for result in results:
print(f"Score: {result['score']:.3f} - {result['document']}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
2. Recommendation System β
python
class ContentRecommender:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.items = []
self.embeddings = None
def add_items(self, items_with_descriptions):
"""Add items with their descriptions"""
for item_id, description in items_with_descriptions.items():
self.items.append({'id': item_id, 'description': description})
descriptions = [item['description'] for item in self.items]
self.embeddings = self.model.encode(descriptions)
def get_recommendations(self, user_preferences, top_k=3):
"""Get recommendations based on user preferences"""
pref_embedding = self.model.encode([user_preferences])
similarities = cosine_similarity(pref_embedding, self.embeddings)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
recommendations = []
for idx in top_indices:
recommendations.append({
'item': self.items[idx],
'similarity': similarities[idx]
})
return recommendations
# Example usage
recommender = ContentRecommender()
items = {
'book1': 'A thrilling mystery novel with detective investigations',
'book2': 'Science fiction adventure in space with aliens',
'book3': 'Romance story set in historical Victorian era',
'book4': 'Crime thriller with police investigations',
'book5': 'Fantasy adventure with magic and dragons'
}
recommender.add_items(items)
# Get recommendations
user_pref = "I love mystery stories and crime investigations"
recs = recommender.get_recommendations(user_pref)
for rec in recs:
print(f"Similarity: {rec['similarity']:.3f} - {rec['item']['id']}: {rec['item']['description']}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
3. Text Classification with Embeddings β
python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
class EmbeddingClassifier:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.classifier = LogisticRegression()
def train(self, texts, labels):
"""Train classifier using embeddings as features"""
# Convert texts to embeddings
embeddings = self.model.encode(texts)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
embeddings, labels, test_size=0.2, random_state=42
)
# Train classifier
self.classifier.fit(X_train, y_train)
# Evaluate
y_pred = self.classifier.predict(X_test)
print(classification_report(y_test, y_pred))
return self
def predict(self, texts):
"""Predict labels for new texts"""
embeddings = self.model.encode(texts)
return self.classifier.predict(embeddings)
# Example usage
texts = [
"This movie was absolutely amazing!",
"Terrible film, waste of time",
"Great acting and wonderful story",
"Boring and predictable plot",
"Loved every minute of it"
]
labels = ['positive', 'negative', 'positive', 'negative', 'positive']
classifier = EmbeddingClassifier()
classifier.train(texts, labels)
# Predict new texts
new_texts = ["This was a fantastic experience", "Really disappointed"]
predictions = classifier.predict(new_texts)
print(f"Predictions: {predictions}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Advanced Techniques β
Fine-tuning Embeddings β
python
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
def fine_tune_embeddings(model_name, training_data):
"""Fine-tune embeddings for specific domain"""
# Load pre-trained model
model = SentenceTransformer(model_name)
# Prepare training examples
train_examples = []
for anchor, positive, negative in training_data:
train_examples.append(InputExample(texts=[anchor, positive], label=1.0))
train_examples.append(InputExample(texts=[anchor, negative], label=0.0))
# Create data loader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Define loss function
train_loss = losses.CosineSimilarityLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100
)
return model
# Example training data (anchor, positive, negative)
training_data = [
("Python programming", "Coding in Python", "JavaScript development"),
("Machine learning", "AI algorithms", "Web design"),
("Data science", "Statistical analysis", "Graphic design")
]
# Fine-tune model
# fine_tuned_model = fine_tune_embeddings('all-MiniLM-L6-v2', training_data)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Multilingual Embeddings β
python
from sentence_transformers import SentenceTransformer
# Load multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
# Text in different languages
texts = [
"Hello, how are you?", # English
"Hola, ΒΏcΓ³mo estΓ‘s?", # Spanish
"Bonjour, comment allez-vous?", # French
"Hallo, wie geht es dir?", # German
]
# Generate embeddings
embeddings = model.encode(texts)
# Calculate cross-lingual similarities
similarities = cosine_similarity(embeddings)
print("Cross-lingual similarities:")
for i, text in enumerate(texts):
print(f"{text}: {similarities[0][i]:.3f}")1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Best Practices β
1. Choosing the Right Embedding Model β
python
# Decision guide for embedding selection
def choose_embedding_model(use_case, data_size, latency_requirement):
"""Guide for selecting appropriate embedding model"""
if use_case == "semantic_search":
if latency_requirement == "low":
return "all-MiniLM-L6-v2" # Fast, good quality
else:
return "all-mpnet-base-v2" # Best quality
elif use_case == "multilingual":
return "paraphrase-multilingual-MiniLM-L12-v2"
elif use_case == "code_search":
return "microsoft/codebert-base"
elif data_size == "large" and latency_requirement == "low":
return "text-embedding-3-small" # OpenAI API
else:
return "all-MiniLM-L6-v2" # Good default1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2. Optimizing Performance β
python
import faiss
import numpy as np
class FastSimilaritySearch:
def __init__(self, embedding_dim):
"""Initialize FAISS index for fast similarity search"""
self.index = faiss.IndexFlatIP(embedding_dim) # Inner product
self.documents = []
def add_embeddings(self, embeddings, documents):
"""Add embeddings to index"""
# Normalize embeddings for cosine similarity
embeddings_norm = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
self.index.add(embeddings_norm.astype('float32'))
self.documents.extend(documents)
def search(self, query_embedding, k=5):
"""Fast similarity search"""
query_norm = query_embedding / np.linalg.norm(query_embedding)
scores, indices = self.index.search(
query_norm.reshape(1, -1).astype('float32'), k
)
results = []
for score, idx in zip(scores[0], indices[0]):
if idx < len(self.documents):
results.append({
'document': self.documents[idx],
'score': score
})
return results1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
3. Evaluation Metrics β
python
def evaluate_embeddings(embeddings, labels, task='classification'):
"""Evaluate embedding quality"""
if task == 'classification':
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
clf = LogisticRegression()
scores = cross_val_score(clf, embeddings, labels, cv=5)
return scores.mean()
elif task == 'clustering':
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
kmeans = KMeans(n_clusters=len(set(labels)))
cluster_labels = kmeans.fit_predict(embeddings)
return silhouette_score(embeddings, cluster_labels)
elif task == 'similarity':
# Evaluate on similarity tasks
similarities = cosine_similarity(embeddings)
# Return correlation with human judgments (if available)
pass1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Future Directions β
Emerging Trends β
- Multimodal embeddings: Combining text, image, and audio
- Dynamic embeddings: Embeddings that evolve over time
- Compressed embeddings: Smaller vectors without quality loss
- Domain-specific models: Specialized embeddings for specific fields
Applications β
- Scientific research: Embedding research papers and patents
- Legal tech: Understanding legal documents and cases
- Healthcare: Medical text understanding and drug discovery
- Finance: Financial document analysis and risk assessment
Next Steps:
- Similarity Search: Learn how to efficiently search through vector embeddings
- Storage Patterns: Understand how to store and manage embeddings at scale
- RAG Systems: See how embeddings enable retrieval-augmented generation