Skip to content

Vector Embeddings ​

How words and concepts become mathematical representations that AI can understand

🎯 What are Vector Embeddings? ​

Definition: Dense numerical representations that capture semantic meaning and relationships between words, sentences, or any type of data

Simple Analogy: Think of embeddings as coordinates on a map where similar words are placed close to each other, and relationships between concepts are preserved as directions and distances.

Why Embeddings Matter ​

The Core Problem ​

  • Computers understand numbers, not words: Traditional approaches like one-hot encoding create sparse, inefficient representations
  • Semantic relationships lost: Simple encoding can't capture that "king" and "queen" are related
  • Context ignored: Words with multiple meanings (like "bank") are treated identically

The Embedding Solution ​

  • Dense representations: Each word becomes a vector of real numbers (typically 100-1000 dimensions)
  • Semantic similarity: Similar words have similar vectors
  • Mathematical relationships: Analogies become vector arithmetic
  • Context-aware: Modern embeddings can represent different meanings based on context

Real-World Examples ​

Word Relationships ​

python
# Vector arithmetic captures relationships
king - man + woman β‰ˆ queen
paris - france + italy β‰ˆ rome
walking - walk + run β‰ˆ running
  • Document search: Find articles similar to a query
  • Product recommendations: Suggest similar items
  • Content moderation: Detect similar harmful content
  • Duplicate detection: Find near-duplicate documents

Semantic Understanding ​

  • Question answering: Match questions to relevant context
  • Chatbots: Understand user intent beyond exact keyword matches
  • Translation: Align concepts across languages
  • Code search: Find functionally similar code snippets

Types of Embeddings ​

Word Embeddings ​

Word2Vec ​

  • Approach: Predict surrounding words (CBOW) or predict center word from context (Skip-gram)
  • Training: Shallow neural network on large text corpus
  • Pros: Fast, captures semantic relationships well
  • Cons: Static (one vector per word), doesn't handle context
python
from gensim.models import Word2Vec

# Training Word2Vec
sentences = [['cat', 'sits', 'on', 'mat'], ['dog', 'runs', 'in', 'park']]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Finding similar words
similar_words = model.wv.most_similar('cat', topn=5)
print(similar_words)

# Vector arithmetic
result = model.wv['king'] - model.wv['man'] + model.wv['woman']
most_similar = model.wv.similar_by_vector(result, topn=1)
print(most_similar)  # Should be close to 'queen'

GloVe (Global Vectors) ​

  • Approach: Factorize word co-occurrence matrix
  • Training: Combines global statistics with local context
  • Pros: Leverages both global and local statistics
  • Cons: Still static embeddings
python
# Using pre-trained GloVe embeddings
import numpy as np

def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Load embeddings
glove_embeddings = load_glove_embeddings('glove.6B.100d.txt')

# Calculate similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity = cosine_similarity(
    glove_embeddings['king'], 
    glove_embeddings['queen']
)
print(f"Similarity between 'king' and 'queen': {similarity:.3f}")

FastText ​

  • Innovation: Represents words as bag of character n-grams
  • Advantage: Handles out-of-vocabulary words and morphology
  • Use case: Languages with rich morphology, rare words
python
from gensim.models import FastText

# Training FastText
sentences = [['running', 'runner', 'runs'], ['walking', 'walker', 'walks']]
model = FastText(sentences, vector_size=100, window=3, min_count=1, sg=1)

# Can handle unseen words due to subword information
try:
    vector = model.wv['runners']  # Even if not in training data
    print("FastText can handle unseen words!")
except KeyError:
    print("Word not found")

Contextual Embeddings ​

BERT Embeddings ​

  • Innovation: Context-dependent representations
  • Approach: Bidirectional transformer encoding
  • Advantage: Same word gets different vectors in different contexts
python
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_embeddings(text):
    # Tokenize and encode
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
    
    return embeddings

# Different contexts give different embeddings
sentence1 = "I went to the bank to deposit money"
sentence2 = "I sat by the river bank"

emb1 = get_bert_embeddings(sentence1)
emb2 = get_bert_embeddings(sentence2)

# The word "bank" will have different embeddings in each context
print("BERT provides context-aware embeddings!")

OpenAI Embeddings ​

  • Models: text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large
  • Use case: High-quality embeddings for various applications
  • API-based: Easy to use without local model training
python
import openai

# Get embeddings from OpenAI API
def get_openai_embedding(text, model="text-embedding-3-small"):
    response = openai.Embedding.create(
        input=text,
        model=model
    )
    return response['data'][0]['embedding']

# Example usage
text = "Artificial intelligence is transforming technology"
embedding = get_openai_embedding(text)
print(f"Embedding dimension: {len(embedding)}")

Sentence and Document Embeddings ​

Sentence-BERT ​

  • Purpose: Create meaningful sentence-level embeddings
  • Approach: Modified BERT architecture for sentence similarity
  • Use case: Semantic search, clustering, duplicate detection
python
from sentence_transformers import SentenceTransformer

# Load pre-trained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode sentences
sentences = [
    "The cat sits on the mat",
    "A feline rests on the rug",
    "Dogs are playing in the park",
    "The weather is nice today"
]

embeddings = model.encode(sentences)

# Calculate similarities
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)

print("Sentence similarities:")
for i, sentence in enumerate(sentences):
    print(f"{sentence}: {similarities[0][i]:.3f}")

Doc2Vec ​

  • Purpose: Document-level embeddings
  • Approach: Extension of Word2Vec to documents
  • Use case: Document classification, similarity search
python
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Prepare documents
documents = [
    TaggedDocument(words=['machine', 'learning', 'ai'], tags=[0]),
    TaggedDocument(words=['deep', 'learning', 'neural', 'networks'], tags=[1]),
    TaggedDocument(words=['natural', 'language', 'processing'], tags=[2])
]

# Train Doc2Vec model
model = Doc2Vec(documents, vector_size=50, window=2, min_count=1, workers=4)

# Get document embedding
doc_embedding = model.docvecs[0]
print(f"Document embedding shape: {doc_embedding.shape}")

Practical Applications ​

1. Semantic Search Implementation ​

python
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class SemanticSearch:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None
    
    def add_documents(self, documents):
        """Add documents to the search index"""
        self.documents.extend(documents)
        new_embeddings = self.model.encode(documents)
        
        if self.embeddings is None:
            self.embeddings = new_embeddings
        else:
            self.embeddings = np.vstack([self.embeddings, new_embeddings])
    
    def search(self, query, top_k=5):
        """Search for similar documents"""
        query_embedding = self.model.encode([query])
        
        # Calculate similarities
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]
        
        # Get top results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append({
                'document': self.documents[idx],
                'score': similarities[idx]
            })
        
        return results

# Example usage
search_engine = SemanticSearch()

documents = [
    "Machine learning algorithms can recognize patterns in data",
    "Deep neural networks are powerful for image recognition",
    "Natural language processing helps computers understand text",
    "Computer vision enables machines to interpret visual information",
    "Reinforcement learning trains agents through rewards and penalties"
]

search_engine.add_documents(documents)

# Search with semantic understanding
results = search_engine.search("How do computers understand images?")
for result in results:
    print(f"Score: {result['score']:.3f} - {result['document']}")

2. Recommendation System ​

python
class ContentRecommender:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.items = []
        self.embeddings = None
    
    def add_items(self, items_with_descriptions):
        """Add items with their descriptions"""
        for item_id, description in items_with_descriptions.items():
            self.items.append({'id': item_id, 'description': description})
        
        descriptions = [item['description'] for item in self.items]
        self.embeddings = self.model.encode(descriptions)
    
    def get_recommendations(self, user_preferences, top_k=3):
        """Get recommendations based on user preferences"""
        pref_embedding = self.model.encode([user_preferences])
        
        similarities = cosine_similarity(pref_embedding, self.embeddings)[0]
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        recommendations = []
        for idx in top_indices:
            recommendations.append({
                'item': self.items[idx],
                'similarity': similarities[idx]
            })
        
        return recommendations

# Example usage
recommender = ContentRecommender()

items = {
    'book1': 'A thrilling mystery novel with detective investigations',
    'book2': 'Science fiction adventure in space with aliens',
    'book3': 'Romance story set in historical Victorian era',
    'book4': 'Crime thriller with police investigations',
    'book5': 'Fantasy adventure with magic and dragons'
}

recommender.add_items(items)

# Get recommendations
user_pref = "I love mystery stories and crime investigations"
recs = recommender.get_recommendations(user_pref)

for rec in recs:
    print(f"Similarity: {rec['similarity']:.3f} - {rec['item']['id']}: {rec['item']['description']}")

3. Text Classification with Embeddings ​

python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

class EmbeddingClassifier:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.classifier = LogisticRegression()
    
    def train(self, texts, labels):
        """Train classifier using embeddings as features"""
        # Convert texts to embeddings
        embeddings = self.model.encode(texts)
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            embeddings, labels, test_size=0.2, random_state=42
        )
        
        # Train classifier
        self.classifier.fit(X_train, y_train)
        
        # Evaluate
        y_pred = self.classifier.predict(X_test)
        print(classification_report(y_test, y_pred))
        
        return self
    
    def predict(self, texts):
        """Predict labels for new texts"""
        embeddings = self.model.encode(texts)
        return self.classifier.predict(embeddings)

# Example usage
texts = [
    "This movie was absolutely amazing!",
    "Terrible film, waste of time",
    "Great acting and wonderful story",
    "Boring and predictable plot",
    "Loved every minute of it"
]
labels = ['positive', 'negative', 'positive', 'negative', 'positive']

classifier = EmbeddingClassifier()
classifier.train(texts, labels)

# Predict new texts
new_texts = ["This was a fantastic experience", "Really disappointed"]
predictions = classifier.predict(new_texts)
print(f"Predictions: {predictions}")

Advanced Techniques ​

Fine-tuning Embeddings ​

python
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

def fine_tune_embeddings(model_name, training_data):
    """Fine-tune embeddings for specific domain"""
    
    # Load pre-trained model
    model = SentenceTransformer(model_name)
    
    # Prepare training examples
    train_examples = []
    for anchor, positive, negative in training_data:
        train_examples.append(InputExample(texts=[anchor, positive], label=1.0))
        train_examples.append(InputExample(texts=[anchor, negative], label=0.0))
    
    # Create data loader
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    
    # Define loss function
    train_loss = losses.CosineSimilarityLoss(model)
    
    # Fine-tune
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=1,
        warmup_steps=100
    )
    
    return model

# Example training data (anchor, positive, negative)
training_data = [
    ("Python programming", "Coding in Python", "JavaScript development"),
    ("Machine learning", "AI algorithms", "Web design"),
    ("Data science", "Statistical analysis", "Graphic design")
]

# Fine-tune model
# fine_tuned_model = fine_tune_embeddings('all-MiniLM-L6-v2', training_data)

Multilingual Embeddings ​

python
from sentence_transformers import SentenceTransformer

# Load multilingual model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Text in different languages
texts = [
    "Hello, how are you?",           # English
    "Hola, ΒΏcΓ³mo estΓ‘s?",           # Spanish  
    "Bonjour, comment allez-vous?", # French
    "Hallo, wie geht es dir?",      # German
]

# Generate embeddings
embeddings = model.encode(texts)

# Calculate cross-lingual similarities
similarities = cosine_similarity(embeddings)
print("Cross-lingual similarities:")
for i, text in enumerate(texts):
    print(f"{text}: {similarities[0][i]:.3f}")

Best Practices ​

1. Choosing the Right Embedding Model ​

python
# Decision guide for embedding selection
def choose_embedding_model(use_case, data_size, latency_requirement):
    """Guide for selecting appropriate embedding model"""
    
    if use_case == "semantic_search":
        if latency_requirement == "low":
            return "all-MiniLM-L6-v2"  # Fast, good quality
        else:
            return "all-mpnet-base-v2"  # Best quality
    
    elif use_case == "multilingual":
        return "paraphrase-multilingual-MiniLM-L12-v2"
    
    elif use_case == "code_search":
        return "microsoft/codebert-base"
    
    elif data_size == "large" and latency_requirement == "low":
        return "text-embedding-3-small"  # OpenAI API
    
    else:
        return "all-MiniLM-L6-v2"  # Good default

2. Optimizing Performance ​

python
import faiss
import numpy as np

class FastSimilaritySearch:
    def __init__(self, embedding_dim):
        """Initialize FAISS index for fast similarity search"""
        self.index = faiss.IndexFlatIP(embedding_dim)  # Inner product
        self.documents = []
    
    def add_embeddings(self, embeddings, documents):
        """Add embeddings to index"""
        # Normalize embeddings for cosine similarity
        embeddings_norm = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        
        self.index.add(embeddings_norm.astype('float32'))
        self.documents.extend(documents)
    
    def search(self, query_embedding, k=5):
        """Fast similarity search"""
        query_norm = query_embedding / np.linalg.norm(query_embedding)
        
        scores, indices = self.index.search(
            query_norm.reshape(1, -1).astype('float32'), k
        )
        
        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx < len(self.documents):
                results.append({
                    'document': self.documents[idx],
                    'score': score
                })
        
        return results

3. Evaluation Metrics ​

python
def evaluate_embeddings(embeddings, labels, task='classification'):
    """Evaluate embedding quality"""
    
    if task == 'classification':
        from sklearn.linear_model import LogisticRegression
        from sklearn.model_selection import cross_val_score
        
        clf = LogisticRegression()
        scores = cross_val_score(clf, embeddings, labels, cv=5)
        return scores.mean()
    
    elif task == 'clustering':
        from sklearn.cluster import KMeans
        from sklearn.metrics import silhouette_score
        
        kmeans = KMeans(n_clusters=len(set(labels)))
        cluster_labels = kmeans.fit_predict(embeddings)
        return silhouette_score(embeddings, cluster_labels)
    
    elif task == 'similarity':
        # Evaluate on similarity tasks
        similarities = cosine_similarity(embeddings)
        # Return correlation with human judgments (if available)
        pass

Future Directions ​

  • Multimodal embeddings: Combining text, image, and audio
  • Dynamic embeddings: Embeddings that evolve over time
  • Compressed embeddings: Smaller vectors without quality loss
  • Domain-specific models: Specialized embeddings for specific fields

Applications ​

  • Scientific research: Embedding research papers and patents
  • Legal tech: Understanding legal documents and cases
  • Healthcare: Medical text understanding and drug discovery
  • Finance: Financial document analysis and risk assessment

Next Steps:

Released under the MIT License.