Text Vectorization β
Converting preprocessed text into numerical representations for machine learning
π’ What is Text Vectorization? β
Vectorizing text is the process of converting text data (like words, sentences, or documents) into numerical representations (vectors) so that machines (like AI models) can process and understand the text.
Since machine learning algorithms work with numbers, we need to turn text into a numerical format. This is done using various text vectorization techniques, depending on how detailed or sophisticated the representation needs to be.
π TEXT VECTORIZATION OVERVIEW
Text β Numbers β Machine Learning
π― THE CHALLENGE
Computers don't understand words like "cat" or "happy"
They only understand numbers!
π§ THE SOLUTION
Convert text into mathematical vectors that preserve:
βββ Semantic meaning
βββ Relationships between words
βββ Context information
βββ Statistical patterns
π‘ THE IMPACT
This enables:
βββ Text classification
βββ Information retrieval
βββ Recommendation systems
βββ Machine translation
βββ Question answeringπ― Why Text Vectorization Matters β
The Challenge: Computers don't understand words like "cat" or "happy" - they only understand numbers.
The Solution: Convert text into mathematical vectors that preserve semantic meaning and relationships.
The Impact: This enables:
- Text classification (spam detection, sentiment analysis)
- Information retrieval (search engines)
- Recommendation systems
- Machine translation
- Question answering systems
π Common Text Vectorization Methods β
π’ VECTORIZATION METHODS HIERARCHY
CLASSICAL METHODS (Fast, Simple)
βββ Bag of Words (BoW)
βββ TF-IDF
βββ N-gram Features
MODERN METHODS (Slow, Sophisticated)
βββ Word2Vec
βββ FastText
βββ GloVe
βββ Transformer Embeddings (BERT, GPT)
π EVOLUTION
Simple Counting β Statistical Weighting β Neural Embeddings β Contextual Embeddings1. Bag of Words (BoW) β
BoW converts a text document into a vector of word counts β it just counts how many times each word from the vocabulary appears in the document.
π BAG OF WORDS CONCEPT
Input Documents:
1. "I love dogs"
2. "I love cats"
3. "Dogs are amazing pets"
VOCABULARY: ["I", "love", "dogs", "cats", "are", "amazing", "pets"]
VECTORS:
Doc 1: [1, 1, 1, 0, 0, 0, 0] β "I love dogs"
Doc 2: [1, 1, 0, 1, 0, 0, 0] β "I love cats"
Doc 3: [0, 0, 1, 0, 1, 1, 1] β "Dogs are amazing pets"
π― CHARACTERISTICS
βββ Order ignored
βββ Grammar ignored
βββ Only frequency matters
βββ Simple and fastKey Characteristics:
- Order of words is ignored
- Grammar is ignored
- Only word frequency matters
π Simple Example β
Suppose we have two sentences:
- "I love dogs"
- "I love cats"
Vocabulary (unique words from all texts): ["I", "love", "dogs", "cats"]
Now we turn each sentence into a vector:
- "I love dogs" β [1, 1, 1, 0]
- "I love cats" β [1, 1, 0, 1]
Each number in the vector corresponds to the count of a word in the sentence.
π§ BoW Implementation in Python β
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Sample texts
texts = ["I love dogs", "I love cats", "Dogs are amazing pets"]
# Create BoW vectorizer
vectorizer = CountVectorizer()
# Fit and transform the texts
X = vectorizer.fit_transform(texts)
# Show the feature names (vocabulary)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:")
print(X.toarray())
# Create a readable DataFrame
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
bow_df.index = texts
print("\nBoW as DataFrame:")
print(bow_df)
# OUTPUT:
# Vocabulary: ['amazing' 'are' 'cats' 'dogs' 'love' 'pets']
# BoW Vectors:
# [[0 0 0 1 1 0] β "I love dogs"
# [0 0 1 0 1 0] β "I love cats"
# [1 1 0 1 0 1]] β "Dogs are amazing pets"
#
# BoW as DataFrame:
# amazing are cats dogs love pets
# I love dogs 0 0 0 1 1 0
# I love cats 0 0 1 0 1 0
# Dogs are amazing pets 1 1 0 1 0 1π Real-World BoW Example β
# More realistic dataset
data = [
'Most shark attacks occur about 10 feet from the beach since that is where the people are',
'The efficiency with which he paired the socks in the drawer was quite admirable',
'Carol drank the blood as if she were a vampire',
'Giving directions that the mountains are to the west only works when you can see them',
'The sign said there was road work ahead so he decided to speed up',
'The gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms'
]
# Create BoW representation
countvec = CountVectorizer()
countvec_fit = countvec.fit_transform(data)
# Convert to DataFrame for better visualization
bag_of_words = pd.DataFrame(
countvec_fit.toarray(),
columns=countvec.get_feature_names_out()
)
print(f"Vocabulary size: {len(countvec.get_feature_names_out())} words")
print(f"Document vectors shape: {bag_of_words.shape}")
print("\nFirst few columns of BoW matrix:")
print(bag_of_words.iloc[:, :10]) # Show first 10 words
# Find most common words across all documents
word_frequencies = bag_of_words.sum().sort_values(ascending=False)
print("\nTop 10 most frequent words:")
print(word_frequencies.head(10))
# OUTPUT:
# Vocabulary size: 75 words
# Document vectors shape: (6, 75)
#
# First few columns of BoW matrix:
# 10 about admirable ahead and are as attacks back bait
# 0 1 1 0 0 0 1 0 1 0 0
# 1 0 0 1 0 1 0 0 0 0 0
# 2 0 0 0 0 0 0 2 0 0 0
# 3 0 0 0 0 0 0 0 0 0 0
# 4 0 0 0 1 0 0 0 0 0 0
# 5 0 0 0 0 1 0 1 0 1 1
#
# Top 10 most frequent words:
# the 8
# he 4
# to 4
# was 3
# and 2
# are 2
# as 2
# of 2
# that 2
# with 2β Pros and Cons of BoW β
β
PROS
βββ Simple and fast to implement
βββ Works well for basic text classification
βββ Easy to understand and interpret
βββ Good baseline for many NLP problems
βββ Computationally efficient
β CONS
βββ Ignores word order and context
βββ Large vocabularies create sparse vectors
βββ Doesn't capture semantic similarity
βββ Treats "not good" same as "good not"
βββ High-dimensional feature spaceβ Pros:
- Simple and fast to implement
- Works well for basic text classification tasks
- Easy to understand and interpret
- Good baseline for many NLP problems
β Cons:
- Ignores word order and context
- Large vocabularies create sparse, high-dimensional vectors
- Doesn't capture meaning or similarity between words
- Treats "not good" the same as "good not"
β When to Use BoW:
- Quick prototyping and baseline models
- Tasks where context is less important (spam detection)
- Teaching and learning NLP fundamentals
- Small datasets with limited vocabulary
2. TF-IDF (Term FrequencyβInverse Document Frequency) β
TF-IDF is an improved version of Bag of Words that not only counts word occurrences, but also weighs how important a word is in a document relative to a collection of documents.
π TF-IDF CONCEPT
TF-IDF = Term Frequency Γ Inverse Document Frequency
π― INTUITION
"How important is a word in this document,
compared to all other documents?"
π WEIGHTING PRINCIPLE
βββ Common words (the, and) β Lower weight
βββ Rare, informative words β Higher weight
βββ Frequent in doc, rare overall β Highest weight
βββ Automatic stop word handling
π’ FORMULA
TF(w,d) = (count of w in d) / (total words in d)
IDF(w) = log(N / (docs containing w))
TF-IDF(w,d) = TF(w,d) Γ IDF(w)βοΈ What Does TF-IDF Do? β
It answers: "How important is a word in this document, compared to all other documents?"
π§© TF-IDF Formula β
For a word w in a document d:
TF-IDF(w,d) = TF(w,d) Γ IDF(w)Where:
TF (Term Frequency) = frequency of the word in the document
TF(w,d) = (# times w appears in d) / (total # words in d)IDF (Inverse Document Frequency) = how rare the word is across all documents
IDF(w) = log(N / (1 + # documents containing w))where N is the total number of documents.
π Why TF-IDF is Better Than BoW β
- Common words (like "the", "and") get lower weight
- Rare, informative words get higher weight
- Preserves the idea of "importance," not just presence
- Automatically handles stop words by giving them low scores
π TF-IDF Example β
Two documents:
- "I love dogs"
- "Dogs are the best pets"
TF-IDF gives "dogs" a higher weight because it's meaningful across both documents, but "the" will get a low weight because it appears frequently and isn't informative.
π§ TF-IDF Implementation β
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# Sample documents
docs = [
"I love dogs",
"Dogs are the best pets",
"Cats are also great pets",
"I love both cats and dogs"
]
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit the vectorizer and transform the documents into TF-IDF vectors
X = vectorizer.fit_transform(docs)
# Print the vocabulary (ordered by column index in the TF-IDF matrix)
print("Vocabulary:", vectorizer.get_feature_names_out())
# Print the TF-IDF matrix as a dense array
print("TF-IDF Vectors:")
tfidf_array = X.toarray()
print(tfidf_array)
# Create readable DataFrame
tfidf_df = pd.DataFrame(tfidf_array, columns=vectorizer.get_feature_names_out())
tfidf_df.index = [f"Doc {i+1}" for i in range(len(docs))]
print("\nTF-IDF as DataFrame:")
print(tfidf_df.round(3))
# OUTPUT:
# Vocabulary: ['also' 'and' 'are' 'best' 'both' 'cats' 'dogs' 'great' 'love' 'pets' 'the']
# TF-IDF Vectors:
# [[0.000 0.000 0.000 0.000 0.000 0.000 0.652 0.000 0.758 0.000 0.000]
# [0.000 0.000 0.408 0.524 0.000 0.000 0.408 0.000 0.000 0.408 0.524]
# [0.524 0.000 0.408 0.000 0.000 0.524 0.000 0.524 0.000 0.408 0.000]
# [0.000 0.447 0.000 0.000 0.447 0.447 0.347 0.000 0.404 0.000 0.000]]
#
# TF-IDF as DataFrame:
# also and are best both cats dogs great love pets the
# Doc 1 0.000 0.000 0.000 0.000 0.000 0.000 0.652 0.000 0.758 0.000 0.000
# Doc 2 0.000 0.000 0.408 0.524 0.000 0.000 0.408 0.000 0.000 0.408 0.524
# Doc 3 0.524 0.000 0.408 0.000 0.000 0.524 0.000 0.524 0.000 0.408 0.000
# Doc 4 0.000 0.447 0.000 0.000 0.447 0.447 0.347 0.000 0.404 0.000 0.000π Understanding TF-IDF Output β
# Analyze specific words
def analyze_tfidf_scores(docs, words_to_analyze):
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
feature_names = vectorizer.get_feature_names_out()
for word in words_to_analyze:
if word in feature_names:
word_idx = list(feature_names).index(word)
scores = X[:, word_idx].toarray().flatten()
print(f"\nWord: '{word}'")
for i, (doc, score) in enumerate(zip(docs, scores)):
print(f" Doc {i+1}: {score:.3f} - '{doc}'")
else:
print(f"\nWord '{word}' not found in vocabulary")
# Analyze how different words are weighted
analyze_tfidf_scores(docs, ['dogs', 'love', 'are', 'pets'])
# OUTPUT:
# Word: 'dogs'
# Doc 1: 0.652 - 'I love dogs'
# Doc 2: 0.408 - 'Dogs are the best pets'
# Doc 3: 0.000 - 'Cats are also great pets'
# Doc 4: 0.347 - 'I love both cats and dogs'
#
# Word: 'love'
# Doc 1: 0.758 - 'I love dogs'
# Doc 2: 0.000 - 'Dogs are the best pets'
# Doc 3: 0.000 - 'Cats are also great pets'
# Doc 4: 0.404 - 'I love both cats and dogs'
#
# Word: 'are'
# Doc 1: 0.000 - 'I love dogs'
# Doc 2: 0.408 - 'Dogs are the best pets'
# Doc 3: 0.408 - 'Cats are also great pets'
# Doc 4: 0.000 - 'I love both cats and dogs'
#
# Word: 'pets'
# Doc 1: 0.000 - 'I love dogs'
# Doc 2: 0.408 - 'Dogs are the best pets'
# Doc 3: 0.408 - 'Cats are also great pets'
# Doc 4: 0.000 - 'I love both cats and dogs'π― TF-IDF Best Practices β
# Advanced TF-IDF configuration
advanced_vectorizer = TfidfVectorizer(
max_features=1000, # Limit vocabulary size
min_df=2, # Ignore words appearing in fewer than 2 documents
max_df=0.8, # Ignore words appearing in more than 80% of documents
stop_words='english', # Remove common English stop words
ngram_range=(1, 2), # Include unigrams and bigrams
lowercase=True, # Convert to lowercase
strip_accents='unicode' # Remove accents
)
# Example with larger dataset
larger_docs = [
"The quick brown fox jumps over the lazy dog",
"Machine learning is a subset of artificial intelligence",
"Natural language processing helps computers understand human language",
"Deep learning neural networks can process large amounts of data",
"Text vectorization converts words into numerical representations",
"TF-IDF weighs the importance of words in documents"
]
X_advanced = advanced_vectorizer.fit_transform(larger_docs)
print(f"Advanced TF-IDF shape: {X_advanced.shape}")
print("Feature names (first 20):")
print(advanced_vectorizer.get_feature_names_out()[:20])
# OUTPUT:
# Advanced TF-IDF shape: (6, 48)
# Feature names (first 20):
# ['amounts' 'artificial' 'artificial intelligence' 'brown' 'brown fox'
# 'computers' 'converts' 'data' 'deep' 'deep learning' 'documents' 'dog'
# 'fox' 'helps' 'human' 'human language' 'importance' 'intelligence'
# 'jumps' 'language']β Pros and Cons of TF-IDF β
β
PROS
βββ Reduces influence of common words
βββ Still fast and interpretable
βββ Emphasizes informative words
βββ Better than BoW for most tasks
βββ Works well with linear classifiers
βββ Automatic stop word handling
β CONS
βββ Still ignores word order and context
βββ Sparse and high-dimensional vectors
βββ No semantic similarity (cat β feline)
βββ Performance degrades with large vocabularies
βββ Doesn't handle synonyms or related wordsβ Pros:
- Reduces the influence of common, unimportant words
- Still fast and interpretable
- Helps emphasize informative words
- Better than BoW for most tasks
- Works well with linear classifiers
β Cons:
- Still ignores word order and context
- Vectors are sparse and high-dimensional
- Doesn't capture semantic similarity (e.g., "cat" β "feline")
- Performance degrades with very large vocabularies
β Use Cases:
- Text classification and clustering
- Document retrieval and search
- Keyword extraction
- Feature engineering for ML models
- Information retrieval systems
3. Comparison: BoW vs TF-IDF β
βοΈ BOW vs TF-IDF COMPARISON
FEATURE | BOW | TF-IDF
------------------|------------------|------------------
Word Importance | Equal weight | Weighted by rarity
Common Words | High influence | Low influence
Rare Words | Low influence | High influence
Speed | Fast | Fast
Interpretability | High | High
Context Awareness | None | None
Use Cases | Simple tasks | Better for most tasksdef compare_vectorization_methods(docs):
"""Compare BoW and TF-IDF on the same documents"""
# BoW
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(docs)
# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)
# Compare for a specific word
word = "love"
if word in bow_vectorizer.get_feature_names_out():
bow_idx = list(bow_vectorizer.get_feature_names_out()).index(word)
tfidf_idx = list(tfidf_vectorizer.get_feature_names_out()).index(word)
print(f"Comparison for word: '{word}'")
print("=" * 40)
print(f"{'Document':<30} {'BoW':<10} {'TF-IDF':<10}")
print("-" * 50)
for i, doc in enumerate(docs):
bow_score = bow_matrix[i, bow_idx]
tfidf_score = tfidf_matrix[i, tfidf_idx]
print(f"{doc[:25]+'...' if len(doc) > 25 else doc:<30} "
f"{bow_score:<10} {tfidf_score:.3f}")
# Test the comparison
test_docs = [
"I love programming",
"I love love love this book",
"Programming is fun",
"This book teaches programming with love"
]
compare_vectorization_methods(test_docs)
# OUTPUT:
# Comparison for word: 'love'
# ========================================
# Document BoW TF-IDF
# --------------------------------------------------
# I love programming 1 0.577
# I love love love this book 3 0.774
# Programming is fun 0 0.000
# This book teaches programm... 1 0.378π Quick Start Guide β
π GETTING STARTED WITH TEXT VECTORIZATION
1. CHOOSE VECTORIZATION METHOD
β‘ Start with simple methods (BoW/TF-IDF)
β‘ Consider your dataset size
β‘ Think about context requirements
β‘ Evaluate computational constraints
2. CONFIGURE PARAMETERS
β‘ Set vocabulary size limits
β‘ Choose n-gram ranges
β‘ Configure stop word handling
β‘ Set frequency thresholds
3. FIT AND TRANSFORM
β‘ Fit vectorizer on training data
β‘ Transform documents to vectors
β‘ Handle sparse matrices efficiently
β‘ Validate vector dimensions
4. ANALYZE AND OPTIMIZE
β‘ Check sparsity levels
β‘ Analyze feature importance
β‘ Optimize for performance
β‘ Monitor memory usageπ¦ Required Libraries β
# Essential libraries for text vectorization
pip install scikit-learn pandas numpy matplotlib seaborn
# Optional but recommended
pip install scipy # For sparse matrix operations
pip install joblib # For model persistenceπ Complete Vectorization Workflow Example β
# Complete example showing preprocessing + vectorization pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd
import numpy as np
import re
from scipy.sparse import csr_matrix
# Sample documents
documents = [
"Natural language processing is fascinating! It helps computers understand human text.",
"Machine learning algorithms can process and analyze large amounts of textual data.",
"Text preprocessing cleans and prepares data for analysis and machine learning.",
"Vectorization converts preprocessed words into numerical representations for computers."
]
def integrated_preprocessing_vectorization_pipeline(documents, method='tfidf'):
"""
Complete pipeline from raw text to vectors
"""
print(f"π INTEGRATED PREPROCESSING + VECTORIZATION PIPELINE")
print(f"Using method: {method.upper()}")
print("=" * 60)
# Step 1: Preprocessing
print("\nπ STEP 1: TEXT PREPROCESSING")
print("-" * 30)
def preprocess_text(text):
# Lowercase
text = text.lower()
# Remove punctuation but keep spaces
text = re.sub(r'[^\w\s]', '', text)
# Remove extra whitespace
text = ' '.join(text.split())
return text
cleaned_docs = [preprocess_text(doc) for doc in documents]
# Show preprocessing results
preprocessing_df = pd.DataFrame({
'Original': [doc[:50] + '...' if len(doc) > 50 else doc for doc in documents],
'Cleaned': [doc[:50] + '...' if len(doc) > 50 else doc for doc in cleaned_docs]
})
print(preprocessing_df.to_string(index=False))
# Step 2: Vectorization
print(f"\nπ’ STEP 2: TEXT VECTORIZATION ({method.upper()})")
print("-" * 30)
# Choose vectorization method
if method == 'bow':
vectorizer = CountVectorizer(
max_features=50,
stop_words='english',
ngram_range=(1, 2)
)
elif method == 'tfidf':
vectorizer = TfidfVectorizer(
max_features=50,
stop_words='english',
ngram_range=(1, 2)
)
else:
raise ValueError("Method must be 'bow' or 'tfidf'")
# Fit and transform
vectors = vectorizer.fit_transform(cleaned_docs)
feature_names = vectorizer.get_feature_names_out()
# Step 3: Analysis
print(f"\nπ STEP 3: VECTOR ANALYSIS")
print("-" * 30)
print(f"Vocabulary size: {len(feature_names)}")
print(f"Vector shape: {vectors.shape}")
print(f"Sparsity: {1.0 - vectors.nnz / (vectors.shape[0] * vectors.shape[1]):.2%}")
print(f"Memory usage: {vectors.data.nbytes + vectors.indices.nbytes + vectors.indptr.nbytes:,} bytes")
# Show feature names
print(f"\nTop 10 features: {feature_names[:10].tolist()}")
# Create readable DataFrame
vectors_df = pd.DataFrame(
vectors.toarray(),
columns=feature_names,
index=[f"Doc {i+1}" for i in range(len(documents))]
)
# Show non-zero values for first document
doc1_vector = vectors_df.iloc[0]
non_zero_features = doc1_vector[doc1_vector > 0].sort_values(ascending=False)
print(f"\nTop features for Document 1:")
for feature, score in non_zero_features.head(10).items():
print(f" {feature}: {score:.3f}")
# Step 4: Comparison
print(f"\nβοΈ STEP 4: METHOD COMPARISON")
print("-" * 30)
# Compare with alternative method
alt_method = 'bow' if method == 'tfidf' else 'tfidf'
if alt_method == 'bow':
alt_vectorizer = CountVectorizer(max_features=50, stop_words='english')
else:
alt_vectorizer = TfidfVectorizer(max_features=50, stop_words='english')
alt_vectors = alt_vectorizer.fit_transform(cleaned_docs)
comparison_df = pd.DataFrame({
f'{method.upper()} Shape': [str(vectors.shape)],
f'{method.upper()} Sparsity': [f"{1.0 - vectors.nnz / (vectors.shape[0] * vectors.shape[1]):.2%}"],
f'{alt_method.upper()} Shape': [str(alt_vectors.shape)],
f'{alt_method.upper()} Sparsity': [f"{1.0 - alt_vectors.nnz / (alt_vectors.shape[0] * alt_vectors.shape[1]):.2%}"]
})
print(comparison_df.to_string(index=False))
return vectors, vectorizer, vectors_df
# Test both methods
print("π― TESTING BAG OF WORDS")
bow_vectors, bow_vectorizer, bow_df = integrated_preprocessing_vectorization_pipeline(documents, 'bow')
print("\n" + "="*80 + "\n")
print("π― TESTING TF-IDF")
tfidf_vectors, tfidf_vectorizer, tfidf_df = integrated_preprocessing_vectorization_pipeline(documents, 'tfidf')
# Final comparison
print(f"\nπ FINAL COMPARISON")
print("=" * 40)
print(f"BoW vectors shape: {bow_vectors.shape}")
print(f"TF-IDF vectors shape: {tfidf_vectors.shape}")
print(f"BoW sparsity: {1.0 - bow_vectors.nnz / (bow_vectors.shape[0] * bow_vectors.shape[1]):.2%}")
print(f"TF-IDF sparsity: {1.0 - tfidf_vectors.nnz / (tfidf_vectors.shape[0] * tfidf_vectors.shape[1]):.2%}")
# OUTPUT:
# π INTEGRATED PREPROCESSING + VECTORIZATION PIPELINE
# Using method: TFIDF
# ============================================================
#
# π STEP 1: TEXT PREPROCESSING
# ------------------------------
# Original Cleaned
# Natural language processing is fascinating! It... natural language processing is fascinating it he...
# Machine learning algorithms can process and analy... machine learning algorithms can process and ana...
# Text preprocessing cleans and prepares data for a... text preprocessing cleans and prepares data fo...
# Vectorization converts preprocessed words into nu... vectorization converts preprocessed words into...
#
# π’ STEP 2: TEXT VECTORIZATION (TFIDF)
# ------------------------------
#
# π STEP 3: VECTOR ANALYSIS
# ------------------------------
# Vocabulary size: 29
# Vector shape: (4, 29)
# Sparsity: 71.55%
# Memory usage: 864 bytes
#
# Top 10 features: ['algorithms', 'amounts', 'analysis', 'analyze', 'and', 'can', 'cleans', 'computers', 'converts', 'data']
#
# Top features for Document 1:
# processing: 0.447
# language: 0.447
# natural: 0.447
# text: 0.327
# understand: 0.447
# human: 0.447
# helps: 0.447
# fascinating: 0.447
# computers: 0.327
#
# βοΈ STEP 4: METHOD COMPARISON
# ------------------------------
# TFIDF Shape TFIDF Sparsity BOW Shape BOW Sparsity
# (4, 29) 71.55% (4, 29) 70.69%
#
# π FINAL COMPARISON
# ========================================
# BoW vectors shape: (4, 29)
# TF-IDF vectors shape: (4, 29)
# BoW sparsity: 70.69%
# TF-IDF sparsity: 71.55%π― Key Takeaways β
π― TEXT VECTORIZATION SUMMARY
π KEY CONCEPTS
βββ Text must be converted to numbers for ML
βββ BoW counts word frequencies (simple but effective)
βββ TF-IDF weighs words by importance (better for most tasks)
βββ Sparsity is a common challenge with classical methods
βββ Choose method based on task requirements
π PRACTICAL APPLICATIONS
βββ Document classification and clustering
βββ Information retrieval and search
βββ Sentiment analysis and opinion mining
βββ Recommendation systems
βββ Feature engineering for ML models
π NEXT STEPS
βββ Experiment with different vectorization parameters
βββ Try n-gram features for better context
βββ Explore neural embedding methods (Word2Vec, BERT)
βββ Consider computational efficiency for large datasets
βββ Validate vectorization quality on your specific taskπ When to Use Each Method β
π VECTORIZATION METHOD SELECTION
π― USE BAG OF WORDS WHEN:
βββ Building quick prototypes
βββ Working with small datasets
βββ Need interpretable features
βββ Computational resources are limited
βββ Task doesn't require semantic understanding
π― USE TF-IDF WHEN:
βββ Need better feature weighting
βββ Working with varied document lengths
βββ Want automatic stop word handling
βββ Building information retrieval systems
βββ Need balance between simplicity and performance
π― CONSIDER ADVANCED METHODS WHEN:
βββ Semantic similarity is important
βββ Context and word order matter
βββ Working with large-scale datasets
βββ Need dense vector representations
βββ Performance is criticalRelated Topics:
- Text Preprocessing: Clean and prepare text before vectorization
- Text Analysis: POS tagging and Named Entity Recognition
- Embeddings & Semantic Similarity: Advanced neural vector representations
- Transformers & Attention: Modern contextual embeddings