Text Vectorization

Converting preprocessed text into numerical representations for machine learning

🔢 What is Text Vectorization?

Vectorizing text is the process of converting text data (like words, sentences, or documents) into numerical representations (vectors) so that machines (like AI models) can process and understand the text.

Since machine learning algorithms work with numbers, we need to turn text into a numerical format. This is done using various text vectorization techniques, depending on how detailed or sophisticated the representation needs to be.

text

📊 TEXT VECTORIZATION OVERVIEW

Text → Numbers → Machine Learning

🎯 THE CHALLENGE
Computers don't understand words like "cat" or "happy"
They only understand numbers!

🔧 THE SOLUTION
Convert text into mathematical vectors that preserve:
├── Semantic meaning
├── Relationships between words
├── Context information
└── Statistical patterns

💡 THE IMPACT
This enables:
├── Text classification
├── Information retrieval
├── Recommendation systems
├── Machine translation
└── Question answering

🎯 Why Text Vectorization Matters

The Challenge: Computers don't understand words like "cat" or "happy" - they only understand numbers.

The Solution: Convert text into mathematical vectors that preserve semantic meaning and relationships.

The Impact: This enables:

Text classification (spam detection, sentiment analysis)
Information retrieval (search engines)
Recommendation systems
Machine translation
Question answering systems

📊 Common Text Vectorization Methods

text

🔢 VECTORIZATION METHODS HIERARCHY

CLASSICAL METHODS (Fast, Simple)
├── Bag of Words (BoW)
├── TF-IDF
└── N-gram Features

MODERN METHODS (Slow, Sophisticated)
├── Word2Vec
├── FastText
├── GloVe
└── Transformer Embeddings (BERT, GPT)

📈 EVOLUTION
Simple Counting → Statistical Weighting → Neural Embeddings → Contextual Embeddings

1. Bag of Words (BoW)

BoW converts a text document into a vector of word counts — it just counts how many times each word from the vocabulary appears in the document.

text

📝 BAG OF WORDS CONCEPT

Input Documents:
1. "I love dogs"
2. "I love cats"
3. "Dogs are amazing pets"

VOCABULARY: ["I", "love", "dogs", "cats", "are", "amazing", "pets"]

VECTORS:
Doc 1: [1, 1, 1, 0, 0, 0, 0] → "I love dogs"
Doc 2: [1, 1, 0, 1, 0, 0, 0] → "I love cats"
Doc 3: [0, 0, 1, 0, 1, 1, 1] → "Dogs are amazing pets"

🎯 CHARACTERISTICS
├── Order ignored
├── Grammar ignored
├── Only frequency matters
└── Simple and fast

Key Characteristics:

Order of words is ignored
Grammar is ignored
Only word frequency matters

📘 Simple Example

Suppose we have two sentences:

"I love dogs"
"I love cats"

Vocabulary (unique words from all texts): ["I", "love", "dogs", "cats"]

Now we turn each sentence into a vector:

"I love dogs" → [1, 1, 1, 0]
"I love cats" → [1, 1, 0, 1]

Each number in the vector corresponds to the count of a word in the sentence.

🔧 BoW Implementation in Python

python

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample texts
texts = ["I love dogs", "I love cats", "Dogs are amazing pets"]

# Create BoW vectorizer
vectorizer = CountVectorizer()

# Fit and transform the texts
X = vectorizer.fit_transform(texts)

# Show the feature names (vocabulary)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:")
print(X.toarray())

# Create a readable DataFrame
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
bow_df.index = texts
print("\nBoW as DataFrame:")
print(bow_df)

# OUTPUT:
# Vocabulary: ['amazing' 'are' 'cats' 'dogs' 'love' 'pets']
# BoW Vectors:
# [[0 0 0 1 1 0]   → "I love dogs"
#  [0 0 1 0 1 0]   → "I love cats" 
#  [1 1 0 1 0 1]]  → "Dogs are amazing pets"
#
# BoW as DataFrame:
#                    amazing  are  cats  dogs  love  pets
# I love dogs              0    0     0     1     1     0
# I love cats              0    0     1     0     1     0
# Dogs are amazing pets    1    1     0     1     0     1

🔍 Real-World BoW Example

python

# More realistic dataset
data = [
    'Most shark attacks occur about 10 feet from the beach since that is where the people are',
    'The efficiency with which he paired the socks in the drawer was quite admirable',
    'Carol drank the blood as if she were a vampire',
    'Giving directions that the mountains are to the west only works when you can see them',
    'The sign said there was road work ahead so he decided to speed up',
    'The gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms'
]

# Create BoW representation
countvec = CountVectorizer()
countvec_fit = countvec.fit_transform(data)

# Convert to DataFrame for better visualization
bag_of_words = pd.DataFrame(
    countvec_fit.toarray(), 
    columns=countvec.get_feature_names_out()
)

print(f"Vocabulary size: {len(countvec.get_feature_names_out())} words")
print(f"Document vectors shape: {bag_of_words.shape}")
print("\nFirst few columns of BoW matrix:")
print(bag_of_words.iloc[:, :10])  # Show first 10 words

# Find most common words across all documents
word_frequencies = bag_of_words.sum().sort_values(ascending=False)
print("\nTop 10 most frequent words:")
print(word_frequencies.head(10))

# OUTPUT:
# Vocabulary size: 75 words
# Document vectors shape: (6, 75)
# 
# First few columns of BoW matrix:
#    10  about  admirable  ahead  and  are  as  attacks  back  bait
# 0   1      1          0      0    0    1   0        1     0     0
# 1   0      0          1      0    1    0   0        0     0     0
# 2   0      0          0      0    0    0   2        0     0     0
# 3   0      0          0      0    0    0   0        0     0     0
# 4   0      0          0      1    0    0   0        0     0     0
# 5   0      0          0      0    1    0   1        0     1     1
# 
# Top 10 most frequent words:
# the        8
# he         4
# to         4
# was        3
# and        2
# are        2
# as         2
# of         2
# that       2
# with       2

✅ Pros and Cons of BoW

text

✅ PROS
├── Simple and fast to implement
├── Works well for basic text classification
├── Easy to understand and interpret
├── Good baseline for many NLP problems
└── Computationally efficient

❌ CONS
├── Ignores word order and context
├── Large vocabularies create sparse vectors
├── Doesn't capture semantic similarity
├── Treats "not good" same as "good not"
└── High-dimensional feature space

✅ Pros:

Simple and fast to implement
Works well for basic text classification tasks
Easy to understand and interpret
Good baseline for many NLP problems

❌ Cons:

Ignores word order and context
Large vocabularies create sparse, high-dimensional vectors
Doesn't capture meaning or similarity between words
Treats "not good" the same as "good not"

➕ When to Use BoW:

Quick prototyping and baseline models
Tasks where context is less important (spam detection)
Teaching and learning NLP fundamentals
Small datasets with limited vocabulary

2. TF-IDF (Term Frequency–Inverse Document Frequency)

TF-IDF is an improved version of Bag of Words that not only counts word occurrences, but also weighs how important a word is in a document relative to a collection of documents.

text

📊 TF-IDF CONCEPT

TF-IDF = Term Frequency × Inverse Document Frequency

🎯 INTUITION
"How important is a word in this document, 
compared to all other documents?"

📈 WEIGHTING PRINCIPLE
├── Common words (the, and) → Lower weight
├── Rare, informative words → Higher weight
├── Frequent in doc, rare overall → Highest weight
└── Automatic stop word handling

🔢 FORMULA
TF(w,d) = (count of w in d) / (total words in d)
IDF(w) = log(N / (docs containing w))
TF-IDF(w,d) = TF(w,d) × IDF(w)

⚙️ What Does TF-IDF Do?

It answers: "How important is a word in this document, compared to all other documents?"

🧩 TF-IDF Formula

For a word w in a document d:

TF-IDF(w,d) = TF(w,d) × IDF(w)

Where:

TF (Term Frequency) = frequency of the word in the document
```
TF(w,d) = (# times w appears in d) / (total # words in d)
```
1
IDF (Inverse Document Frequency) = how rare the word is across all documents
```
IDF(w) = log(N / (1 + # documents containing w))
```
1
where N is the total number of documents.

🔍 Why TF-IDF is Better Than BoW

Common words (like "the", "and") get lower weight
Rare, informative words get higher weight
Preserves the idea of "importance," not just presence
Automatically handles stop words by giving them low scores

📘 TF-IDF Example

Two documents:

"I love dogs"
"Dogs are the best pets"

TF-IDF gives "dogs" a higher weight because it's meaningful across both documents, but "the" will get a low weight because it appears frequently and isn't informative.

🔧 TF-IDF Implementation

python

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Sample documents
docs = [
    "I love dogs",
    "Dogs are the best pets",
    "Cats are also great pets",
    "I love both cats and dogs"
]

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer and transform the documents into TF-IDF vectors
X = vectorizer.fit_transform(docs)

# Print the vocabulary (ordered by column index in the TF-IDF matrix)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the TF-IDF matrix as a dense array
print("TF-IDF Vectors:")
tfidf_array = X.toarray()
print(tfidf_array)

# Create readable DataFrame
tfidf_df = pd.DataFrame(tfidf_array, columns=vectorizer.get_feature_names_out())
tfidf_df.index = [f"Doc {i+1}" for i in range(len(docs))]
print("\nTF-IDF as DataFrame:")
print(tfidf_df.round(3))

# OUTPUT:
# Vocabulary: ['also' 'and' 'are' 'best' 'both' 'cats' 'dogs' 'great' 'love' 'pets' 'the']
# TF-IDF Vectors:
# [[0.000 0.000 0.000 0.000 0.000 0.000 0.652 0.000 0.758 0.000 0.000]
#  [0.000 0.000 0.408 0.524 0.000 0.000 0.408 0.000 0.000 0.408 0.524]
#  [0.524 0.000 0.408 0.000 0.000 0.524 0.000 0.524 0.000 0.408 0.000]
#  [0.000 0.447 0.000 0.000 0.447 0.447 0.347 0.000 0.404 0.000 0.000]]
#
# TF-IDF as DataFrame:
#        also    and    are   best   both   cats   dogs  great   love   pets    the
# Doc 1  0.000  0.000  0.000  0.000  0.000  0.000  0.652  0.000  0.758  0.000  0.000
# Doc 2  0.000  0.000  0.408  0.524  0.000  0.000  0.408  0.000  0.000  0.408  0.524
# Doc 3  0.524  0.000  0.408  0.000  0.000  0.524  0.000  0.524  0.000  0.408  0.000
# Doc 4  0.000  0.447  0.000  0.000  0.447  0.447  0.347  0.000  0.404  0.000  0.000

🔍 Understanding TF-IDF Output

python

# Analyze specific words
def analyze_tfidf_scores(docs, words_to_analyze):
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(docs)
    feature_names = vectorizer.get_feature_names_out()
    
    for word in words_to_analyze:
        if word in feature_names:
            word_idx = list(feature_names).index(word)
            scores = X[:, word_idx].toarray().flatten()
            
            print(f"\nWord: '{word}'")
            for i, (doc, score) in enumerate(zip(docs, scores)):
                print(f"  Doc {i+1}: {score:.3f} - '{doc}'")
        else:
            print(f"\nWord '{word}' not found in vocabulary")

# Analyze how different words are weighted
analyze_tfidf_scores(docs, ['dogs', 'love', 'are', 'pets'])

# OUTPUT:
# Word: 'dogs'
#   Doc 1: 0.652 - 'I love dogs'
#   Doc 2: 0.408 - 'Dogs are the best pets'
#   Doc 3: 0.000 - 'Cats are also great pets'
#   Doc 4: 0.347 - 'I love both cats and dogs'
#
# Word: 'love'
#   Doc 1: 0.758 - 'I love dogs'
#   Doc 2: 0.000 - 'Dogs are the best pets'
#   Doc 3: 0.000 - 'Cats are also great pets'
#   Doc 4: 0.404 - 'I love both cats and dogs'
#
# Word: 'are'
#   Doc 1: 0.000 - 'I love dogs'
#   Doc 2: 0.408 - 'Dogs are the best pets'
#   Doc 3: 0.408 - 'Cats are also great pets'
#   Doc 4: 0.000 - 'I love both cats and dogs'
#
# Word: 'pets'
#   Doc 1: 0.000 - 'I love dogs'
#   Doc 2: 0.408 - 'Dogs are the best pets'
#   Doc 3: 0.408 - 'Cats are also great pets'
#   Doc 4: 0.000 - 'I love both cats and dogs'

🎯 TF-IDF Best Practices

python

# Advanced TF-IDF configuration
advanced_vectorizer = TfidfVectorizer(
    max_features=1000,      # Limit vocabulary size
    min_df=2,               # Ignore words appearing in fewer than 2 documents
    max_df=0.8,             # Ignore words appearing in more than 80% of documents
    stop_words='english',   # Remove common English stop words
    ngram_range=(1, 2),     # Include unigrams and bigrams
    lowercase=True,         # Convert to lowercase
    strip_accents='unicode' # Remove accents
)

# Example with larger dataset
larger_docs = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "Natural language processing helps computers understand human language",
    "Deep learning neural networks can process large amounts of data",
    "Text vectorization converts words into numerical representations",
    "TF-IDF weighs the importance of words in documents"
]

X_advanced = advanced_vectorizer.fit_transform(larger_docs)
print(f"Advanced TF-IDF shape: {X_advanced.shape}")
print("Feature names (first 20):")
print(advanced_vectorizer.get_feature_names_out()[:20])

# OUTPUT:
# Advanced TF-IDF shape: (6, 48)
# Feature names (first 20):
# ['amounts' 'artificial' 'artificial intelligence' 'brown' 'brown fox'
#  'computers' 'converts' 'data' 'deep' 'deep learning' 'documents' 'dog'
#  'fox' 'helps' 'human' 'human language' 'importance' 'intelligence'
#  'jumps' 'language']

✅ Pros and Cons of TF-IDF

text

✅ PROS
├── Reduces influence of common words
├── Still fast and interpretable
├── Emphasizes informative words
├── Better than BoW for most tasks
├── Works well with linear classifiers
└── Automatic stop word handling

❌ CONS
├── Still ignores word order and context
├── Sparse and high-dimensional vectors
├── No semantic similarity (cat ≠ feline)
├── Performance degrades with large vocabularies
└── Doesn't handle synonyms or related words

✅ Pros:

Reduces the influence of common, unimportant words
Still fast and interpretable
Helps emphasize informative words
Better than BoW for most tasks
Works well with linear classifiers

❌ Cons:

Still ignores word order and context
Vectors are sparse and high-dimensional
Doesn't capture semantic similarity (e.g., "cat" ≠ "feline")
Performance degrades with very large vocabularies

➕ Use Cases:

Text classification and clustering
Document retrieval and search
Keyword extraction
Feature engineering for ML models
Information retrieval systems

3. Comparison: BoW vs TF-IDF

text

⚖️ BOW vs TF-IDF COMPARISON

FEATURE           | BOW              | TF-IDF
------------------|------------------|------------------
Word Importance   | Equal weight     | Weighted by rarity
Common Words      | High influence   | Low influence
Rare Words        | Low influence    | High influence
Speed             | Fast             | Fast
Interpretability  | High             | High
Context Awareness | None             | None
Use Cases         | Simple tasks     | Better for most tasks

python

def compare_vectorization_methods(docs):
    """Compare BoW and TF-IDF on the same documents"""
    
    # BoW
    bow_vectorizer = CountVectorizer()
    bow_matrix = bow_vectorizer.fit_transform(docs)
    
    # TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(docs)
    
    # Compare for a specific word
    word = "love"
    
    if word in bow_vectorizer.get_feature_names_out():
        bow_idx = list(bow_vectorizer.get_feature_names_out()).index(word)
        tfidf_idx = list(tfidf_vectorizer.get_feature_names_out()).index(word)
        
        print(f"Comparison for word: '{word}'")
        print("=" * 40)
        print(f"{'Document':<30} {'BoW':<10} {'TF-IDF':<10}")
        print("-" * 50)
        
        for i, doc in enumerate(docs):
            bow_score = bow_matrix[i, bow_idx]
            tfidf_score = tfidf_matrix[i, tfidf_idx]
            
            print(f"{doc[:25]+'...' if len(doc) > 25 else doc:<30} "
                  f"{bow_score:<10} {tfidf_score:.3f}")

# Test the comparison
test_docs = [
    "I love programming",
    "I love love love this book",
    "Programming is fun",
    "This book teaches programming with love"
]

compare_vectorization_methods(test_docs)

# OUTPUT:
# Comparison for word: 'love'
# ========================================
# Document                       BoW        TF-IDF    
# --------------------------------------------------
# I love programming             1          0.577
# I love love love this book     3          0.774
# Programming is fun             0          0.000
# This book teaches programm...  1          0.378

🚀 Quick Start Guide

text

🚀 GETTING STARTED WITH TEXT VECTORIZATION

1. CHOOSE VECTORIZATION METHOD
   □ Start with simple methods (BoW/TF-IDF)
   □ Consider your dataset size
   □ Think about context requirements
   □ Evaluate computational constraints

2. CONFIGURE PARAMETERS
   □ Set vocabulary size limits
   □ Choose n-gram ranges
   □ Configure stop word handling
   □ Set frequency thresholds

3. FIT AND TRANSFORM
   □ Fit vectorizer on training data
   □ Transform documents to vectors
   □ Handle sparse matrices efficiently
   □ Validate vector dimensions

4. ANALYZE AND OPTIMIZE
   □ Check sparsity levels
   □ Analyze feature importance
   □ Optimize for performance
   □ Monitor memory usage

📦 Required Libraries

python

# Essential libraries for text vectorization
pip install scikit-learn pandas numpy matplotlib seaborn

# Optional but recommended
pip install scipy  # For sparse matrix operations
pip install joblib  # For model persistence

🔄 Complete Vectorization Workflow Example

python

# Complete example showing preprocessing + vectorization pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd
import numpy as np
import re
from scipy.sparse import csr_matrix

# Sample documents
documents = [
    "Natural language processing is fascinating! It helps computers understand human text.",
    "Machine learning algorithms can process and analyze large amounts of textual data.",
    "Text preprocessing cleans and prepares data for analysis and machine learning.",
    "Vectorization converts preprocessed words into numerical representations for computers."
]

def integrated_preprocessing_vectorization_pipeline(documents, method='tfidf'):
    """
    Complete pipeline from raw text to vectors
    """
    
    print(f"🔄 INTEGRATED PREPROCESSING + VECTORIZATION PIPELINE")
    print(f"Using method: {method.upper()}")
    print("=" * 60)
    
    # Step 1: Preprocessing
    print("\n📝 STEP 1: TEXT PREPROCESSING")
    print("-" * 30)
    
    def preprocess_text(text):
        # Lowercase
        text = text.lower()
        # Remove punctuation but keep spaces
        text = re.sub(r'[^\w\s]', '', text)
        # Remove extra whitespace
        text = ' '.join(text.split())
        return text
    
    cleaned_docs = [preprocess_text(doc) for doc in documents]
    
    # Show preprocessing results
    preprocessing_df = pd.DataFrame({
        'Original': [doc[:50] + '...' if len(doc) > 50 else doc for doc in documents],
        'Cleaned': [doc[:50] + '...' if len(doc) > 50 else doc for doc in cleaned_docs]
    })
    print(preprocessing_df.to_string(index=False))
    
    # Step 2: Vectorization
    print(f"\n🔢 STEP 2: TEXT VECTORIZATION ({method.upper()})")
    print("-" * 30)
    
    # Choose vectorization method
    if method == 'bow':
        vectorizer = CountVectorizer(
            max_features=50,
            stop_words='english',
            ngram_range=(1, 2)
        )
    elif method == 'tfidf':
        vectorizer = TfidfVectorizer(
            max_features=50,
            stop_words='english',
            ngram_range=(1, 2)
        )
    else:
        raise ValueError("Method must be 'bow' or 'tfidf'")
    
    # Fit and transform
    vectors = vectorizer.fit_transform(cleaned_docs)
    feature_names = vectorizer.get_feature_names_out()
    
    # Step 3: Analysis
    print(f"\n📊 STEP 3: VECTOR ANALYSIS")
    print("-" * 30)
    
    print(f"Vocabulary size: {len(feature_names)}")
    print(f"Vector shape: {vectors.shape}")
    print(f"Sparsity: {1.0 - vectors.nnz / (vectors.shape[0] * vectors.shape[1]):.2%}")
    print(f"Memory usage: {vectors.data.nbytes + vectors.indices.nbytes + vectors.indptr.nbytes:,} bytes")
    
    # Show feature names
    print(f"\nTop 10 features: {feature_names[:10].tolist()}")
    
    # Create readable DataFrame
    vectors_df = pd.DataFrame(
        vectors.toarray(), 
        columns=feature_names,
        index=[f"Doc {i+1}" for i in range(len(documents))]
    )
    
    # Show non-zero values for first document
    doc1_vector = vectors_df.iloc[0]
    non_zero_features = doc1_vector[doc1_vector > 0].sort_values(ascending=False)
    
    print(f"\nTop features for Document 1:")
    for feature, score in non_zero_features.head(10).items():
        print(f"  {feature}: {score:.3f}")
    
    # Step 4: Comparison
    print(f"\n⚖️ STEP 4: METHOD COMPARISON")
    print("-" * 30)
    
    # Compare with alternative method
    alt_method = 'bow' if method == 'tfidf' else 'tfidf'
    if alt_method == 'bow':
        alt_vectorizer = CountVectorizer(max_features=50, stop_words='english')
    else:
        alt_vectorizer = TfidfVectorizer(max_features=50, stop_words='english')
    
    alt_vectors = alt_vectorizer.fit_transform(cleaned_docs)
    
    comparison_df = pd.DataFrame({
        f'{method.upper()} Shape': [str(vectors.shape)],
        f'{method.upper()} Sparsity': [f"{1.0 - vectors.nnz / (vectors.shape[0] * vectors.shape[1]):.2%}"],
        f'{alt_method.upper()} Shape': [str(alt_vectors.shape)],
        f'{alt_method.upper()} Sparsity': [f"{1.0 - alt_vectors.nnz / (alt_vectors.shape[0] * alt_vectors.shape[1]):.2%}"]
    })
    
    print(comparison_df.to_string(index=False))
    
    return vectors, vectorizer, vectors_df

# Test both methods
print("🎯 TESTING BAG OF WORDS")
bow_vectors, bow_vectorizer, bow_df = integrated_preprocessing_vectorization_pipeline(documents, 'bow')

print("\n" + "="*80 + "\n")

print("🎯 TESTING TF-IDF")
tfidf_vectors, tfidf_vectorizer, tfidf_df = integrated_preprocessing_vectorization_pipeline(documents, 'tfidf')

# Final comparison
print(f"\n🏆 FINAL COMPARISON")
print("=" * 40)
print(f"BoW vectors shape: {bow_vectors.shape}")
print(f"TF-IDF vectors shape: {tfidf_vectors.shape}")
print(f"BoW sparsity: {1.0 - bow_vectors.nnz / (bow_vectors.shape[0] * bow_vectors.shape[1]):.2%}")
print(f"TF-IDF sparsity: {1.0 - tfidf_vectors.nnz / (tfidf_vectors.shape[0] * tfidf_vectors.shape[1]):.2%}")

# OUTPUT:
# 🔄 INTEGRATED PREPROCESSING + VECTORIZATION PIPELINE
# Using method: TFIDF
# ============================================================
# 
# 📝 STEP 1: TEXT PREPROCESSING
# ------------------------------
#                                            Original                                             Cleaned
# Natural language processing is fascinating! It...  natural language processing is fascinating it he...
# Machine learning algorithms can process and analy...  machine learning algorithms can process and ana...
# Text preprocessing cleans and prepares data for a...  text preprocessing cleans and prepares data fo...
# Vectorization converts preprocessed words into nu...  vectorization converts preprocessed words into...
# 
# 🔢 STEP 2: TEXT VECTORIZATION (TFIDF)
# ------------------------------
# 
# 📊 STEP 3: VECTOR ANALYSIS
# ------------------------------
# Vocabulary size: 29
# Vector shape: (4, 29)
# Sparsity: 71.55%
# Memory usage: 864 bytes
# 
# Top 10 features: ['algorithms', 'amounts', 'analysis', 'analyze', 'and', 'can', 'cleans', 'computers', 'converts', 'data']
# 
# Top features for Document 1:
#   processing: 0.447
#   language: 0.447
#   natural: 0.447
#   text: 0.327
#   understand: 0.447
#   human: 0.447
#   helps: 0.447
#   fascinating: 0.447
#   computers: 0.327
# 
# ⚖️ STEP 4: METHOD COMPARISON
# ------------------------------
#  TFIDF Shape TFIDF Sparsity BOW Shape BOW Sparsity
#      (4, 29)       71.55%   (4, 29)     70.69%
# 
# 🏆 FINAL COMPARISON
# ========================================
# BoW vectors shape: (4, 29)
# TF-IDF vectors shape: (4, 29)
# BoW sparsity: 70.69%
# TF-IDF sparsity: 71.55%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185

🎯 Key Takeaways

text

🎯 TEXT VECTORIZATION SUMMARY

🔑 KEY CONCEPTS
├── Text must be converted to numbers for ML
├── BoW counts word frequencies (simple but effective)
├── TF-IDF weighs words by importance (better for most tasks)
├── Sparsity is a common challenge with classical methods
└── Choose method based on task requirements

📊 PRACTICAL APPLICATIONS
├── Document classification and clustering
├── Information retrieval and search
├── Sentiment analysis and opinion mining
├── Recommendation systems
└── Feature engineering for ML models

🚀 NEXT STEPS
├── Experiment with different vectorization parameters
├── Try n-gram features for better context
├── Explore neural embedding methods (Word2Vec, BERT)
├── Consider computational efficiency for large datasets
└── Validate vectorization quality on your specific task

🔄 When to Use Each Method

text

📋 VECTORIZATION METHOD SELECTION

🎯 USE BAG OF WORDS WHEN:
├── Building quick prototypes
├── Working with small datasets
├── Need interpretable features
├── Computational resources are limited
└── Task doesn't require semantic understanding

🎯 USE TF-IDF WHEN:
├── Need better feature weighting
├── Working with varied document lengths
├── Want automatic stop word handling
├── Building information retrieval systems
└── Need balance between simplicity and performance

🎯 CONSIDER ADVANCED METHODS WHEN:
├── Semantic similarity is important
├── Context and word order matter
├── Working with large-scale datasets
├── Need dense vector representations
└── Performance is critical

Related Topics:

Text Preprocessing: Clean and prepare text before vectorization
Text Analysis: POS tagging and Named Entity Recognition
Embeddings & Semantic Similarity: Advanced neural vector representations
Transformers & Attention: Modern contextual embeddings

Text Vectorization ​

🔢 What is Text Vectorization? ​

🎯 Why Text Vectorization Matters ​

📊 Common Text Vectorization Methods ​

1. Bag of Words (BoW) ​

📘 Simple Example ​

🔧 BoW Implementation in Python ​

🔍 Real-World BoW Example ​

✅ Pros and Cons of BoW ​

2. TF-IDF (Term Frequency–Inverse Document Frequency) ​

⚙️ What Does TF-IDF Do? ​

🧩 TF-IDF Formula ​

🔍 Why TF-IDF is Better Than BoW ​

📘 TF-IDF Example ​

🔧 TF-IDF Implementation ​

🔍 Understanding TF-IDF Output ​

🎯 TF-IDF Best Practices ​

✅ Pros and Cons of TF-IDF ​

3. Comparison: BoW vs TF-IDF ​

🚀 Quick Start Guide ​

📦 Required Libraries ​

🔄 Complete Vectorization Workflow Example ​

🎯 Key Takeaways ​

🔄 When to Use Each Method ​

Text Vectorization

🔢 What is Text Vectorization?

🎯 Why Text Vectorization Matters

📊 Common Text Vectorization Methods

1. Bag of Words (BoW)

📘 Simple Example

🔧 BoW Implementation in Python

🔍 Real-World BoW Example

✅ Pros and Cons of BoW

2. TF-IDF (Term Frequency–Inverse Document Frequency)

⚙️ What Does TF-IDF Do?

🧩 TF-IDF Formula

🔍 Why TF-IDF is Better Than BoW

📘 TF-IDF Example

🔧 TF-IDF Implementation

🔍 Understanding TF-IDF Output

🎯 TF-IDF Best Practices

✅ Pros and Cons of TF-IDF

3. Comparison: BoW vs TF-IDF

🚀 Quick Start Guide

📦 Required Libraries

🔄 Complete Vectorization Workflow Example

🎯 Key Takeaways

🔄 When to Use Each Method