Skip to content

Text Vectorization ​

Converting preprocessed text into numerical representations for machine learning

πŸ”’ What is Text Vectorization? ​

Vectorizing text is the process of converting text data (like words, sentences, or documents) into numerical representations (vectors) so that machines (like AI models) can process and understand the text.

Since machine learning algorithms work with numbers, we need to turn text into a numerical format. This is done using various text vectorization techniques, depending on how detailed or sophisticated the representation needs to be.

text
πŸ“Š TEXT VECTORIZATION OVERVIEW

Text β†’ Numbers β†’ Machine Learning

🎯 THE CHALLENGE
Computers don't understand words like "cat" or "happy"
They only understand numbers!

πŸ”§ THE SOLUTION
Convert text into mathematical vectors that preserve:
β”œβ”€β”€ Semantic meaning
β”œβ”€β”€ Relationships between words
β”œβ”€β”€ Context information
└── Statistical patterns

πŸ’‘ THE IMPACT
This enables:
β”œβ”€β”€ Text classification
β”œβ”€β”€ Information retrieval
β”œβ”€β”€ Recommendation systems
β”œβ”€β”€ Machine translation
└── Question answering

🎯 Why Text Vectorization Matters ​

The Challenge: Computers don't understand words like "cat" or "happy" - they only understand numbers.

The Solution: Convert text into mathematical vectors that preserve semantic meaning and relationships.

The Impact: This enables:

  • Text classification (spam detection, sentiment analysis)
  • Information retrieval (search engines)
  • Recommendation systems
  • Machine translation
  • Question answering systems

πŸ“Š Common Text Vectorization Methods ​

text
πŸ”’ VECTORIZATION METHODS HIERARCHY

CLASSICAL METHODS (Fast, Simple)
β”œβ”€β”€ Bag of Words (BoW)
β”œβ”€β”€ TF-IDF
└── N-gram Features

MODERN METHODS (Slow, Sophisticated)
β”œβ”€β”€ Word2Vec
β”œβ”€β”€ FastText
β”œβ”€β”€ GloVe
└── Transformer Embeddings (BERT, GPT)

πŸ“ˆ EVOLUTION
Simple Counting β†’ Statistical Weighting β†’ Neural Embeddings β†’ Contextual Embeddings

1. Bag of Words (BoW) ​

BoW converts a text document into a vector of word counts β€” it just counts how many times each word from the vocabulary appears in the document.

text
πŸ“ BAG OF WORDS CONCEPT

Input Documents:
1. "I love dogs"
2. "I love cats"
3. "Dogs are amazing pets"

VOCABULARY: ["I", "love", "dogs", "cats", "are", "amazing", "pets"]

VECTORS:
Doc 1: [1, 1, 1, 0, 0, 0, 0] β†’ "I love dogs"
Doc 2: [1, 1, 0, 1, 0, 0, 0] β†’ "I love cats"
Doc 3: [0, 0, 1, 0, 1, 1, 1] β†’ "Dogs are amazing pets"

🎯 CHARACTERISTICS
β”œβ”€β”€ Order ignored
β”œβ”€β”€ Grammar ignored
β”œβ”€β”€ Only frequency matters
└── Simple and fast

Key Characteristics:

  • Order of words is ignored
  • Grammar is ignored
  • Only word frequency matters

πŸ“˜ Simple Example ​

Suppose we have two sentences:

  1. "I love dogs"
  2. "I love cats"

Vocabulary (unique words from all texts): ["I", "love", "dogs", "cats"]

Now we turn each sentence into a vector:

  • "I love dogs" β†’ [1, 1, 1, 0]
  • "I love cats" β†’ [1, 1, 0, 1]

Each number in the vector corresponds to the count of a word in the sentence.

πŸ”§ BoW Implementation in Python ​

python
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Sample texts
texts = ["I love dogs", "I love cats", "Dogs are amazing pets"]

# Create BoW vectorizer
vectorizer = CountVectorizer()

# Fit and transform the texts
X = vectorizer.fit_transform(texts)

# Show the feature names (vocabulary)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vectors:")
print(X.toarray())

# Create a readable DataFrame
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
bow_df.index = texts
print("\nBoW as DataFrame:")
print(bow_df)

# OUTPUT:
# Vocabulary: ['amazing' 'are' 'cats' 'dogs' 'love' 'pets']
# BoW Vectors:
# [[0 0 0 1 1 0]   β†’ "I love dogs"
#  [0 0 1 0 1 0]   β†’ "I love cats" 
#  [1 1 0 1 0 1]]  β†’ "Dogs are amazing pets"
#
# BoW as DataFrame:
#                    amazing  are  cats  dogs  love  pets
# I love dogs              0    0     0     1     1     0
# I love cats              0    0     1     0     1     0
# Dogs are amazing pets    1    1     0     1     0     1

πŸ” Real-World BoW Example ​

python
# More realistic dataset
data = [
    'Most shark attacks occur about 10 feet from the beach since that is where the people are',
    'The efficiency with which he paired the socks in the drawer was quite admirable',
    'Carol drank the blood as if she were a vampire',
    'Giving directions that the mountains are to the west only works when you can see them',
    'The sign said there was road work ahead so he decided to speed up',
    'The gruff old man sat in the back of the bait shop grumbling to himself as he scooped out a handful of worms'
]

# Create BoW representation
countvec = CountVectorizer()
countvec_fit = countvec.fit_transform(data)

# Convert to DataFrame for better visualization
bag_of_words = pd.DataFrame(
    countvec_fit.toarray(), 
    columns=countvec.get_feature_names_out()
)

print(f"Vocabulary size: {len(countvec.get_feature_names_out())} words")
print(f"Document vectors shape: {bag_of_words.shape}")
print("\nFirst few columns of BoW matrix:")
print(bag_of_words.iloc[:, :10])  # Show first 10 words

# Find most common words across all documents
word_frequencies = bag_of_words.sum().sort_values(ascending=False)
print("\nTop 10 most frequent words:")
print(word_frequencies.head(10))

# OUTPUT:
# Vocabulary size: 75 words
# Document vectors shape: (6, 75)
# 
# First few columns of BoW matrix:
#    10  about  admirable  ahead  and  are  as  attacks  back  bait
# 0   1      1          0      0    0    1   0        1     0     0
# 1   0      0          1      0    1    0   0        0     0     0
# 2   0      0          0      0    0    0   2        0     0     0
# 3   0      0          0      0    0    0   0        0     0     0
# 4   0      0          0      1    0    0   0        0     0     0
# 5   0      0          0      0    1    0   1        0     1     1
# 
# Top 10 most frequent words:
# the        8
# he         4
# to         4
# was        3
# and        2
# are        2
# as         2
# of         2
# that       2
# with       2

βœ… Pros and Cons of BoW ​

text
βœ… PROS
β”œβ”€β”€ Simple and fast to implement
β”œβ”€β”€ Works well for basic text classification
β”œβ”€β”€ Easy to understand and interpret
β”œβ”€β”€ Good baseline for many NLP problems
└── Computationally efficient

❌ CONS
β”œβ”€β”€ Ignores word order and context
β”œβ”€β”€ Large vocabularies create sparse vectors
β”œβ”€β”€ Doesn't capture semantic similarity
β”œβ”€β”€ Treats "not good" same as "good not"
└── High-dimensional feature space

βœ… Pros:

  • Simple and fast to implement
  • Works well for basic text classification tasks
  • Easy to understand and interpret
  • Good baseline for many NLP problems

❌ Cons:

  • Ignores word order and context
  • Large vocabularies create sparse, high-dimensional vectors
  • Doesn't capture meaning or similarity between words
  • Treats "not good" the same as "good not"

βž• When to Use BoW:

  • Quick prototyping and baseline models
  • Tasks where context is less important (spam detection)
  • Teaching and learning NLP fundamentals
  • Small datasets with limited vocabulary

2. TF-IDF (Term Frequency–Inverse Document Frequency) ​

TF-IDF is an improved version of Bag of Words that not only counts word occurrences, but also weighs how important a word is in a document relative to a collection of documents.

text
πŸ“Š TF-IDF CONCEPT

TF-IDF = Term Frequency Γ— Inverse Document Frequency

🎯 INTUITION
"How important is a word in this document, 
compared to all other documents?"

πŸ“ˆ WEIGHTING PRINCIPLE
β”œβ”€β”€ Common words (the, and) β†’ Lower weight
β”œβ”€β”€ Rare, informative words β†’ Higher weight
β”œβ”€β”€ Frequent in doc, rare overall β†’ Highest weight
└── Automatic stop word handling

πŸ”’ FORMULA
TF(w,d) = (count of w in d) / (total words in d)
IDF(w) = log(N / (docs containing w))
TF-IDF(w,d) = TF(w,d) Γ— IDF(w)

βš™οΈ What Does TF-IDF Do? ​

It answers: "How important is a word in this document, compared to all other documents?"

🧩 TF-IDF Formula ​

For a word w in a document d:

TF-IDF(w,d) = TF(w,d) Γ— IDF(w)

Where:

  • TF (Term Frequency) = frequency of the word in the document

    TF(w,d) = (# times w appears in d) / (total # words in d)
  • IDF (Inverse Document Frequency) = how rare the word is across all documents

    IDF(w) = log(N / (1 + # documents containing w))

    where N is the total number of documents.

πŸ” Why TF-IDF is Better Than BoW ​

  • Common words (like "the", "and") get lower weight
  • Rare, informative words get higher weight
  • Preserves the idea of "importance," not just presence
  • Automatically handles stop words by giving them low scores

πŸ“˜ TF-IDF Example ​

Two documents:

  1. "I love dogs"
  2. "Dogs are the best pets"

TF-IDF gives "dogs" a higher weight because it's meaningful across both documents, but "the" will get a low weight because it appears frequently and isn't informative.

πŸ”§ TF-IDF Implementation ​

python
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Sample documents
docs = [
    "I love dogs",
    "Dogs are the best pets",
    "Cats are also great pets",
    "I love both cats and dogs"
]

# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer and transform the documents into TF-IDF vectors
X = vectorizer.fit_transform(docs)

# Print the vocabulary (ordered by column index in the TF-IDF matrix)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the TF-IDF matrix as a dense array
print("TF-IDF Vectors:")
tfidf_array = X.toarray()
print(tfidf_array)

# Create readable DataFrame
tfidf_df = pd.DataFrame(tfidf_array, columns=vectorizer.get_feature_names_out())
tfidf_df.index = [f"Doc {i+1}" for i in range(len(docs))]
print("\nTF-IDF as DataFrame:")
print(tfidf_df.round(3))

# OUTPUT:
# Vocabulary: ['also' 'and' 'are' 'best' 'both' 'cats' 'dogs' 'great' 'love' 'pets' 'the']
# TF-IDF Vectors:
# [[0.000 0.000 0.000 0.000 0.000 0.000 0.652 0.000 0.758 0.000 0.000]
#  [0.000 0.000 0.408 0.524 0.000 0.000 0.408 0.000 0.000 0.408 0.524]
#  [0.524 0.000 0.408 0.000 0.000 0.524 0.000 0.524 0.000 0.408 0.000]
#  [0.000 0.447 0.000 0.000 0.447 0.447 0.347 0.000 0.404 0.000 0.000]]
#
# TF-IDF as DataFrame:
#        also    and    are   best   both   cats   dogs  great   love   pets    the
# Doc 1  0.000  0.000  0.000  0.000  0.000  0.000  0.652  0.000  0.758  0.000  0.000
# Doc 2  0.000  0.000  0.408  0.524  0.000  0.000  0.408  0.000  0.000  0.408  0.524
# Doc 3  0.524  0.000  0.408  0.000  0.000  0.524  0.000  0.524  0.000  0.408  0.000
# Doc 4  0.000  0.447  0.000  0.000  0.447  0.447  0.347  0.000  0.404  0.000  0.000

πŸ” Understanding TF-IDF Output ​

python
# Analyze specific words
def analyze_tfidf_scores(docs, words_to_analyze):
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(docs)
    feature_names = vectorizer.get_feature_names_out()
    
    for word in words_to_analyze:
        if word in feature_names:
            word_idx = list(feature_names).index(word)
            scores = X[:, word_idx].toarray().flatten()
            
            print(f"\nWord: '{word}'")
            for i, (doc, score) in enumerate(zip(docs, scores)):
                print(f"  Doc {i+1}: {score:.3f} - '{doc}'")
        else:
            print(f"\nWord '{word}' not found in vocabulary")

# Analyze how different words are weighted
analyze_tfidf_scores(docs, ['dogs', 'love', 'are', 'pets'])

# OUTPUT:
# Word: 'dogs'
#   Doc 1: 0.652 - 'I love dogs'
#   Doc 2: 0.408 - 'Dogs are the best pets'
#   Doc 3: 0.000 - 'Cats are also great pets'
#   Doc 4: 0.347 - 'I love both cats and dogs'
#
# Word: 'love'
#   Doc 1: 0.758 - 'I love dogs'
#   Doc 2: 0.000 - 'Dogs are the best pets'
#   Doc 3: 0.000 - 'Cats are also great pets'
#   Doc 4: 0.404 - 'I love both cats and dogs'
#
# Word: 'are'
#   Doc 1: 0.000 - 'I love dogs'
#   Doc 2: 0.408 - 'Dogs are the best pets'
#   Doc 3: 0.408 - 'Cats are also great pets'
#   Doc 4: 0.000 - 'I love both cats and dogs'
#
# Word: 'pets'
#   Doc 1: 0.000 - 'I love dogs'
#   Doc 2: 0.408 - 'Dogs are the best pets'
#   Doc 3: 0.408 - 'Cats are also great pets'
#   Doc 4: 0.000 - 'I love both cats and dogs'

🎯 TF-IDF Best Practices ​

python
# Advanced TF-IDF configuration
advanced_vectorizer = TfidfVectorizer(
    max_features=1000,      # Limit vocabulary size
    min_df=2,               # Ignore words appearing in fewer than 2 documents
    max_df=0.8,             # Ignore words appearing in more than 80% of documents
    stop_words='english',   # Remove common English stop words
    ngram_range=(1, 2),     # Include unigrams and bigrams
    lowercase=True,         # Convert to lowercase
    strip_accents='unicode' # Remove accents
)

# Example with larger dataset
larger_docs = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "Natural language processing helps computers understand human language",
    "Deep learning neural networks can process large amounts of data",
    "Text vectorization converts words into numerical representations",
    "TF-IDF weighs the importance of words in documents"
]

X_advanced = advanced_vectorizer.fit_transform(larger_docs)
print(f"Advanced TF-IDF shape: {X_advanced.shape}")
print("Feature names (first 20):")
print(advanced_vectorizer.get_feature_names_out()[:20])

# OUTPUT:
# Advanced TF-IDF shape: (6, 48)
# Feature names (first 20):
# ['amounts' 'artificial' 'artificial intelligence' 'brown' 'brown fox'
#  'computers' 'converts' 'data' 'deep' 'deep learning' 'documents' 'dog'
#  'fox' 'helps' 'human' 'human language' 'importance' 'intelligence'
#  'jumps' 'language']

βœ… Pros and Cons of TF-IDF ​

text
βœ… PROS
β”œβ”€β”€ Reduces influence of common words
β”œβ”€β”€ Still fast and interpretable
β”œβ”€β”€ Emphasizes informative words
β”œβ”€β”€ Better than BoW for most tasks
β”œβ”€β”€ Works well with linear classifiers
└── Automatic stop word handling

❌ CONS
β”œβ”€β”€ Still ignores word order and context
β”œβ”€β”€ Sparse and high-dimensional vectors
β”œβ”€β”€ No semantic similarity (cat β‰  feline)
β”œβ”€β”€ Performance degrades with large vocabularies
└── Doesn't handle synonyms or related words

βœ… Pros:

  • Reduces the influence of common, unimportant words
  • Still fast and interpretable
  • Helps emphasize informative words
  • Better than BoW for most tasks
  • Works well with linear classifiers

❌ Cons:

  • Still ignores word order and context
  • Vectors are sparse and high-dimensional
  • Doesn't capture semantic similarity (e.g., "cat" β‰  "feline")
  • Performance degrades with very large vocabularies

βž• Use Cases:

  • Text classification and clustering
  • Document retrieval and search
  • Keyword extraction
  • Feature engineering for ML models
  • Information retrieval systems

3. Comparison: BoW vs TF-IDF ​

text
βš–οΈ BOW vs TF-IDF COMPARISON

FEATURE           | BOW              | TF-IDF
------------------|------------------|------------------
Word Importance   | Equal weight     | Weighted by rarity
Common Words      | High influence   | Low influence
Rare Words        | Low influence    | High influence
Speed             | Fast             | Fast
Interpretability  | High             | High
Context Awareness | None             | None
Use Cases         | Simple tasks     | Better for most tasks
python
def compare_vectorization_methods(docs):
    """Compare BoW and TF-IDF on the same documents"""
    
    # BoW
    bow_vectorizer = CountVectorizer()
    bow_matrix = bow_vectorizer.fit_transform(docs)
    
    # TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(docs)
    
    # Compare for a specific word
    word = "love"
    
    if word in bow_vectorizer.get_feature_names_out():
        bow_idx = list(bow_vectorizer.get_feature_names_out()).index(word)
        tfidf_idx = list(tfidf_vectorizer.get_feature_names_out()).index(word)
        
        print(f"Comparison for word: '{word}'")
        print("=" * 40)
        print(f"{'Document':<30} {'BoW':<10} {'TF-IDF':<10}")
        print("-" * 50)
        
        for i, doc in enumerate(docs):
            bow_score = bow_matrix[i, bow_idx]
            tfidf_score = tfidf_matrix[i, tfidf_idx]
            
            print(f"{doc[:25]+'...' if len(doc) > 25 else doc:<30} "
                  f"{bow_score:<10} {tfidf_score:.3f}")

# Test the comparison
test_docs = [
    "I love programming",
    "I love love love this book",
    "Programming is fun",
    "This book teaches programming with love"
]

compare_vectorization_methods(test_docs)

# OUTPUT:
# Comparison for word: 'love'
# ========================================
# Document                       BoW        TF-IDF    
# --------------------------------------------------
# I love programming             1          0.577
# I love love love this book     3          0.774
# Programming is fun             0          0.000
# This book teaches programm...  1          0.378

πŸš€ Quick Start Guide ​

text
πŸš€ GETTING STARTED WITH TEXT VECTORIZATION

1. CHOOSE VECTORIZATION METHOD
   β–‘ Start with simple methods (BoW/TF-IDF)
   β–‘ Consider your dataset size
   β–‘ Think about context requirements
   β–‘ Evaluate computational constraints

2. CONFIGURE PARAMETERS
   β–‘ Set vocabulary size limits
   β–‘ Choose n-gram ranges
   β–‘ Configure stop word handling
   β–‘ Set frequency thresholds

3. FIT AND TRANSFORM
   β–‘ Fit vectorizer on training data
   β–‘ Transform documents to vectors
   β–‘ Handle sparse matrices efficiently
   β–‘ Validate vector dimensions

4. ANALYZE AND OPTIMIZE
   β–‘ Check sparsity levels
   β–‘ Analyze feature importance
   β–‘ Optimize for performance
   β–‘ Monitor memory usage

πŸ“¦ Required Libraries ​

python
# Essential libraries for text vectorization
pip install scikit-learn pandas numpy matplotlib seaborn

# Optional but recommended
pip install scipy  # For sparse matrix operations
pip install joblib  # For model persistence

πŸ”„ Complete Vectorization Workflow Example ​

python
# Complete example showing preprocessing + vectorization pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd
import numpy as np
import re
from scipy.sparse import csr_matrix

# Sample documents
documents = [
    "Natural language processing is fascinating! It helps computers understand human text.",
    "Machine learning algorithms can process and analyze large amounts of textual data.",
    "Text preprocessing cleans and prepares data for analysis and machine learning.",
    "Vectorization converts preprocessed words into numerical representations for computers."
]

def integrated_preprocessing_vectorization_pipeline(documents, method='tfidf'):
    """
    Complete pipeline from raw text to vectors
    """
    
    print(f"πŸ”„ INTEGRATED PREPROCESSING + VECTORIZATION PIPELINE")
    print(f"Using method: {method.upper()}")
    print("=" * 60)
    
    # Step 1: Preprocessing
    print("\nπŸ“ STEP 1: TEXT PREPROCESSING")
    print("-" * 30)
    
    def preprocess_text(text):
        # Lowercase
        text = text.lower()
        # Remove punctuation but keep spaces
        text = re.sub(r'[^\w\s]', '', text)
        # Remove extra whitespace
        text = ' '.join(text.split())
        return text
    
    cleaned_docs = [preprocess_text(doc) for doc in documents]
    
    # Show preprocessing results
    preprocessing_df = pd.DataFrame({
        'Original': [doc[:50] + '...' if len(doc) > 50 else doc for doc in documents],
        'Cleaned': [doc[:50] + '...' if len(doc) > 50 else doc for doc in cleaned_docs]
    })
    print(preprocessing_df.to_string(index=False))
    
    # Step 2: Vectorization
    print(f"\nπŸ”’ STEP 2: TEXT VECTORIZATION ({method.upper()})")
    print("-" * 30)
    
    # Choose vectorization method
    if method == 'bow':
        vectorizer = CountVectorizer(
            max_features=50,
            stop_words='english',
            ngram_range=(1, 2)
        )
    elif method == 'tfidf':
        vectorizer = TfidfVectorizer(
            max_features=50,
            stop_words='english',
            ngram_range=(1, 2)
        )
    else:
        raise ValueError("Method must be 'bow' or 'tfidf'")
    
    # Fit and transform
    vectors = vectorizer.fit_transform(cleaned_docs)
    feature_names = vectorizer.get_feature_names_out()
    
    # Step 3: Analysis
    print(f"\nπŸ“Š STEP 3: VECTOR ANALYSIS")
    print("-" * 30)
    
    print(f"Vocabulary size: {len(feature_names)}")
    print(f"Vector shape: {vectors.shape}")
    print(f"Sparsity: {1.0 - vectors.nnz / (vectors.shape[0] * vectors.shape[1]):.2%}")
    print(f"Memory usage: {vectors.data.nbytes + vectors.indices.nbytes + vectors.indptr.nbytes:,} bytes")
    
    # Show feature names
    print(f"\nTop 10 features: {feature_names[:10].tolist()}")
    
    # Create readable DataFrame
    vectors_df = pd.DataFrame(
        vectors.toarray(), 
        columns=feature_names,
        index=[f"Doc {i+1}" for i in range(len(documents))]
    )
    
    # Show non-zero values for first document
    doc1_vector = vectors_df.iloc[0]
    non_zero_features = doc1_vector[doc1_vector > 0].sort_values(ascending=False)
    
    print(f"\nTop features for Document 1:")
    for feature, score in non_zero_features.head(10).items():
        print(f"  {feature}: {score:.3f}")
    
    # Step 4: Comparison
    print(f"\nβš–οΈ STEP 4: METHOD COMPARISON")
    print("-" * 30)
    
    # Compare with alternative method
    alt_method = 'bow' if method == 'tfidf' else 'tfidf'
    if alt_method == 'bow':
        alt_vectorizer = CountVectorizer(max_features=50, stop_words='english')
    else:
        alt_vectorizer = TfidfVectorizer(max_features=50, stop_words='english')
    
    alt_vectors = alt_vectorizer.fit_transform(cleaned_docs)
    
    comparison_df = pd.DataFrame({
        f'{method.upper()} Shape': [str(vectors.shape)],
        f'{method.upper()} Sparsity': [f"{1.0 - vectors.nnz / (vectors.shape[0] * vectors.shape[1]):.2%}"],
        f'{alt_method.upper()} Shape': [str(alt_vectors.shape)],
        f'{alt_method.upper()} Sparsity': [f"{1.0 - alt_vectors.nnz / (alt_vectors.shape[0] * alt_vectors.shape[1]):.2%}"]
    })
    
    print(comparison_df.to_string(index=False))
    
    return vectors, vectorizer, vectors_df

# Test both methods
print("🎯 TESTING BAG OF WORDS")
bow_vectors, bow_vectorizer, bow_df = integrated_preprocessing_vectorization_pipeline(documents, 'bow')

print("\n" + "="*80 + "\n")

print("🎯 TESTING TF-IDF")
tfidf_vectors, tfidf_vectorizer, tfidf_df = integrated_preprocessing_vectorization_pipeline(documents, 'tfidf')

# Final comparison
print(f"\nπŸ† FINAL COMPARISON")
print("=" * 40)
print(f"BoW vectors shape: {bow_vectors.shape}")
print(f"TF-IDF vectors shape: {tfidf_vectors.shape}")
print(f"BoW sparsity: {1.0 - bow_vectors.nnz / (bow_vectors.shape[0] * bow_vectors.shape[1]):.2%}")
print(f"TF-IDF sparsity: {1.0 - tfidf_vectors.nnz / (tfidf_vectors.shape[0] * tfidf_vectors.shape[1]):.2%}")

# OUTPUT:
# πŸ”„ INTEGRATED PREPROCESSING + VECTORIZATION PIPELINE
# Using method: TFIDF
# ============================================================
# 
# πŸ“ STEP 1: TEXT PREPROCESSING
# ------------------------------
#                                            Original                                             Cleaned
# Natural language processing is fascinating! It...  natural language processing is fascinating it he...
# Machine learning algorithms can process and analy...  machine learning algorithms can process and ana...
# Text preprocessing cleans and prepares data for a...  text preprocessing cleans and prepares data fo...
# Vectorization converts preprocessed words into nu...  vectorization converts preprocessed words into...
# 
# πŸ”’ STEP 2: TEXT VECTORIZATION (TFIDF)
# ------------------------------
# 
# πŸ“Š STEP 3: VECTOR ANALYSIS
# ------------------------------
# Vocabulary size: 29
# Vector shape: (4, 29)
# Sparsity: 71.55%
# Memory usage: 864 bytes
# 
# Top 10 features: ['algorithms', 'amounts', 'analysis', 'analyze', 'and', 'can', 'cleans', 'computers', 'converts', 'data']
# 
# Top features for Document 1:
#   processing: 0.447
#   language: 0.447
#   natural: 0.447
#   text: 0.327
#   understand: 0.447
#   human: 0.447
#   helps: 0.447
#   fascinating: 0.447
#   computers: 0.327
# 
# βš–οΈ STEP 4: METHOD COMPARISON
# ------------------------------
#  TFIDF Shape TFIDF Sparsity BOW Shape BOW Sparsity
#      (4, 29)       71.55%   (4, 29)     70.69%
# 
# πŸ† FINAL COMPARISON
# ========================================
# BoW vectors shape: (4, 29)
# TF-IDF vectors shape: (4, 29)
# BoW sparsity: 70.69%
# TF-IDF sparsity: 71.55%

🎯 Key Takeaways ​

text
🎯 TEXT VECTORIZATION SUMMARY

πŸ”‘ KEY CONCEPTS
β”œβ”€β”€ Text must be converted to numbers for ML
β”œβ”€β”€ BoW counts word frequencies (simple but effective)
β”œβ”€β”€ TF-IDF weighs words by importance (better for most tasks)
β”œβ”€β”€ Sparsity is a common challenge with classical methods
└── Choose method based on task requirements

πŸ“Š PRACTICAL APPLICATIONS
β”œβ”€β”€ Document classification and clustering
β”œβ”€β”€ Information retrieval and search
β”œβ”€β”€ Sentiment analysis and opinion mining
β”œβ”€β”€ Recommendation systems
└── Feature engineering for ML models

πŸš€ NEXT STEPS
β”œβ”€β”€ Experiment with different vectorization parameters
β”œβ”€β”€ Try n-gram features for better context
β”œβ”€β”€ Explore neural embedding methods (Word2Vec, BERT)
β”œβ”€β”€ Consider computational efficiency for large datasets
└── Validate vectorization quality on your specific task

πŸ”„ When to Use Each Method ​

text
πŸ“‹ VECTORIZATION METHOD SELECTION

🎯 USE BAG OF WORDS WHEN:
β”œβ”€β”€ Building quick prototypes
β”œβ”€β”€ Working with small datasets
β”œβ”€β”€ Need interpretable features
β”œβ”€β”€ Computational resources are limited
└── Task doesn't require semantic understanding

🎯 USE TF-IDF WHEN:
β”œβ”€β”€ Need better feature weighting
β”œβ”€β”€ Working with varied document lengths
β”œβ”€β”€ Want automatic stop word handling
β”œβ”€β”€ Building information retrieval systems
└── Need balance between simplicity and performance

🎯 CONSIDER ADVANCED METHODS WHEN:
β”œβ”€β”€ Semantic similarity is important
β”œβ”€β”€ Context and word order matter
β”œβ”€β”€ Working with large-scale datasets
β”œβ”€β”€ Need dense vector representations
└── Performance is critical

Related Topics:

Released under the MIT License.