Skip to content

Text Preprocessing Pipeline

Converting raw text into clean, structured data for machine learning

🧹 What is Text Preprocessing?

Text preprocessing is a crucial first step in any NLP project. It involves cleaning and transforming raw text into a format suitable for analysis or machine learning. Below are the essential preprocessing steps with practical examples using the popular NLTK and spaCy libraries.

text
📝 TEXT PREPROCESSING OVERVIEW

Raw Text → Cleaned Text → Structured Data → Ready for ML/Analysis

🔧 COMMON PREPROCESSING STEPS
├── Text Normalization (Lowercasing)
├── Stop Words Removal
├── Tokenization
├── Stemming/Lemmatization
├── Regular Expressions
├── N-grams Analysis
├── Parts of Speech (POS) Tagging
└── Named Entity Recognition (NER)

🎯 GOALS
├── Reduce noise and complexity
├── Standardize text format
├── Extract meaningful features
└── Improve ML model performance

🚀 Quick Start Guide

text
🚀 GETTING STARTED WITH TEXT PREPROCESSING

1. ENVIRONMENT SETUP
   □ Install required libraries (NLTK, spaCy)
   □ Download language models
   □ Set up development environment
   □ Test basic functionality

2. CHOOSE PREPROCESSING STEPS
   □ Identify your text source and format
   □ Determine required cleaning steps
   □ Select appropriate techniques
   □ Consider domain-specific requirements

3. IMPLEMENT PIPELINE
   □ Create preprocessing function
   □ Test on sample data
   □ Optimize for performance
   □ Document your approach

4. VALIDATION & TESTING
   □ Verify preprocessing results
   □ Check for edge cases
   □ Validate on different text types
   □ Monitor preprocessing quality

📦 Required Libraries

python
# Essential NLP libraries for text preprocessing
pip install nltk spacy scikit-learn pandas numpy matplotlib seaborn

# Download spaCy language model
python -m spacy download en_core_web_sm

# Download NLTK data
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

1. Text Normalization - Lowercasing

Converting text to lowercase is an important first step. It helps maintain consistency in data and ensures that words are treated the same regardless of capitalization.

text
📝 LOWERCASING PROCESS

Input:  "Her cat's name is Luna"
Output: "her cat's name is luna"

🎯 BENEFITS
├── Ensures consistency
├── Reduces vocabulary size
├── Simplifies processing
└── Improves model performance

Why lowercase?

  • Ensures consistency (e.g., "Apple" and "apple" treated as the same word)
  • Reduces vocabulary size and complexity
  • Makes further text processing easier

Important note: Lowercasing can change meaning (e.g., "US" as country vs "us" as pronoun).

python
# Example 1: Single sentence
sentence = "Her cat's name is Luna"
print(sentence)
lower_sentence = sentence.lower()
print(lower_sentence)

# OUTPUT:
# Her cat's name is Luna
# her cat's name is luna

# Example 2: Multiple sentences
sentence_list = ['Could you pass me the TV remote?', 
                 'It is IMPOSSIBLE to find this hotel', 
                 'Want to go for dinner on Tuesday?']
lower_sentence_list = [x.lower() for x in sentence_list]
print(lower_sentence_list)

# OUTPUT:
# ['could you pass me the tv remote?', 'it is impossible to find this hotel', 'want to go for dinner on tuesday?']

2. Stop Words Removal

Stop words are common words that don't carry much meaning (e.g., "and", "the", "is"). Removing them:

text
🛑 STOP WORDS REMOVAL

Input:  "it was too far to go to the shop"
Output: "far shop" (after removing: it, was, too, to, go, to, the)

🎯 BENEFITS
├── Reduces data complexity
├── Speeds up processing
├── Improves ML accuracy
└── Creates cleaner datasets
  • Reduces data complexity and noise
  • Speeds up processing times
  • Often improves machine learning accuracy
  • Creates smaller, cleaner datasets
python
import nltk
from nltk.corpus import stopwords

# Load English stop words
en_stopwords = stopwords.words('english')
print(en_stopwords[:20])  # Show first 20

# OUTPUT:
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

sentence = "it was too far to go to the shop and he did not want her to walk"
# Remove stop words
sentence_no_stopwords = ' '.join([word for word in sentence.split() if word not in en_stopwords])
print(sentence_no_stopwords)

# OUTPUT:
# far shop want walk

# Customizing stop words
en_stopwords.remove("did")  # Keep 'did'
en_stopwords.remove("not")  # Keep 'not'
en_stopwords.append("go")   # Add 'go' as stop word
sentence_custom = ' '.join([word for word in sentence.split() if word not in en_stopwords])
print(sentence_custom)

# OUTPUT:
# far shop did not want walk

3. Regular Expressions (Regex)

Regex provides powerful pattern matching for text cleaning and manipulation.

text
🔍 REGEX PATTERN MATCHING

Pattern: r"[^\w\s]"
Purpose: Remove punctuation (keep only words and spaces)

Input:  "Hello, World! How are you?"
Output: "Hello World How are you"

🎯 COMMON REGEX PATTERNS
├── r"[^\w\s]" → Remove punctuation
├── r"\d+" → Find numbers
├── r"[A-Z]+" → Find uppercase letters
└── r"\b\w+\b" → Find whole words
python
import re

# Raw strings (important for regex)
my_folder = r"C:\desktop\notes"  # Use r"" to avoid escape issues

# Basic pattern matching
result = re.search("pattern", r"string containing the pattern")
if result:
    print(result.group())  # Returns the matched pattern

# OUTPUT:
# pattern

# Text replacement
string = r"sara was able to help me find the items i needed quickly"
new_string = re.sub(r"sara", r"sarah", string)
print(new_string)

# OUTPUT:
# sarah was able to help me find the items i needed quickly

# Advanced patterns
customer_reviews = [
    'sam was a great help to me in the store',
    'amazing work from sadeen!',
    'sarah was able to help me find the items i needed quickly',
    'great service from sara she found me what i wanted'
]

# Find Sarah's reviews (including "sara")
pattern = r"sarah?"  # ? makes 'h' optional
sarahs_reviews = [r for r in customer_reviews if re.search(pattern, r)]
print(sarahs_reviews)

# OUTPUT:
# ['sarah was able to help me find the items i needed quickly', 'great service from sara she found me what i wanted']

# Remove punctuation
pattern = r"[^\w\s]"  # Remove anything that's not word or whitespace
clean_reviews = [re.sub(pattern, "", r) for r in customer_reviews]
print(clean_reviews)

# OUTPUT:
# ['sam was a great help to me in the store', 'amazing work from sadeen', 'sarah was able to help me find the items i needed quickly', 'great service from sara she found me what i wanted']

4. Tokenization

Tokenization breaks text into smaller units (tokens) - words, sentences, or subwords. This is fundamental for NLP analysis.

text
🔪 TOKENIZATION PROCESS

Input:  "Her cat's name is Luna. Her dog's name is Max!"

SENTENCE TOKENIZATION:
├── "Her cat's name is Luna."
└── "Her dog's name is Max!"

WORD TOKENIZATION:
├── "Her", "cat", "'s", "name", "is", "Luna"
└── "Her", "dog", "'s", "name", "is", "Max"
python
from nltk.tokenize import word_tokenize, sent_tokenize

# Sentence tokenization
sentences = "Her cat's name is Luna. Her dog's name is Max!"
sent_tokens = sent_tokenize(sentences)
print(sent_tokens)  # ['Her cat\'s name is Luna.', 'Her dog\'s name is Max!']

# OUTPUT:
# ["Her cat's name is Luna.", "Her dog's name is Max!"]

# Word tokenization
sentence = "Her cat's name is Luna"
word_tokens = word_tokenize(sentence)
print(word_tokens)  # ['Her', 'cat', "'s", 'name', 'is', 'Luna']

# OUTPUT:
# ['Her', 'cat', "'s", 'name', 'is', 'Luna']

# Notice: "cat's" becomes ['cat', "'s"]

5. Stemming

Stemming reduces words to their base form by removing suffixes. It's fast but can produce non-meaningful stems.

text
🌱 STEMMING PROCESS

Input Words: ["connecting", "connected", "connectivity", "connects"]
Stemmed:     ["connect", "connect", "connect", "connect"]

🎯 CHARACTERISTICS
├── Fast processing
├── May produce non-words
├── Rule-based approach
└── Good for search/indexing
python
from nltk.stem import PorterStemmer

ps = PorterStemmer()

# Examples
connect_tokens = ['connecting', 'connected', 'connectivity', 'connect', 'connects']
for token in connect_tokens:
    print(f"{token:12}{ps.stem(token)}")
# OUTPUT:
# connecting   → connect
# connected    → connect
# connectivity → connect
# connect      → connect
# connects     → connect
learn_tokens = ['learned', 'learning', 'learn', 'learns', 'learner']
for token in learn_tokens:
    print(f"{token:8}{ps.stem(token)}")

# OUTPUT:
# learned  → learn
# learning → learn
# learn    → learn
# learns   → learn
# learner  → learn

6. Lemmatization

Lemmatization produces meaningful base forms using a dictionary. More accurate but slower than stemming.

text
🔍 LEMMATIZATION PROCESS

Input Words: ["better", "running", "flies", "went"]
Lemmatized:  ["good", "run", "fly", "go"]

🎯 CHARACTERISTICS
├── Produces real words
├── Dictionary-based
├── More accurate
└── Slower than stemming
python
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Comparison: Stemming vs Lemmatization
test_words = ['connecting', 'better', 'worse', 'running', 'flies']
print(f"{'Word':<12} | {'Stemming':<10} | {'Lemmatization'}")
print("-" * 40)
for word in test_words:
    stemmed = ps.stem(word)
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<12} | {stemmed:<10} | {lemmatized}")

# OUTPUT:
# Word         | Stemming   | Lemmatization
# ----------------------------------------
# connecting   | connect    | connecting
# better       | better     | better
# worse        | wors       | worse
# running      | run        | running
# flies        | fli        | fly

7. N-grams Analysis

N-grams are sequences of N consecutive tokens, useful for understanding context and creating features.

text
📊 N-GRAMS ANALYSIS

Input: "The rise of artificial intelligence"

UNIGRAMS (n=1): ["The", "rise", "of", "artificial", "intelligence"]
BIGRAMS (n=2):  [("The", "rise"), ("rise", "of"), ("of", "artificial"), ("artificial", "intelligence")]
TRIGRAMS (n=3): [("The", "rise", "of"), ("rise", "of", "artificial"), ("of", "artificial", "intelligence")]

🎯 APPLICATIONS
├── Context understanding
├── Feature extraction
├── Language modeling
└── Phrase detection
python
import pandas as pd
import matplotlib.pyplot as plt

tokens = ['the', 'rise', 'of', 'artificial', 'intelligence', 'has', 'led', 'to', 
          'significant', 'advancements', 'in', 'natural', 'language', 'processing']

# Unigrams (n=1)
unigrams = pd.Series(nltk.ngrams(tokens, 1)).value_counts()
print("Top unigrams:", unigrams.head())

# OUTPUT:
# Top unigrams: (the,)                    1
# (rise,)                   1
# (of,)                     1
# (artificial,)             1
# (intelligence,)           1
# dtype: int64

# Bigrams (n=2)
bigrams = pd.Series(nltk.ngrams(tokens, 2)).value_counts()
print("Top bigrams:", bigrams.head())

# OUTPUT:
# Top bigrams: (the, rise)                  1
# (rise, of)                1
# (of, artificial)          1
# (artificial, intelligence) 1
# (intelligence, has)       1
# dtype: int64

# Trigrams (n=3)
trigrams = pd.Series(nltk.ngrams(tokens, 3)).value_counts()
print("Top trigrams:", trigrams.head())

# OUTPUT:
# Top trigrams: (the, rise, of)                    1
# (rise, of, artificial)          1
# (of, artificial, intelligence)  1
# (artificial, intelligence, has) 1
# (intelligence, has, led)        1
# dtype: int64

# Visualization
unigrams.head(10).plot.barh(color='lightsalmon', title='Top 10 Unigrams')
plt.show()

8. Advanced Text Analysis

For more advanced text analysis including Parts of Speech (POS) Tagging and Named Entity Recognition (NER), see the dedicated Text Analysis guide. These techniques are more about understanding and extracting information from text rather than preprocessing it.


🔄 Complete Preprocessing Pipeline

text
📝 COMPREHENSIVE NLP PREPROCESSING PIPELINE

Raw Text Input: "Hello, World! How are you today?"

1. Lowercasing: "hello, world! how are you today?"

2. Tokenization: ["hello", ",", "world", "!", "how", "are", "you", "today", "?"]

3. Punctuation Removal: ["hello", "world", "how", "are", "you", "today"]

4. Stop Words Removal: ["hello", "world", "today"]

5. Stemming/Lemmatization: ["hello", "world", "today"]

6. Feature Extraction: [Vector representations, N-grams, etc.]

Output: Clean, processed tokens ready for ML/Analysis

🎯 Preprocessing Pipeline Practical Example

This comprehensive example demonstrates how to build a complete text preprocessing pipeline that handles real-world scenarios. We'll create a flexible TextPreprocessor class that can be configured for different use cases and then apply it to various types of text data.

What we're building:

  • A configurable preprocessing pipeline that shows each step
  • Real-world examples using customer reviews and social media text
  • Comparison of different preprocessing approaches
  • Specialized handling for domain-specific text (social media)
python
import nltk
import spacy
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

class TextPreprocessor:
    """
    A comprehensive text preprocessing pipeline with configurable options.
    
    This class demonstrates:
    - Step-by-step text transformation
    - Configurable preprocessing options
    - Visual feedback of each processing step
    - Comparison of different approaches
    """
    def __init__(self, use_stemming=True, remove_stopwords=True):
        self.use_stemming = use_stemming
        self.remove_stopwords = remove_stopwords
        
        # Initialize tools
        self.stemmer = PorterStemmer()
        self.lemmatizer = WordNetLemmatizer()
        self.stopwords = set(stopwords.words('english'))
        
    def preprocess_text(self, text):
        """
        Complete preprocessing pipeline with step-by-step visualization.
        
        This method demonstrates:
        - How each preprocessing step transforms the text
        - The cumulative effect of multiple preprocessing steps
        - Visual feedback to understand the transformation process
        """
        print(f"📝 ORIGINAL: {text}")
        
        # 1. Lowercase normalization
        # Why: Ensures "Apple" and "apple" are treated as the same word
        text = text.lower()
        print(f"1️⃣ LOWERCASE: {text}")
        
        # 2. Remove special characters (keep letters, numbers, spaces)
        # Why: Removes noise like punctuation that doesn't add semantic meaning
        text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
        print(f"2️⃣ REMOVE SPECIAL CHARS: {text}")
        
        # 3. Tokenization
        # Why: Breaks text into individual words/tokens for processing
        tokens = word_tokenize(text)
        print(f"3️⃣ TOKENIZED: {tokens}")
        
        # 4. Remove stopwords (if enabled)
        # Why: Removes common words that don't carry much meaning
        if self.remove_stopwords:
            tokens = [token for token in tokens if token not in self.stopwords]
            print(f"4️⃣ REMOVE STOPWORDS: {tokens}")
        
        # 5. Stemming or Lemmatization
        # Why: Reduces words to their root form for better feature matching
        if self.use_stemming:
            tokens = [self.stemmer.stem(token) for token in tokens]
            print(f"5️⃣ STEMMED: {tokens}")
        else:
            tokens = [self.lemmatizer.lemmatize(token) for token in tokens]
            print(f"5️⃣ LEMMATIZED: {tokens}")
        
        print(f"✅ FINAL RESULT: {tokens}")
        return tokens

# EXAMPLE 1: Processing customer reviews
# This shows how preprocessing works on real e-commerce review data
print("🔍 REAL-WORLD EXAMPLE 1: CUSTOMER REVIEW PREPROCESSING")
print("=" * 60)
print("📋 WHAT WE'RE DOING:")
print("• Processing a typical e-commerce customer review")
print("• Showing how each step transforms the text")
print("• Demonstrating the cumulative effect of preprocessing")
print("• Extracting meaningful features from noisy text")
print("=" * 60)

preprocessor = TextPreprocessor(use_stemming=True, remove_stopwords=True)

# Sample customer review with typical characteristics:
# - Mixed case, punctuation, contractions, emojis, ratings
customer_review = """
The iPhone 13 Pro is AMAZING! I've been using it for 3 months now, and 
it's the best phone I've ever owned. The camera quality is incredible, 
especially in low-light conditions. Battery life lasts all day easily. 
Highly recommended! 5/5 stars ⭐⭐⭐⭐⭐
"""

# Process the review
processed_tokens = preprocessor.preprocess_text(customer_review)

print("\n" + "=" * 60)
print("📊 PROCESSING SUMMARY:")
print("💡 WHAT THIS TELLS US:")
print("• Original text contained lots of noise (punctuation, mixed case)")
print("• Preprocessing extracted the core meaningful words")
print("• Result is clean tokens ready for machine learning")
print(f"• Efficiency: {len(customer_review)} characters → {len(processed_tokens)} tokens")
print("=" * 60)
print(f"Original length: {len(customer_review)} characters")
print(f"Processed tokens: {len(processed_tokens)} tokens")
print(f"Processed result: {' '.join(processed_tokens)}")

# EXAMPLE 2: Comparing different preprocessing approaches
# This demonstrates how different settings affect the final result
print("\n" + "=" * 60)
print("🔄 EXAMPLE 2: COMPARING DIFFERENT APPROACHES")
print("=" * 60)
print("📋 WHAT WE'RE DOING:")
print("• Testing the same text with different preprocessing settings")
print("• Showing how stemming vs lemmatization affects results")
print("• Demonstrating the impact of keeping vs removing stopwords")
print("• Helping you choose the right approach for your use case")
print("=" * 60)

test_text = "I'm loving this new MacBook Pro! It's super fast and efficient."

# Approach 1: Aggressive preprocessing (stemming + remove stopwords)
print("\n1️⃣ AGGRESSIVE PREPROCESSING (STEMMING + REMOVE STOPWORDS):")
print("💡 BEST FOR: Search engines, topic modeling, keyword extraction")
preprocessor1 = TextPreprocessor(use_stemming=True, remove_stopwords=True)
result1 = preprocessor1.preprocess_text(test_text)

# Approach 2: Balanced preprocessing (lemmatization + remove stopwords)  
print("\n2️⃣ BALANCED PREPROCESSING (LEMMATIZATION + REMOVE STOPWORDS):")
print("💡 BEST FOR: Text classification, sentiment analysis")
preprocessor2 = TextPreprocessor(use_stemming=False, remove_stopwords=True)
result2 = preprocessor2.preprocess_text(test_text)

# Approach 3: Conservative preprocessing (stemming + keep stopwords)
print("\n3️⃣ CONSERVATIVE PREPROCESSING (STEMMING + KEEP STOPWORDS):")
print("💡 BEST FOR: Language modeling, translation, where context matters")
preprocessor3 = TextPreprocessor(use_stemming=True, remove_stopwords=False)
result3 = preprocessor3.preprocess_text(test_text)

print("\n" + "=" * 60)
print("📈 COMPARISON RESULTS:")
print("💡 NOTICE HOW DIFFERENT APPROACHES GIVE DIFFERENT RESULTS:")
print(f"• Aggressive (Stem + No Stop):  {result1}")
print(f"• Balanced (Lemma + No Stop):   {result2}")
print(f"• Conservative (Stem + Keep Stop): {result3}")
print("=" * 60)

# EXAMPLE 3: Advanced preprocessing for social media text
# This shows specialized preprocessing for social media data
def comprehensive_preprocessing_pipeline(text):
    """
    Comprehensive preprocessing specifically designed for social media text.
    
    This function demonstrates:
    - Handling social media specific patterns (URLs, mentions, hashtags)
    - Converting emojis to text representations
    - Specialized cleaning for informal text
    - Real-world application of preprocessing techniques
    """
    
    print(f"📱 SOCIAL MEDIA TEXT PREPROCESSING")
    print(f"💡 WHAT WE'RE DOING:")
    print(f"• Handling URLs, mentions (@user), hashtags (#tag)")
    print(f"• Converting emojis to text representations")
    print(f"• Cleaning informal social media language")
    print(f"• Extracting meaningful content from noisy social posts")
    print(f"Original: {text}")
    
    # 1. Handle social media specific patterns
    # Remove URLs (common in social media)
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    print(f"1️⃣ Remove URLs: {text}")
    
    # Handle mentions and hashtags (convert to generic tokens)
    text = re.sub(r'@\w+', 'USER_MENTION', text)
    text = re.sub(r'#\w+', 'HASHTAG', text)
    print(f"2️⃣ Handle mentions/hashtags: {text}")
    
    # Handle emojis (convert to text representation)
    text = re.sub(r'😀|😃|😄|😁|😆|😅|😂|🤣|😊|😇|🙂|🙃|😉|😌|😍|🥰|😘|😗|😙|😚|😋|😛|😝|😜|🤪|🤨|🧐|🤓|😎|🤩|🥳|😏|😒|😞|😔|😟|😕|🙁|☹️|😣|😖|😫|😩|🥺|😢|😭|😤|😠|😡|🤬|🤯|😳|🥵|🥶|😱|😨|😰|😥|😓|🤗|🤔|🤭|🤫|🤥|😶|😐|😑|😬|🙄|😯|😦|😧|😮|😲|🥱|😴|🤤|😪|😵|🤐|🥴|🤢|🤮|🤧|😷|🤒|🤕|🤑|🤠|😈|👿|👹|👺|🤡|💩|👻|💀|☠️|👽|👾|🤖|🎃|😺|😸|😹|😻|😼|😽|🙀|😿|😾', ' EMOJI ', text)
    print(f"3️⃣ Handle emojis: {text}")
    
    # Standard preprocessing
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    tokens = word_tokenize(text)
    
    # Remove stopwords and very short tokens
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and len(token) > 1]
    
    print(f"4️⃣ Final tokens: {tokens}")
    return tokens

# Test social media preprocessing
print("\n" + "=" * 60)
print("📱 EXAMPLE 3: SOCIAL MEDIA PREPROCESSING")
print("=" * 60)
print("📋 WHAT WE'RE DOING:")
print("• Processing a typical social media post")
print("• Handling URLs, mentions, hashtags, and emojis")
print("• Showing specialized preprocessing for informal text")
print("• Demonstrating domain-specific text processing")
print("=" * 60)

social_media_text = """
OMG! Just tried the new @starbucks drink 🤤 It's absolutely delicious! 
#coffee #latte #yum Check it out: https://starbucks.com/newdrink 
I'm literally obsessed 😍😍😍
"""

social_tokens = comprehensive_preprocessing_pipeline(social_media_text)

print(f"\n✅ SOCIAL MEDIA PROCESSING COMPLETE!")
print(f"💡 WHAT WE ACHIEVED:")
print(f"• Converted social media post to clean, analyzable tokens")
print(f"• Preserved semantic meaning while removing noise")
print(f"• Handled domain-specific elements (URLs, mentions, emojis)")
print(f"• Result ready for sentiment analysis or topic modeling")
print(f"Final processed tokens: {social_tokens}")

🎯 Output Example

text
📝 ORIGINAL: The iPhone 13 Pro is AMAZING! I've been using it for 3 months now, and it's the best phone I've ever owned.

1️⃣ LOWERCASE: the iphone 13 pro is amazing! i've been using it for 3 months now, and it's the best phone i've ever owned.

2️⃣ REMOVE SPECIAL CHARS: the iphone 13 pro is amazing ive been using it for 3 months now and its the best phone ive ever owned

3️⃣ TOKENIZED: ['the', 'iphone', '13', 'pro', 'is', 'amazing', 'ive', 'been', 'using', 'it', 'for', '3', 'months', 'now', 'and', 'its', 'the', 'best', 'phone', 'ive', 'ever', 'owned']

4️⃣ REMOVE STOPWORDS: ['iphone', '13', 'pro', 'amazing', 'ive', 'using', '3', 'months', 'best', 'phone', 'ive', 'ever', 'owned']

5️⃣ STEMMED: ['iphon', '13', 'pro', 'amaz', 'ive', 'use', '3', 'month', 'best', 'phone', 'ive', 'ever', 'own']

✅ FINAL RESULT: ['iphon', '13', 'pro', 'amaz', 'ive', 'use', '3', 'month', 'best', 'phone', 'ive', 'ever', 'own']

🎯 Key Preprocessing Decisions

text
🤔 PREPROCESSING DECISION MATRIX

TECHNIQUE          | WHEN TO USE                    | CONSIDERATIONS
------------------|--------------------------------|------------------
Lowercasing       | Almost always                  | Unless case matters (names)
Stop Words        | Topic modeling, keywords       | May remove important context
Stemming          | Speed > accuracy               | Fast, may create non-words
Lemmatization     | Accuracy > speed               | Slower, produces real words
N-grams           | Context important              | Increases feature space
POS Tagging       | Syntax analysis                | Computationally expensive
NER               | Information extraction         | Domain-specific models

When to Use Each Technique:

  • Lowercasing: Almost always, unless case matters (e.g., proper nouns)
  • Stop words removal: For topic modeling, keyword extraction
  • Stemming: When speed matters more than accuracy
  • Lemmatization: When accuracy matters more than speed
  • N-grams: For capturing context and phrases
  • POS tagging: For syntax analysis and feature engineering
  • NER: For information extraction and entity analysis

🔧 Advanced Preprocessing Techniques

python
def advanced_preprocessing_pipeline(text):
    """Advanced preprocessing with multiple options"""
    
    # 1. Handle contractions
    contractions = {
        "don't": "do not",
        "won't": "will not",
        "can't": "cannot",
        "n't": " not",
        "'re": " are",
        "'ve": " have",
        "'ll": " will",
        "'d": " would"
    }
    
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    
    # 2. Handle numbers
    text = re.sub(r'\d+', 'NUM', text)  # Replace numbers with NUM token
    
    # 3. Handle URLs and emails
    text = re.sub(r'http\S+', 'URL', text)
    text = re.sub(r'\S+@\S+', 'EMAIL', text)
    
    # 4. Handle repeated characters
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)  # "sooooo" → "soo"
    
    # 5. Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Example
text = "I don't think this is sooooo cool!!! Visit http://example.com or email me@test.com"
print("Original:", text)
print("Advanced preprocessing:", advanced_preprocessing_pipeline(text))

# OUTPUT:
# Original: I don't think this is sooooo cool!!! Visit http://example.com or email me@test.com
# Advanced preprocessing: I do not think this is soo cool Visit URL or email EMAIL

🎯 Key Takeaways

text
📚 PREPROCESSING PRINCIPLES

✅ BEST PRACTICES
├── Clean systematically (follow consistent pipeline)
├── Consider your use case (different tasks need different steps)
├── Preserve meaning (don't remove important information)
├── Test and iterate (experiment with combinations)
└── Document your choices (for reproducibility)

⚠️ COMMON PITFALLS
├── Over-preprocessing (removing too much information)
├── Under-preprocessing (leaving too much noise)
├── Inconsistent preprocessing (different steps for train/test)
├── Ignoring domain specifics (medical vs social media text)
└── Not validating results (check if preprocessing helps)

Preprocessing Principles:

  • Clean systematically: Follow a consistent preprocessing pipeline
  • Consider your use case: Different tasks need different preprocessing steps
  • Preserve meaning: Don't remove information that matters for your task
  • Test and iterate: Experiment with different preprocessing combinations

Next Steps:

Released under the MIT License.