Unsupervised Learning

Discovering hidden patterns and structures in data without labeled examples

🔍 What is Unsupervised Learning?

Definition: A machine learning approach that finds hidden patterns, structures, or relationships in data without pre-existing labels or target variables.

Simple Analogy: Like an explorer discovering new territories without a map. You examine the landscape (data) to find natural groupings, paths, or interesting features without knowing what you're supposed to find.

text

🔍 UNSUPERVISED LEARNING PROCESS

Input: Raw Data (No Labels) → Algorithm → Discovered Patterns/Structure

Example:
Customer Data (age, income, purchases) → Clustering → Customer Segments
(No predefined segments given)

Types of Unsupervised Learning

text

🎯 UNSUPERVISED LEARNING TYPES

                    🔍 UNSUPERVISED LEARNING
                    ┌──────────────────────────┐
                    │  Pattern Discovery       │
                    │  Without Labels          │
                    └────────────┬─────────────┘
                                │
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
┌───────▼────────┐    ┌─────────▼─────────┐    ┌───────▼────────┐
│  🔗 CLUSTERING  │    │ 📉 DIMENSIONALITY │    │ 🔍 ASSOCIATION │
│                │    │   REDUCTION       │    │ RULE LEARNING  │
│ Group Similar  │    │                   │    │                │
│ Data Points    │    │ Reduce Features   │    │ Find Item      │
│                │    │ Keep Information  │    │ Relationships  │
└───────┬────────┘    └─────────┬─────────┘    └───────┬────────┘
        │                       │                       │
    ┌───┼────┐               ┌──┼──────┐               │
    │   │    │               │  │      │               │
┌───▼┐ ┌▼──┐ ┌▼───┐      ┌───▼┐ ┌▼────┐ ┌▼──────┐     │
│K-  │ │Hier│ │DBSC│      │PCA │ │t-SNE│ │Factor │     │
│Mean│ │arch│ │AN  │      │    │ │     │ │Analy. │     │
│s   │ │ical│ │    │      │    │ │     │ │       │     │
└────┘ └────┘ └────┘      └────┘ └─────┘ └───────┘     │
                                                       │
                                              ┌────────▼────────┐
                                              │ Market Basket   │
                                              │ Analysis        │
                                              │ (A → B)         │
                                              └─────────────────┘

Clustering

What is Clustering?

Definition: Grouping similar data points together while keeping dissimilar points in different groups.

Goal: Discover natural groupings in data where members of each group are more similar to each other than to members of other groups.

text

🔗 CLUSTERING CONCEPT

Before Clustering:        After Clustering:
     ●  ○     ●               ●  ○ ○ ○  ● 
   ○   ●   ○                  ●  ○ ○   ●
     ○   ●                      ○       ●
   ●     ○  ●                ● ● ●      ●
                            
Random Points             3 Clear Clusters

K-Means Clustering

How it works: Partitions data into k clusters by minimizing distance from points to cluster centers

Steps:

Choose number of clusters (k)
Initialize k cluster centers randomly
Assign each point to nearest center
Update centers to mean of assigned points
Repeat until convergence

Pros:

Simple and fast
Works well with spherical clusters
Scales well to large datasets
Guaranteed to converge

Cons:

Need to specify k beforehand
Sensitive to initialization
Assumes spherical clusters
Affected by outliers

text

📊 K-MEANS EXAMPLE

Customer Segmentation:
Age vs Income scatter plot

Initial Centers:    After Convergence:
   C1●                   ●──C1 (Young, Low Income)
      ○ ○ ○                ○ ○ ○
    ○   ○                ○   ○
         C2●                   ●──C2 (Middle-aged, Medium Income)
       ● ●                   ● ●
     ●     ●               ●     ●
           C3●                   ●──C3 (Older, High Income)

Clusters Found:
1. Young professionals (low income)
2. Middle-aged (medium income) 
3. Established professionals (high income)

Hierarchical Clustering

How it works: Creates a tree of clusters by iteratively merging or splitting

Types:

Agglomerative: Bottom-up (start with individual points, merge)
Divisive: Top-down (start with all points, split)

Pros:

No need to specify number of clusters
Creates hierarchy of clusters
Deterministic results
Can handle any distance metric

Cons:

Computationally expensive O(n³)
Sensitive to noise and outliers
Difficult to handle large datasets
Hard to undo previous steps

text

🌳 HIERARCHICAL CLUSTERING DENDROGRAM

         ┌─────────────────┐
         │                │
    ┌────┴───┐        ┌───┴────┐
    │        │        │        │
 ┌──┴──┐  ┌─┴─┐    ┌─┴─┐   ┌──┴──┐
 A     B   C   D    E   F   G     H

Cut here → 2 clusters: {A,B,C,D} and {E,F,G,H}
Cut here → 4 clusters: {A,B}, {C,D}, {E,F}, {G,H}

DBSCAN (Density-Based)

How it works: Groups points that are closely packed while marking outliers

Key Concepts:

Core points: Have enough neighbors within radius
Border points: Within radius of core point
Noise points: Neither core nor border (outliers)

Pros:

Automatically determines number of clusters
Can find arbitrarily shaped clusters
Robust to outliers
Can identify noise points

Cons:

Sensitive to hyperparameters
Struggles with varying densities
Memory intensive for large datasets
Difficult with high-dimensional data

text

🎯 DBSCAN EXAMPLE

Points and Neighborhoods:
    ●    ○   ●        ● = Core point (≥3 neighbors)
  ●   ●   ○           ○ = Border point  
●   ●       ○         × = Noise/Outlier
    ●   ○     ×
      ○   ○

Result: 2 clusters + 1 outlier

Dimensionality Reduction

What is Dimensionality Reduction?

Definition: Reducing the number of features while preserving important information

Why needed:

Curse of dimensionality: Too many features can hurt performance
Visualization: Reduce to 2D/3D for plotting
Storage: Less memory and computation
Noise reduction: Remove irrelevant features

text

📉 DIMENSIONALITY REDUCTION CONCEPT

High-Dimensional Data:        Reduced Dimension:
Feature 1: Height              Component 1: "Size"
Feature 2: Weight              Component 2: "Build"  
Feature 3: Shoe Size           
Feature 4: Hand Span           
Feature 5: Head Circumference  
...                           (captures 95% of variance)

100 features → 2 components (easier to visualize and process)

Principal Component Analysis (PCA)

How it works: Finds directions of maximum variance in data

Steps:

Standardize the data
Compute covariance matrix
Find eigenvectors (principal components)
Project data onto top components

Pros:

Reduces overfitting
Removes correlated features
Fast and simple
Linear transformation

Cons:

Linear combinations only
Components hard to interpret
May lose important information
Sensitive to scaling

text

📊 PCA EXAMPLE

Original 2D Data:        After PCA:
     ●                    PC1 (Main direction)
   ●   ●                    ●
 ●       ●                ●   ●
●         ●             ●       ●
 ●       ●               ●       ●
   ●   ●                   ●   ●
     ●                       ●

PC1 captures 90% variance, PC2 captures 10%
Can use just PC1 for 1D representation

t-SNE (t-Distributed Stochastic Neighbor Embedding)

How it works: Preserves local neighborhood structure for visualization

Use case: Mainly for visualization of high-dimensional data in 2D/3D

Pros:

Excellent for visualization
Preserves local structure
Can reveal hidden patterns
Works with non-linear relationships

Cons:

Computationally expensive
Only for visualization (not feature reduction)
Non-deterministic results
Hyperparameter sensitive

text

🎨 t-SNE VISUALIZATION

High-Dimensional Data → t-SNE → 2D Visualization

Image Dataset:            t-SNE Plot:
- Cat images              ● ● ● ← Cat cluster
- Dog images              
- Bird images             ○ ○ ○ ← Dog cluster
                          
                          △ △ △ ← Bird cluster

Similar images cluster together in 2D space

Association Rule Learning

What is Association Rule Learning?

Definition: Finding relationships between different items or events

Common Format: "If A, then B" or A → B

Applications:

Market basket analysis
Web usage patterns
Protein sequences
Medical diagnosis patterns

text

🛒 MARKET BASKET ANALYSIS

Transaction Data:
Customer 1: {Bread, Milk, Eggs}
Customer 2: {Bread, Butter}  
Customer 3: {Milk, Eggs, Butter}
Customer 4: {Bread, Milk, Butter}
Customer 5: {Bread, Eggs}

Association Rules Found:
Bread → Milk (Support: 40%, Confidence: 67%)
Milk → Eggs (Support: 40%, Confidence: 67%)

Key Metrics

Support: How frequently items appear together

Support(A → B) = P(A and B)

Confidence: How often B appears when A is present

Confidence(A → B) = P(B|A) = Support(A,B) / Support(A)

Lift: How much more likely B is when A is present

Lift(A → B) = Confidence(A → B) / Support(B)

text

📊 ASSOCIATION RULE METRICS

Rule: Beer → Chips

Support = 200/1000 = 0.2 (20% of transactions)
Confidence = 200/500 = 0.4 (40% of beer buyers also buy chips)
Lift = 0.4/0.3 = 1.33 (33% more likely than random)

Interpretation:
- Support: 20% of customers buy both
- Confidence: 40% of beer buyers also buy chips  
- Lift > 1: Positive correlation (Beer increases chip purchases)

Practical Applications

🛒 Customer Segmentation

Business Problem: Understand different types of customers for targeted marketing

text

🎯 CUSTOMER SEGMENTATION EXAMPLE

Input Features:
├── Demographics: Age, Gender, Location
├── Behavior: Purchase frequency, Average order value
├── Engagement: Website visits, Email opens
└── Preferences: Product categories, Brands

Clustering Results:
├── Cluster 1: "Budget Shoppers" (Price-sensitive, infrequent)
├── Cluster 2: "Premium Customers" (High-value, brand-loyal)  
├── Cluster 3: "Digital Natives" (Online-first, tech products)
└── Cluster 4: "Occasional Buyers" (Seasonal, specific needs)

Business Actions:
├── Cluster 1: Discount campaigns, Value bundles
├── Cluster 2: Exclusive products, Premium service
├── Cluster 3: Digital marketing, Latest tech
└── Cluster 4: Seasonal promotions, Reminders

🔍 Anomaly Detection

Business Problem: Identify unusual patterns that might indicate fraud, errors, or opportunities

text

🚨 ANOMALY DETECTION PROCESS

Normal Behavior Pattern Discovery:
├── User login times: Usually 9 AM - 5 PM
├── Transaction amounts: Usually $10 - $200  
├── Purchase locations: Usually home city
└── Device usage: Usually same device/browser

Anomaly Detection:
├── Login at 3 AM from different country → SUSPICIOUS
├── Transaction of $5000 → REVIEW NEEDED
├── Purchase from unusual location → FLAG
└── New device with high-value purchase → VERIFY

Applications:
├── Credit card fraud detection
├── Network security monitoring  
├── Quality control in manufacturing
└── Healthcare monitoring

📊 Data Exploration and Preprocessing

Use Case: Understanding data structure before supervised learning

text

🔍 EXPLORATORY DATA ANALYSIS

Raw Dataset: Employee Performance
├── 50 features (experience, education, skills, etc.)
├── 10,000 employees
└── Goal: Understand data before prediction

Unsupervised Analysis:
├── PCA: Reduce to 10 main components
├── Clustering: Find 4 employee types
├── Association Rules: Skill combinations
└── Outlier detection: Unusual profiles

Insights Discovered:
├── 3 main factors explain 80% of variance
├── Clear employee archetypes exist
├── Certain skills often go together  
└── Some profiles are very rare

Benefits for Supervised Learning:
├── Better feature selection
├── Understanding of data structure
├── Identification of edge cases
└── Improved model design

Evaluation Methods

Clustering Evaluation

Since clustering has no ground truth labels, evaluation is more challenging:

text

📊 CLUSTERING EVALUATION METRICS

🔍 INTERNAL METRICS (No ground truth needed):

Silhouette Score:
├── Measures how similar points are to their cluster vs other clusters
├── Range: -1 to 1 (higher is better)
├── >0.5 = good clustering
└── <0.2 = poor clustering

Inertia (Within-Cluster Sum of Squares):
├── Sum of squared distances to cluster centers
├── Lower is better
├── Used in elbow method
└── Can decrease as k increases

Calinski-Harabasz Index:
├── Ratio of between-cluster to within-cluster variance
├── Higher is better
└── Good for comparing different numbers of clusters

🎯 EXTERNAL METRICS (When ground truth available):

Adjusted Rand Index (ARI):
├── Compares clustering to true labels
├── Range: -1 to 1 (1 = perfect match)
└── Adjusted for chance

Normalized Mutual Information (NMI):
├── Information theoretic measure
├── Range: 0 to 1 (1 = perfect match)
└── Less sensitive to cluster size

Dimensionality Reduction Evaluation

text

📈 DIMENSIONALITY REDUCTION EVALUATION

Explained Variance Ratio:
├── How much variance each component captures
├── Cumulative variance plot
├── Choose components that capture 95% variance
└── Elbow method for optimal number

Reconstruction Error:
├── How well reduced data can reconstruct original
├── Lower error = better preservation
├── Cross-validation recommended
└── Compare with random projection

Visualization Quality:
├── Do similar points cluster together?
├── Are different classes separated?
├── Does the plot make intuitive sense?
└── Preserve local neighborhoods?

Choosing the Right Algorithm

text

🎯 ALGORITHM SELECTION GUIDE

CLUSTERING:
├── Known number of clusters → K-Means
├── Hierarchical relationships → Hierarchical Clustering
├── Arbitrary shapes, noise → DBSCAN
├── Large datasets → MiniBatch K-Means
└── Mixed data types → K-Modes

DIMENSIONALITY REDUCTION:
├── Linear relationships → PCA
├── Visualization → t-SNE, UMAP
├── Non-linear relationships → Kernel PCA
├── Sparse data → Truncated SVD
└── Interpretability → Factor Analysis

ASSOCIATION RULES:
├── Market basket → Apriori, FP-Growth
├── Sequential patterns → Sequential pattern mining
├── Large datasets → FP-Growth
└── Real-time → Stream mining algorithms

DATA CHARACTERISTICS:
├── Small dataset (<1K) → Any algorithm
├── Medium dataset (1K-100K) → Most algorithms
├── Large dataset (>100K) → Scalable versions
├── High dimensions → Dimensionality reduction first
└── Mixed data types → Specialized algorithms

Common Challenges and Solutions

🎯 Choosing Number of Clusters

Problem: K-means requires specifying k, but we don't know the natural number of clusters

Solutions:

text

📊 CLUSTER NUMBER SELECTION METHODS

Elbow Method:
├── Plot inertia vs number of clusters
├── Look for "elbow" in the curve  
├── Point where improvement slows down
└── Subjective interpretation

Silhouette Analysis:
├── Calculate silhouette score for different k
├── Choose k with highest average silhouette
├── More objective than elbow method
└── Consider individual cluster silhouettes

Gap Statistic:
├── Compare clustering to random data
├── Find k where gap is largest
├── More statistically rigorous
└── Computationally expensive

Domain Knowledge:
├── Business constraints (e.g., 3 customer tiers)
├── Practical limitations (e.g., max 5 marketing segments)
├── Previous research or experience
└── Interpretability requirements

🔍 High-Dimensional Data

Problem: Curse of dimensionality affects distance-based algorithms

Solutions:

text

📉 HIGH-DIMENSIONAL SOLUTIONS

Dimensionality Reduction First:
├── Apply PCA before clustering
├── Use feature selection techniques
├── Remove correlated features
└── Domain-specific feature engineering

Alternative Distance Metrics:
├── Cosine similarity for text data
├── Manhattan distance for high dimensions
├── Correlation-based distances
└── Learned embeddings

Specialized Algorithms:
├── Subspace clustering
├── Projected clustering  
├── Density-based methods
└── Spectral clustering

⚖️ Imbalanced Clusters

Problem: Some clusters much larger than others

Solutions:

text

⚖️ IMBALANCED CLUSTER SOLUTIONS

Algorithm Selection:
├── DBSCAN (handles varying densities)
├── Hierarchical clustering
├── Gaussian Mixture Models
└── Avoid K-means for severe imbalance

Data Preprocessing:
├── Sampling techniques
├── Outlier removal
├── Feature scaling/normalization
└── Distance metric selection

Evaluation Adjustments:
├── Use silhouette analysis
├── Examine individual cluster quality
├── Consider business importance of small clusters
└── Manual cluster validation

Real-World Project Example

text

🎯 COMPLETE PROJECT: CUSTOMER SEGMENTATION

BUSINESS PROBLEM:
E-commerce company wants to understand customer types for personalized marketing

1️⃣ DATA COLLECTION:
   ├── Customer demographics (age, location, gender)
   ├── Purchase history (frequency, amount, categories)
   ├── Website behavior (pages visited, time spent)
   ├── Engagement (email opens, social media)
   └── 50,000 customers, 25 features

2️⃣ EXPLORATORY ANALYSIS:
   ├── PCA: Identify main variance directions
   ├── Correlation analysis: Remove redundant features
   ├── Outlier detection: Handle extreme cases
   └── Feature scaling: Normalize different units

3️⃣ DIMENSIONALITY REDUCTION:
   ├── PCA: 25 features → 8 components (90% variance)
   ├── Feature importance: Keep most informative
   ├── t-SNE: Visualize customer distribution
   └── Domain expertise: Validate component meaning

4️⃣ CLUSTERING:
   ├── K-means: Try k=2 to k=10
   ├── Hierarchical: Understand cluster relationships
   ├── DBSCAN: Check for noise/outliers
   └── Elbow method + silhouette → k=5 optimal

5️⃣ CLUSTER INTERPRETATION:
   ├── Cluster 1: "High-Value Loyalists" (5%, high spend)
   ├── Cluster 2: "Bargain Hunters" (30%, price-sensitive)
   ├── Cluster 3: "Occasional Shoppers" (25%, infrequent)
   ├── Cluster 4: "Digital Natives" (35%, online-first)
   └── Cluster 5: "New Customers" (5%, recent signups)

6️⃣ BUSINESS ACTIONS:
   ├── Personalized product recommendations
   ├── Targeted email campaigns
   ├── Customized website experience
   ├── Retention strategies for each segment
   └── Pricing strategies per cluster

7️⃣ EVALUATION & MONITORING:
   ├── A/B test different strategies per cluster
   ├── Monitor cluster stability over time
   ├── Track business metrics (conversion, retention)
   └── Re-cluster quarterly with new data

🎯 Key Takeaways

text

🏆 UNSUPERVISED LEARNING MASTERY

💡 WHEN TO USE UNSUPERVISED LEARNING:
├── No labeled data available
├── Want to understand data structure
├── Discover hidden patterns
├── Reduce data complexity
└── Exploratory data analysis

🔍 MAIN TECHNIQUES:
├── Clustering: Group similar items
├── Dimensionality Reduction: Simplify data
├── Association Rules: Find relationships
├── Anomaly Detection: Identify outliers
└── Density Estimation: Understand distributions

🎯 SUCCESS FACTORS:
├── Domain knowledge for interpretation
├── Proper data preprocessing
├── Multiple algorithm comparison
├── Appropriate evaluation metrics
└── Business context consideration

⚠️ COMMON PITFALLS:
├── Over-interpreting clusters
├── Ignoring domain expertise
├── Wrong similarity metrics
├── Not validating results
└── Assuming clusters are meaningful

Next Steps:

Machine Learning Implementation: Build production ML systems
Advanced Techniques: Modern AI applications
Vector Databases: Storage for ML applications

Unsupervised Learning ​

🔍 What is Unsupervised Learning? ​

Types of Unsupervised Learning ​

Clustering ​

What is Clustering? ​

K-Means Clustering ​

Hierarchical Clustering ​

DBSCAN (Density-Based) ​

Dimensionality Reduction ​

What is Dimensionality Reduction? ​

Principal Component Analysis (PCA) ​

t-SNE (t-Distributed Stochastic Neighbor Embedding) ​

Association Rule Learning ​

What is Association Rule Learning? ​

Key Metrics ​

Practical Applications ​

🛒 Customer Segmentation ​

🔍 Anomaly Detection ​

📊 Data Exploration and Preprocessing ​

Evaluation Methods ​

Clustering Evaluation ​

Dimensionality Reduction Evaluation ​

Choosing the Right Algorithm ​

Common Challenges and Solutions ​

🎯 Choosing Number of Clusters ​

🔍 High-Dimensional Data ​

⚖️ Imbalanced Clusters ​

Real-World Project Example ​

🎯 Key Takeaways ​

Unsupervised Learning

🔍 What is Unsupervised Learning?

Types of Unsupervised Learning

Clustering

What is Clustering?

K-Means Clustering

Hierarchical Clustering

DBSCAN (Density-Based)

Dimensionality Reduction

What is Dimensionality Reduction?

Principal Component Analysis (PCA)

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Association Rule Learning

What is Association Rule Learning?

Key Metrics

Practical Applications

🛒 Customer Segmentation

🔍 Anomaly Detection

📊 Data Exploration and Preprocessing

Evaluation Methods

Clustering Evaluation

Dimensionality Reduction Evaluation

Choosing the Right Algorithm

Common Challenges and Solutions

🎯 Choosing Number of Clusters

🔍 High-Dimensional Data

⚖️ Imbalanced Clusters

Real-World Project Example

🎯 Key Takeaways