Unsupervised Learning β
Discovering hidden patterns and structures in data without labeled examples
π What is Unsupervised Learning? β
Definition: A machine learning approach that finds hidden patterns, structures, or relationships in data without pre-existing labels or target variables.
Simple Analogy: Like an explorer discovering new territories without a map. You examine the landscape (data) to find natural groupings, paths, or interesting features without knowing what you're supposed to find.
π UNSUPERVISED LEARNING PROCESS
Input: Raw Data (No Labels) β Algorithm β Discovered Patterns/Structure
Example:
Customer Data (age, income, purchases) β Clustering β Customer Segments
(No predefined segments given)Types of Unsupervised Learning β
π― UNSUPERVISED LEARNING TYPES
π UNSUPERVISED LEARNING
ββββββββββββββββββββββββββββ
β Pattern Discovery β
β Without Labels β
ββββββββββββββ¬ββββββββββββββ
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β β β
βββββββββΌβββββββββ βββββββββββΌββββββββββ βββββββββΌβββββββββ
β π CLUSTERING β β π DIMENSIONALITY β β π ASSOCIATION β
β β β REDUCTION β β RULE LEARNING β
β Group Similar β β β β β
β Data Points β β Reduce Features β β Find Item β
β β β Keep Information β β Relationships β
βββββββββ¬βββββββββ βββββββββββ¬ββββββββββ βββββββββ¬βββββββββ
β β β
βββββΌβββββ ββββΌβββββββ β
β β β β β β β
βββββΌβ ββΌβββ ββΌββββ βββββΌβ ββΌβββββ ββΌβββββββ β
βK- β βHierβ βDBSCβ βPCA β βt-SNEβ βFactor β β
βMeanβ βarchβ βAN β β β β β βAnaly. β β
βs β βicalβ β β β β β β β β β
ββββββ ββββββ ββββββ ββββββ βββββββ βββββββββ β
β
ββββββββββΌβββββββββ
β Market Basket β
β Analysis β
β (A β B) β
βββββββββββββββββββClustering β
What is Clustering? β
Definition: Grouping similar data points together while keeping dissimilar points in different groups.
Goal: Discover natural groupings in data where members of each group are more similar to each other than to members of other groups.
π CLUSTERING CONCEPT
Before Clustering: After Clustering:
β β β β β β β β
β β β β β β β
β β β β
β β β β β β β
Random Points 3 Clear ClustersK-Means Clustering β
How it works: Partitions data into k clusters by minimizing distance from points to cluster centers
Steps:
- Choose number of clusters (k)
- Initialize k cluster centers randomly
- Assign each point to nearest center
- Update centers to mean of assigned points
- Repeat until convergence
Pros:
- Simple and fast
- Works well with spherical clusters
- Scales well to large datasets
- Guaranteed to converge
Cons:
- Need to specify k beforehand
- Sensitive to initialization
- Assumes spherical clusters
- Affected by outliers
π K-MEANS EXAMPLE
Customer Segmentation:
Age vs Income scatter plot
Initial Centers: After Convergence:
C1β βββC1 (Young, Low Income)
β β β β β β
β β β β
C2β βββC2 (Middle-aged, Medium Income)
β β β β
β β β β
C3β βββC3 (Older, High Income)
Clusters Found:
1. Young professionals (low income)
2. Middle-aged (medium income)
3. Established professionals (high income)Hierarchical Clustering β
How it works: Creates a tree of clusters by iteratively merging or splitting
Types:
- Agglomerative: Bottom-up (start with individual points, merge)
- Divisive: Top-down (start with all points, split)
Pros:
- No need to specify number of clusters
- Creates hierarchy of clusters
- Deterministic results
- Can handle any distance metric
Cons:
- Computationally expensive O(nΒ³)
- Sensitive to noise and outliers
- Difficult to handle large datasets
- Hard to undo previous steps
π³ HIERARCHICAL CLUSTERING DENDROGRAM
βββββββββββββββββββ
β β
ββββββ΄ββββ βββββ΄βββββ
β β β β
ββββ΄βββ βββ΄ββ βββ΄ββ ββββ΄βββ
A B C D E F G H
Cut here β 2 clusters: {A,B,C,D} and {E,F,G,H}
Cut here β 4 clusters: {A,B}, {C,D}, {E,F}, {G,H}DBSCAN (Density-Based) β
How it works: Groups points that are closely packed while marking outliers
Key Concepts:
- Core points: Have enough neighbors within radius
- Border points: Within radius of core point
- Noise points: Neither core nor border (outliers)
Pros:
- Automatically determines number of clusters
- Can find arbitrarily shaped clusters
- Robust to outliers
- Can identify noise points
Cons:
- Sensitive to hyperparameters
- Struggles with varying densities
- Memory intensive for large datasets
- Difficult with high-dimensional data
π― DBSCAN EXAMPLE
Points and Neighborhoods:
β β β β = Core point (β₯3 neighbors)
β β β β = Border point
β β β Γ = Noise/Outlier
β β Γ
β β
Result: 2 clusters + 1 outlierDimensionality Reduction β
What is Dimensionality Reduction? β
Definition: Reducing the number of features while preserving important information
Why needed:
- Curse of dimensionality: Too many features can hurt performance
- Visualization: Reduce to 2D/3D for plotting
- Storage: Less memory and computation
- Noise reduction: Remove irrelevant features
π DIMENSIONALITY REDUCTION CONCEPT
High-Dimensional Data: Reduced Dimension:
Feature 1: Height Component 1: "Size"
Feature 2: Weight Component 2: "Build"
Feature 3: Shoe Size
Feature 4: Hand Span
Feature 5: Head Circumference
... (captures 95% of variance)
100 features β 2 components (easier to visualize and process)Principal Component Analysis (PCA) β
How it works: Finds directions of maximum variance in data
Steps:
- Standardize the data
- Compute covariance matrix
- Find eigenvectors (principal components)
- Project data onto top components
Pros:
- Reduces overfitting
- Removes correlated features
- Fast and simple
- Linear transformation
Cons:
- Linear combinations only
- Components hard to interpret
- May lose important information
- Sensitive to scaling
π PCA EXAMPLE
Original 2D Data: After PCA:
β PC1 (Main direction)
β β β
β β β β
β β β β
β β β β
β β β β
β β
PC1 captures 90% variance, PC2 captures 10%
Can use just PC1 for 1D representationt-SNE (t-Distributed Stochastic Neighbor Embedding) β
How it works: Preserves local neighborhood structure for visualization
Use case: Mainly for visualization of high-dimensional data in 2D/3D
Pros:
- Excellent for visualization
- Preserves local structure
- Can reveal hidden patterns
- Works with non-linear relationships
Cons:
- Computationally expensive
- Only for visualization (not feature reduction)
- Non-deterministic results
- Hyperparameter sensitive
π¨ t-SNE VISUALIZATION
High-Dimensional Data β t-SNE β 2D Visualization
Image Dataset: t-SNE Plot:
- Cat images β β β β Cat cluster
- Dog images
- Bird images β β β β Dog cluster
β³ β³ β³ β Bird cluster
Similar images cluster together in 2D spaceAssociation Rule Learning β
What is Association Rule Learning? β
Definition: Finding relationships between different items or events
Common Format: "If A, then B" or A β B
Applications:
- Market basket analysis
- Web usage patterns
- Protein sequences
- Medical diagnosis patterns
π MARKET BASKET ANALYSIS
Transaction Data:
Customer 1: {Bread, Milk, Eggs}
Customer 2: {Bread, Butter}
Customer 3: {Milk, Eggs, Butter}
Customer 4: {Bread, Milk, Butter}
Customer 5: {Bread, Eggs}
Association Rules Found:
Bread β Milk (Support: 40%, Confidence: 67%)
Milk β Eggs (Support: 40%, Confidence: 67%)Key Metrics β
Support: How frequently items appear together
- Support(A β B) = P(A and B)
Confidence: How often B appears when A is present
- Confidence(A β B) = P(B|A) = Support(A,B) / Support(A)
Lift: How much more likely B is when A is present
- Lift(A β B) = Confidence(A β B) / Support(B)
π ASSOCIATION RULE METRICS
Rule: Beer β Chips
Support = 200/1000 = 0.2 (20% of transactions)
Confidence = 200/500 = 0.4 (40% of beer buyers also buy chips)
Lift = 0.4/0.3 = 1.33 (33% more likely than random)
Interpretation:
- Support: 20% of customers buy both
- Confidence: 40% of beer buyers also buy chips
- Lift > 1: Positive correlation (Beer increases chip purchases)Practical Applications β
π Customer Segmentation β
Business Problem: Understand different types of customers for targeted marketing
π― CUSTOMER SEGMENTATION EXAMPLE
Input Features:
βββ Demographics: Age, Gender, Location
βββ Behavior: Purchase frequency, Average order value
βββ Engagement: Website visits, Email opens
βββ Preferences: Product categories, Brands
Clustering Results:
βββ Cluster 1: "Budget Shoppers" (Price-sensitive, infrequent)
βββ Cluster 2: "Premium Customers" (High-value, brand-loyal)
βββ Cluster 3: "Digital Natives" (Online-first, tech products)
βββ Cluster 4: "Occasional Buyers" (Seasonal, specific needs)
Business Actions:
βββ Cluster 1: Discount campaigns, Value bundles
βββ Cluster 2: Exclusive products, Premium service
βββ Cluster 3: Digital marketing, Latest tech
βββ Cluster 4: Seasonal promotions, Remindersπ Anomaly Detection β
Business Problem: Identify unusual patterns that might indicate fraud, errors, or opportunities
π¨ ANOMALY DETECTION PROCESS
Normal Behavior Pattern Discovery:
βββ User login times: Usually 9 AM - 5 PM
βββ Transaction amounts: Usually $10 - $200
βββ Purchase locations: Usually home city
βββ Device usage: Usually same device/browser
Anomaly Detection:
βββ Login at 3 AM from different country β SUSPICIOUS
βββ Transaction of $5000 β REVIEW NEEDED
βββ Purchase from unusual location β FLAG
βββ New device with high-value purchase β VERIFY
Applications:
βββ Credit card fraud detection
βββ Network security monitoring
βββ Quality control in manufacturing
βββ Healthcare monitoringπ Data Exploration and Preprocessing β
Use Case: Understanding data structure before supervised learning
π EXPLORATORY DATA ANALYSIS
Raw Dataset: Employee Performance
βββ 50 features (experience, education, skills, etc.)
βββ 10,000 employees
βββ Goal: Understand data before prediction
Unsupervised Analysis:
βββ PCA: Reduce to 10 main components
βββ Clustering: Find 4 employee types
βββ Association Rules: Skill combinations
βββ Outlier detection: Unusual profiles
Insights Discovered:
βββ 3 main factors explain 80% of variance
βββ Clear employee archetypes exist
βββ Certain skills often go together
βββ Some profiles are very rare
Benefits for Supervised Learning:
βββ Better feature selection
βββ Understanding of data structure
βββ Identification of edge cases
βββ Improved model designEvaluation Methods β
Clustering Evaluation β
Since clustering has no ground truth labels, evaluation is more challenging:
π CLUSTERING EVALUATION METRICS
π INTERNAL METRICS (No ground truth needed):
Silhouette Score:
βββ Measures how similar points are to their cluster vs other clusters
βββ Range: -1 to 1 (higher is better)
βββ >0.5 = good clustering
βββ <0.2 = poor clustering
Inertia (Within-Cluster Sum of Squares):
βββ Sum of squared distances to cluster centers
βββ Lower is better
βββ Used in elbow method
βββ Can decrease as k increases
Calinski-Harabasz Index:
βββ Ratio of between-cluster to within-cluster variance
βββ Higher is better
βββ Good for comparing different numbers of clusters
π― EXTERNAL METRICS (When ground truth available):
Adjusted Rand Index (ARI):
βββ Compares clustering to true labels
βββ Range: -1 to 1 (1 = perfect match)
βββ Adjusted for chance
Normalized Mutual Information (NMI):
βββ Information theoretic measure
βββ Range: 0 to 1 (1 = perfect match)
βββ Less sensitive to cluster sizeDimensionality Reduction Evaluation β
π DIMENSIONALITY REDUCTION EVALUATION
Explained Variance Ratio:
βββ How much variance each component captures
βββ Cumulative variance plot
βββ Choose components that capture 95% variance
βββ Elbow method for optimal number
Reconstruction Error:
βββ How well reduced data can reconstruct original
βββ Lower error = better preservation
βββ Cross-validation recommended
βββ Compare with random projection
Visualization Quality:
βββ Do similar points cluster together?
βββ Are different classes separated?
βββ Does the plot make intuitive sense?
βββ Preserve local neighborhoods?Choosing the Right Algorithm β
π― ALGORITHM SELECTION GUIDE
CLUSTERING:
βββ Known number of clusters β K-Means
βββ Hierarchical relationships β Hierarchical Clustering
βββ Arbitrary shapes, noise β DBSCAN
βββ Large datasets β MiniBatch K-Means
βββ Mixed data types β K-Modes
DIMENSIONALITY REDUCTION:
βββ Linear relationships β PCA
βββ Visualization β t-SNE, UMAP
βββ Non-linear relationships β Kernel PCA
βββ Sparse data β Truncated SVD
βββ Interpretability β Factor Analysis
ASSOCIATION RULES:
βββ Market basket β Apriori, FP-Growth
βββ Sequential patterns β Sequential pattern mining
βββ Large datasets β FP-Growth
βββ Real-time β Stream mining algorithms
DATA CHARACTERISTICS:
βββ Small dataset (<1K) β Any algorithm
βββ Medium dataset (1K-100K) β Most algorithms
βββ Large dataset (>100K) β Scalable versions
βββ High dimensions β Dimensionality reduction first
βββ Mixed data types β Specialized algorithmsCommon Challenges and Solutions β
π― Choosing Number of Clusters β
Problem: K-means requires specifying k, but we don't know the natural number of clusters
Solutions:
π CLUSTER NUMBER SELECTION METHODS
Elbow Method:
βββ Plot inertia vs number of clusters
βββ Look for "elbow" in the curve
βββ Point where improvement slows down
βββ Subjective interpretation
Silhouette Analysis:
βββ Calculate silhouette score for different k
βββ Choose k with highest average silhouette
βββ More objective than elbow method
βββ Consider individual cluster silhouettes
Gap Statistic:
βββ Compare clustering to random data
βββ Find k where gap is largest
βββ More statistically rigorous
βββ Computationally expensive
Domain Knowledge:
βββ Business constraints (e.g., 3 customer tiers)
βββ Practical limitations (e.g., max 5 marketing segments)
βββ Previous research or experience
βββ Interpretability requirementsπ High-Dimensional Data β
Problem: Curse of dimensionality affects distance-based algorithms
Solutions:
π HIGH-DIMENSIONAL SOLUTIONS
Dimensionality Reduction First:
βββ Apply PCA before clustering
βββ Use feature selection techniques
βββ Remove correlated features
βββ Domain-specific feature engineering
Alternative Distance Metrics:
βββ Cosine similarity for text data
βββ Manhattan distance for high dimensions
βββ Correlation-based distances
βββ Learned embeddings
Specialized Algorithms:
βββ Subspace clustering
βββ Projected clustering
βββ Density-based methods
βββ Spectral clusteringβοΈ Imbalanced Clusters β
Problem: Some clusters much larger than others
Solutions:
βοΈ IMBALANCED CLUSTER SOLUTIONS
Algorithm Selection:
βββ DBSCAN (handles varying densities)
βββ Hierarchical clustering
βββ Gaussian Mixture Models
βββ Avoid K-means for severe imbalance
Data Preprocessing:
βββ Sampling techniques
βββ Outlier removal
βββ Feature scaling/normalization
βββ Distance metric selection
Evaluation Adjustments:
βββ Use silhouette analysis
βββ Examine individual cluster quality
βββ Consider business importance of small clusters
βββ Manual cluster validationReal-World Project Example β
π― COMPLETE PROJECT: CUSTOMER SEGMENTATION
BUSINESS PROBLEM:
E-commerce company wants to understand customer types for personalized marketing
1οΈβ£ DATA COLLECTION:
βββ Customer demographics (age, location, gender)
βββ Purchase history (frequency, amount, categories)
βββ Website behavior (pages visited, time spent)
βββ Engagement (email opens, social media)
βββ 50,000 customers, 25 features
2οΈβ£ EXPLORATORY ANALYSIS:
βββ PCA: Identify main variance directions
βββ Correlation analysis: Remove redundant features
βββ Outlier detection: Handle extreme cases
βββ Feature scaling: Normalize different units
3οΈβ£ DIMENSIONALITY REDUCTION:
βββ PCA: 25 features β 8 components (90% variance)
βββ Feature importance: Keep most informative
βββ t-SNE: Visualize customer distribution
βββ Domain expertise: Validate component meaning
4οΈβ£ CLUSTERING:
βββ K-means: Try k=2 to k=10
βββ Hierarchical: Understand cluster relationships
βββ DBSCAN: Check for noise/outliers
βββ Elbow method + silhouette β k=5 optimal
5οΈβ£ CLUSTER INTERPRETATION:
βββ Cluster 1: "High-Value Loyalists" (5%, high spend)
βββ Cluster 2: "Bargain Hunters" (30%, price-sensitive)
βββ Cluster 3: "Occasional Shoppers" (25%, infrequent)
βββ Cluster 4: "Digital Natives" (35%, online-first)
βββ Cluster 5: "New Customers" (5%, recent signups)
6οΈβ£ BUSINESS ACTIONS:
βββ Personalized product recommendations
βββ Targeted email campaigns
βββ Customized website experience
βββ Retention strategies for each segment
βββ Pricing strategies per cluster
7οΈβ£ EVALUATION & MONITORING:
βββ A/B test different strategies per cluster
βββ Monitor cluster stability over time
βββ Track business metrics (conversion, retention)
βββ Re-cluster quarterly with new dataπ― Key Takeaways β
π UNSUPERVISED LEARNING MASTERY
π‘ WHEN TO USE UNSUPERVISED LEARNING:
βββ No labeled data available
βββ Want to understand data structure
βββ Discover hidden patterns
βββ Reduce data complexity
βββ Exploratory data analysis
π MAIN TECHNIQUES:
βββ Clustering: Group similar items
βββ Dimensionality Reduction: Simplify data
βββ Association Rules: Find relationships
βββ Anomaly Detection: Identify outliers
βββ Density Estimation: Understand distributions
π― SUCCESS FACTORS:
βββ Domain knowledge for interpretation
βββ Proper data preprocessing
βββ Multiple algorithm comparison
βββ Appropriate evaluation metrics
βββ Business context consideration
β οΈ COMMON PITFALLS:
βββ Over-interpreting clusters
βββ Ignoring domain expertise
βββ Wrong similarity metrics
βββ Not validating results
βββ Assuming clusters are meaningfulNext Steps:
- Machine Learning Implementation: Build production ML systems
- Advanced Techniques: Modern AI applications
- Vector Databases: Storage for ML applications