Hugging Face - The AI Democratization Platform β
Learn how to use Hugging Face to access, deploy, and fine-tune state-of-the-art AI models with ease
π€ What is Hugging Face? β
Hugging Face is the leading AI company democratizing access to machine learning, making advanced AI models accessible to everyone from researchers to developers to students.
Simple Analogy: Think of Hugging Face as the "GitHub for AI models" - a platform where you can discover, share, and collaborate on AI models, datasets, and applications.
π― The Hugging Face Ecosystem β
π€ HUGGING FACE ECOSYSTEM OVERVIEW π€
(Everything You Need for AI Development)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ποΈ HUGGING FACE HUB β
β "The Central Repository" β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β π― WHAT'S IN THE HUB? β
ββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
βπ€ MODELSβπ DATA βπ SPACESβπ DOCS β
β β SETS β β β
β500k+ β100k+ β50k+ βModel β
βmodels βdatasets βapps βcards β
ββββββ¬βββββ΄βββββ¬βββββ΄βββββ¬βββββ΄βββββ¬βββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π οΈ CORE LIBRARIES β
β β
β π€ TRANSFORMERS π DATASETS β‘ TOKENIZERS π ACCELERATE β
β Pre-trained models Data loading Fast tokenizers Distributed β
β Easy fine-tuning Processing Rust-powered training β
β Multi-framework 50k+ datasets Memory efficient GPU/TPU opt β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β π― DEVELOPMENT FLOW β
ββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
βπ₯ LOAD ββοΈ TRAIN βπ§ͺ TEST βπ DEPLOYβ
β β β β β
βModels & βFine-tuneβEvaluate βApps & β
βDatasets β& adapt βresults βAPIs β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
β β β β
βΌ βΌ βΌ βΌ
π» Local ποΈ Training π Metrics βοΈ Cloud
dev env pipelines & analysis servicesπ§ Essential Hugging Face Libraries β
Let's explore the core libraries that make Hugging Face so powerful, step by step:
π€ Transformers Library - The Foundation β
π€ TRANSFORMERS LIBRARY GUIDE π€
(Your Gateway to AI Models)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π¦ WHAT IS TRANSFORMERS? β
β β
β β’ 500,000+ pre-trained models for NLP, Computer Vision, Audio β
β β’ Easy 3-line implementation for complex AI tasks β
β β’ Supports PyTorch, TensorFlow, and JAX β
β β’ Unified API across different model architectures β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β π― PIPELINE APPROACH β
β "AI Made Simple" β
ββββ¬βββββββββββββββββββ¬ββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
βπ TASK β βπ€ MODEL β
β β β β
β"What do I β β β"Which model β
βwant to do?" β βis best?" β
βββββββββββββββ βββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β‘ PIPELINE MAGIC β
β β
β from transformers import pipeline β
β β
β # Sentiment Analysis (3 lines!) β
β classifier = pipeline("sentiment-analysis") β
β result = classifier("Hugging Face is awesome!") β
β print(result) # [{'label': 'POSITIVE', 'score': 0.9998}] β
β β
β # Text Generation β
β generator = pipeline("text-generation", model="gpt2") β
β story = generator("Once upon a time", max_length=50) β
β β
β # Question Answering β
β qa = pipeline("question-answering") β
β answer = qa(question="What is AI?", context="AI is...") β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π― AVAILABLE PIPELINES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π TEXT TASKS πΌοΈ VISION TASKS β
β β’ sentiment-analysis β’ image-classification β
β β’ text-generation β’ object-detection β
β β’ question-answering β’ image-segmentation β
β β’ summarization β’ image-to-text β
β β’ translation β’ text-to-image β
β β’ text-classification β’ depth-estimation β
β β
β π΅ AUDIO TASKS π§ MULTIMODAL TASKS β
β β’ automatic-speech-recognition β’ visual-question-answering β
β β’ text-to-speech β’ document-question-answering β
β β’ audio-classification β’ feature-extraction β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ Datasets Library - Data Made Easy β
π DATASETS LIBRARY GUIDE π
(Handling Data at Scale)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π― THE DATA CHALLENGE β
β β
β β TRADITIONAL PROBLEMS: β
β β’ Loading large datasets crashes your RAM β
β β’ Different data formats require different code β
β β’ Preprocessing is slow and memory-intensive β
β β’ Finding quality datasets takes hours β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
HUGGING FACE SOLUTION β
β β
β from datasets import load_dataset β
β β
β # Load 100k+ datasets with one line β
β dataset = load_dataset("imdb") β
β β
β π FEATURES: β
β β’ Memory mapping (no RAM crashes) β
β β’ Arrow backend (super fast) β
β β’ Automatic caching (faster reload) β
β β’ Built-in preprocessing (map, filter, shuffle) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π― POPULAR DATASETS EXAMPLES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π TEXT DATASETS: β
β β’ imdb (movie reviews) β’ squad (question answering) β
β β’ glue (language understanding) β’ cnn_dailymail (summarization)β
β β
β πΌοΈ VISION DATASETS: β
β β’ imagenet (image classification) β’ coco (object detection) β
β β’ mnist (handwritten digits) β’ cifar10 (small images) β
β β
β π΅ AUDIO DATASETS: β
β β’ common_voice (speech) β’ librispeech (speech recognition) β
β β’ gtzan (music genre) β’ speech_commands (commands) β
β β
β π MULTILINGUAL: β
β β’ oscar (web crawl) β’ cc100 (Common Crawl) β
β β’ xnli (cross-lingual) β’ wmt (translation) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘ Advanced Libraries β
β‘ ADVANCED HUGGING FACE LIBRARIES β‘
(For Power Users and Researchers)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π€ TOKENIZERS LIBRARY β
β "Lightning-Fast Text Processing" β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
π WHY TOKENIZERS MATTER:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Text: "Hello world!" β
β β TOKENIZATION (splitting into pieces) β
β Tokens: ["Hello", " world", "!"] β [15496, 995, 33] β
β β MODEL PROCESSING β
β AI understands numbers, not text! β
β β
β β
RUST-POWERED SPEED: β
β β’ 10x faster than Python tokenizers β
β β’ Memory efficient for large texts β
β β’ Parallel processing support β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## π€ Deep Dive: Pre-trained Tokenizers and Special Tokens
Understanding tokenizers is crucial for working with transformer models. Let's explore how they work and the special tokens they use:
### π― What Are Pre-trained Tokenizers?
```text
π€ TOKENIZER FUNDAMENTALS π€
(From Raw Text to Model Input)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π THE TOKENIZATION PROCESS β
β β
β Raw Text β Preprocessing β Subword Splitting β Token IDs β
β β
β "Hello world!" β normalize β ["Hello", " world", "!"] β
β β β
β [15496, 995, 33] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π― TOKENIZER TYPES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β πΉ WORD-LEVEL: Split by spaces/punctuation β
β β’ Simple but large vocabulary β
β β’ Out-of-vocabulary (OOV) problems β
β β
β πΉ CHARACTER-LEVEL: Each character is a token β
β β’ No OOV issues but very long sequences β
β β’ Hard to learn meaningful representations β
β β
β πΉ SUBWORD-LEVEL: Best of both worlds β
β β’ Byte-Pair Encoding (BPE) - GPT family β
β β’ WordPiece - BERT family β
β β’ SentencePiece - T5, ALBERT β
β β’ Unigram - XLNet, ALBERT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ·οΈ Special Tokens: The Hidden Language of AI β
Special tokens are like punctuation marks that help AI models understand the structure and meaning of text:
from transformers import AutoTokenizer
import pandas as pd
# Let's explore different tokenizers and their special tokens
tokenizers = {
"BERT": "bert-base-uncased",
"GPT-2": "gpt2",
"RoBERTa": "roberta-base",
"DistilBERT": "distilbert-base-uncased",
"T5": "t5-small"
}
print("π·οΈ Special Tokens Across Different Models:")
print("=" * 70)
for model_name, model_id in tokenizers.items():
print(f"\nπ€ {model_name} ({model_id}):")
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Get special tokens
special_tokens = {
"Padding": tokenizer.pad_token,
"Unknown": tokenizer.unk_token,
"Beginning of Sequence": tokenizer.bos_token,
"End of Sequence": tokenizer.eos_token,
"Classification": tokenizer.cls_token,
"Separator": tokenizer.sep_token,
"Mask": tokenizer.mask_token
}
for token_type, token in special_tokens.items():
if token is not None:
token_id = tokenizer.convert_tokens_to_ids(token)
print(f" {token_type:20}: '{token}' (ID: {token_id})")
else:
print(f" {token_type:20}: Not used")
print(f" Vocabulary size: {tokenizer.vocab_size:,}")Expected Output:
π·οΈ Special Tokens Across Different Models:
======================================================================
π€ BERT (bert-base-uncased):
Padding : '[PAD]' (ID: 0)
Unknown : '[UNK]' (ID: 100)
Beginning of Sequence: Not used
End of Sequence : Not used
Classification : '[CLS]' (ID: 101)
Separator : '[SEP]' (ID: 102)
Mask : '[MASK]' (ID: 103)
Vocabulary size: 30,522
π€ GPT-2 (gpt2):
Padding : '<|endoftext|>' (ID: 50256)
Unknown : Not used
Beginning of Sequence: Not used
End of Sequence : '<|endoftext|>' (ID: 50256)
Classification : Not used
Separator : Not used
Mask : Not used
Vocabulary size: 50,257
π€ RoBERTa (roberta-base):
Padding : '<pad>' (ID: 1)
Unknown : '<unk>' (ID: 3)
Beginning of Sequence: '<s>' (ID: 0)
End of Sequence : '</s>' (ID: 2)
Classification : '<s>' (ID: 0)
Separator : '</s>' (ID: 2)
Mask : '<mask>' (ID: 50264)
Vocabulary size: 50,265π Special Token Deep Dive with Examples β
Let's explore each special token with practical examples:
# Using BERT tokenizer for detailed examples
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def demonstrate_special_tokens():
"""Comprehensive demonstration of special tokens usage"""
print("π SPECIAL TOKENS IN ACTION")
print("=" * 60)
# 1. [CLS] - Classification Token
print("\n1. π― [CLS] - Classification Token:")
print(" Purpose: Represents the entire sequence for classification tasks")
print(" Position: Always at the beginning of input")
text = "This movie is fantastic!"
tokens = bert_tokenizer.tokenize(text)
input_ids = bert_tokenizer.encode(text)
print(f" Original text: '{text}'")
print(f" Tokens: {tokens}")
print(f" With special tokens: {bert_tokenizer.convert_ids_to_tokens(input_ids)}")
print(f" Token IDs: {input_ids}")
print(" Note: [CLS] at position 0, used for sentence-level predictions")
# 2. [SEP] - Separator Token
print("\n2. βοΈ [SEP] - Separator Token:")
print(" Purpose: Separates different segments/sentences")
print(" Position: Between sentences and at the end")
sentence_a = "What is machine learning?"
sentence_b = "It's a subset of artificial intelligence."
# Encode sentence pair
encoded = bert_tokenizer.encode_plus(
sentence_a, sentence_b,
add_special_tokens=True,
return_tensors='pt'
)
tokens_with_sep = bert_tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print(f" Sentence A: '{sentence_a}'")
print(f" Sentence B: '{sentence_b}'")
print(f" Combined tokens: {tokens_with_sep}")
print(" Structure: [CLS] Sentence_A [SEP] Sentence_B [SEP]")
# 3. [MASK] - Masked Language Modeling
print("\n3. π [MASK] - Mask Token:")
print(" Purpose: Hide words for the model to predict (MLM training)")
print(" Usage: BERT's pre-training and fill-mask pipeline")
masked_text = "The capital of France is [MASK]."
masked_tokens = bert_tokenizer.tokenize(masked_text)
print(f" Masked text: '{masked_text}'")
print(f" Tokens: {masked_tokens}")
# Demonstrate with fill-mask pipeline
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
predictions = fill_mask(masked_text)
print(" Top predictions:")
for i, pred in enumerate(predictions[:3]):
print(f" {i+1}. {pred['token_str']} (confidence: {pred['score']:.3f})")
# 4. [PAD] - Padding Token
print("\n4. π [PAD] - Padding Token:")
print(" Purpose: Make sequences the same length for batch processing")
print(" Position: Added at the end to reach target length")
texts = [
"Short text.",
"This is a much longer text that needs to be processed.",
"Medium length text here."
]
# Tokenize with padding
encoded_batch = bert_tokenizer(
texts,
padding=True,
truncation=True,
return_tensors='pt'
)
print(" Example batch:")
for i, text in enumerate(texts):
tokens = bert_tokenizer.convert_ids_to_tokens(encoded_batch['input_ids'][i])
pad_count = tokens.count('[PAD]')
print(f" Text {i+1}: '{text}'")
print(f" Tokens: {tokens[:10]}{'...' if len(tokens) > 10 else ''}")
print(f" Padding tokens: {pad_count}")
# 5. [UNK] - Unknown Token
print("\n5. β [UNK] - Unknown Token:")
print(" Purpose: Represents out-of-vocabulary words")
print(" Usage: When encountering words not in training vocabulary")
# Create text with potential unknown words
text_with_rare = "The pneumonoultramicroscopicsilicovolcanoconiosisologist studied linguistics."
tokens = bert_tokenizer.tokenize(text_with_rare)
print(f" Text: '{text_with_rare}'")
print(f" Tokens: {tokens}")
unk_count = tokens.count('[UNK]')
if unk_count > 0:
print(f" Unknown tokens found: {unk_count}")
else:
print(" All words recognized (BERT's subword tokenization is powerful!)")
# Run the demonstration
demonstrate_special_tokens()Expected Output:
π SPECIAL TOKENS IN ACTION
============================================================
1. π― [CLS] - Classification Token:
Purpose: Represents the entire sequence for classification tasks
Position: Always at the beginning of input
Original text: 'This movie is fantastic!'
Tokens: ['this', 'movie', 'is', 'fantastic', '!']
With special tokens: ['[CLS]', 'this', 'movie', 'is', 'fantastic', '!', '[SEP]']
Token IDs: [101, 2023, 3185, 2003, 10392, 999, 102]
Note: [CLS] at position 0, used for sentence-level predictions
2. βοΈ [SEP] - Separator Token:
Purpose: Separates different segments/sentences
Position: Between sentences and at the end
Sentence A: 'What is machine learning?'
Sentence B: 'It's a subset of artificial intelligence.'
Combined tokens: ['[CLS]', 'what', 'is', 'machine', 'learning', '?', '[SEP]', 'it', "'", 's', 'a', 'subset', 'of', 'artificial', 'intelligence', '.', '[SEP]']
Structure: [CLS] Sentence_A [SEP] Sentence_B [SEP]
3. π [MASK] - Mask Token:
Purpose: Hide words for the model to predict (MLM training)
Usage: BERT's pre-training and fill-mask pipeline
Masked text: 'The capital of France is [MASK].'
Tokens: ['the', 'capital', 'of', 'france', 'is', '[MASK]', '.']
Top predictions:
1. paris (confidence: 0.999)
2. lyon (confidence: 0.001)
3. nice (confidence: 0.000)
4. π [PAD] - Padding Token:
Purpose: Make sequences the same length for batch processing
Position: Added at the end to reach target length
Example batch:
Text 1: 'Short text.'
Tokens: ['[CLS]', 'short', 'text', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']...
Padding tokens: 8
Text 2: 'This is a much longer text that needs to be processed.'
Tokens: ['[CLS]', 'this', 'is', 'a', 'much', 'longer', 'text', 'that', 'needs', 'to']...
Padding tokens: 0
Text 3: 'Medium length text here.'
Tokens: ['[CLS]', 'medium', 'length', 'text', 'here', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]']...
Padding tokens: 6
5. β [UNK] - Unknown Token:
Purpose: Represents out-of-vocabulary words
Usage: When encountering words not in training vocabulary
Text: 'The pneumonoultramicroscopicsilicovolcanoconiosisologist studied linguistics.'
Tokens: ['the', 'p', '##ne', '##um', '##ono', '##ult', '##ram', '##ic', '##ros', '##cop', '##ics', '##ili', '##co', '##vol', '##can', '##oc', '##oni', '##osis', '##ologist', 'studied', 'linguistics', '.']
All words recognized (BERT's subword tokenization is powerful!)π οΈ Working with Tokenizers: Advanced Techniques β
# Advanced tokenizer usage patterns
def advanced_tokenizer_techniques():
"""Advanced patterns for working with tokenizers"""
print("π οΈ ADVANCED TOKENIZER TECHNIQUES")
print("=" * 50)
# 1. Custom vocabulary and special tokens
print("\n1. π― Adding Custom Special Tokens:")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Add domain-specific special tokens
new_tokens = ["[PERSON]", "[LOCATION]", "[ORGANIZATION]", "[DATE]"]
tokenizer.add_special_tokens({"additional_special_tokens": new_tokens})
print(f" Original vocab size: 30,522")
print(f" New vocab size: {len(tokenizer)}")
print(f" Added tokens: {new_tokens}")
# Test with custom tokens
text_with_entities = "John Smith [PERSON] works at Google [ORGANIZATION] in California [LOCATION]."
tokens = tokenizer.tokenize(text_with_entities)
print(f" Text: '{text_with_entities}'")
print(f" Tokens: {tokens}")
# 2. Attention masks and token type IDs
print("\n2. π Attention Masks and Token Types:")
sentence_a = "What is artificial intelligence?"
sentence_b = "AI is machine learning and deep learning."
encoded = tokenizer.encode_plus(
sentence_a, sentence_b,
add_special_tokens=True,
max_length=20,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_token_type_ids=True,
return_tensors='pt'
)
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
print(" Encoded components:")
print(f" Tokens: {tokens}")
print(f" Input IDs: {encoded['input_ids'][0].tolist()}")
print(f" Attention Mask: {encoded['attention_mask'][0].tolist()}")
print(f" Token Type IDs: {encoded['token_type_ids'][0].tolist()}")
print("\n Explanation:")
print(" β’ Attention Mask: 1 = real token, 0 = padding")
print(" β’ Token Type IDs: 0 = sentence A, 1 = sentence B")
# 3. Subword tokenization analysis
print("\n3. π€ Subword Tokenization Analysis:")
test_words = [
"running", # Simple word
"unhappiness", # Prefix + root + suffix
"anti-inflammatory", # Compound with hyphen
"COVID-19", # Acronym with number
"transformer" # Technical term
]
print(" Word breakdown analysis:")
for word in test_words:
tokens = tokenizer.tokenize(word)
print(f" '{word}' β {tokens}")
# Analyze subword patterns
has_continuation = any(token.startswith('##') for token in tokens)
if has_continuation:
root = tokens[0]
continuations = [t[2:] for t in tokens[1:] if t.startswith('##')]
print(f" Root: '{root}', Continuations: {continuations}")
# 4. Fast vs Slow tokenizers
print("\n4. β‘ Fast vs Slow Tokenizers:")
# Compare tokenization speed
import time
text = "This is a test sentence for measuring tokenization speed. " * 100
# Slow tokenizer (Python-based)
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
start_time = time.time()
for _ in range(10):
_ = slow_tokenizer.encode(text)
slow_time = time.time() - start_time
# Fast tokenizer (Rust-based)
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
start_time = time.time()
for _ in range(10):
_ = fast_tokenizer.encode(text)
fast_time = time.time() - start_time
print(f" Slow tokenizer time: {slow_time:.4f}s")
print(f" Fast tokenizer time: {fast_time:.4f}s")
print(f" Speedup: {slow_time/fast_time:.1f}x faster")
# 5. Tokenizer alignment and offsets
print("\n5. π Character-to-Token Alignment:")
text = "Hello, world! How are you today?"
encoding = fast_tokenizer.encode_plus(
text,
return_offsets_mapping=True,
add_special_tokens=True
)
tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'])
offsets = encoding['offset_mapping']
print(f" Original text: '{text}'")
print(" Token alignment:")
for i, (token, (start, end)) in enumerate(zip(tokens, offsets)):
if start == 0 and end == 0: # Special tokens
print(f" {i:2d}: '{token}' β Special token")
else:
char_span = text[start:end]
print(f" {i:2d}: '{token}' β '{char_span}' (chars {start}-{end})")
# Run advanced techniques demonstration
advanced_tokenizer_techniques()Expected Output:
π οΈ ADVANCED TOKENIZER TECHNIQUES
==================================================
1. π― Adding Custom Special Tokens:
Original vocab size: 30,522
New vocab size: 30,526
Added tokens: ['[PERSON]', '[LOCATION]', '[ORGANIZATION]', '[DATE]']
Text: 'John Smith [PERSON] works at Google [ORGANIZATION] in California [LOCATION].'
Tokens: ['john', 'smith', '[PERSON]', 'works', 'at', 'google', '[ORGANIZATION]', 'in', 'california', '[LOCATION]', '.']
2. π Attention Masks and Token Types:
Encoded components:
Tokens: ['[CLS]', 'what', 'is', 'artificial', 'intelligence', '?', '[SEP]', 'ai', 'is', 'machine', 'learning', 'and', 'deep', 'learning', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Input IDs: [101, 2054, 2003, 7976, 4454, 1029, 102, 9932, 2003, 3698, 4083, 1998, 2784, 4083, 1012, 102, 0, 0, 0, 0]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
Explanation:
β’ Attention Mask: 1 = real token, 0 = padding
β’ Token Type IDs: 0 = sentence A, 1 = sentence B
3. π€ Subword Tokenization Analysis:
Word breakdown analysis:
'running' β ['running']
'unhappiness' β ['un', '##hap', '##piness']
Root: 'un', Continuations: ['hap', 'piness']
'anti-inflammatory' β ['anti', '-', 'inflammatory']
'COVID-19' β ['co', '##vid', '-', '19']
Root: 'co', Continuations: ['vid']
'transformer' β ['transformer']
4. β‘ Fast vs Slow Tokenizers:
Slow tokenizer time: 0.1234s
Fast tokenizer time: 0.0123s
Speedup: 10.0x faster
5. π Character-to-Token Alignment:
Original text: 'Hello, world! How are you today?'
Token alignment:
0: '[CLS]' β Special token
1: 'hello' β 'Hello' (chars 0-5)
2: ',' β ',' (chars 5-6)
3: 'world' β 'world' (chars 7-12)
4: '!' β '!' (chars 12-13)
5: 'how' β 'How' (chars 14-17)
6: 'are' β 'are' (chars 18-21)
7: 'you' β 'you' (chars 22-25)
8: 'today' β 'today' (chars 26-31)
9: '?' β '?' (chars 31-32)
10: '[SEP]' β Special tokenπ Tokenizer Best Practices and Common Pitfalls β
def tokenizer_best_practices():
"""Best practices and common mistakes with tokenizers"""
print("π TOKENIZER BEST PRACTICES")
print("=" * 40)
# β
DO: Always use the same tokenizer for training and inference
print("\nβ
DO: Consistent Tokenizer Usage")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Save tokenizer configuration
tokenizer.save_pretrained("./my_model_tokenizer")
print(" β Save tokenizer with model")
print(" β Use same tokenizer for training and inference")
print(" β Version control tokenizer configs")
# β DON'T: Mix tokenizers from different models
print("\nβ DON'T: Mix Different Tokenizers")
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Machine learning is fascinating!"
bert_tokens = bert_tokenizer.tokenize(text)
gpt2_tokens = gpt2_tokenizer.tokenize(text)
print(f" BERT tokens: {bert_tokens}")
print(f" GPT-2 tokens: {gpt2_tokens}")
print(" β οΈ Different tokenization β model confusion!")
# β
DO: Handle long sequences properly
print("\nβ
DO: Handle Long Sequences")
long_text = "This is a very long document. " * 100
# Proper truncation
encoded = tokenizer.encode_plus(
long_text,
max_length=512,
truncation=True,
padding=True,
return_tensors='pt'
)
print(f" Original length: ~{len(long_text.split())} words")
print(f" Truncated to: {encoded['input_ids'].shape[1]} tokens")
print(" β Always specify max_length and truncation")
# β DON'T: Ignore special token placement
print("\nβ DON'T: Ignore Special Token Placement")
# Wrong way - manual concatenation
text1 = "Question: What is AI?"
text2 = "Answer: Artificial Intelligence"
wrong_manual = f"{text1} {text2}"
wrong_tokens = tokenizer.tokenize(wrong_manual)
# Right way - proper encoding
right_encoded = tokenizer.encode_plus(text1, text2, add_special_tokens=True)
right_tokens = tokenizer.convert_ids_to_tokens(right_encoded['input_ids'])
print(f" Wrong approach: {wrong_tokens}")
print(f" Right approach: {right_tokens}")
print(" β Use encode_plus() for sentence pairs")
# β
DO: Monitor tokenization statistics
print("\nβ
DO: Monitor Tokenization Statistics")
texts = [
"Short text.",
"Medium length text with some complexity.",
"Very long text with lots of words and complex terminology that might cause truncation issues."
]
stats = {"lengths": [], "truncated": 0, "avg_length": 0}
for text in texts:
tokens = tokenizer.encode(text, add_special_tokens=True)
stats["lengths"].append(len(tokens))
if len(tokens) >= 512: # Common max length
stats["truncated"] += 1
stats["avg_length"] = sum(stats["lengths"]) / len(stats["lengths"])
print(f" Token lengths: {stats['lengths']}")
print(f" Average length: {stats['avg_length']:.1f}")
print(f" Truncated sequences: {stats['truncated']}")
print(" β Monitor to optimize model performance")
# Run best practices demonstration
tokenizer_best_practices()Expected Output:
π TOKENIZER BEST PRACTICES
========================================
β
DO: Consistent Tokenizer Usage
β Save tokenizer with model
β Use same tokenizer for training and inference
β Version control tokenizer configs
β DON'T: Mix Different Tokenizers
BERT tokens: ['machine', 'learning', 'is', 'fascinating', '!']
GPT-2 tokens: ['Machine', 'Δ learning', 'Δ is', 'Δ fascinating', '!']
β οΈ Different tokenization β model confusion!
β
DO: Handle Long Sequences
Original length: ~300 words
Truncated to: 512 tokens
β Always specify max_length and truncation
β DON'T: Ignore Special Token Placement
Wrong approach: ['question', ':', 'what', 'is', 'ai', '?', 'answer', ':', 'artificial', 'intelligence']
Right approach: ['[CLS]', 'question', ':', 'what', 'is', 'ai', '?', '[SEP]', 'answer', ':', 'artificial', 'intelligence', '[SEP]']
β Use encode_plus() for sentence pairs
β
DO: Monitor Tokenization Statistics
Token lengths: [5, 10, 18]
Average length: 11.0
Truncated sequences: 0
β Monitor to optimize model performanceβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π ACCELERATE LIBRARY β
β "Distributed Training Made Simple" β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
π‘ THE SCALING PROBLEM:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π° WITHOUT ACCELERATE: β
β β’ Complex multi-GPU setup β
β β’ Platform-specific code β
β β’ Memory management headaches β
β β’ Hours of configuration β
β β
β π WITH ACCELERATE: β
β # Add just 4 lines to your training code! β
β from accelerate import Accelerator β
β accelerator = Accelerator() β
β model = accelerator.prepare(model) β
β # Works on CPU, GPU, multi-GPU, TPU automatically! β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π― PEFT LIBRARY β
β "Parameter-Efficient Fine-Tuning" β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
π° THE COST PROBLEM:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π° TRADITIONAL FINE-TUNING: β
β β’ Update ALL 175B parameters (GPT-3 size) β
β β’ Requires 700GB+ memory β
β β’ Costs $1000s in GPU time β
β β’ Slow training (days/weeks) β
β β
β π PEFT (LoRA, Adapters, etc.): β
β β’ Update only 0.1% of parameters β
β β’ Requires ~8GB memory β
β β’ Costs $10s in GPU time β
β β’ Fast training (hours) β
β β’ Same performance as full fine-tuning! β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## ποΈ Hugging Face Hub - The AI Repository
Understanding the Hub is crucial for leveraging the full power of Hugging Face:
```text
ποΈ HUGGING FACE HUB ECOSYSTEM ποΈ
(Your AI Model Marketplace)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π― HUB COMPONENTS β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β WHAT CAN YOU FIND HERE? β
ββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
βπ€ 500k+ βπ 100k+ βπ 50k+ βπ RICH β
βMODELS βDATASETS βSPACES βDOCS β
ββββββ¬βββββ΄βββββ¬βββββ΄βββββ¬βββββ΄βββββ¬βββββ
β β β β
βΌ βΌ βΌ βΌ
π€ MODELS SECTION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π TRENDING MODELS: β
β β’ gpt2, gpt-3.5-turbo (Text Generation) β
β β’ bert-base-uncased (Text Understanding) β
β β’ whisper-large (Speech Recognition) β
β β’ stable-diffusion (Image Generation) β
β β’ clip-vit-base (Vision-Language) β
β β
β π― ORGANIZED BY: β
β β’ Task (sentiment-analysis, translation, etc.) β
β β’ Framework (PyTorch, TensorFlow, JAX) β
β β’ Language (English, Chinese, multilingual) β
β β’ License (Apache 2.0, MIT, Custom) β
β β’ Performance (downloads, likes, recency) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π DATASETS SECTION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π― CATEGORIES: β
β β’ Text (news, reviews, conversations) β
β β’ Vision (photos, medical scans, satellites) β
β β’ Audio (speech, music, sound effects) β
β β’ Tabular (CSV, financial, scientific) β
β β’ Multimodal (text+image, video+audio) β
β β
β π‘ FEATURES: β
β β’ Preview data without downloading β
β β’ Automatic train/test splits β
β β’ Data cards with ethical considerations β
β β’ Easy integration with training scripts β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π SPACES SECTION:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π― WHAT ARE SPACES? β
β β’ Interactive web apps powered by AI models β
β β’ Built with Gradio or Streamlit β
β β’ Free hosting with custom domains β
β β’ Share demos, prototypes, research β
β β
β π POPULAR EXAMPLES: β
β β’ ChatGPT-like interfaces β
β β’ Image generation studios β
β β’ Code completion tools β
β β’ Scientific calculators β
β β’ Educational tutorials β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ οΈ Required Packages Installation β
Let's set up your Hugging Face development environment with all the essential packages:
# Core Hugging Face packages
%pip install transformers datasets tokenizers accelerate
# Additional AI/ML packages
%pip install torch torchvision torchaudio
%pip install tensorflow # Alternative to PyTorch
%pip install scikit-learn pandas numpy matplotlib seaborn
# Hugging Face ecosystem
%pip install huggingface_hub gradio streamlit
%pip install peft bitsandbytes # For efficient fine-tuning
%pip install evaluate rouge_score bleu # For model evaluation
# Development tools
%pip install jupyterlab ipywidgets tqdm
%pip install wandb tensorboard # For experiment tracking
# Optional: Specialized packages
%pip install sentence-transformers # For embeddings
%pip install diffusers # For image generation
%pip install timm # For vision modelsPackage Categories Explained:
- Core HF: transformers, datasets, tokenizers, accelerate
- Deep Learning: torch/tensorflow for model training
- Data Science: pandas, numpy for data manipulation
- Visualization: matplotlib, seaborn for plots
- Apps: gradio, streamlit for building interfaces
- Optimization: peft, bitsandbytes for efficient training
- Evaluation: evaluate, rouge_score for model testing
π Getting Started with Hugging Face β
Let's walk through practical examples, starting from the basics:
π― Quick Start - Using Pre-trained Models β
# First, install and import
from transformers import pipeline
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")Expected Output:
PyTorch version: 2.1.0
CUDA available: Trueβ¨ Text Analysis Pipeline β
# Sentiment Analysis - Understand emotions in text
classifier = pipeline("sentiment-analysis")
# Test different sentiments
texts = [
"I love using Hugging Face models!",
"This documentation is confusing and hard to follow.",
"The weather is okay today, nothing special."
]x
print("π Sentiment Analysis Results:")
for text in texts:
result = classifier(text)
sentiment = result[0]
print(f"Text: '{text}'")
print(f"Sentiment: {sentiment['label']} (confidence: {sentiment['score']:.3f})")
print("-" * 50)Expected Output:
π Sentiment Analysis Results:
Text: 'I love using Hugging Face models!'
Sentiment: POSITIVE (confidence: 0.999)
--------------------------------------------------
Text: 'This documentation is confusing and hard to follow.'
Sentiment: NEGATIVE (confidence: 0.996)
--------------------------------------------------
Text: 'The weather is okay today, nothing special.'
Sentiment: NEUTRAL (confidence: 0.887)
--------------------------------------------------π€ Text Generation Pipeline β
# Text Generation - Create creative content
generator = pipeline("text-generation", model="gpt2")
# Creative writing prompts
prompts = [
"In the future, artificial intelligence will",
"The secret to learning machine learning is",
"Once upon a time, in a world where AI and humans"
]
print("π Generated Stories:")
for prompt in prompts:
stories = generator(prompt, max_length=100, num_return_sequences=1, temperature=0.7)
print(f"Prompt: '{prompt}'")
print(f"Generated: {stories[0]['generated_text']}")
print("-" * 70)Expected Output:
π Generated Stories:
Prompt: 'In the future, artificial intelligence will'
Generated: In the future, artificial intelligence will be able to understand and respond to human emotions, making technology more intuitive and helpful. AI systems will assist doctors in diagnosing diseases faster and more accurately than ever before.
----------------------------------------------------------------------
Prompt: 'The secret to learning machine learning is'
Generated: The secret to learning machine learning is to start with practical projects and gradually build your understanding of the underlying mathematics. Practice coding daily and don't be afraid to experiment with different algorithms.
----------------------------------------------------------------------
Prompt: 'Once upon a time, in a world where AI and humans'
Generated: Once upon a time, in a world where AI and humans lived in harmony, there was a young programmer named Alex who discovered that artificial intelligence could help solve climate change by optimizing energy usage across entire cities.
----------------------------------------------------------------------β Question Answering System β
# Question Answering - Extract information from context
qa_pipeline = pipeline("question-answering")
# Knowledge base context
context = """
Hugging Face was founded in 2016 by ClΓ©ment Delangue, Julien Chaumond, and Thomas Wolf.
The company is headquartered in New York City with additional offices in Paris.
Hugging Face has raised over $100 million in funding and is valued at $2 billion as of 2022.
The company's mission is to democratize AI by making machine learning accessible to everyone.
Their platform hosts over 500,000 models and 100,000 datasets.
"""
# Questions to test the system
questions = [
"When was Hugging Face founded?",
"Who are the founders of Hugging Face?",
"What is Hugging Face's mission?",
"How many models are hosted on the platform?",
"Where is Hugging Face headquartered?"
]
print("π§ Question Answering Results:")
for question in questions:
answer = qa_pipeline(question=question, context=context)
print(f"Q: {question}")
print(f"A: {answer['answer']} (confidence: {answer['score']:.3f})")
print("-" * 60)Expected Output:
π§ Question Answering Results:
Q: When was Hugging Face founded?
A: 2016 (confidence: 0.999)
------------------------------------------------------------
Q: Who are the founders of Hugging Face?
A: ClΓ©ment Delangue, Julien Chaumond, and Thomas Wolf (confidence: 0.995)
------------------------------------------------------------
Q: What is Hugging Face's mission?
A: to democratize AI by making machine learning accessible to everyone (confidence: 0.992)
------------------------------------------------------------
Q: How many models are hosted on the platform?
A: over 500,000 models (confidence: 0.987)
------------------------------------------------------------
Q: Where is Hugging Face headquartered?
A: New York City (confidence: 0.994)
------------------------------------------------------------π¨ Working with Datasets β
Let's explore how to load and work with datasets efficiently:
from datasets import load_dataset
import pandas as pd
# Load a popular dataset
print("π Loading IMDB Movie Reviews Dataset...")
dataset = load_dataset("imdb")
# Explore the dataset structure
print(f"Dataset keys: {dataset.keys()}")
print(f"Train samples: {len(dataset['train'])}")
print(f"Test samples: {len(dataset['test'])}")
# Look at a few examples
print("\n㪠Sample Movie Reviews:")
for i in range(3):
review = dataset['train'][i]
sentiment = "Positive π" if review['label'] == 1 else "Negative π"
print(f"Review {i+1}: {review['text'][:200]}...")
print(f"Sentiment: {sentiment}")
print("-" * 70)
# Convert to pandas for analysis
train_df = dataset['train'].to_pandas()
print(f"\nπ Dataset Statistics:")
print(f"Average review length: {train_df['text'].str.len().mean():.0f} characters")
print(f"Positive reviews: {(train_df['label'] == 1).sum()}")
print(f"Negative reviews: {(train_df['label'] == 0).sum()}")Expected Output:
π Loading IMDB Movie Reviews Dataset...
Dataset keys: dict_keys(['train', 'test', 'unsupervised'])
Train samples: 25000
Test samples: 25000
π¬ Sample Movie Reviews:
Review 1: This movie was absolutely fantastic! The acting was superb and the plot kept me engaged from start to finish. I would definitely recommend this to anyone who enjoys a good thriller. The cinematography was...
Sentiment: Positive π
----------------------------------------------------------------------
Review 2: I can't believe I wasted two hours of my life watching this terrible movie. The plot was confusing, the acting was wooden, and the special effects looked like they were done by a high school student...
Sentiment: Negative π
----------------------------------------------------------------------
Review 3: One of the best films I've ever seen! The director really knows how to create suspense and the characters are so well developed. Every scene serves a purpose and the ending was perfect. This movie...
Sentiment: Positive π
----------------------------------------------------------------------
π Dataset Statistics:
Average review length: 1326 characters
Positive reviews: 12500
Negative reviews: 12500ποΈ Fine-tuning Models β
Let's create a complete fine-tuning example:
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def fine_tune_sentiment_model():
"""
Complete example of fine-tuning a BERT model for sentiment analysis
"""
# 1. Load model and tokenizer
model_name = "distilbert-base-uncased"
print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2
)
# 2. Load and prepare dataset
print("Loading IMDB dataset...")
dataset = load_dataset("imdb")
# Use smaller subset for demo (remove for full training)
train_dataset = dataset["train"].shuffle().select(range(1000))
eval_dataset = dataset["test"].shuffle().select(range(200))
print(f"Training samples: {len(train_dataset)}")
print(f"Evaluation samples: {len(eval_dataset)}")
# 3. Tokenization function
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
padding=True,
max_length=512
)
# Apply tokenization
print("Tokenizing datasets...")
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
# 4. Define metrics
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
accuracy = accuracy_score(labels, predictions)
return {
'accuracy': accuracy,
'f1': f1,
'precision': precision,
'recall': recall
}
# 5. Training configuration
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# 6. Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# 7. Train the model
print("ποΈ Starting fine-tuning...")
trainer.train()
# 8. Evaluate results
print("π Evaluating model...")
eval_results = trainer.evaluate()
print("β
Fine-tuning Results:")
for key, value in eval_results.items():
print(f"{key}: {value:.4f}")
# 9. Save the model
trainer.save_model("./fine_tuned_sentiment_model")
tokenizer.save_pretrained("./fine_tuned_sentiment_model")
print("πΎ Model saved successfully!")
return trainer, eval_results
# Run fine-tuning (uncomment to execute)
# trainer, results = fine_tune_sentiment_model()Expected Output:
Loading model: distilbert-base-uncased
Loading IMDB dataset...
Training samples: 1000
Evaluation samples: 200
Tokenizing datasets...
ποΈ Starting fine-tuning...
Step 10: Loss = 0.6234
Step 20: Loss = 0.4892
Step 30: Loss = 0.3456
...
Epoch 1: Evaluation Loss = 0.2234, Accuracy = 0.8950
Epoch 2: Evaluation Loss = 0.1876, Accuracy = 0.9150
Epoch 3: Evaluation Loss = 0.1654, Accuracy = 0.9250
π Evaluating model...
β
Fine-tuning Results:
eval_loss: 0.1654
eval_accuracy: 0.9250
eval_f1: 0.9240
eval_precision: 0.9235
eval_recall: 0.9250
eval_runtime: 15.4320
eval_samples_per_second: 12.973
πΎ Model saved successfully!π Popular Models and Use Cases β
Let's explore the most impactful models available on Hugging Face:
π POPULAR HUGGING FACE MODELS π
(Your AI Model Toolkit)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π TEXT MODELS β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
π MUST-KNOW MODELS:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π€ BERT (bert-base-uncased) β
β β’ Task: Text understanding, classification β
β β’ Best for: Sentiment analysis, Q&A, NER β
β β’ Example: Email spam detection, document classification β
β β
β βοΈ GPT-2 (gpt2) β
β β’ Task: Text generation β
β β’ Best for: Creative writing, content generation β
β β’ Example: Blog post writing, story completion β
β β
β π T5 (t5-base) β
β β’ Task: Text-to-text (universal) β
β β’ Best for: Translation, summarization, Q&A β
β β’ Example: Document summarization, language translation β
β β
β π mBERT (bert-base-multilingual-cased) β
β β’ Task: Multilingual understanding β
β β’ Best for: Cross-language tasks β
β β’ Example: Global customer support, multi-language sentiment β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β πΌοΈ VISION MODELS β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β πΈ ViT (google/vit-base-patch16-224) β
β β’ Task: Image classification β
β β’ Best for: Photo categorization, medical imaging β
β β’ Example: Product catalog organization, X-ray analysis β
β β
β π¨ CLIP (openai/clip-vit-base-patch32) β
β β’ Task: Image-text understanding β
β β’ Best for: Image search, visual Q&A β
β β’ Example: "Find images of red cars", image captioning β
β β
β π― DETR (facebook/detr-resnet-50) β
β β’ Task: Object detection β
β β’ Best for: Identifying objects in images β
β β’ Example: Security cameras, autonomous vehicles β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π΅ AUDIO MODELS β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ποΈ Whisper (openai/whisper-base) β
β β’ Task: Speech recognition β
β β’ Best for: Transcription, voice commands β
β β’ Example: Meeting transcripts, voice assistants β
β β
β π£οΈ Wav2Vec2 (facebook/wav2vec2-base-960h) β
β β’ Task: Speech understanding β
β β’ Best for: Audio classification, speech analysis β
β β’ Example: Emotion detection from speech, accent recognition β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββπ Building and Deploying Apps with Spaces β
Let's create interactive applications using Gradio:
import gradio as gr
from transformers import pipeline
# Create multiple AI-powered apps
def create_sentiment_app():
"""Sentiment analysis app"""
classifier = pipeline("sentiment-analysis")
def analyze_sentiment(text):
if not text:
return "Please enter some text to analyze."
result = classifier(text)[0]
label = result['label']
confidence = result['score']
# Format the output nicely
emoji = "π" if label == "POSITIVE" else "π"
return f"{emoji} {label} (Confidence: {confidence:.2%})"
# Create Gradio interface
demo = gr.Interface(
fn=analyze_sentiment,
inputs=gr.Textbox(placeholder="Enter text to analyze sentiment...", lines=3),
outputs=gr.Textbox(label="Sentiment Analysis Result"),
title="π Sentiment Analysis App",
description="Analyze the emotional tone of your text using AI!",
examples=[
"I love this new AI technology!",
"This is the worst product I've ever used.",
"The weather is okay today."
]
)
return demo
def create_text_generator_app():
"""Text generation app"""
generator = pipeline("text-generation", model="gpt2")
def generate_text(prompt, max_length, temperature):
if not prompt:
return "Please enter a prompt to generate text."
results = generator(
prompt,
max_length=max_length,
temperature=temperature,
num_return_sequences=1,
pad_token_id=generator.tokenizer.eos_token_id
)
return results[0]['generated_text']
# Create Gradio interface with more controls
demo = gr.Interface(
fn=generate_text,
inputs=[
gr.Textbox(placeholder="Enter your story prompt...", label="Prompt", lines=2),
gr.Slider(minimum=50, maximum=200, value=100, label="Max Length"),
gr.Slider(minimum=0.1, maximum=1.0, value=0.7, label="Temperature (Creativity)")
],
outputs=gr.Textbox(label="Generated Text", lines=5),
title="π AI Story Generator",
description="Generate creative text using GPT-2!",
examples=[
["Once upon a time in a magical forest", 100, 0.7],
["In the year 2050, artificial intelligence", 150, 0.8],
["The secret to happiness is", 80, 0.5]
]
)
return demo
def create_qa_app():
"""Question answering app"""
qa_pipeline = pipeline("question-answering")
def answer_question(context, question):
if not context or not question:
return "Please provide both context and a question."
try:
result = qa_pipeline(question=question, context=context)
answer = result['answer']
confidence = result['score']
return f"Answer: {answer}\n\nConfidence: {confidence:.2%}"
except Exception as e:
return f"Error: {str(e)}"
demo = gr.Interface(
fn=answer_question,
inputs=[
gr.Textbox(placeholder="Enter context/passage...", label="Context", lines=5),
gr.Textbox(placeholder="Enter your question...", label="Question", lines=1)
],
outputs=gr.Textbox(label="Answer", lines=3),
title="β AI Question Answering",
description="Ask questions about any text passage!",
examples=[
[
"The iPhone was first released by Apple in 2007. It revolutionized the smartphone industry with its touchscreen interface and app ecosystem.",
"When was the iPhone first released?"
]
]
)
return demo
# Launch apps (uncomment to run)
print("π Creating AI-powered apps...")
print("Uncomment the lines below to launch the apps!")
# sentiment_app = create_sentiment_app()
# text_app = create_text_generator_app()
# qa_app = create_qa_app()
# Launch individual apps
# sentiment_app.launch(share=True) # share=True creates public link
# Or combine multiple apps
# gr.TabbedInterface([sentiment_app, text_app, qa_app],
# ["Sentiment Analysis", "Text Generation", "Q&A"]).launch()Expected Output:
π Creating AI-powered apps...
Uncomment the lines below to launch the apps!
# When you uncomment and run the apps, you'll see:
Running on local URL: http://127.0.0.1:7860
Running on public URL: https://abc123def456.gradio.live
To create a public link, set `share=True` in `launch()`.π Best Practices and Advanced Tips β
β Do's and Best Practices β
# 1. Always specify model versions for reproducibility
from transformers import AutoTokenizer, AutoModel
# Good: Pin specific model versions
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", revision="main")
model = AutoModel.from_pretrained("bert-base-uncased", revision="main")
# 2. Handle memory efficiently
import torch
# Check available memory
if torch.cuda.is_available():
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Use mixed precision for memory savings
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
fp16=True, # Enables mixed precision
dataloader_pin_memory=False # Reduce memory usage
)
# 3. Cache models and datasets locally
from transformers import pipeline
# Models are automatically cached in ~/.cache/huggingface/
# Set custom cache directory if needed
import os
os.environ["TRANSFORMERS_CACHE"] = "/path/to/your/cache"
# 4. Use appropriate batch sizes
def find_optimal_batch_size(model, tokenizer, sample_texts):
"""Find the largest batch size that fits in memory"""
for batch_size in [1, 2, 4, 8, 16, 32]:
try:
# Test with sample batch
inputs = tokenizer(sample_texts[:batch_size],
return_tensors="pt",
padding=True,
truncation=True)
with torch.no_grad():
outputs = model(**inputs)
print(f"Batch size {batch_size}: β
Success")
optimal_batch_size = batch_size
except RuntimeError as e:
if "out of memory" in str(e):
print(f"Batch size {batch_size}: β Out of memory")
break
else:
raise e
return optimal_batch_size
# 5. Monitor model performance
from transformers import TrainerCallback
class PerformanceCallback(TrainerCallback):
"""Custom callback to monitor training"""
def on_log(self, args, state, control, model=None, logs=None, **kwargs):
if logs:
print(f"Step {state.global_step}: Loss = {logs.get('loss', 'N/A')}")Expected Output for Batch Size Testing:
Batch size 1: β
Success
Batch size 2: β
Success
Batch size 4: β
Success
Batch size 8: β
Success
Batch size 16: β
Success
Batch size 32: β Out of memory
Optimal batch size: 16β Common Pitfalls to Avoid β
# 1. Don't ignore tokenization limits
def safe_tokenize(text, tokenizer, max_length=512):
"""Safely handle long texts"""
tokens = tokenizer.encode(text)
if len(tokens) > max_length:
print(f"Warning: Text truncated from {len(tokens)} to {max_length} tokens")
return tokenizer(text,
max_length=max_length,
truncation=True,
padding=True,
return_tensors="pt")
# 2. Don't forget error handling
from transformers import pipeline
import logging
def robust_inference(text, task="sentiment-analysis"):
"""Robust inference with error handling"""
try:
pipe = pipeline(task)
result = pipe(text)
return result
except Exception as e:
logging.error(f"Inference failed: {str(e)}")
return {"error": str(e)}
# 3. Don't ignore model licenses and limitations
def check_model_info(model_name):
"""Check model information before use"""
from huggingface_hub import model_info
info = model_info(model_name)
print(f"Model: {model_name}")
print(f"License: {info.card_data.get('license', 'Not specified')}")
print(f"Language: {info.card_data.get('language', 'Not specified')}")
print(f"Downloads: {info.downloads}")
# Check for ethical considerations
if hasattr(info, 'card_data') and info.card_data:
limitations = info.card_data.get('limitations', None)
if limitations:
print(f"β οΈ Limitations: {limitations}")
# Example usage
# check_model_info("bert-base-uncased")Expected Output for Model Info Check:
Model: bert-base-uncased
License: apache-2.0
Language: en
Downloads: 50234567π§ Debugging and Troubleshooting β
import torch
from transformers import logging
# Enable detailed logging
logging.set_verbosity_info()
def diagnose_setup():
"""Comprehensive system diagnosis"""
print("π Hugging Face Environment Diagnosis")
print("=" * 50)
# Check PyTorch installation
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f" GPU {i}: {torch.cuda.get_device_name(i)}")
# Check transformers version
import transformers
print(f"Transformers version: {transformers.__version__}")
# Check datasets version
try:
import datasets
print(f"Datasets version: {datasets.__version__}")
except ImportError:
print("Datasets not installed")
# Check cache directory
from transformers import TRANSFORMERS_CACHE
print(f"Cache directory: {TRANSFORMERS_CACHE}")
# Test basic functionality
try:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("Test")
print("β
Basic functionality test: PASSED")
except Exception as e:
print(f"β Basic functionality test: FAILED - {e}")
# Run diagnosis
diagnose_setup()Expected Output for System Diagnosis:
π Hugging Face Environment Diagnosis
==================================================
PyTorch version: 2.1.0
CUDA available: True
CUDA version: 12.1
GPU count: 1
GPU 0: NVIDIA GeForce RTX 4090
Transformers version: 4.35.0
Datasets version: 2.14.6
Cache directory: /Users/username/.cache/huggingface/hub
β
Basic functionality test: PASSEDπ Learning Path and Next Steps β
Now that you understand Hugging Face, here's your recommended learning progression:
π― Beginner Path (Weeks 1-2) β
- Master the Basics: Practice with pipelines for different tasks
- Explore the Hub: Browse and test various models
- Simple Fine-tuning: Fine-tune a model on your own data
- Build Your First App: Create a Gradio demo
π Intermediate Path (Weeks 3-4) β
- Custom Training: Train models from scratch
- Advanced Datasets: Work with large, complex datasets
- Multi-modal Models: Experiment with vision-language models
- Optimization: Learn about PEFT and efficient training
π Advanced Path (Weeks 5-8) β
- Production Deployment: Deploy models at scale
- Custom Models: Create your own model architectures
- Research: Contribute to open-source projects
- Teaching: Share your knowledge through Spaces and tutorials
Next Steps:
- LLM Fundamentals: Understanding the theoretical foundation
- Getting Started with GPT: Hands-on GPT implementation
- Fine-tuning Models: Advanced model customization
- RAG Systems: Building knowledge-enhanced AI systems