Text Analysis: POS Tagging & Named Entity Recognition β
Understanding grammatical structure and extracting meaningful entities from text
π― What is Text Analysis? β
Text analysis involves extracting meaningful information from preprocessed text. Two fundamental techniques are Parts of Speech (POS) Tagging and Named Entity Recognition (NER). These techniques help understand the grammatical structure and identify important entities in text.
π TEXT ANALYSIS OVERVIEW
Preprocessed Text β Grammatical Analysis β Entity Extraction β Structured Information
π§ CORE TECHNIQUES
βββ Parts of Speech (POS) Tagging
β βββ Identify grammatical roles
β βββ Noun, verb, adjective classification
β βββ Syntax understanding
βββ Named Entity Recognition (NER)
βββ Extract people, organizations, locations
βββ Identify dates, monetary values
βββ Domain-specific entity extraction
π― APPLICATIONS
βββ Information extraction
βββ Question answering systems
βββ Document summarization
βββ Content classification
βββ Knowledge graph constructionπ Quick Start Guide β
π GETTING STARTED WITH TEXT ANALYSIS
1. ENVIRONMENT SETUP
β‘ Install spaCy for advanced NLP
β‘ Download language models
β‘ Install NLTK for basic POS tagging
β‘ Set up visualization tools
2. CHOOSE ANALYSIS TECHNIQUES
β‘ Identify required information types
β‘ Select appropriate models
β‘ Consider accuracy vs speed trade-offs
β‘ Plan integration with preprocessing
3. IMPLEMENT ANALYSIS PIPELINE
β‘ Create analysis functions
β‘ Test on sample data
β‘ Optimize for your use case
β‘ Handle edge cases
4. VALIDATION & EVALUATION
β‘ Verify analysis accuracy
β‘ Test on different text types
β‘ Monitor performance
β‘ Document findingsπ¦ Required Libraries β
# Essential libraries for text analysis
pip install spacy nltk matplotlib seaborn pandas numpy
# Download spaCy language model (includes POS and NER)
python -m spacy download en_core_web_sm
# For more advanced models
python -m spacy download en_core_web_md
python -m spacy download en_core_web_lg
# Download NLTK data for POS tagging
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')1. Parts of Speech (POS) Tagging β
POS tagging identifies grammatical roles of words (noun, verb, adjective, etc.). This is crucial for understanding sentence structure and meaning.
π·οΈ POS TAGGING PROCESS
Input: "emma woodhouse handsome clever and rich"
POS TAGS:
βββ emma β PROPN (Proper noun)
βββ woodhouse β PROPN (Proper noun)
βββ handsome β ADJ (Adjective)
βββ clever β ADJ (Adjective)
βββ and β CCONJ (Coordinating conjunction)
βββ rich β ADJ (Adjective)
π― APPLICATIONS
βββ Syntax analysis
βββ Feature engineering
βββ Information extraction
βββ Grammar checking
βββ Text classificationBasic POS Tagging with NLTK β
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
import pandas as pd
# Download required NLTK data
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
text = "Emma Woodhouse, handsome, clever, and rich, with a comfortable home"
tokens = word_tokenize(text)
# Get POS tags
pos_tags = nltk.pos_tag(tokens)
print("NLTK POS Tags:")
for word, tag in pos_tags:
print(f"{word:12} β {tag}")
# Analyze POS distribution
pos_counts = Counter([tag for word, tag in pos_tags])
print("\nPOS Distribution:")
for tag, count in pos_counts.most_common():
print(f"{tag}: {count}")
# OUTPUT:
# NLTK POS Tags:
# Emma β NNP
# Woodhouse β NNP
# , β ,
# handsome β JJ
# , β ,
# clever β JJ
# , β ,
# and β CC
# rich β JJ
# , β ,
# with β IN
# a β DT
# comfortable β JJ
# home β NN
#
# POS Distribution:
# ,: 4
# JJ: 4
# NNP: 2
# CC: 1
# IN: 1
# DT: 1
# NN: 1Advanced POS Tagging with spaCy β
import spacy
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
# Load spaCy model
nlp = spacy.load('en_core_web_sm')
text = "Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence."
doc = nlp(text)
# Extract detailed POS information
pos_data = []
for token in doc:
pos_data.append({
'text': token.text,
'pos': token.pos_,
'tag': token.tag_,
'dep': token.dep_,
'lemma': token.lemma_,
'is_alpha': token.is_alpha,
'is_stop': token.is_stop
})
pos_df = pd.DataFrame(pos_data)
print("Detailed POS Analysis:")
print(pos_df.to_string(index=False))
# Visualize POS distribution
pos_counts = Counter([token.pos_ for token in doc if token.is_alpha])
plt.figure(figsize=(10, 6))
plt.bar(pos_counts.keys(), pos_counts.values(), color='skyblue')
plt.title('Parts of Speech Distribution')
plt.xlabel('POS Tags')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# OUTPUT:
# Detailed POS Analysis:
# text pos tag dep lemma is_alpha is_stop
# Emma PROPN NNP nsubj Emma True False
# Woodhouse PROPN NNP flat Woodhouse True False
# , PUNCT , punct , False False
# handsome ADJ JJ acomp handsome True False
# , PUNCT , punct , False False
# clever ADJ JJ conj clever True False
# , PUNCT , punct , False False
# and CCONJ CC cc and True True
# rich ADJ JJ conj rich True False
#
# POS Distribution: ADJ(4), PROPN(2), NOUN(6), VERB(2), etc.POS Tag Meanings β
# Common POS tags and their meanings
pos_explanations = {
'NOUN': 'Noun - person, place, thing',
'PROPN': 'Proper noun - specific names',
'VERB': 'Verb - action or state',
'ADJ': 'Adjective - describes nouns',
'ADV': 'Adverb - describes verbs/adjectives',
'PRON': 'Pronoun - replaces nouns',
'DET': 'Determiner - the, a, an',
'ADP': 'Adposition - preposition',
'CONJ': 'Conjunction - and, or, but',
'PUNCT': 'Punctuation marks',
'NUM': 'Number - cardinal/ordinal',
'PART': 'Particle - possessive marker',
'INTJ': 'Interjection - exclamation'
}
# Get explanations for tags in our text
print("POS Tag Explanations:")
unique_pos = set([token.pos_ for token in doc])
for pos in sorted(unique_pos):
explanation = pos_explanations.get(pos, 'Unknown')
print(f"{pos:6} β {explanation}")
# OUTPUT:
# POS Tag Explanations:
# ADJ β Adjective - describes nouns
# ADP β Adposition - preposition
# CCONJ β Conjunction - and, or, but
# DET β Determiner - the, a, an
# NOUN β Noun - person, place, thing
# PART β Particle - possessive marker
# PROPN β Proper noun - specific names
# PUNCT β Punctuation marks
# VERB β Verb - action or stateExtracting Specific POS Categories β
def extract_pos_categories(text):
"""Extract words by POS categories"""
doc = nlp(text)
categories = {
'nouns': [token.text for token in doc if token.pos_ in ['NOUN', 'PROPN']],
'verbs': [token.text for token in doc if token.pos_ == 'VERB'],
'adjectives': [token.text for token in doc if token.pos_ == 'ADJ'],
'adverbs': [token.text for token in doc if token.pos_ == 'ADV']
}
return categories
# Example usage
text = "The brilliant scientist quickly discovered amazing new compounds in her modern laboratory."
categories = extract_pos_categories(text)
for category, words in categories.items():
print(f"{category.capitalize()}: {words}")
# OUTPUT:
# Nouns: ['scientist', 'compounds', 'laboratory']
# Verbs: ['discovered']
# Adjectives: ['brilliant', 'amazing', 'new', 'modern']
# Adverbs: ['quickly']2. Named Entity Recognition (NER) β
NER identifies and categorizes entities like people, organizations, dates, and locations. This is essential for information extraction and knowledge graph construction.
π’ NAMED ENTITY RECOGNITION
Input: "Google was founded by Larry Page and Sergey Brin at Stanford University"
ENTITIES:
βββ Google β ORG (Organization)
βββ Larry Page β PERSON (Person)
βββ Sergey Brin β PERSON (Person)
βββ Stanford University β ORG (Organization)
π― ENTITY TYPES
βββ PERSON (People, including fictional)
βββ ORG (Organizations, companies, agencies)
βββ GPE (Countries, cities, states)
βββ DATE (Dates, periods)
βββ TIME (Times smaller than a day)
βββ MONEY (Monetary values)
βββ QUANTITY (Measurements)
βββ ORDINAL (First, second, etc.)
βββ CARDINAL (Numerals)
βββ PRODUCT (Products, not services)Basic NER with spaCy β
import spacy
from spacy import displacy
from collections import Counter
# Load spaCy model
nlp = spacy.load('en_core_web_sm')
text = """Google was founded on September 4, 1998, by computer scientists
Larry Page and Sergey Brin while they were PhD students at Stanford University
in California. The company received $25 million in funding from Sequoia Capital."""
doc = nlp(text)
# Extract entities
print("Named Entities:")
print("-" * 50)
for ent in doc.ents:
print(f"{ent.text:25} | {ent.label_:10} | {spacy.explain(ent.label_)}")
# Count entity types
entity_counts = Counter([ent.label_ for ent in doc.ents])
print(f"\nEntity Type Distribution:")
for entity_type, count in entity_counts.most_common():
print(f"{entity_type:10}: {count}")
# OUTPUT:
# Named Entities:
# --------------------------------------------------
# Google | ORG | Companies, agencies, institutions, etc.
# September 4, 1998 | DATE | Absolute or relative dates or periods
# Larry Page | PERSON | People, including fictional
# Sergey Brin | PERSON | People, including fictional
# Stanford University | ORG | Companies, agencies, institutions, etc.
# California | GPE | Countries, cities, states
# $25 million | MONEY | Monetary values, including unit
# Sequoia Capital | ORG | Companies, agencies, institutions, etc.
#
# Entity Type Distribution:
# ORG : 3
# PERSON : 2
# DATE : 1
# GPE : 1
# MONEY : 1Advanced NER Analysis β
def analyze_entities(text):
"""Comprehensive entity analysis"""
doc = nlp(text)
# Group entities by type
entities_by_type = {}
for ent in doc.ents:
if ent.label_ not in entities_by_type:
entities_by_type[ent.label_] = []
entities_by_type[ent.label_].append({
'text': ent.text,
'start': ent.start_char,
'end': ent.end_char,
'confidence': getattr(ent, 'confidence', 'N/A')
})
return entities_by_type
# Example analysis
text = """Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne
in April 1976 in Cupertino, California. The iPhone was launched on June 29, 2007,
and generated $150 billion in revenue by 2020."""
entities = analyze_entities(text)
print("Entities by Type:")
print("=" * 60)
for entity_type, entity_list in entities.items():
print(f"\n{entity_type} ({spacy.explain(entity_type)}):")
for entity in entity_list:
print(f" β’ {entity['text']} (chars {entity['start']}-{entity['end']})")
# OUTPUT:
# Entities by Type:
# ============================================================
#
# ORG (Companies, agencies, institutions, etc.):
# β’ Apple Inc. (chars 0-10)
#
# PERSON (People, including fictional):
# β’ Steve Jobs (chars 29-39)
# β’ Steve Wozniak (chars 41-54)
# β’ Ronald Wayne (chars 60-72)
#
# DATE (Absolute or relative dates or periods):
# β’ April 1976 (chars 76-86)
# β’ June 29, 2007 (chars 141-154)
# β’ 2020 (chars 191-195)
#
# GPE (Countries, cities, states):
# β’ Cupertino (chars 90-99)
# β’ California (chars 101-111)
#
# MONEY (Monetary values, including unit):
# β’ $150 billion (chars 169-181)
#
# PRODUCT (Objects, vehicles, foods, etc.):
# β’ iPhone (chars 117-123)Entity Visualization β
# Visualize entities (for Jupyter notebooks)
def visualize_entities(text):
"""Create entity visualization"""
doc = nlp(text)
# Display in Jupyter
displacy.render(doc, style="ent", jupyter=True)
# Alternative: HTML output
html = displacy.render(doc, style="ent", page=True)
return html
# Example visualization
text = "Microsoft was founded by Bill Gates and Paul Allen in 1975 in Redmond, Washington."
# visualize_entities(text) # Uncomment for Jupyter
# OUTPUT:
# Creates visual highlighting of entities in the text:
# [Microsoft](ORG) was founded by [Bill Gates](PERSON) and [Paul Allen](PERSON)
# in [1975](DATE) in [Redmond](GPE), [Washington](GPE).
#
# HTML output contains styled spans with entity labels and colorsCustom Entity Recognition β
def extract_custom_entities(text, custom_patterns=None):
"""Extract entities with custom patterns"""
doc = nlp(text)
# Standard entities
standard_entities = [(ent.text, ent.label_) for ent in doc.ents]
# Custom pattern matching (example: email addresses)
import re
custom_entities = []
if custom_patterns:
for pattern_name, pattern in custom_patterns.items():
matches = re.finditer(pattern, text)
for match in matches:
custom_entities.append((match.group(), pattern_name))
return {
'standard': standard_entities,
'custom': custom_entities
}
# Example with custom patterns
text = "Contact John Doe at john.doe@company.com or visit https://example.com"
custom_patterns = {
'EMAIL': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'URL': r'https?://[^\s]+',
'PHONE': r'\b\d{3}-\d{3}-\d{4}\b'
}
entities = extract_custom_entities(text, custom_patterns)
print("Standard entities:", entities['standard'])
print("Custom entities:", entities['custom'])
# OUTPUT:
# Standard entities: [('John Doe', 'PERSON')]
# Custom entities: [('john.doe@company.com', 'EMAIL'), ('https://example.com', 'URL')]3. Combined POS and NER Analysis β
Combining POS tagging and NER provides rich linguistic analysis for advanced NLP tasks.
π COMBINED ANALYSIS PIPELINE
Input Text β POS Tagging β NER β Linguistic Features β Structured Output
π― BENEFITS
βββ Rich feature extraction
βββ Better context understanding
βββ Improved information extraction
βββ Enhanced text classification
βββ Advanced query processingIntegrated Analysis Pipeline β
class TextAnalyzer:
def __init__(self):
self.nlp = spacy.load('en_core_web_sm')
def analyze_text(self, text):
"""Comprehensive text analysis"""
doc = self.nlp(text)
analysis = {
'text': text,
'tokens': [],
'entities': [],
'pos_summary': {},
'entity_summary': {},
'sentences': []
}
# Token-level analysis
for token in doc:
analysis['tokens'].append({
'text': token.text,
'pos': token.pos_,
'tag': token.tag_,
'lemma': token.lemma_,
'dep': token.dep_,
'is_alpha': token.is_alpha,
'is_stop': token.is_stop,
'is_entity': token.ent_type_ != ''
})
# Entity analysis
for ent in doc.ents:
analysis['entities'].append({
'text': ent.text,
'label': ent.label_,
'start': ent.start_char,
'end': ent.end_char,
'description': spacy.explain(ent.label_)
})
# Summaries
analysis['pos_summary'] = Counter([token.pos_ for token in doc if token.is_alpha])
analysis['entity_summary'] = Counter([ent.label_ for ent in doc.ents])
# Sentence analysis
for sent in doc.sents:
analysis['sentences'].append({
'text': sent.text.strip(),
'entities': [ent.text for ent in sent.ents],
'main_pos': [token.pos_ for token in sent if token.pos_ in ['NOUN', 'VERB', 'ADJ']]
})
return analysis
def generate_report(self, analysis):
"""Generate analysis report"""
print("π TEXT ANALYSIS REPORT")
print("=" * 50)
print(f"\nπ Text: {analysis['text'][:100]}...")
print(f"π Tokens: {len(analysis['tokens'])}")
print(f"π Sentences: {len(analysis['sentences'])}")
print(f"\nπ·οΈ POS Distribution:")
for pos, count in analysis['pos_summary'].most_common():
print(f" {pos:8}: {count}")
print(f"\nπ’ Entity Distribution:")
for entity_type, count in analysis['entity_summary'].most_common():
print(f" {entity_type:8}: {count}")
print(f"\nπ Detected Entities:")
for entity in analysis['entities']:
print(f" β’ {entity['text']:20} ({entity['label']})")
# Example usage
analyzer = TextAnalyzer()
text = """Apple Inc. was founded by Steve Jobs and Steve Wozniak in April 1976.
The company revolutionized personal computing with the Apple II and later the Macintosh.
Today, Apple is headquartered in Cupertino, California, and is worth over $2 trillion."""
analysis = analyzer.analyze_text(text)
analyzer.generate_report(analysis)
# OUTPUT:
# π TEXT ANALYSIS REPORT
# ==================================================
#
# π Text: Apple Inc. was founded by Steve Jobs and Steve Wozniak in April 1976. The company revolutionized...
# π Tokens: 31
# π Sentences: 3
#
# π·οΈ POS Distribution:
# NOUN : 8
# VERB : 4
# ADJ : 3
# PROPN : 7
# ADP : 4
# DET : 3
#
# π’ Entity Distribution:
# ORG : 1
# PERSON : 2
# DATE : 1
# GPE : 2
# MONEY : 1
# PRODUCT : 2
#
# π Detected Entities:
# β’ Apple Inc. (ORG)
# β’ Steve Jobs (PERSON)
# β’ Steve Wozniak (PERSON)
# β’ April 1976 (DATE)
# β’ Apple II (PRODUCT)
# β’ Macintosh (PRODUCT)
# β’ Apple (ORG)
# β’ Cupertino (GPE)
# β’ California (GPE)
# β’ $2 trillion (MONEY)Advanced Feature Extraction β
def extract_linguistic_features(text):
"""Extract advanced linguistic features"""
doc = nlp(text)
features = {
# Basic statistics
'word_count': len([token for token in doc if token.is_alpha]),
'sentence_count': len(list(doc.sents)),
'avg_sentence_length': sum(len(sent.text.split()) for sent in doc.sents) / len(list(doc.sents)),
# POS features
'noun_ratio': len([token for token in doc if token.pos_ == 'NOUN']) / len(doc),
'verb_ratio': len([token for token in doc if token.pos_ == 'VERB']) / len(doc),
'adj_ratio': len([token for token in doc if token.pos_ == 'ADJ']) / len(doc),
# Entity features
'entity_count': len(doc.ents),
'person_count': len([ent for ent in doc.ents if ent.label_ == 'PERSON']),
'org_count': len([ent for ent in doc.ents if ent.label_ == 'ORG']),
'location_count': len([ent for ent in doc.ents if ent.label_ in ['GPE', 'LOC']]),
# Complexity features
'unique_pos_count': len(set([token.pos_ for token in doc])),
'entity_density': len(doc.ents) / len(doc),
'lexical_diversity': len(set([token.lemma_ for token in doc if token.is_alpha])) / len([token for token in doc if token.is_alpha])
}
return features
# Example feature extraction
text = "Google was founded by Larry Page and Sergey Brin at Stanford University in California."
features = extract_linguistic_features(text)
print("π LINGUISTIC FEATURES:")
for feature, value in features.items():
if isinstance(value, float):
print(f"{feature:20}: {value:.3f}")
else:
print(f"{feature:20}: {value}")
# OUTPUT:
# π LINGUISTIC FEATURES:
# word_count : 13
# sentence_count : 1
# avg_sentence_length : 15.0
# noun_ratio : 0.067
# verb_ratio : 0.067
# adj_ratio : 0.000
# entity_count : 5
# person_count : 2
# org_count : 2
# location_count : 1
# unique_pos_count : 8
# entity_density : 0.333
# lexical_diversity : 0.846π― Key Applications β
π TEXT ANALYSIS APPLICATIONS
1. INFORMATION EXTRACTION
βββ Extract key facts from documents
βββ Identify important entities
βββ Structure unstructured text
βββ Create knowledge graphs
2. DOCUMENT CLASSIFICATION
βββ Use POS patterns as features
βββ Entity-based categorization
βββ Content type detection
βββ Topic modeling enhancement
3. QUESTION ANSWERING
βββ Identify question types via POS
βββ Extract answer entities
βββ Validate answer formats
βββ Improve response relevance
4. CONTENT ANALYSIS
βββ Analyze writing style
βββ Detect content themes
βββ Measure text complexity
βββ Generate summariesReal-World Example: News Article Analysis β
def analyze_news_article(article_text):
"""Analyze a news article for key information"""
doc = nlp(article_text)
# Extract key information
people = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
organizations = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
locations = [ent.text for ent in doc.ents if ent.label_ in ['GPE', 'LOC']]
dates = [ent.text for ent in doc.ents if ent.label_ == 'DATE']
# Analyze sentence structure
sentences = list(doc.sents)
key_sentences = []
for sent in sentences:
# Find sentences with important entities
sent_entities = [ent for ent in sent.ents if ent.label_ in ['PERSON', 'ORG', 'GPE']]
if len(sent_entities) >= 2: # Sentences with multiple important entities
key_sentences.append(sent.text.strip())
return {
'people': list(set(people)),
'organizations': list(set(organizations)),
'locations': list(set(locations)),
'dates': list(set(dates)),
'key_sentences': key_sentences
}
# Example news article analysis
article = """
Apple CEO Tim Cook announced today that the company will invest $1 billion in a new
manufacturing facility in Austin, Texas. The announcement came during a visit to the
Apple Park headquarters in Cupertino, California, where Cook met with Texas Governor
Greg Abbott to discuss the expansion plans.
"""
news_analysis = analyze_news_article(article)
print("π° NEWS ARTICLE ANALYSIS:")
print(f"π₯ People: {news_analysis['people']}")
print(f"π’ Organizations: {news_analysis['organizations']}")
print(f"π Locations: {news_analysis['locations']}")
print(f"π
Dates: {news_analysis['dates']}")
print(f"π Key Sentences:")
for sentence in news_analysis['key_sentences']:
print(f" β’ {sentence}")
# OUTPUT:
# π° NEWS ARTICLE ANALYSIS:
# π₯ People: ['Tim Cook', 'Greg Abbott']
# π’ Organizations: ['Apple', 'Apple Park']
# π Locations: ['Austin', 'Texas', 'Cupertino', 'California']
# π
Dates: ['today']
# π Key Sentences:
# β’ Apple CEO Tim Cook announced today that the company will invest $1 billion in a new manufacturing facility in Austin, Texas.
# β’ The announcement came during a visit to the Apple Park headquarters in Cupertino, California, where Cook met with Texas Governor Greg Abbott to discuss the expansion plans.π― Best Practices β
π TEXT ANALYSIS BEST PRACTICES
β
RECOMMENDATIONS
βββ Use appropriate models for your domain
βββ Validate entity extraction accuracy
βββ Consider context when interpreting POS
βββ Combine multiple techniques for robustness
βββ Handle edge cases and errors gracefully
βββ Evaluate performance on your specific data
β οΈ COMMON PITFALLS
βββ Over-relying on single model predictions
βββ Ignoring model confidence scores
βββ Not handling ambiguous cases
βββ Assuming perfect entity recognition
βββ Ignoring domain-specific terminology
βββ Not validating results manuallyModel Selection and Evaluation β
def evaluate_ner_performance(test_texts, true_entities):
"""Evaluate NER performance"""
predicted_entities = []
for text in test_texts:
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
predicted_entities.append(entities)
# Calculate metrics (simplified)
correct = 0
total_pred = 0
total_true = 0
for pred, true in zip(predicted_entities, true_entities):
pred_set = set(pred)
true_set = set(true)
correct += len(pred_set.intersection(true_set))
total_pred += len(pred_set)
total_true += len(true_set)
precision = correct / total_pred if total_pred > 0 else 0
recall = correct / total_true if total_true > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return {
'precision': precision,
'recall': recall,
'f1': f1
}
# Example evaluation
test_texts = ["Apple Inc. was founded by Steve Jobs."]
true_entities = [[("Apple Inc.", "ORG"), ("Steve Jobs", "PERSON")]]
metrics = evaluate_ner_performance(test_texts, true_entities)
print("π NER Performance:")
for metric, value in metrics.items():
print(f"{metric:10}: {value:.3f}")
# OUTPUT:
# π NER Performance:
# precision : 1.000
# recall : 1.000
# f1 : 1.000π― Key Takeaways β
π― TEXT ANALYSIS SUMMARY
π KEY CONCEPTS
βββ POS tagging identifies grammatical roles
βββ NER extracts meaningful entities
βββ Combined analysis provides rich features
βββ Domain-specific models improve accuracy
βββ Validation is crucial for reliability
π PRACTICAL APPLICATIONS
βββ Information extraction from documents
βββ Enhanced text classification
βββ Question answering systems
βββ Content analysis and summarization
βββ Knowledge graph construction
π NEXT STEPS
βββ Experiment with different models
βββ Build domain-specific entity recognizers
βββ Integrate with text classification
βββ Explore advanced linguistic features
βββ Combine with machine learning pipelinesRelated Topics:
- Text Preprocessing: Clean and prepare text for analysis
- Text Vectorization: Convert text to numerical representations
- Embeddings & Semantic Similarity: Advanced vector representations
- Transformers & Attention: Modern NLP architectures