Named Entity Recognition — Teaching Machines to Find Names in Text
A scroll-driven visual deep dive into Named Entity Recognition (NER). From rule-based to CRF to transformer-based approaches — learn how search engines and email services extract people, places, companies, and dates from unstructured text.
📍
Your inbox knows
who, what, where, and when.
“Meeting with Dr. Sarah Chen at Google HQ in Mountain View on March 15th at 2pm.” Gmail extracts every entity — person, organization, location, date — and auto-creates a calendar event. That’s Named Entity Recognition: finding structured facts in unstructured text.
↓ Scroll to learn how machines find names in a sea of words
What Are Named Entities?
Three Eras of NER
A rule-based NER system uses capitalization to find names: 'Capitalize words after periods are likely names.' It processes: 'I love Paris. Great city.' What happens?
💡 What happens to every word that starts a new sentence?
This is why rule-based NER is brittle: 'Great' is capitalized because it starts a sentence after a period — not because it's a named entity. The rule 'capitalized word = entity' produces massive false positives at sentence boundaries. CRF and neural models learn to distinguish sentence-initial capitalization from entity capitalization by also considering: (1) the word's frequency, (2) whether it appears capitalized mid-sentence elsewhere, (3) surrounding context. This single example shows why ML-based NER replaced rules.
The BIO Tagging Scheme
CRF: Why Label Dependencies Matter
CRF: jointly optimizing the full label sequence
Independent model: P(tag₁) × P(tag₂) × ... × P(tagₙ) CRF model: P(tag₁, tag₂, ..., tagₙ | x₁, x₂, ..., xₙ) Score(y|x) = Σᵢ [emission(yᵢ,xᵢ) + transition(yᵢ₋₁,yᵢ)] transition(B-PER → I-PER) = HIGH transition(I-PER → I-LOC) = IMPOSSIBLE A simple NER model (without CRF) labels 'New York City' as: New=B-LOC, York=B-LOC, City=O. What went wrong and how does CRF fix it?
💡 What's the difference between B-LOC → B-LOC and B-LOC → I-LOC?
Without a CRF layer, each token is labeled independently. The model sees 'New' (capitalized, location-like) → B-LOC. Then 'York' (capitalized, location-like) → B-LOC again, starting a NEW entity instead of continuing. 'City' (common word) → O. This fragments 'New York City' into two entities: 'New' and 'York.' A CRF adds transition scores: P(I-LOC | B-LOC) >> P(B-LOC | B-LOC), so the optimal sequence becomes B-LOC, I-LOC, I-LOC — correctly recognizing the full 3-word entity.
Modern NER: BERT + CRF
The sentence is: 'I ate an apple while reading about Apple on my Apple Watch.' A BERT-based NER model should:
💡 Does BERT produce the same vector for 'apple' in 'apple pie' and 'Apple Inc.'?
This is exactly what contextual embeddings solve. In a non-contextual model (Word2Vec), 'apple' always has the SAME vector regardless of context. But BERT produces DIFFERENT embeddings for each occurrence: 'ate an apple' context → fruit → O tag. 'reading about Apple' context → company → B-ORG. 'Apple Watch' context → brand + product → B-ORG I-ORG. This context-dependent representation is why BERT revolutionized NER.
Real-World NER Applications
Your NER system tags 'Apple' as ORG and 'Cupertino' as LOC correctly. But for 'Jordan played basketball,' it tags Jordan as LOC instead of PER. What's the core problem and fix?
💡 The same word can refer to different entity types depending on context — what resolves this?
Entity ambiguity is one of NER's hardest problems. 'Washington' can be a person, state, city, or university. 'Amazon' can be company, river, or streaming service. NER labels span + type, but entity linking (disambiguation) resolves WHICH entity. Modern systems use a pipeline: NER extracts mentions → entity linking matches each to a knowledge base using context. Wikipedia-based entity linking achieves ~90% accuracy by encoding context and candidate descriptions with cross-encoders.
🎓 What You Now Know
✓ NER is sequence labeling, not classification — every token gets a label (B-PER, I-PER, O, etc.), making it fundamentally harder than document-level tasks.
✓ BIO tagging handles multi-word entities — B marks where an entity starts, I marks continuation, O marks non-entities.
✓ CRFs model label dependencies — preventing invalid sequences like I-PER following B-LOC, and helping multi-word entities stay together.
✓ BERT revolutionized NER — contextual embeddings disambiguate “apple” (fruit) vs “Apple” (company) based on surrounding words, eliminating the need for hand-crafted features.
✓ NER powers real products — Gmail auto-creating calendar events, Google’s Knowledge Graph, financial contract analysis, and clinical text mining all rely on NER.
Named Entity Recognition is how machines extract structure from chaos. Every email auto-categorized, every knowledge panel shown, every medical record digitized — NER is working quietly behind the scenes. 📍
↗ Keep Learning
Text Preprocessing — Turning Messy Words into Clean Features
A scroll-driven visual deep dive into text preprocessing. Learn tokenization, stemming, lemmatization, stopword removal, and normalization — the essential first step of every NLP pipeline.
Text Classification — Teaching Machines to Sort Your Inbox
A scroll-driven visual deep dive into text classification. From spam filters to Gmail's categories — learn how ML models read text, extract features, and assign labels at scale.
Word Embeddings — When Words Learned to Be Vectors
A scroll-driven visual deep dive into word embeddings. Learn how Word2Vec, GloVe, and FastText turn words into dense vectors where meaning becomes geometry — and why 'king - man + woman = queen' actually works.
Query Understanding — What Did the User Actually Mean?
A scroll-driven visual deep dive into query understanding. From spell correction to query expansion to intent classification — learn how search engines interpret ambiguous, misspelled, and complex queries.
Comments
No comments yet. Be the first!