Text Preprocessing — Turning Messy Words into Clean Features
A scroll-driven visual deep dive into text preprocessing. Learn tokenization, stemming, lemmatization, stopword removal, and normalization — the essential first step of every NLP pipeline.
🧹
Your model doesn’t see words.
It sees chaos.
“Running”, “runs”, “RAN”, and “runner” — are these the same thing or four different things? Without preprocessing, your NLP model has no idea. This is where every text pipeline begins.
↓ Scroll to learn how raw text becomes clean features
Why Can’t Models Just Read Text?
Step 1: Tokenization — Splitting Text into Pieces
A spam filter trained on emails encounters the brand-new word 'cryptoscam'. With word-level tokenization, what happens?
💡 What happens when a word isn't in the dictionary the model was built on?
With word-level tokenization, the vocabulary is fixed at training time. A word the model has never seen becomes an 'unknown' token — all meaning is lost. This is catastrophic for a spam filter because 'cryptoscam' is a VERY strong spam signal! Subword tokenization (BPE) would split it into ['crypto', 'scam'] — both known tokens with useful information. This is why Gmail and modern systems use subword tokenizers.
Step 2: Normalization — Making Text Consistent
A search engine lowercases all queries and documents. A user searches for 'US elections'. What problem might occur?
💡 Think about how common the word 'us' is in everyday English...
Lowercasing 'US' → 'us' loses the distinction between the country abbreviation and the pronoun. The pronoun 'us' appears in nearly every document, so search results get flooded with irrelevant matches. This is why production search engines use smarter normalization — they might keep known acronyms uppercase, or use context-aware casing. Google's BERT model handles this via subword tokens that preserve casing information.
Step 3: Stopword Removal — Cutting the Noise
Step 4: Stemming vs Lemmatization — Reducing Words to Roots
A search engine needs to match 'running shoes' with 'run shoe'. Which approach is most appropriate?
💡 What matters more at search engine scale: perfect word forms or speed?
Search engines process millions of queries per second. Stemming (like Porter stemmer) reduces 'running' → 'run' and 'shoes' → 'shoe' — fast and good enough for matching. Lemmatization would produce the same output here but 5-10x slower because it needs dictionary lookups. At Google-scale, that latency difference matters enormously. This is why Lucene/Elasticsearch default to stemming, not lemmatization.
The Complete Preprocessing Pipeline
Impact of preprocessing on vocabulary size
Raw corpus: 50,000 unique tokens After lowercasing: ~35,000 tokens After stopword removal: ~34,500 tokens After stemming: ~20,000 tokens You're building a sentiment analysis system. Which preprocessing step is MOST DANGEROUS to apply blindly?
💡 What happens to 'not good' if you remove 'not'?
'This movie is not good' after stopword removal becomes 'movie good' — the exact opposite sentiment! Negation words ('not', 'no', 'never', 'hardly', 'barely') are typically in stopword lists, but they are critical for sentiment analysis. Similarly, 'I don't like it' becomes 'like' — positive! Always customize your stopword list for your specific task. Some teams use 'sentiment-aware' stopword lists that keep negation words.
🎓 What You Now Know
✓ Tokenization splits text into processable units — word, character, or subword. Modern models use subword (BPE/WordPiece) to handle unknown words.
✓ Normalization standardizes text — lowercasing, punctuation removal, Unicode normalization. But each step can destroy useful information.
✓ Stopword removal cuts noise — but negation words are stopwords too. Always customize for your task.
✓ Stemming is fast but crude; lemmatization is accurate but slow — search engines prefer stemming; NER systems prefer lemmatization.
✓ There’s no universal pipeline — preprocessing choices depend entirely on the downstream task, data characteristics, and scale requirements.
Text preprocessing is the unglamorous foundation that makes everything else work. Get it wrong, and no amount of model sophistication can save you. Get it right, and even simple models like Naive Bayes become surprisingly powerful. 🧹
↗ Keep Learning
Bag of Words & TF-IDF — How Search Engines Ranked Before AI
A scroll-driven visual deep dive into Bag of Words and TF-IDF. Learn how documents become vectors, why term frequency alone fails, and how IDF rescues relevance — the backbone of search before neural models.
Naive Bayes — Why 'Stupid' Assumptions Work Brilliantly
A scroll-driven visual deep dive into Naive Bayes. Learn Bayes' theorem, why the 'naive' independence assumption is wrong but works anyway, and why it dominates spam filtering.
Text Classification — Teaching Machines to Sort Your Inbox
A scroll-driven visual deep dive into text classification. From spam filters to Gmail's categories — learn how ML models read text, extract features, and assign labels at scale.
Comments
No comments yet. Be the first!