All articles
· 14 min deep-diveNLPtext-processing
Article 1 in your session

Text Preprocessing — Turning Messy Words into Clean Features

A scroll-driven visual deep dive into text preprocessing. Learn tokenization, stemming, lemmatization, stopword removal, and normalization — the essential first step of every NLP pipeline.

Introduction 0%
Introduction
🎯 0/4 0%

🧹

Your model doesn’t see words.
It sees chaos.

“Running”, “runs”, “RAN”, and “runner” — are these the same thing or four different things? Without preprocessing, your NLP model has no idea. This is where every text pipeline begins.

↓ Scroll to learn how raw text becomes clean features

Why Preprocess?

Why Can’t Models Just Read Text?

❌ Raw Text (what the model sees)“I’m RUNNING to the store!!! Can’t wait 2 buy shoes 👟“Mixed case • contractions • emojis • slang • punctuation • special chars✓ Preprocessed Text (clean features)[“run”, “store”, “wait”, “buy”, “shoe”]Lowercased • lemmatized • stopwords removed • clean tokens
The same sentence looks completely different to a machine before and after preprocessing
Tokenization

Step 1: Tokenization — Splitting Text into Pieces

Word Tokenization”I love NLP” → [“I”, “love”, “NLP”]Simple split on spaces/punctuation. Most intuitive. Used in Bag-of-Words, TF-IDF.Character Tokenization”NLP” → [“N”, “L”, “P”]Every character is a token. Small vocab but loses word meaning. Used in some deep learning.Subword Tokenization (BPE / WordPiece)“unhappiness” → [“un”, “##happi”, “##ness”]Best of both worlds — handles unknown words by breaking into known pieces.⭐ Used by BERT, GPT, and all modern transformers
Three levels of tokenization — from words to characters to subwords
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A spam filter trained on emails encounters the brand-new word 'cryptoscam'. With word-level tokenization, what happens?

Normalization

Step 2: Normalization — Making Text Consistent

📝 Raw Text 'HeLLo!!!' 🔡 Lowercase 'hello!!!' ✂️ Remove Punct 'hello' Clean Text 'hello'
Normalization steps transform raw text into a standard form
Lowercasing• “Apple” → “apple”• “URGENT” → “urgent”⚠️ Destroys meaning sometimes:“US” (country) → “us” (pronoun)Punctuation Removal• “hello!!!” → “hello”• “it’s” → “its” or “it s”⚠️ Can break meaning:“$100” → “100”, “C++” → “C”Number Handling• “2024” → “<NUM>” or remove• “$99.99” → “<PRICE>“Numbers rarely help classification✓ Great for spam detectionUnicode / Encoding• “café” → “cafe”• “naïve” → “naive”Especially important formultilingual search engines
Common normalization techniques and when to use them
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A search engine lowercases all queries and documents. A user searches for 'US elections'. What problem might occur?

Stopwords

Step 3: Stopword Removal — Cutting the Noise

Before: “The cat sat on the mat in the room”The×3catsatoninmatroomAfter: [“cat”, “sat”, “mat”, “room”]catsatmatroom← 56% reduction, 100% meaning preserved
Stopwords dominate word counts but carry almost no meaning
Stem vs Lemma

Step 4: Stemming vs Lemmatization — Reducing Words to Roots

Input Wordsrunning better studies flies✂️ Stemming (Porter)running → “run” ✓better → “better” ✗ (can’t fix)studies → “studi” ✗ (not a word!)flies → “fli” ✗ (not a word!)⚡ Fast: just strip suffixes🔧 No dictionary needed📖 Lemmatizationrunning → “run” ✓better → “good” ✓studies → “study” ✓flies → “fly” ✓🐢 Slower: needs dictionary + POS📚 Requires WordNet or similar
Stemming chops blindly; lemmatization understands grammar
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

A search engine needs to match 'running shoes' with 'run shoe'. Which approach is most appropriate?

Full Pipeline

The Complete Preprocessing Pipeline

📄 Raw Text Input document 🔡 Lowercase Normalize case ✂️ Tokenize Split into tokens 🧹 Remove Noise Punct, HTML, URLs 🚫 Stopwords Remove common 🌱 Stem/Lemma Reduce to roots Clean Tokens Ready for ML!
A typical NLP preprocessing pipeline — order matters!

Impact of preprocessing on vocabulary size

1
Raw corpus: 50,000 unique tokens
Including 'Run', 'run', 'RUNNING', 'runs', 'runner', etc.
2
After lowercasing: ~35,000 tokens
30% reduction — 'Run' and 'run' collapse into one
3
After stopword removal: ~34,500 tokens
Small vocab reduction but HUGE frequency reduction — stopwords are 40-50% of all word occurrences
4
After stemming: ~20,000 tokens
60% total reduction — 'running', 'runs', 'runner' all become 'run'
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

You're building a sentiment analysis system. Which preprocessing step is MOST DANGEROUS to apply blindly?

🎓 What You Now Know

Tokenization splits text into processable units — word, character, or subword. Modern models use subword (BPE/WordPiece) to handle unknown words.

Normalization standardizes text — lowercasing, punctuation removal, Unicode normalization. But each step can destroy useful information.

Stopword removal cuts noise — but negation words are stopwords too. Always customize for your task.

Stemming is fast but crude; lemmatization is accurate but slow — search engines prefer stemming; NER systems prefer lemmatization.

There’s no universal pipeline — preprocessing choices depend entirely on the downstream task, data characteristics, and scale requirements.

Text preprocessing is the unglamorous foundation that makes everything else work. Get it wrong, and no amount of model sophistication can save you. Get it right, and even simple models like Naive Bayes become surprisingly powerful. 🧹

Keep Learning