All articles
· 10 min deep-divedata-structuresalgorithms
Article 1 in your session

Hash Tables — O(1) Access That Powers Everything

A scroll-driven visual deep dive into hash tables. From hash functions to collision resolution (chaining vs open addressing) to load factors — understand the data structure behind caches, databases, and deduplication.

Introduction 0%
Introduction
🎯 0/3 0%

🗝️

Any key. Any value.
O(1) average time.

A hash table maps keys to values using a hash function that converts any key into an array index. Insert, lookup, and delete all run in O(1) average time. It’s the most-used data structure in software engineering — powering Python dicts, JavaScript objects, database indexes, caches, and more. The catch? Collisions, and understanding when O(1) becomes O(n).

↓ Scroll to understand the data structure you use 100 times a day

Hash Functions

From Key to Index in One Step

Keys”alice""bob""charlie""dave”hash(key)% table_size→ index 0..7Buckets (array)[0][1][2][3][4][5][6][7]“alice” → 42”bob” → 28”charlie” → 35”dave” → 19
A hash function compresses any key into a fixed-size array index
Hash Math

What makes a good hash function

🔑 Key-to-Index Mapping

A hash function takes any key (string, number, object) and returns a fixed-size integer. The modulo operator then maps it to a valid array index from 0 to table_size - 1.

index = hash(key) % table_size
🎯 Deterministic

The same key must ALWAYS produce the same hash value. If hash('alice') returned different values on different calls, lookups would fail silently.

📊 Uniform Distribution

Hashes should spread evenly across all buckets. If many keys cluster into the same bucket, performance degrades from O(1) to O(n) — defeating the purpose of hashing.

🌊 Avalanche Effect

A tiny change in the key should produce a completely different hash. 'alice' and 'alicf' must map to unrelated indices — this prevents clustering and ensures uniform distribution.

↑ Answer the question above to continue ↑
🟢 Quick Check Knowledge Check

You have a hash table of size 8. Keys 'cat', 'car', 'cap' all hash to index 3. What is this called, and what does it do to performance?

Chaining

Collision Resolution: Chaining vs Open Addressing

Chaining[0]→ ∅[1]catcarcap[2]dog[3]→ ∅✓ Simple, never “full”✓ Degrades gracefully✗ Extra memory for pointers✗ Cache-unfriendly (linked list)Used by: Java HashMap, Go mapOpen Addressing[0]empty[1]cat (hash→1)[2]car (hash→1, probe→2)[3]cap (hash→1, probe→3)✓ Cache-friendly (contiguous memory)✓ No pointer overhead✗ Clustering under high load✗ Table can become “full”Used by: Python dict, Rust HashMap
Two strategies to handle collisions: linked lists per bucket (chaining) vs probing for empty slots (open addressing)
↑ Answer the question above to continue ↑
🟡 Checkpoint Knowledge Check

Python's dict uses open addressing with a probe sequence. If you insert 1 million keys into a dict with 1.3 million slots, what is the expected number of probes per lookup?

Load Factor

Load Factor: The Performance Knob

Load factor and resizing

📊 Load Factor (α) 🟢 Low Load (α < 0.5) 🎯 Sweet Spot (α ≈ 0.75) 🔴 Overloaded (α → 1.0) 📐 Resize (2× Table)
📊 Load Factor (α)

The load factor measures how full the table is — the ratio of occupied slots to total slots. It's the single most important metric for hash table performance.

α = n / m (keys / table size)
🟢 Low Load (α < 0.5)

Most lookups find an empty slot immediately with just 1 probe. Fast access, but over half the table sits empty — wasting memory.

🎯 Sweet Spot (α ≈ 0.75)

The ideal balance of speed and memory — 2-3 probes on average. Java's HashMap resizes at α = 0.75, Python's dict at α ≈ 0.67.

🔴 Overloaded (α → 1.0)

Probe counts skyrocket as almost every slot is occupied. At α = 0.95, linear probing needs ~20 probes on average — performance collapses.

📐 Resize (2× Table)

When α exceeds the threshold, allocate a table twice the size and rehash ALL keys. Each resize is O(n), but amortized over n inserts it's O(1) per insert.

Real-World Uses

Hash Tables Are Everywhere

🗄️ Databases• Hash indexes for O(1) row lookup• Join operations (hash join)• GROUP BY aggregationPostgreSQL, MySQL, Redis⚡ Caching• Redis: in-memory key-value store• Memcached: distributed cache• CDN edge caching (URL → content)Cloudflare, Akamai, Fastly🔍 Deduplication• URL dedup in web crawlers• File dedup in backup systems• Bloom filters (probabilistic hash)Google Crawler, ZFS, Git🔐 Security• Password hashing (bcrypt, argon2)• File integrity (SHA-256 checksums)• Digital signatures / certificatesSSL/TLS, Git commit hashes
Hash tables power the most critical systems in computing
↑ Answer the question above to continue ↑
🔴 Challenge Knowledge Check

Git identifies every commit, tree, and blob using a SHA-1 hash. Why does Git use hashing instead of sequential IDs (like commit #1, #2, #3)?

🎓 What You Now Know

Hash tables map keys to values in O(1) average time — using hash(key) % table_size to compute an array index.

Collisions are inevitable — chaining uses linked lists per bucket; open addressing probes for empty slots. Both degrade to O(n) under pathological hashing.

Load factor α determines performance — resize at α ≈ 0.75 to keep probe counts low. Resizing is O(n) but amortized O(1).

Hash tables power databases, caches, dedup, and security — from Redis to Git to SSL certificates, hashing is the universal data access pattern.

When you need ordered keys, range queries, or worst-case O(log n), reach for a balanced BST or B-tree instead. ⚡

Keep Learning