Hash Tables — O(1) Access That Powers Everything
A scroll-driven visual deep dive into hash tables. From hash functions to collision resolution (chaining vs open addressing) to load factors — understand the data structure behind caches, databases, and deduplication.
🗝️
Any key. Any value.
O(1) average time.
A hash table maps keys to values using a hash function that converts any key into an array index. Insert, lookup, and delete all run in O(1) average time. It’s the most-used data structure in software engineering — powering Python dicts, JavaScript objects, database indexes, caches, and more. The catch? Collisions, and understanding when O(1) becomes O(n).
↓ Scroll to understand the data structure you use 100 times a day
From Key to Index in One Step
What makes a good hash function
A hash function takes any key (string, number, object) and returns a fixed-size integer. The modulo operator then maps it to a valid array index from 0 to table_size - 1.
index = hash(key) % table_size The same key must ALWAYS produce the same hash value. If hash('alice') returned different values on different calls, lookups would fail silently.
Hashes should spread evenly across all buckets. If many keys cluster into the same bucket, performance degrades from O(1) to O(n) — defeating the purpose of hashing.
A tiny change in the key should produce a completely different hash. 'alice' and 'alicf' must map to unrelated indices — this prevents clustering and ensures uniform distribution.
You have a hash table of size 8. Keys 'cat', 'car', 'cap' all hash to index 3. What is this called, and what does it do to performance?
💡 What happens when multiple keys land in the same array slot?
A collision happens when hash(key1) % size == hash(key2) % size. With 3 keys in one bucket, a lookup must check all 3 entries (compare keys for equality) instead of jumping directly to the value. If ALL n keys collide into one bucket, the hash table becomes a linked list with O(n) operations. This is why hash function quality matters: a bad hash function that clusters keys destroys performance.
Collision Resolution: Chaining vs Open Addressing
Python's dict uses open addressing with a probe sequence. If you insert 1 million keys into a dict with 1.3 million slots, what is the expected number of probes per lookup?
💡 Think about the load factor (number of keys / number of slots)...
The load factor α = n/m determines performance. At α = 0.77, linear probing gives ~4.3 expected probes (due to primary clustering), but Python uses a perturbation-based probe sequence that scatters probes, achieving ~1.5-2 probes per lookup. This is why Python resizes dicts when α exceeds ~0.67 — keeping probe counts low. The key insight: load factor, not raw table size, determines performance.
Load Factor: The Performance Knob
Load factor and resizing
The load factor measures how full the table is — the ratio of occupied slots to total slots. It's the single most important metric for hash table performance.
α = n / m (keys / table size) Most lookups find an empty slot immediately with just 1 probe. Fast access, but over half the table sits empty — wasting memory.
The ideal balance of speed and memory — 2-3 probes on average. Java's HashMap resizes at α = 0.75, Python's dict at α ≈ 0.67.
Probe counts skyrocket as almost every slot is occupied. At α = 0.95, linear probing needs ~20 probes on average — performance collapses.
When α exceeds the threshold, allocate a table twice the size and rehash ALL keys. Each resize is O(n), but amortized over n inserts it's O(1) per insert.
Hash Tables Are Everywhere
Git identifies every commit, tree, and blob using a SHA-1 hash. Why does Git use hashing instead of sequential IDs (like commit #1, #2, #3)?
💡 Think about what makes Git work without a central server...
Git is a content-addressable storage system: hash(content) → address. This gives three superpowers: (1) Deduplication — identical files share a hash and are stored once, (2) Integrity — if a single bit flips, the hash changes, detecting corruption, (3) Distributed — any developer can generate unique identifiers without a central server. Sequential IDs would require coordination between clones, which contradicts Git's distributed model.
🎓 What You Now Know
✓ Hash tables map keys to values in O(1) average time — using hash(key) % table_size to compute an array index.
✓ Collisions are inevitable — chaining uses linked lists per bucket; open addressing probes for empty slots. Both degrade to O(n) under pathological hashing.
✓ Load factor α determines performance — resize at α ≈ 0.75 to keep probe counts low. Resizing is O(n) but amortized O(1).
✓ Hash tables power databases, caches, dedup, and security — from Redis to Git to SSL certificates, hashing is the universal data access pattern.
When you need ordered keys, range queries, or worst-case O(log n), reach for a balanced BST or B-tree instead. ⚡
↗ Keep Learning
String Manipulation — Patterns, Hashing, and Sliding Windows
A scroll-driven visual deep dive into string algorithms. From brute-force pattern matching to KMP, Rabin-Karp hashing, and sliding window techniques — with real-world applications in search engines, DNA analysis, and parsers.
Tries — Prefix Trees for Autocomplete, Spell Check, and IP Routing
A scroll-driven visual deep dive into trie data structures. From basic prefix trees to compressed tries and radix trees — with real-world applications in autocomplete, spell checking, IP routing tables, and T9 keyboard input.
Caching — The Art of Remembering What's Expensive to Compute
A visual deep dive into caching. From CPU caches to CDNs — understand cache strategies, eviction policies, and the hardest problem in computer science: cache invalidation.
Linked Lists — Pointers, Flexibility, and the Tradeoffs vs Arrays
A scroll-driven visual deep dive into linked lists. Singly, doubly, and circular variants. Understand pointer manipulation, O(1) insertion/deletion, and real-world uses in LRU caches, OS schedulers, and undo systems.
Comments
No comments yet. Be the first!