Database Sharding — Scaling Beyond One Machine
A visual deep dive into database sharding. From single-server bottlenecks to consistent hashing — understand how companies scale their databases to billions of rows.
🪓
One database can’t hold
the entire internet.
Instagram has 2+ billion users. Twitter handles 500 million tweets per day.
No single database server can hold all that data or handle all those queries.
Sharding splits your data across multiple machines — and it’s how every large-scale system works.
↓ Scroll to learn how databases scale horizontally
The Single-Server Bottleneck
Every database starts on one server. As your app grows, you hit limits:
Two Ways to Scale
Your startup has 100K users and your PostgreSQL server is at 40% CPU. Should you shard your database?
💡 Consider the trade-off: sharding adds complexity. Is the benefit worth it at this scale?
At 40% CPU with 100K users, you have plenty of headroom. Add indexes, optimize queries, add read replicas, upgrade the server. Sharding introduces enormous complexity (cross-shard joins, data migration, operational overhead) and should only be done when you've exhausted simpler options — typically at millions of users or terabytes of data.
Choosing the Right Shard Key
The shard key determines which machine stores each row. It’s the most important decision in sharding — and it’s very hard to change later.
You're building a social media app. Most queries are 'get posts by user X'. Which shard key is best?
💡 Choose the key that matches your most common query pattern...
If most queries fetch posts by a specific user, sharding by user_id ensures all of that user's posts are on the same shard — no cross-shard queries needed. Timestamp would cause 'hot shard' problems (recent timestamps all go to one shard). Random UUIDs distribute evenly but require scatter-gather for user queries.
Sharding Strategies: Range vs Hash
You shard using hash(user_id) % 4 across 4 shards. Now you want to add a 5th shard. What happens?
💡 Try it: hash=7. 7%4=3, 7%5=2. Different shard! Now imagine this for billions of keys...
When N changes from 4 to 5, hash(key) % 4 vs hash(key) % 5 produces different results for ~80% of keys. This means massive data migration across all shards. This is why simple modulo sharding is dangerous — use consistent hashing instead, which minimizes key movement when adding/removing shards.
Consistent Hashing: The Solution to Rebalancing
Regular hash % N is catastrophic when N changes. Consistent hashing solves this by placing keys and servers on a virtual ring.
The Hard Parts of Sharding
Sharding solves the scale problem but creates new ones:
Why are JOINs problematic in a sharded database?
💡 Think about what happens when the two tables being JOINed live on different physical machines...
In a single database, a JOIN is a local operation — both tables are on the same machine. But in a sharded DB, user data might be on shard 1 while their orders are on shard 3. The JOIN now requires network calls between shards, which is orders of magnitude slower. This is why co-locating related data on the same shard (using the same shard key) is critical.
With consistent hashing and 100 servers, you add 1 new server. Approximately how much data needs to move?
💡 On a ring, adding a server only 'steals' keys from its clockwise neighbor...
With consistent hashing, adding 1 server to 100 only affects keys in the arc between the new server and its clockwise neighbor. That's roughly 1/N = 1/101 ≈ 1% of all data. Compare this to hash % N, where ~99% of data would need to move! This is why consistent hashing is essential for production systems.
🎓 What You Now Know
✓ Scale vertically first, shard only when necessary — Sharding adds massive complexity. Exhaust simpler options first.
✓ Shard key choice is the most critical decision — High cardinality, even distribution, query-aligned, immutable.
✓ Hash sharding beats range sharding for distribution — But you lose range queries. Pick based on your access patterns.
✓ Consistent hashing enables painless scaling — Adding a server moves only ~1/N of keys, not everything.
✓ Cross-shard operations are the enemy — Design your shard key to keep related data together.
Sharding is inevitable at scale. Every system that handles billions of records — Instagram, Discord, Uber — uses it. The concepts here are the foundation of every system design interview and every real-world scalable architecture. 🚀
📄 Consistent Hashing and Random Trees (Karger et al., 1997)
📄 Dynamo: Amazon’s Highly Available Key-value Store (DeCandia et al., 2007)
↗ Keep Learning
Caching — The Art of Remembering What's Expensive to Compute
A visual deep dive into caching. From CPU caches to CDNs — understand cache strategies, eviction policies, and the hardest problem in computer science: cache invalidation.
Vector Databases — Search by Meaning, Not Keywords
A visual deep dive into vector databases. From embeddings to ANN search to HNSW — understand how AI-powered search finds what you actually mean, not just what you typed.
Comments
No comments yet. Be the first!