Design Uber / Yelp (Geospatial Index)
1. What is a Geospatial System?
A Geospatial System (or Location-Based Service - LBS) is the engine behind modern on-demand apps. Whether it’s Uber matching riders with drivers, Yelp finding nearby sushi, or Tinder finding matches within 5 miles, the core problem is Proximity Search.
[!TIP] The Core Challenge: Standard databases (B-Trees) are excellent for 1D data (e.g.,
salary > 50000). They are terrible for 2D data (e.g.,xandynear me). We need specialized Spatial Indexing structures to map 2D coordinates into 1D keys.
Real-World Examples
- Uber/Lyft: Real-time driver tracking and matching.
- Google Maps: Points of Interest (POI) search.
- Pokemon Go: Spawning entities based on player location.
2. Requirements & Goals
Functional Requirements
- Driver Location Updates: Drivers ping their location (lat/long) every 5 seconds.
- Nearby Search: Passengers can see available drivers within a radius (e.g., 2km).
- ETA Calculation: Estimate arrival time based on road traffic.
- Trip History: Store trip data for billing and analytics.
Non-Functional Requirements
- Low Latency: Search results must return in < 200ms.
- High Throughput: Handle millions of driver updates per second globally.
- Availability: The system must be always online (CAP: AP over CP).
- Accuracy: Minimal lag between actual driver position and system state.
3. Capacity Estimation
Let’s design for Uber-scale.
- Total Users: 500 Million.
- Daily Active Users (DAU): 20 Million.
- Total Drivers: 5 Million.
- Active Drivers: 1 Million (online at any given time).
Throughput (QPS)
- Driver Updates: Each active driver pings every 5 seconds.
- $1,000,000 \text{ drivers} / 5 \text{ sec} = 200,000 \text{ QPS}$ (Write heavy).
- Passenger Searches: Assume 50M searches per day.
- $50,000,000 / 86400 \approx 600 \text{ QPS}$ (Read is much lower than Write).
Storage
- Location data is ephemeral. We only need the current location for the hot path.
- 1 Million active drivers $\times$ 100 bytes (ID, lat, long, status) $\approx$ 100 MB. This fits easily in Memory (Redis).
4. System APIs
1. Update Driver Location
POST /v1/driver/location
Authorization: Bearer <token>
{
"lat": 37.7749,
"long": -122.4194,
"status": "AVAILABLE" // or BUSY
}
2. Find Nearby Drivers
GET /v1/passenger/nearby?lat=37.7749&long=-122.4194&radius=2000
Response:
{
"drivers": [
{ "id": "d1", "lat": 37.7750, "long": -122.4195 },
{ "id": "d2", "lat": 37.7751, "long": -122.4193 }
]
}
5. Database Design & Spatial Indexing
Why not just use SQL?
SELECT * FROM drivers
WHERE lat BETWEEN my_lat - R AND my_lat + R
AND long BETWEEN my_long - R AND my_long + R
This requires scanning too many rows. Standard B-Tree indexes cannot efficiently handle two dimensions simultaneously.
The Solution: Grid-Based Indexing
We need to map 2D coordinates into a 1D value that preserves locality.
Comparison: Geohash vs QuadTree vs Google S2
| Feature | Geohash (String) | QuadTree (Tree) | Google S2 (Integer) |
|---|---|---|---|
| Data Structure | Base-32 String Interleaving | Recursive 4-ary Tree | Hilbert Curve on Cube |
| Locality | OK (Z-Order Curve) | Good (Adaptive) | Excellent (Hilbert Curve) |
| Query Logic | Prefix Search (LIKE '9q8%') |
Tree Traversal | Integer Range Scan |
| Precision | Fixed grid sizes | Variable/Adaptive | Fixed Cell Levels (1-30) |
| Edge Cases | Bad at Poles (Distortion) | Complexity at implementation | Handles Sphere math correctly |
| Best For | Simple string-based DBs | Dense/Sparse data variance | High Scale (Uber/Tinder) |
Option C: Google S2 (The Industry Standard)
- Maps the sphere onto a cube, then uses a Hilbert Space-Filling Curve to map 2D cells to 1D integers (Cell IDs).
- Why Uber uses S2:
- Uniformity: S2 cells are roughly square everywhere on Earth. Geohash rectangles get very thin at the poles.
- Math: Hilbert curves preserve locality better than Z-order curves (Geohash).
Interactive: Z-Order vs Hilbert Curve
This demo visualizes how 2D points are mapped to a 1D line.
- Z-Order (Geohash): Often makes large jumps (e.g., from cell 7 to 8). Points that are close in ID might be far in space.
- Hilbert (Google S2): Preserves locality much better. Adjacent 1D values are almost always adjacent in 2D space.
[!TIP] Try it yourself: Click the buttons below to trace the curve. Notice how the Green curve (Z-Order) has long diagonal jumps, while the Blue curve (Hilbert) stays local.
6. High-Level Design
High-Level Architecture: Real-Time Geospatial Flow.
- Driver App: Sends real-time location pings via WS Gateway (WebSocket) to maintain long-lived stateful connections.
- Location Service: Processes incoming pings.
- Hot Path: Uses the S2 Resolver to map lat/long coordinates to Google S2 Cell IDs. It then updates the Redis Hot Index using
ZADD. - Cold Path: Simultaneously pushes a “History Event” to Kafka, which is consumed by workers and persisted in Cassandra for billing and trip history.
- Hot Path: Uses the S2 Resolver to map lat/long coordinates to Google S2 Cell IDs. It then updates the Redis Hot Index using
- Passenger App: Requests nearby drivers via the Search Service.
- Redis Cluster: Stores active driver locations, sharded by S2 Cell ID range to enable fast, distributed $O(\log N)$ range queries.
[!NOTE] See Load Balancing for how to handle the ingress traffic.
7. Deep Dive: QuadTree Implementation
A QuadTree is a tree data structure in which each internal node has exactly four children: NW, NE, SW, SE.
Algorithm
- Insert(Point):
- Start at Root.
- If leaf node and
points < Capacity: Add point. - If leaf node and
points == Capacity: Split into 4 children. Distribute existing points to children. - If internal node: Recurse into the correct quadrant.
- Search(Region):
- Start at Root.
- If node overlaps with search region:
- Check points in this node.
- Recurse into children.
This creates a map where dense areas are granular, and sparse areas are broad.
8. Data Partitioning (Sharding)
With 200k updates/sec, a single Redis instance will melt. For detailed scaling strategies, see Module 07: Data Scaling.
Sharding Strategies
- By City/Region:
- “San Francisco” server, “New York” server.
- Problem: “Hot Cities”. NY might overload while “Montana” is idle.
- Problem: Boundary issues (Driver moving from SF to Daly City).
- By Geohash / S2 Cell ID (Consistent Hashing):
- As shown in the Redis Hot Index shards in our diagram, we hash the S2 Cell ID to map it to a specific Redis node.
- This ensures that nearby drivers (who share cell ID prefixes) are more likely to be on the same shard, or predictable adjacent shards.
- Preferred Approach.
9. Reliability & Trade-offs
The “Ghost Driver” Problem (Accuracy)
When a driver app crashes or loses internet, they might appear “stuck” on the map.
- Solution: Use a TTL (Time-To-Live) in Redis for every entry in the Hot Index.
SET driver:123 location EX 30- If no update comes in 30 seconds, the driver automatically disappears from the index.
- This is a form of Passive Expiration common in Caching Strategies.
CAP Theorem Trade-off
- We choose AP (Availability + Partition Tolerance).
- Eventual Consistency: It’s acceptable if a driver appears 10 meters away from their actual spot. The map is always an approximation of reality.
- Liveness over Correctness: It is unacceptable for the user to see a “Service Unavailable” screen just because one Redis node is syncing.
10. Interactive Decision Visualizer: QuadTree Range Search
This demo simulates how a QuadTree efficiently searches for points.
- Add Drivers: Click on the map to populate it.
- Range Search: Move your mouse to see which QuadTree nodes are queried (Green) vs ignored (Red/Gray).
[!TIP] Try it yourself: Click “Add 50 Random Drivers” then hover over the box. Watch how the search only visits a few Green boxes instead of scanning the whole world.
QuadTree Range Search
Hover to search. Click to add points.
11. System Walkthrough: The Life of a Ride
Let’s trace the exact flow of data when a Driver goes online and a Passenger searches for them.
Scenario A: Driver Pings Location (Write Path)
- Driver App (User 123) sends a ping via WebSocket.
{ "event": "update_location", "lat": 37.7749, "long": -122.4194, "status": "AVAILABLE" } - Location Service receives the message.
- Converts Lat/Long to S2 Cell ID (Level 12).
- Cell ID:
808580ff...(Hex).
- Redis Update:
- The service identifies the correct Redis Shard (e.g., Shard 5 for San Francisco).
- Executes
GEOADD(orZADDwith encoded integer).GEOADD drivers:available -122.4194 37.7749 "driver_123" EXPIRE driver_123 30 # Reset TTL
- Kafka Publish (Async):
- Publishes event to
driver-locationstopic for historical tracking.
- Publishes event to
Scenario B: Passenger Searches for Ride (Read Path)
- Passenger App requests nearby drivers.
GET /nearby?lat=37.7750&long=-122.4190&radius=500m - Search Service:
- Calculates the S2 Cell ID for the passenger.
- Determines neighboring cells to query (usually the center cell + 8 neighbors).
- Redis Query:
- Queries the Redis Shard for all drivers in those cells.
GEORADIUS drivers:available -122.4190 37.7750 500 m WITHDIST
- Response:
{ "drivers": [ { "id": "driver_123", "lat": 37.7749, "long": -122.4194, "dist_m": 15.2 } ] }
12. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Real-Time Updates | WebSocket Gateway (Stateful) + In-Memory Redis. |
| Low Latency (<200ms) | Google S2 Index (Fast Math) + Sharded Redis Cluster. |
| Accuracy | 30s TTL (Time-to-Live) ensures stale drivers disappear. |
| Scalability (200k QPS) | Database Sharding by S2 Cell ID (Geospatial Partitioning). |
| Availability (AP) | Redis Cluster with Replicas. Eventual Consistency accepted. |
| Historical Data | Async Kafka pipeline to Cassandra (Write-Optimized). |
13. Follow-Up Questions: The Interview Gauntlet
This section covers 50 rapid-fire questions to test your depth.
I. Geospatial Indexing
- Why S2 over Geohash? S2 cells are square-like, while Geohash rectangles distort heavily near poles. S2 uses Hilbert curves for better locality.
- Why not Postgres PostGIS? PostGIS uses R-Trees (disk-based). While powerful, it cannot handle 200k write IOPS per second. Redis (Memory) is required.
- How does QuadTree rebalancing work? If a node exceeds capacity (e.g., 500 points), split it into 4 children. If children become sparse, merge them.
- What is the “Edge Case” problem? A user standing on a grid line might not see a driver 1 meter away in the adjacent cell. We solve this by querying neighbor cells (all 9 blocks).
II. Scalability & Performance
- How to shard Redis? Consistent Hashing on the S2 Cell ID. This keeps nearby drivers on the same shard.
- Hot Spot Problem: What if Times Square has 10k drivers? The shard melts. Solution: Further split hot cells into smaller child cells (Dynamic Sharding) or use Read Replicas.
- WebSocket vs Polling: WebSocket reduces overhead (headers) and latency. Essential for 1s update intervals.
- Bandwidth Usage: 200k * 100 bytes = 20 MB/s. Manageable, but use Protobuf instead of JSON to reduce size by 60%.
III. Reliability & Failure Modes
- Redis Crash: If Redis memory is wiped, we lose current locations. This is acceptable. Drivers will ping again in 5 seconds, self-healing the state.
- Network Partition: If SF cannot talk to NY, it’s fine. Geospatial data is naturally partitioned by location.
- Ghost Drivers: Handled by Redis TTL. If the app crashes, the key expires automatically.
14. Summary: The Whiteboard Strategy
If asked to design Uber, draw this 4-Quadrant Layout:
1. Requirements
- Func: Driver Ping (Write), Passenger Search (Read).
- Non-Func: Low Latency, High Throughput (200k QPS).
- Scale: 500M Users, 1M Active Drivers.
2. Architecture
↓ (WS)
[Location Svc] -- [Kafka] -> [Cassandra]
↓
[Redis Cluster (S2)]
↑
[Search Svc]
↑
[Passenger]
* S2 Index: Maps 2D to 1D.
* Redis: Handles high write throughput.
3. Data & API
API: POST /location, GET /nearby
S2: Level 12 (~200m blocks)
4. Trade-offs
- QuadTree vs S2: S2 has better math/locality. QuadTree is harder to balance distributedly.
- CAP: AP (Availability). It's okay if map is slightly stale.
- Protocol: WebSocket for drivers (push), HTTP for passengers (pull).