BigTable: The Database of Google
[!TIP] Origins: Built in 2006 to handle the Internet. Google needed to store the Web Index (Pages, URLs, Contents) and Google Earth (Satellite images). MySQL couldn’t scale. BigTable is built on top of GFS (Google File System).
1. What is BigTable?
It is a Sparse, Distributed, Persistent, Multi-Dimensional Sorted Map.
- Map: Key-Value pairs.
- Sorted: Keys are stored lexicographically (alphabetically).
- Distributed: Sharded across thousands of machines.
- Sparse: Empty columns take up zero space (unlike SQL NULLs which take bits).
Data Model
` (Row: string, Column: string, Time: int64) -> String `
- Row Key: “com.google.www” (Reversed URL for locality).
- Column Family: “contents”, “anchor”.
- Timestamp: Versioning.
2. Architecture
BigTable separates the Control Plane (Metadata/Locks) from the Data Plane (Reads/Writes), as shown in our architecture diagram.
- Chubby (Distributed Lock): A Paxos-based lock service. The client first talks to Chubby to find the Root Tablet location (Bootstrap Metadata). If Chubby is down, the system is inaccessible.
- Master: Assigns Tablets to Tablet Servers and handles metadata changes (e.g., column family creation). It is not in the data path for reads/writes.
- Tablet Server: The workers in the Data Plane. They handle all internal read/write requests for their assigned tablets.
- GFS / Colossus: The persistent storage layer where the actual SSTables and Commit Logs are stored across a cluster of disk nodes.
BigTable Library
- Master Election
- ACLs / Schema Keys
- Failure Detection
- Load Balancing
Distributed File System
3. The SSTable (Sorted String Table)
Data is stored in SSTables on the GFS / Colossus layer.
- Structure: A sequence of Blocks (usually 64KB).
- Block Index: The end of the file contains an index.
- Example: “Key ‘Apple’ is in Block 1”, “Key ‘Zebra’ is in Block 99”.
- Lookup:
- Load the Block Index into RAM.
- Binary Search in RAM to find the correct Block offset.
- 1 Disk Seek to read the 64KB block.
- Scan the block in RAM to find the key.
This guarantees that any read takes at most 1 Disk Seek.
4. Tablets and Splitting
- A table is split into Tablets (Ranges of rows).
- Start with 1 Tablet covering range
(-inf, +inf). - As it grows (e.g., > 200MB), it Splits into two.
- The Master reassigns the new Tablet to another server.
Interactive Demo: Tablet Splitting & Balancing
Watch a Tablet fill up with Data (Rows). When it hits the limit, watch it split and rebalance across servers.
5. Key Design: Row Keys & Hotspotting
The most critical part of BigTable design is the Row Key. Because the master assigns tablets based on ranges (e.g., com.google.*), choosing a bad key can hammer a single Tablet Server.
Interactive Demo: Hotspot Visualizer
See what happens when you use a Timestamp as a key vs a Hashed key.
Good vs Bad Keys
| Scenario | Bad Key | Why Bad? | Good Key | Why Good? |
|---|---|---|---|---|
| Time Series | timestamp (e.g., 1678888) |
All writes go to the end of the table (1 server). Hotspot! | reverse(machine_id) + timestamp |
Spreads writes across all servers (based on machine hash). |
| URL Storage | www.google.com |
Domains are reversed. Subdomains are far apart. | com.google.www:path |
Keeps all pages from the same domain together for faster compression and scanning. |
| User Data | user_id (Sequential 1, 2, 3) |
New users (highest IDs) are all on the last tablet. | hash(user_id) or UUID |
Distributes users randomly across tablets. |
6. BigTable vs HBase
HBase is the Open Source clone of BigTable.
| Component | Google BigTable | Apache HBase |
|---|---|---|
| File System | GFS (Colossus) | HDFS |
| Lock Service | Chubby | ZooKeeper |
| Worker Node | Tablet Server | RegionServer |
| Storage Unit | Tablet | Region |
| File Format | SSTable | HFile |
7. Optimization: Bloom Filters
Reading from disk (SSTables) is slow. If a row doesn’t exist, we still have to check all SSTables to be sure.
- Solution: BigTable attaches a Bloom Filter to each SSTable.
- How: Before reading the file, it asks the Bloom Filter: “Does this row key exist in this file?”
- Result: If the answer is “No”, we skip the file entirely. This drastically reduces disk seeks for non-existent keys.
8. System Walkthrough: A Read Operation (Dry Run)
Let’s trace a Get(Row: "com.google.www") request.
Step 1: Client Bootstrap (Locating the Tablet)
- Client has no cache. It needs to find which Tablet Server holds “com.google.www”.
- Chubby: Client asks “Where is the Root Tablet?”
- Root Tablet: Contains the metadata for all User Tablets.
- Metadata Tablet: Client queries the Metadata Tablet to find the range covering
com.google.*. - Result: Tablet is on Tablet Server #55.
Step 2: Tablet Server Request
- Client sends
Get("com.google.www")to Tablet Server #55. - Server #55 checks MemTable (RAM). (Not found).
- Server #55 checks Block Cache (RAM). (Not found).
Step 3: SSTable Lookup (Disk)
- Server #55 has 10 SSTables on GFS.
- It checks the Bloom Filter for each SSTable.
- SSTable 1: “No” (Skip).
- SSTable 2: “Maybe”.
- Server #55 reads the Block Index of SSTable 2.
- Finds that “com.google.www” is in Block 12 (Offset 4096).
- Server #55 reads Block 12 from GFS.
- Scans block, finds Key.
Step 4: Response
- Server #55 returns the value.
- Client caches the location (Row -> Tablet Server) to skip Step 1 next time.
9. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Petabyte Scale | Sharding into Tablets + GFS/Colossus Storage. |
| High Throughput | Linear Scaling (Add Tablet Servers). |
| Sparse Data | Column-Family storage only writes non-null values. |
| Consistency | Single-Row Atomic transactions. |
| Read Performance | Block Cache + Bloom Filters. |
| Write Performance | MemTable + CommitLog (Sequential Write). |
10. Interview Gauntlet
- Is BigTable ACID compliant?
- Only on a Single Row. It does not support multi-row transactions.
- What happens if the Master Node dies?
- Data Plane continues to work (Clients talk to Tablet Servers). Only metadata changes (splits/merges) are paused. Chubby elects a new Master.
- Why use Chubby?
- To provide a highly available, consistent store for small metadata (like Root Tablet location) and Leader Election.
- How does BigTable handle schema changes?
- Columns are dynamic. You only define Column Families upfront.
- What is a “Minor Compaction”?
- Flushing MemTable to an SSTable.
- What is a “Major Compaction”?
- Merging all SSTables into one, removing deleted data (tombstones).
- Why separate Storage (GFS) from Compute (Tablet Server)?
- Fast recovery. If a Tablet Server dies, the new server just reads the files from GFS. No data copy needed.
- How do you prevent Hotspots?
- Good Row Key design (Hashing or Reversing domains).
- What is “Locality Groups”?
- Grouping Column Families that are often accessed together into the same SSTable.
- Explain the 3-Level Hierarchy.
- Chubby -> Root Tablet -> Metadata Tablet -> User Tablet.
11. Summary: The Whiteboard Strategy
1. Core Concepts
- Wide Column: Sparse, multi-dimensional map.
- Decoupled: Storage (GFS) separate from Compute (TS).
- Master: Control Plane (Assignment).
- Tablet Server: Data Plane (Read/Write).
2. The 3-Layer Hierarchy
|
v
Root Tablet (Metadata)
|
v
User Tablet (Data)
* Lookup: Cache location to avoid repeating.
3. Performance Keys
SSTable: Immutable File (Disk).
Bloom Filter: Avoid Disk Seeks.
Row Key: Hashing prevents hotspots.
4. Trade-offs
- Single Row Atomicity: No complex transactions.
- Chubby Dependency: If lock service dies, system halts.
- Master Bottleneck: Only for metadata, not throughput.