BigTable: The Database of Google

[!TIP] Origins: Built in 2006 to handle the Internet. Google needed to store the Web Index (Pages, URLs, Contents) and Google Earth (Satellite images). MySQL couldn’t scale. BigTable is built on top of GFS (Google File System).

1. What is BigTable?

It is a Sparse, Distributed, Persistent, Multi-Dimensional Sorted Map.

Map: Key-Value pairs.
Sorted: Keys are stored lexicographically (alphabetically).
Distributed: Sharded across thousands of machines.
Sparse: Empty columns take up zero space (unlike SQL NULLs which take bits).

Data Model

` (Row: string, Column: string, Time: int64) -> String `

Row Key: “com.google.www” (Reversed URL for locality).
Column Family: “contents”, “anchor”.
Timestamp: Versioning.

2. Architecture

BigTable separates the Control Plane (Metadata/Locks) from the Data Plane (Reads/Writes), as shown in our architecture diagram.

Chubby (Distributed Lock): A Paxos-based lock service. The client first talks to Chubby to find the Root Tablet location (Bootstrap Metadata). If Chubby is down, the system is inaccessible.
Master: Assigns Tablets to Tablet Servers and handles metadata changes (e.g., column family creation). It is not in the data path for reads/writes.
Tablet Server: The workers in the Data Plane. They handle all internal read/write requests for their assigned tablets.
GFS / Colossus: The persistent storage layer where the actual SSTables and Commit Logs are stored across a cluster of disk nodes.

Architecture: The BigTable Stack

Control Plane (Chubby/Master) | Data Plane (Tablet Servers) | GFS/Colossus

Control Flow (Metadata)

Data Flow (Get/Put)

Metadata Link

Client Application
BigTable Library

Control Plane

Chubby (Distributed Lock)

- Root Tablet Location
- Master Election
- ACLs / Schema Keys

🔒

BigTable Master

- Tablet Assignment
- Failure Detection
- Load Balancing

Data Plane

Tablet Server

Tablet: `com.google.*`

MemTable (RAM)

Block Cache

Tablet Server

Tablet: `com.youtube.*`

GFS / Colossus
Distributed File System

SSTable

Log

3. The SSTable (Sorted String Table)

Data is stored in SSTables on the GFS / Colossus layer.

Structure: A sequence of Blocks (usually 64KB).
Block Index: The end of the file contains an index.
- Example: “Key ‘Apple’ is in Block 1”, “Key ‘Zebra’ is in Block 99”.
Lookup:
1. Load the Block Index into RAM.
2. Binary Search in RAM to find the correct Block offset.
3. 1 Disk Seek to read the 64KB block.
4. Scan the block in RAM to find the key.

This guarantees that any read takes at most 1 Disk Seek.

4. Tablets and Splitting

A table is split into Tablets (Ranges of rows).
Start with 1 Tablet covering range (-inf, +inf).
As it grows (e.g., > 200MB), it Splits into two.
The Master reassigns the new Tablet to another server.

Interactive Demo: Tablet Splitting & Balancing

Watch a Tablet fill up with Data (Rows). When it hits the limit, watch it split and rebalance across servers.

Tablet Server 1

Tablet Server 2

Ready. Tablet size: 0 MB.

5. Key Design: Row Keys & Hotspotting

The most critical part of BigTable design is the Row Key. Because the master assigns tablets based on ranges (e.g., com.google.*), choosing a bad key can hammer a single Tablet Server.

Interactive Demo: Hotspot Visualizer

See what happens when you use a Timestamp as a key vs a Hashed key.

Select a scenario.

Good vs Bad Keys

Scenario	Bad Key	Why Bad?	Good Key	Why Good?
Time Series	`timestamp` (e.g., `1678888`)	All writes go to the end of the table (1 server). Hotspot!	`reverse(machine_id) + timestamp`	Spreads writes across all servers (based on machine hash).
URL Storage	`www.google.com`	Domains are reversed. Subdomains are far apart.	`com.google.www:path`	Keeps all pages from the same domain together for faster compression and scanning.
User Data	`user_id` (Sequential 1, 2, 3)	New users (highest IDs) are all on the last tablet.	`hash(user_id)` or UUID	Distributes users randomly across tablets.

6. BigTable vs HBase

HBase is the Open Source clone of BigTable.

Component	Google BigTable	Apache HBase
File System	GFS (Colossus)	HDFS
Lock Service	Chubby	ZooKeeper
Worker Node	Tablet Server	RegionServer
Storage Unit	Tablet	Region
File Format	SSTable	HFile

7. Optimization: Bloom Filters

Reading from disk (SSTables) is slow. If a row doesn’t exist, we still have to check all SSTables to be sure.

Solution: BigTable attaches a Bloom Filter to each SSTable.
How: Before reading the file, it asks the Bloom Filter: “Does this row key exist in this file?”
Result: If the answer is “No”, we skip the file entirely. This drastically reduces disk seeks for non-existent keys.

8. System Walkthrough: A Read Operation (Dry Run)

Let’s trace a Get(Row: "com.google.www") request.

Step 1: Client Bootstrap (Locating the Tablet)

Client has no cache. It needs to find which Tablet Server holds “com.google.www”.
Chubby: Client asks “Where is the Root Tablet?”
Root Tablet: Contains the metadata for all User Tablets.
Metadata Tablet: Client queries the Metadata Tablet to find the range covering com.google.*.
Result: Tablet is on Tablet Server #55.

Step 2: Tablet Server Request

Client sends Get("com.google.www") to Tablet Server #55.
Server #55 checks MemTable (RAM). (Not found).
Server #55 checks Block Cache (RAM). (Not found).

Step 3: SSTable Lookup (Disk)

Server #55 has 10 SSTables on GFS.
It checks the Bloom Filter for each SSTable.
- SSTable 1: “No” (Skip).
- SSTable 2: “Maybe”.
Server #55 reads the Block Index of SSTable 2.
- Finds that “com.google.www” is in Block 12 (Offset 4096).
Server #55 reads Block 12 from GFS.
Scans block, finds Key.

Step 4: Response

Server #55 returns the value.
Client caches the location (Row -> Tablet Server) to skip Step 1 next time.

9. Requirements Traceability Matrix

Requirement	Architectural Solution
Petabyte Scale	Sharding into Tablets + GFS/Colossus Storage.
High Throughput	Linear Scaling (Add Tablet Servers).
Sparse Data	Column-Family storage only writes non-null values.
Consistency	Single-Row Atomic transactions.
Read Performance	Block Cache + Bloom Filters.
Write Performance	MemTable + CommitLog (Sequential Write).

10. Interview Gauntlet

Is BigTable ACID compliant?
- Only on a Single Row. It does not support multi-row transactions.
What happens if the Master Node dies?
- Data Plane continues to work (Clients talk to Tablet Servers). Only metadata changes (splits/merges) are paused. Chubby elects a new Master.
Why use Chubby?
- To provide a highly available, consistent store for small metadata (like Root Tablet location) and Leader Election.
How does BigTable handle schema changes?
- Columns are dynamic. You only define Column Families upfront.
What is a “Minor Compaction”?
- Flushing MemTable to an SSTable.
What is a “Major Compaction”?
- Merging all SSTables into one, removing deleted data (tombstones).
Why separate Storage (GFS) from Compute (Tablet Server)?
- Fast recovery. If a Tablet Server dies, the new server just reads the files from GFS. No data copy needed.
How do you prevent Hotspots?
- Good Row Key design (Hashing or Reversing domains).
What is “Locality Groups”?
- Grouping Column Families that are often accessed together into the same SSTable.
Explain the 3-Level Hierarchy.
- Chubby -> Root Tablet -> Metadata Tablet -> User Tablet.

11. Summary: The Whiteboard Strategy

1. Core Concepts

Wide Column: Sparse, multi-dimensional map.
Decoupled: Storage (GFS) separate from Compute (TS).
Master: Control Plane (Assignment).
Tablet Server: Data Plane (Read/Write).

2. The 3-Layer Hierarchy

Client -> Chubby (Lock)
|
v
Root Tablet (Metadata)
|
v
User Tablet (Data)

* Lookup: Cache location to avoid repeating.

3. Performance Keys

            MemTable: Write Buffer (RAM).

            SSTable: Immutable File (Disk).

            Bloom Filter: Avoid Disk Seeks.

            Row Key: Hashing prevents hotspots.

4. Trade-offs

Single Row Atomicity: No complex transactions.
Chubby Dependency: If lock service dies, system halts.
Master Bottleneck: Only for metadata, not throughput.