HDFS (Hadoop Distributed File System)
[!TIP] GFS for the Masses: HDFS is the open-source implementation of GFS. While GFS was C++, HDFS is Java. If you understand GFS, you understand 80% of HDFS. The key differences lie in how HDFS evolved to solve the “Single Master” problem.
1. GFS vs HDFS: The Cheat Sheet
| Feature | GFS (Original) | HDFS (Hadoop 2.0+) | |:——–|:——–|:——–| | Master | GFS Master | NameNode (Active + Standby) | | Slave | ChunkServer | DataNode | | Block Size | 64 MB | 128 MB (default) | | Consistency | Relaxed | Write-Once-Read-Many (Strict) | | Coordination | Chubby | ZooKeeper |
2. Architecture: Solving the SPOF
In early Hadoop (1.0), the NameNode was a SPOF. If it died, the cluster died. Hadoop 2.0 introduced High Availability (HA).
The HA Architecture
HDFS 2.0+ uses a quorum-based architecture to ensure Hot Standby failover, as visualized in the metadata layer of our diagram.
- Active NameNode: The leader that handles all client read/write requests and persists metadata changes.
- Standby NameNode: A hot backup that stays synchronized with the Active node by reading from the same JournalNode Quorum.
- JournalNodes (Shared Edits): A cluster of nodes (green markers) that store the Edit Logs. A write is only successful if a quorum of JournalNodes persists it.
- ZKFC (ZooKeeper Failover Controller): A sidecar process (blue box) that monitors the NameNode health and uses ZooKeeper to coordinate leader election if the Active node fails.
Leader Election & Locking
- Writes Edit Logs
- Holds fsImage in RAM
- Hot-Standby RAM
- Standby Checkpoints
3. The “Small File Problem”
HDFS hates small files because all metadata must fit within the NameNode RAM (visualized in the Active node block).
- The Math: Every object (file, block, directory) takes ~150 bytes of NameNode RAM.
- The Scenario:
- 1 million x 100MB files = 100PB data. Metadata = 150MB RAM. (Easy).
- 100 million x 1KB files = 100GB data. Metadata = 15GB RAM. (Expensive).
- Why it kills performance: The NameNode spends all its time GC’ing (Garbage Collecting) huge Java heaps.
- Solution: Hadoop Archives (HAR) or SequenceFiles to pack small files into big containers.
4. Erasure Coding (Hadoop 3.0)
Replication (3x) is expensive. To store 1PB of data, you need 3PB of disk. 2PB is just “backup”. Erasure Coding (EC) changes the game using Reed-Solomon math.
How RS(6, 3) Works
Instead of copying data, we split it and calculate “Parity”. Think of it as Algebra: A + B = C. If you lose A, you can calculate it as C - B.
- Data Blocks: We split a file into 6 chunks (D1 … D6).
- Parity Blocks: We calculate 3 parity chunks (P1 … P3) using linear algebra equations (Vandermonde Matrix).
- Total: 9 Blocks stored on 9 different nodes.
- Fault Tolerance: You can lose any 3 blocks (Data or Parity) and still recover.
- Math: With 6 data points, you need 6 equations to solve the variables. We have 9 points total. Any 6 are sufficient.
The Trade-off
| Feature | 3x Replication | Erasure Coding RS(6,3) | | :— | :— | :— | | Storage Overhead | 200% (1TB -> 3TB) | 50% (1TB -> 1.5TB) | | Recovery Speed | Fast (Just copy) | Slow (CPU heavy math) | | Use Case | Hot Data | Cold / Warm Data |
Interactive Demo: Erasure Coding Repair
Visualize recovering lost data using Parity.
- Setup: 6 Data Blocks (Blue), 3 Parity Blocks (Green).
- Disaster: You can destroy up to 3 blocks.
- Magic: The system reconstructs them.
5. Rack Awareness (Reliability)
HDFS is aware of the physical topology. The Replication Policy (managed by the Active NameNode) ensures data is spread across failure domains:
- Replica 1: Local Machine (Speed).
- Replica 2: Different Rack (Survivability against Top-of-Rack Switch failure).
- Replica 3: Same Rack as #2 (Bandwidth optimization).
Interactive Demo: Rack Awareness & Failure
Visualize how HDFS survives a Rack Failure.
- Write Data: System places replicas according to policy.
- Kill Rack: Simulate a switch failure.
- Verify: Is data still accessible?
6. Observability & Tracing (RED Method)
To manage 5,000 nodes, we rely on the RED Method:
- Rate: RPCs per second to NameNode.
- Alert: Sudden spike = Job gone rogue.
- Errors: Failed DataNode heartbeats.
- Alert: >10% of nodes down.
- Duration: Block Report processing time.
- Warning: If NameNode takes >1s to process reports, it’s overloaded.
JMX Metrics
Hadoop exposes Java Management Extensions (JMX) for Prometheus:
Hadoop:service=NameNode,name=NameNodeInfo:CorruptBlocksHadoop:service=DataNode,name=FSDatasetState:Remaining
7. Deployment Strategy: Rolling Upgrades
You cannot turn off HDFS to upgrade it. We use Rolling Upgrades with zero downtime.
- Upgrade Standby NameNode: Update binaries, restart.
- Failover: Switch Active -> Standby. (New Active is running v2).
- Upgrade Old Active: Update, restart. It becomes the new Standby.
- Upgrade DataNodes: Batch by batch (e.g., 5% at a time).
- Decommissioning: Gracefully drain a DataNode before reboot to prevent unnecessary replication storms.
8. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| No Single Point of Failure | NameNode HA (Active/Standby) + ZKFC. |
| Scalability (PB) | Erasure Coding (RS-6-3) reduces storage overhead. |
| Reliability | Rack Awareness (Top-of-Rack failure protection). |
| Throughput | Sequential I/O optimizations. |
| Consistency | Strict (Write-Once-Read-Many). |
9. Interview Gauntlet
I. High Availability
- What happens if Active NameNode hangs? ZKFC misses a heartbeat. ZooKeeper deletes the ephemeral lock. The Standby ZKFC sees this and transitions Standby to Active.
- How to prevent Split Brain? Fencing (STONITH). The new Active cuts off the old Active (e.g., via SSH kill or revoking shared storage access).
II. Data Reliability
- Explain Rack Awareness. 3 Replicas: 1 Local, 1 on a Different Rack, 1 on that Same Different Rack. Survives node and switch failures.
- Why Erasure Coding? To save 50% disk space on cold data. Replaces 3x replication.
III. Operations
- How to handle a dead disk? DataNode reports disk failure. NameNode marks blocks as under-replicated. It instructs other DataNodes to copy the missing blocks from healthy replicas. Self-healing.
10. Summary: The Whiteboard Strategy
If asked to design HDFS, draw this 4-Quadrant Layout:
1. Requirements
- Func: Read, Write (Append), Delete.
- Non-Func: HA (No SPOF), Rack Aware.
- Scale: 10k Nodes, Exabytes.
2. Architecture
|
[Active NN] -- [JournalNodes] -- [Standby NN]
|
[DataNodes (Block Storage)]
* Metadata: NameNode HA + Journal.
* Data: Blocks + Erasure Coding.
3. Metadata
BlockMap: B1 -> [DN1, DN2, DN3]
EditLog: Transaction Journal.
4. Reliability Patterns
- HA: Active/Standby via ZK.
- Rack Awareness: Survive switch failure.
- Erasure Coding: RS(6,3) for cost.
Next, we move to the “Nervous System” of modern infra: Apache Kafka.