HDFS (Hadoop Distributed File System)

[!TIP] GFS for the Masses: HDFS is the open-source implementation of GFS. While GFS was C++, HDFS is Java. If you understand GFS, you understand 80% of HDFS. The key differences lie in how HDFS evolved to solve the “Single Master” problem.

1. GFS vs HDFS: The Cheat Sheet

| Feature | GFS (Original) | HDFS (Hadoop 2.0+) | |:——–|:——–|:——–| | Master | GFS Master | NameNode (Active + Standby) | | Slave | ChunkServer | DataNode | | Block Size | 64 MB | 128 MB (default) | | Consistency | Relaxed | Write-Once-Read-Many (Strict) | | Coordination | Chubby | ZooKeeper |

2. Architecture: Solving the SPOF

In early Hadoop (1.0), the NameNode was a SPOF. If it died, the cluster died. Hadoop 2.0 introduced High Availability (HA).

The HA Architecture

HDFS 2.0+ uses a quorum-based architecture to ensure Hot Standby failover, as visualized in the metadata layer of our diagram.

  1. Active NameNode: The leader that handles all client read/write requests and persists metadata changes.
  2. Standby NameNode: A hot backup that stays synchronized with the Active node by reading from the same JournalNode Quorum.
  3. JournalNodes (Shared Edits): A cluster of nodes (green markers) that store the Edit Logs. A write is only successful if a quorum of JournalNodes persists it.
  4. ZKFC (ZooKeeper Failover Controller): A sidecar process (blue box) that monitors the NameNode health and uses ZooKeeper to coordinate leader election if the Active node fails.
Architecture: HDFS High Availability
Active / Standby NameNodes | JournalNode Quorum | ZK Failover
Metadata Sync (Edit Logs)
Heartbeats / Status
Election / Control
ZooKeeper Quorum
Leader Election & Locking
Metadata Layer
Active NameNode
- Handles Client RPCs
- Writes Edit Logs
- Holds fsImage in RAM
ZKFC
Standby NameNode
- Reads Edit Logs
- Hot-Standby RAM
- Standby Checkpoints
ZKFC
JournalNodes
Edit Log Q1
Edit Log Q2
Edit Log Q3
DATANODE POOL
DataNode 1
DataNode 2
DataNode 3
Wait-for-Lock (Active) 1. Write Edit Logs 2. Sync Metadata Heartbeats (Dual)

3. The “Small File Problem”

HDFS hates small files because all metadata must fit within the NameNode RAM (visualized in the Active node block).

  • The Math: Every object (file, block, directory) takes ~150 bytes of NameNode RAM.
  • The Scenario:
    • 1 million x 100MB files = 100PB data. Metadata = 150MB RAM. (Easy).
    • 100 million x 1KB files = 100GB data. Metadata = 15GB RAM. (Expensive).
  • Why it kills performance: The NameNode spends all its time GC’ing (Garbage Collecting) huge Java heaps.
  • Solution: Hadoop Archives (HAR) or SequenceFiles to pack small files into big containers.

4. Erasure Coding (Hadoop 3.0)

Replication (3x) is expensive. To store 1PB of data, you need 3PB of disk. 2PB is just “backup”. Erasure Coding (EC) changes the game using Reed-Solomon math.

How RS(6, 3) Works

Instead of copying data, we split it and calculate “Parity”. Think of it as Algebra: A + B = C. If you lose A, you can calculate it as C - B.

  • Data Blocks: We split a file into 6 chunks (D1 … D6).
  • Parity Blocks: We calculate 3 parity chunks (P1 … P3) using linear algebra equations (Vandermonde Matrix).
  • Total: 9 Blocks stored on 9 different nodes.
  • Fault Tolerance: You can lose any 3 blocks (Data or Parity) and still recover.
    • Math: With 6 data points, you need 6 equations to solve the variables. We have 9 points total. Any 6 are sufficient.

The Trade-off

| Feature | 3x Replication | Erasure Coding RS(6,3) | | :— | :— | :— | | Storage Overhead | 200% (1TB -> 3TB) | 50% (1TB -> 1.5TB) | | Recovery Speed | Fast (Just copy) | Slow (CPU heavy math) | | Use Case | Hot Data | Cold / Warm Data |

Interactive Demo: Erasure Coding Repair

Visualize recovering lost data using Parity.

  • Setup: 6 Data Blocks (Blue), 3 Parity Blocks (Green).
  • Disaster: You can destroy up to 3 blocks.
  • Magic: The system reconstructs them.
D1
D2
D3
D4
D5
D6
P1
P2
P3
System Healthy. Click blocks to destroy them.

5. Rack Awareness (Reliability)

HDFS is aware of the physical topology. The Replication Policy (managed by the Active NameNode) ensures data is spread across failure domains:

  1. Replica 1: Local Machine (Speed).
  2. Replica 2: Different Rack (Survivability against Top-of-Rack Switch failure).
  3. Replica 3: Same Rack as #2 (Bandwidth optimization).

Interactive Demo: Rack Awareness & Failure

Visualize how HDFS survives a Rack Failure.

  1. Write Data: System places replicas according to policy.
  2. Kill Rack: Simulate a switch failure.
  3. Verify: Is data still accessible?
Rack 1 (Switch A)
ONLINE
Node A
Node B
Rack 2 (Switch B)
ONLINE
Node C
Node D
System Ready.
Data Availability: UNKNOWN

6. Observability & Tracing (RED Method)

To manage 5,000 nodes, we rely on the RED Method:

  1. Rate: RPCs per second to NameNode.
    • Alert: Sudden spike = Job gone rogue.
  2. Errors: Failed DataNode heartbeats.
    • Alert: >10% of nodes down.
  3. Duration: Block Report processing time.
    • Warning: If NameNode takes >1s to process reports, it’s overloaded.

JMX Metrics

Hadoop exposes Java Management Extensions (JMX) for Prometheus:

  • Hadoop:service=NameNode,name=NameNodeInfo:CorruptBlocks
  • Hadoop:service=DataNode,name=FSDatasetState:Remaining

7. Deployment Strategy: Rolling Upgrades

You cannot turn off HDFS to upgrade it. We use Rolling Upgrades with zero downtime.

  1. Upgrade Standby NameNode: Update binaries, restart.
  2. Failover: Switch Active -> Standby. (New Active is running v2).
  3. Upgrade Old Active: Update, restart. It becomes the new Standby.
  4. Upgrade DataNodes: Batch by batch (e.g., 5% at a time).
    • Decommissioning: Gracefully drain a DataNode before reboot to prevent unnecessary replication storms.

8. Requirements Traceability Matrix

Requirement Architectural Solution
No Single Point of Failure NameNode HA (Active/Standby) + ZKFC.
Scalability (PB) Erasure Coding (RS-6-3) reduces storage overhead.
Reliability Rack Awareness (Top-of-Rack failure protection).
Throughput Sequential I/O optimizations.
Consistency Strict (Write-Once-Read-Many).

9. Interview Gauntlet

I. High Availability

  • What happens if Active NameNode hangs? ZKFC misses a heartbeat. ZooKeeper deletes the ephemeral lock. The Standby ZKFC sees this and transitions Standby to Active.
  • How to prevent Split Brain? Fencing (STONITH). The new Active cuts off the old Active (e.g., via SSH kill or revoking shared storage access).

II. Data Reliability

  • Explain Rack Awareness. 3 Replicas: 1 Local, 1 on a Different Rack, 1 on that Same Different Rack. Survives node and switch failures.
  • Why Erasure Coding? To save 50% disk space on cold data. Replaces 3x replication.

III. Operations

  • How to handle a dead disk? DataNode reports disk failure. NameNode marks blocks as under-replicated. It instructs other DataNodes to copy the missing blocks from healthy replicas. Self-healing.

10. Summary: The Whiteboard Strategy

If asked to design HDFS, draw this 4-Quadrant Layout:

1. Requirements

  • Func: Read, Write (Append), Delete.
  • Non-Func: HA (No SPOF), Rack Aware.
  • Scale: 10k Nodes, Exabytes.

2. Architecture

[ZK Quorum] -- [ZKFC]
|
[Active NN] -- [JournalNodes] -- [Standby NN]
|
[DataNodes (Block Storage)]

* Metadata: NameNode HA + Journal.
* Data: Blocks + Erasure Coding.

3. Metadata

Inode: /user/data -> [Block 1, Block 2]
BlockMap: B1 -> [DN1, DN2, DN3]
EditLog: Transaction Journal.

4. Reliability Patterns

  • HA: Active/Standby via ZK.
  • Rack Awareness: Survive switch failure.
  • Erasure Coding: RS(6,3) for cost.

Next, we move to the “Nervous System” of modern infra: Apache Kafka.