HDFS (Hadoop Distributed File System)

[!TIP] GFS for the Masses: HDFS is the open-source implementation of GFS. While GFS was C++, HDFS is Java. If you understand GFS, you understand 80% of HDFS. The key differences lie in how HDFS evolved to solve the “Single Master” problem.

1. GFS vs HDFS: The Cheat Sheet

2. Architecture: Solving the SPOF

In early Hadoop (1.0), the NameNode was a SPOF. If it died, the cluster died. Hadoop 2.0 introduced High Availability (HA).

The HA Architecture

HDFS 2.0+ uses a quorum-based architecture to ensure Hot Standby failover, as visualized in the metadata layer of our diagram.

Active NameNode: The leader that handles all client read/write requests and persists metadata changes.
Standby NameNode: A hot backup that stays synchronized with the Active node by reading from the same JournalNode Quorum.
JournalNodes (Shared Edits): A cluster of nodes (green markers) that store the Edit Logs. A write is only successful if a quorum of JournalNodes persists it.
ZKFC (ZooKeeper Failover Controller): A sidecar process (blue box) that monitors the NameNode health and uses ZooKeeper to coordinate leader election if the Active node fails.

Architecture: HDFS High Availability

Active / Standby NameNodes | JournalNode Quorum | ZK Failover

Metadata Sync (Edit Logs)

Heartbeats / Status

Election / Control

ZooKeeper Quorum
Leader Election & Locking

Metadata Layer

Active NameNode

- Handles Client RPCs
- Writes Edit Logs
- Holds fsImage in RAM

ZKFC

Standby NameNode

- Reads Edit Logs
- Hot-Standby RAM
- Standby Checkpoints

ZKFC

JournalNodes

Edit Log Q1

Edit Log Q2

Edit Log Q3

DATANODE POOL

DataNode 1

DataNode 2

DataNode 3

3. The “Small File Problem”

HDFS hates small files because all metadata must fit within the NameNode RAM (visualized in the Active node block).

The Math: Every object (file, block, directory) takes ~150 bytes of NameNode RAM.
The Scenario:
- 1 million x 100MB files = 100PB data. Metadata = 150MB RAM. (Easy).
- 100 million x 1KB files = 100GB data. Metadata = 15GB RAM. (Expensive).
Why it kills performance: The NameNode spends all its time GC’ing (Garbage Collecting) huge Java heaps.
Solution: Hadoop Archives (HAR) or SequenceFiles to pack small files into big containers.

4. Erasure Coding (Hadoop 3.0)

Replication (3x) is expensive. To store 1PB of data, you need 3PB of disk. 2PB is just “backup”. Erasure Coding (EC) changes the game using Reed-Solomon math.

How RS(6, 3) Works

Instead of copying data, we split it and calculate “Parity”. Think of it as Algebra: A + B = C. If you lose A, you can calculate it as C - B.

Data Blocks: We split a file into 6 chunks (D₁ … D₆).
Parity Blocks: We calculate 3 parity chunks (P₁ … P₃) using linear algebra equations (Vandermonde Matrix).
Total: 9 Blocks stored on 9 different nodes.
Fault Tolerance: You can lose any 3 blocks (Data or Parity) and still recover.
- Math: With 6 data points, you need 6 equations to solve the variables. We have 9 points total. Any 6 are sufficient.

The Trade-off

Interactive Demo: Erasure Coding Repair

Visualize recovering lost data using Parity.

Setup: 6 Data Blocks (Blue), 3 Parity Blocks (Green).
Disaster: You can destroy up to 3 blocks.
Magic: The system reconstructs them.

System Healthy. Click blocks to destroy them.

5. Rack Awareness (Reliability)

HDFS is aware of the physical topology. The Replication Policy (managed by the Active NameNode) ensures data is spread across failure domains:

Replica 1: Local Machine (Speed).
Replica 2: Different Rack (Survivability against Top-of-Rack Switch failure).
Replica 3: Same Rack as #2 (Bandwidth optimization).

Interactive Demo: Rack Awareness & Failure

Visualize how HDFS survives a Rack Failure.

Write Data: System places replicas according to policy.
Kill Rack: Simulate a switch failure.
Verify: Is data still accessible?

Rack 1 (Switch A)

ONLINE

Node A

Node B

Rack 2 (Switch B)

ONLINE

Node C

Node D

System Ready.

Data Availability: UNKNOWN

6. Observability & Tracing (RED Method)

To manage 5,000 nodes, we rely on the RED Method:

Rate: RPCs per second to NameNode.
- Alert: Sudden spike = Job gone rogue.
Errors: Failed DataNode heartbeats.
- Alert: >10% of nodes down.
Duration: Block Report processing time.
- Warning: If NameNode takes >1s to process reports, it’s overloaded.

JMX Metrics

Hadoop exposes Java Management Extensions (JMX) for Prometheus:

Hadoop:service=NameNode,name=NameNodeInfo:CorruptBlocks
Hadoop:service=DataNode,name=FSDatasetState:Remaining

7. Deployment Strategy: Rolling Upgrades

You cannot turn off HDFS to upgrade it. We use Rolling Upgrades with zero downtime.

Upgrade Standby NameNode: Update binaries, restart.
Failover: Switch Active -> Standby. (New Active is running v2).
Upgrade Old Active: Update, restart. It becomes the new Standby.
Upgrade DataNodes: Batch by batch (e.g., 5% at a time).
- Decommissioning: Gracefully drain a DataNode before reboot to prevent unnecessary replication storms.

8. Requirements Traceability Matrix

Requirement	Architectural Solution
No Single Point of Failure	NameNode HA (Active/Standby) + ZKFC.
Scalability (PB)	Erasure Coding (RS-6-3) reduces storage overhead.
Reliability	Rack Awareness (Top-of-Rack failure protection).
Throughput	Sequential I/O optimizations.
Consistency	Strict (Write-Once-Read-Many).

9. Interview Gauntlet

I. High Availability

What happens if Active NameNode hangs? ZKFC misses a heartbeat. ZooKeeper deletes the ephemeral lock. The Standby ZKFC sees this and transitions Standby to Active.
How to prevent Split Brain? Fencing (STONITH). The new Active cuts off the old Active (e.g., via SSH kill or revoking shared storage access).

II. Data Reliability

Explain Rack Awareness. 3 Replicas: 1 Local, 1 on a Different Rack, 1 on that Same Different Rack. Survives node and switch failures.
Why Erasure Coding? To save 50% disk space on cold data. Replaces 3x replication.

III. Operations

How to handle a dead disk? DataNode reports disk failure. NameNode marks blocks as under-replicated. It instructs other DataNodes to copy the missing blocks from healthy replicas. Self-healing.

10. Summary: The Whiteboard Strategy

If asked to design HDFS, draw this 4-Quadrant Layout:

1. Requirements

Func: Read, Write (Append), Delete.
Non-Func: HA (No SPOF), Rack Aware.
Scale: 10k Nodes, Exabytes.

2. Architecture

[ZK Quorum] -- [ZKFC]
|
[Active NN] -- [JournalNodes] -- [Standby NN]
|
[DataNodes (Block Storage)]

* Metadata: NameNode HA + Journal.
* Data: Blocks + Erasure Coding.

3. Metadata

            Inode: /user/data -> [Block 1, Block 2]

            BlockMap: B1 -> [DN1, DN2, DN3]

            EditLog: Transaction Journal.

4. Reliability Patterns

HA: Active/Standby via ZK.
Rack Awareness: Survive switch failure.
Erasure Coding: RS(6,3) for cost.

Next, we move to the “Nervous System” of modern infra: Apache Kafka.