Design Amazon S3 (Object Storage)

[!NOTE] This module explores the core principles of Design Amazon S3 (Object Storage), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. What is Amazon S3?

Amazon S3 (Simple Storage Service) is an Object Storage service that offers industry-leading scalability, data availability, security, and performance. Unlike a file system (hierarchical, POSIX), S3 is a flat namespace where you store “Objects” (files) inside “Buckets” (containers).

Key Characteristics

  • Scale: Exabytes of data. Trillions of objects.
  • Durability: 11 9s (99.999999999%). You essentially never lose data.
  • Availability: 99.99%.
  • Performance: High throughput for large blobs.

Try it yourself: Upload a 10GB file to S3 via CLI. Notice it finishes faster than your single-thread bandwidth? That’s Multipart Upload in action.


2. Requirements & Goals

Functional Requirements

  • Bucket Operations: Create/Delete Bucket.
  • Object Operations: Put, Get, Delete, List Objects.
  • Versioning: Support multiple versions of an object.
  • Large Files: Support files up to 5TB (via Multipart).

Non-Functional Requirements

  • Durability: 11 9s. We must tolerate simultaneous disk/rack/DC failures.
  • Availability: The system must always accept writes and reads.
  • Scalability: Horizontal scaling for both storage and metadata.
  • Consistency: Since 2020, S3 offers Strong Consistency. A successful PUT is immediately visible to a subsequent GET.

3. Capacity Estimation

Let’s design for a massive scale.

Storage

  • Total Objects: 100 Billion.
  • Avg Size: 1 MB.
  • Total Data: 100 Billion * 1 MB = 100 Petabytes (PB).
  • Growth: 10% month-over-month.

Throughput

  • Read QPS: 100,000 QPS.
  • Write QPS: 10,000 QPS.
  • Bandwidth: If avg request is 1MB, 100k QPS = 100 GB/s outbound.

4. System APIs

S3 uses a RESTful API.

4.1 Bucket Operations

POST /my-bucket
DELETE /my-bucket

4.2 Object Operations

PUT /my-bucket/photo.jpg
Body: <binary_data>

GET /my-bucket/photo.jpg
Response: 200 OK, Body: <binary_data>

4.3 Multipart Upload

For files > 100MB.

  1. Initiate: POST /bucket/file?uploads → Returns UploadId.
  2. Upload Part: PUT /bucket/file?partNumber=1&uploadId=xyz.
  3. Complete: POST /bucket/file?uploadId=xyz (Merges parts).

5. Database Design

We separate Metadata from Data.

Metadata Store (Key-Value)

Stores attributes: Name, Size, Owner, ACLs, Location (Pointer to Block Store).

  • Key: BucketName + ObjectName.
  • Value: JSON Metadata + List of Block IDs.
  • Tech Choice: NewSQL (CockroachDB/Spanner) or Sharded KV (Cassandra/DynamoDB) with Paxos for Strong Consistency.

Block Store (Blob)

Stores the immutable bits.

  • Filesystem: Custom lightweight FS (like Facebook Haystack) optimized for large sequential writes.
  • Addressing: Addressed by BlockID (UUID).

6. High-Level Design

Architecture separating Metadata and Data planes.

System Architecture: Object Storage (S3)
Metadata Service | Block Allocation | Erasure Coding | Storage Nodes
Control Path
Data Path
Client
Interface
Metadata Plane
Storage Plane
User
API Nodes
• Auth (IAM) • Rate Limit • Routing
Metadata Svc
• KV Store • Strong Consistency • Namespace
Placement Svc
• Allocates Block IDs • Monitors Health
Storage Cluster (Erasure Coding)
Rack 1 (Data 1, 2)
Rack 2 (Data 3, 4)
Rack 3 (Parity 1)
Rack 4 (Parity 2)
PUT Object 1. Get Block ID 2. Stream Data (Write) 3. Commit
  1. Client sends PUT /bucket/file.jpg.
  2. API Node authenticates request.
  3. Metadata Service checks bucket exists and authorizes user.
  4. Placement Service allocates a BlockID and determines which Storage Nodes to write to.
  5. API Node streams data to Storage Nodes (using Erasure Coding).
  6. Once data is durable (written to quorum), Metadata Service commits the object (Map file.jpgBlockID).

7. Component Design (Deep Dive)

11 9s Durability: Erasure Coding

Storing 3 copies of 100 PB means storing 300 PB. That is too expensive ($).

  • Replication: 200% overhead (3 copies). Safe but wasteful.
  • Erasure Coding (EC): Breaks data into N data chunks and K parity chunks.
  • Reed-Solomon (10, 4): Split file into 10 parts. Calculate 4 parity parts.
  • Overhead: Only 40% (vs 200%).
  • Durability: Can lose ANY 4 drives and still recover.
  • Trade-off: High CPU usage for calculation, but storage savings are worth it.

Strong Consistency (The 2020 Shift)

For years, S3 was Eventually Consistent (overwrite a file, you might see the old one). In 2020, they switched to Strong Consistency.

  • How?: The Metadata layer now uses a Distributed Consensus Algorithm (likely variants of Paxos or Raft) for every single write.
  • Why now?: Hardware got faster. Network latency dropped. CPU is cheaper. The overhead of consensus is now negligible compared to the network transfer time of the data blob.
  • Cache Coherency: They also implemented a system to actively invalidate caches across the fleet immediately upon commit.

Multipart Upload

Uploading a 5GB file in one stream is risky. If it fails at 99%, you retry from zero.

  • Parallelism: Break file into 50 chunks of 100MB. Upload them in parallel.
  • Resiliency: If chunk 45 fails, retry only chunk 45.
  • Throughput: Maximize bandwidth by saturating multiple TCP connections.

8. Requirements Traceability

Requirement Design Decision Justification
11 9s Durability Erasure Coding (10+4) Can tolerate loss of 4 simultaneous availability zones/disks with low storage overhead.
Scalability Separated Control/Data Plane Metadata scales independently of Storage. Data path bypasses metadata bottleneck.
Cost Tiered Storage (Glacier) Move cold objects to cheaper, slower media (Tape/HDD) automatically.
Performance Multipart Upload Parallelizes writes to maximize throughput and fault tolerance.
Consistency Consensus (Paxos) Ensures Metadata updates are atomic and strongly consistent.

9. Observability & Metrics

Key Metrics

  • Durability: Checksums. Background scrubbers constantly read data to verify integrity.
  • Availability: Error Rate (5xx).
  • Latency: Time to First Byte (TTFB).
  • Storage Efficiency: (Used Space / Raw Space). Monitor overhead of Erasure Coding.

10. Deployment Strategy

Immutable Infrastructure

We never patch storage nodes. We replace them.

  • Data Migration: When a disk is retiring, the system treats it as “failed” and reconstructs its data onto a new node using Erasure Coding.
  • Zone Deployment: Updates are rolled out one Availability Zone at a time.

11. Interview Gauntlet

Rapid Fire Questions

  1. Why use Erasure Coding over Replication? Replication (3x) wastes 200% storage. EC (10+4) only wastes 40% for higher durability. At Exabyte scale, this saves billions of dollars.
  2. How does S3 handle small files? Small files cause metadata bloat and disk fragmentation. S3 aggregates small objects into larger 100MB “containers” or “shards” before writing to disk.
  3. What happens if two users write the same key at the same time? Last Write Wins. The Metadata service serializes the commit requests. The one processed last overwrites the pointer.
  4. Is S3 a filesystem? No. It is a Key-Value store. It does not support rename (move) efficiently. Renaming a “folder” foo/ to bar/ requires rewriting every single object inside with the new key.

12. Interactive Decision Visualizer: Erasure Coding

See how Reed-Solomon encoding works. We split data and generate parity. You can “destroy” chunks and see if the data survives.

Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
Disk 6
System Idle. Click 'Encode' to write data.

13. Summary

  • Erasure Coding: The key to 11 9s durability without 300% storage cost.
  • Strong Consistency: Achieved via Paxos on the Metadata layer.
  • Multipart Upload: Essential for performance and reliability on large files.
  • Separation: Metadata scaling (LSM/NewSQL) is handled separately from Blob storage.