Design Amazon S3 (Object Storage)
[!NOTE] This module explores the core principles of Design Amazon S3 (Object Storage), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. What is Amazon S3?
Amazon S3 (Simple Storage Service) is an Object Storage service that offers industry-leading scalability, data availability, security, and performance. Unlike a file system (hierarchical, POSIX), S3 is a flat namespace where you store “Objects” (files) inside “Buckets” (containers).
Key Characteristics
- Scale: Exabytes of data. Trillions of objects.
- Durability: 11 9s (99.999999999%). You essentially never lose data.
- Availability: 99.99%.
- Performance: High throughput for large blobs.
Try it yourself: Upload a 10GB file to S3 via CLI. Notice it finishes faster than your single-thread bandwidth? That’s Multipart Upload in action.
2. Requirements & Goals
Functional Requirements
- Bucket Operations: Create/Delete Bucket.
- Object Operations: Put, Get, Delete, List Objects.
- Versioning: Support multiple versions of an object.
- Large Files: Support files up to 5TB (via Multipart).
Non-Functional Requirements
- Durability: 11 9s. We must tolerate simultaneous disk/rack/DC failures.
- Availability: The system must always accept writes and reads.
- Scalability: Horizontal scaling for both storage and metadata.
- Consistency: Since 2020, S3 offers Strong Consistency. A successful
PUTis immediately visible to a subsequentGET.
3. Capacity Estimation
Let’s design for a massive scale.
Storage
- Total Objects: 100 Billion.
- Avg Size: 1 MB.
- Total Data: 100 Billion * 1 MB = 100 Petabytes (PB).
- Growth: 10% month-over-month.
Throughput
- Read QPS: 100,000 QPS.
- Write QPS: 10,000 QPS.
- Bandwidth: If avg request is 1MB, 100k QPS = 100 GB/s outbound.
4. System APIs
S3 uses a RESTful API.
4.1 Bucket Operations
POST /my-bucket
DELETE /my-bucket
4.2 Object Operations
PUT /my-bucket/photo.jpg
Body: <binary_data>
GET /my-bucket/photo.jpg
Response: 200 OK, Body: <binary_data>
4.3 Multipart Upload
For files > 100MB.
- Initiate:
POST /bucket/file?uploads→ ReturnsUploadId. - Upload Part:
PUT /bucket/file?partNumber=1&uploadId=xyz. - Complete:
POST /bucket/file?uploadId=xyz(Merges parts).
5. Database Design
We separate Metadata from Data.
Metadata Store (Key-Value)
Stores attributes: Name, Size, Owner, ACLs, Location (Pointer to Block Store).
- Key:
BucketName + ObjectName. - Value: JSON Metadata + List of Block IDs.
- Tech Choice: NewSQL (CockroachDB/Spanner) or Sharded KV (Cassandra/DynamoDB) with Paxos for Strong Consistency.
Block Store (Blob)
Stores the immutable bits.
- Filesystem: Custom lightweight FS (like Facebook Haystack) optimized for large sequential writes.
- Addressing: Addressed by
BlockID(UUID).
6. High-Level Design
Architecture separating Metadata and Data planes.
- Client sends
PUT /bucket/file.jpg. - API Node authenticates request.
- Metadata Service checks bucket exists and authorizes user.
- Placement Service allocates a
BlockIDand determines which Storage Nodes to write to. - API Node streams data to Storage Nodes (using Erasure Coding).
- Once data is durable (written to quorum), Metadata Service commits the object (Map
file.jpg→BlockID).
7. Component Design (Deep Dive)
11 9s Durability: Erasure Coding
Storing 3 copies of 100 PB means storing 300 PB. That is too expensive ($).
- Replication: 200% overhead (3 copies). Safe but wasteful.
- Erasure Coding (EC): Breaks data into
Ndata chunks andKparity chunks. - Reed-Solomon (10, 4): Split file into 10 parts. Calculate 4 parity parts.
- Overhead: Only 40% (vs 200%).
- Durability: Can lose ANY 4 drives and still recover.
- Trade-off: High CPU usage for calculation, but storage savings are worth it.
Strong Consistency (The 2020 Shift)
For years, S3 was Eventually Consistent (overwrite a file, you might see the old one). In 2020, they switched to Strong Consistency.
- How?: The Metadata layer now uses a Distributed Consensus Algorithm (likely variants of Paxos or Raft) for every single write.
- Why now?: Hardware got faster. Network latency dropped. CPU is cheaper. The overhead of consensus is now negligible compared to the network transfer time of the data blob.
- Cache Coherency: They also implemented a system to actively invalidate caches across the fleet immediately upon commit.
Multipart Upload
Uploading a 5GB file in one stream is risky. If it fails at 99%, you retry from zero.
- Parallelism: Break file into 50 chunks of 100MB. Upload them in parallel.
- Resiliency: If chunk 45 fails, retry only chunk 45.
- Throughput: Maximize bandwidth by saturating multiple TCP connections.
8. Requirements Traceability
| Requirement | Design Decision | Justification |
|---|---|---|
| 11 9s Durability | Erasure Coding (10+4) | Can tolerate loss of 4 simultaneous availability zones/disks with low storage overhead. |
| Scalability | Separated Control/Data Plane | Metadata scales independently of Storage. Data path bypasses metadata bottleneck. |
| Cost | Tiered Storage (Glacier) | Move cold objects to cheaper, slower media (Tape/HDD) automatically. |
| Performance | Multipart Upload | Parallelizes writes to maximize throughput and fault tolerance. |
| Consistency | Consensus (Paxos) | Ensures Metadata updates are atomic and strongly consistent. |
9. Observability & Metrics
Key Metrics
- Durability: Checksums. Background scrubbers constantly read data to verify integrity.
- Availability: Error Rate (5xx).
- Latency: Time to First Byte (TTFB).
- Storage Efficiency: (Used Space / Raw Space). Monitor overhead of Erasure Coding.
10. Deployment Strategy
Immutable Infrastructure
We never patch storage nodes. We replace them.
- Data Migration: When a disk is retiring, the system treats it as “failed” and reconstructs its data onto a new node using Erasure Coding.
- Zone Deployment: Updates are rolled out one Availability Zone at a time.
11. Interview Gauntlet
Rapid Fire Questions
- Why use Erasure Coding over Replication? Replication (3x) wastes 200% storage. EC (10+4) only wastes 40% for higher durability. At Exabyte scale, this saves billions of dollars.
- How does S3 handle small files? Small files cause metadata bloat and disk fragmentation. S3 aggregates small objects into larger 100MB “containers” or “shards” before writing to disk.
- What happens if two users write the same key at the same time? Last Write Wins. The Metadata service serializes the commit requests. The one processed last overwrites the pointer.
- Is S3 a filesystem? No. It is a Key-Value store. It does not support
rename(move) efficiently. Renaming a “folder”foo/tobar/requires rewriting every single object inside with the new key.
12. Interactive Decision Visualizer: Erasure Coding
See how Reed-Solomon encoding works. We split data and generate parity. You can “destroy” chunks and see if the data survives.
13. Summary
- Erasure Coding: The key to 11 9s durability without 300% storage cost.
- Strong Consistency: Achieved via Paxos on the Metadata layer.
- Multipart Upload: Essential for performance and reliability on large files.
- Separation: Metadata scaling (LSM/NewSQL) is handled separately from Blob storage.