Git is fundamentally a Content-Addressable Filesystem.
What does that mean? It means Git doesn’t care about file names. It only cares about the content of the files. It takes the content, runs it through a hashing algorithm (SHA-1), and uses the resulting hash as the key to store and retrieve that content.
At its core, Git consists of a key-value store (the objects directory) containing four types of objects:
- Blob (Binary Large Object)
- Tree
- Commit
- Tag
1. The SHA-1 Hash
Every object in Git is identified by a 40-character hexadecimal string, which is the SHA-1 hash of the object’s content plus a small header.
The formula is:
Hash = SHA1("type length\0content")
Let’s see this in action.
2. The Four Object Types
The Blob (Binary Large Object)
- Stores: File content.
- Does NOT Store: Filename, permissions, or creation date.
- Analogy: The text inside a book page.
If you have two files named README.md and BACKUP.txt with the exact same content “Hello”, Git stores only one blob. This is automatic deduplication!
The Tree
- Stores: Directory structure.
- Contains: A list of pointers to blobs (files) and other trees (subdirectories).
- Metadata: It stores filenames and permissions (e.g.,
100644for normal files,040000for directories). - Analogy: The Table of Contents.
The Commit
- Stores: A snapshot of the project at a specific time.
- Contains:
- Pointer to the top-level Tree.
- Author & Committer information (name, email, timestamp).
- Commit message.
- Pointer(s) to Parent Commit(s).
- Analogy: A specific edition of the book.
The Tag
- Stores: A reference to a specific commit (usually).
- Types:
- Lightweight: Just a pointer (like a branch that doesn’t move).
- Annotated: A full object containing a tagger name, date, message, and GPG signature.
3. The Object Graph (DAG)
These objects link together to form a DAG (Directed Acyclic Graph).
- Commits point to Trees (snapshot) and Parents (history).
- Trees point to Blobs (files) and other Trees (subfolders).
Click "Add Commit" to build the graph. Notice how new commits point to their parents.
4. Packfiles & Delta Compression
If you modify a large file 10 times, does Git store 10 full copies? Initially, yes. Git starts by storing every version as a full Loose Object.
However, this is inefficient. To solve this, Git runs a garbage collection process (git gc) that packs these loose objects into Packfiles.
How Packfiles Work
Git finds similar files and stores the differences (deltas) instead of full copies.
- Base Object: The full version of the file (usually the newest one, for fast checkout).
- Delta: Instructions to reconstruct an older version from the base (e.g., “remove line 10, add ‘foo’”).
This is why a 10GB SVN repo might only be 500MB in Git!
Loose Objects (Uncompressed)
Packfile (Compressed)
5. Why SHA-1? (Content Addressability)
By using hashes, Git guarantees data integrity. If a single bit in a file changes, its SHA-1 hash changes. This propagates up:
- Blob hash changes.
- Tree hash (which lists the blob) changes.
- Commit hash (which points to the tree) changes.
This is why you can’t change history without changing all subsequent commit hashes (rebasing).
[!NOTE] Collision Attacks: SHA-1 is theoretically broken (Google announced the SHAttered attack in 2017). However, Git includes hardening against these specific attacks (detecting collision artifacts). Git is also slowly migrating to SHA-256.
Summary
- Blob: Content only.
- Tree: Directory structure.
- Commit: Snapshot + History.
- Hash:
SHA1(header + content).
Git is just a map of Key (Hash) -> Value (Object). Everything else is a convenience.