Git is fundamentally a Content-Addressable Filesystem.

What does that mean? It means Git doesn’t care about file names. It only cares about the content of the files. It takes the content, runs it through a hashing algorithm (SHA-1), and uses the resulting hash as the key to store and retrieve that content.

At its core, Git consists of a key-value store (the objects directory) containing four types of objects:

  1. Blob (Binary Large Object)
  2. Tree
  3. Commit
  4. Tag

1. The SHA-1 Hash

Every object in Git is identified by a 40-character hexadecimal string, which is the SHA-1 hash of the object’s content plus a small header.

The formula is:

Hash = SHA1("type length\0content")

Let’s see this in action.

git-hash-object-simulator
Internal Header + Content
blob 11\0Hello World
SHA-1 Hash (Key)
Computing...

2. The Four Object Types

The Blob (Binary Large Object)

  • Stores: File content.
  • Does NOT Store: Filename, permissions, or creation date.
  • Analogy: The text inside a book page.

If you have two files named README.md and BACKUP.txt with the exact same content “Hello”, Git stores only one blob. This is automatic deduplication!

The Tree

  • Stores: Directory structure.
  • Contains: A list of pointers to blobs (files) and other trees (subdirectories).
  • Metadata: It stores filenames and permissions (e.g., 100644 for normal files, 040000 for directories).
  • Analogy: The Table of Contents.

The Commit

  • Stores: A snapshot of the project at a specific time.
  • Contains:
    • Pointer to the top-level Tree.
    • Author & Committer information (name, email, timestamp).
    • Commit message.
    • Pointer(s) to Parent Commit(s).
  • Analogy: A specific edition of the book.

The Tag

  • Stores: A reference to a specific commit (usually).
  • Types:
    • Lightweight: Just a pointer (like a branch that doesn’t move).
    • Annotated: A full object containing a tagger name, date, message, and GPG signature.

3. The Object Graph (DAG)

These objects link together to form a DAG (Directed Acyclic Graph).

  • Commits point to Trees (snapshot) and Parents (history).
  • Trees point to Blobs (files) and other Trees (subfolders).

Click "Add Commit" to build the graph. Notice how new commits point to their parents.

4. Packfiles & Delta Compression

If you modify a large file 10 times, does Git store 10 full copies? Initially, yes. Git starts by storing every version as a full Loose Object.

However, this is inefficient. To solve this, Git runs a garbage collection process (git gc) that packs these loose objects into Packfiles.

How Packfiles Work

Git finds similar files and stores the differences (deltas) instead of full copies.

  • Base Object: The full version of the file (usually the newest one, for fast checkout).
  • Delta: Instructions to reconstruct an older version from the base (e.g., “remove line 10, add ‘foo’”).

This is why a 10GB SVN repo might only be 500MB in Git!

Loose Objects (Uncompressed)

Version 1 (100KB)
Version 2 (101KB)
Version 3 (102KB)
Total: 303KB
git gc

Packfile (Compressed)

Version 3 (Base)
Delta V3→V2 (1KB)
Delta V2→V1 (1KB)
Total: ~104KB

5. Why SHA-1? (Content Addressability)

By using hashes, Git guarantees data integrity. If a single bit in a file changes, its SHA-1 hash changes. This propagates up:

  1. Blob hash changes.
  2. Tree hash (which lists the blob) changes.
  3. Commit hash (which points to the tree) changes.

This is why you can’t change history without changing all subsequent commit hashes (rebasing).

[!NOTE] Collision Attacks: SHA-1 is theoretically broken (Google announced the SHAttered attack in 2017). However, Git includes hardening against these specific attacks (detecting collision artifacts). Git is also slowly migrating to SHA-256.

Summary

  • Blob: Content only.
  • Tree: Directory structure.
  • Commit: Snapshot + History.
  • Hash: SHA1(header + content).

Git is just a map of Key (Hash) -> Value (Object). Everything else is a convenience.