Deep Dive: Feed Ranking & Aggregation

1. Beyond the Follow Graph

In the previous chapters (Instagram/Twitter), we assumed the feed was just “Posts from people I follow, sorted by time”. Modern feeds (TikTok, FB, Insta, LinkedIn) are different.

Algorithmic Sorting: Not chronological. Best content first.
Mixed Content: Friends + Groups + Ads + “You might like” (Recommendations).

This chapter explains how to build the Brain behind the feed.

2. The Aggregator Service

When you open the app, the Feed Aggregator Service is called. It orchestrates a parallel workflow to build your feed in < 200ms.

The Pipeline

Visualized in the architecture diagram below.

Step 1: Retrieval (Candidate Generation): Fetch 2,000 potential posts (Fanout).
Step 2: Filtering & Deduplication: Check Bloom Filters (Seen Posts) and Safety.
Step 3: ML Scoring: Assign a probability score (P_Like, P_Click) using feature vectors.
Step 4: Re-Ranking: Apply Diversity Rules (e.g., “Don’t show 5 videos in a row”) and inject ads.

3. Step 1: Retrieval (Candidate Generation)

The Aggregator calls multiple services in parallel (Fanout-on-Read):

Follow Service: “Get recent posts from friends” (Redis).
Group Service: “Get top posts from groups I’m in”.
Ad Service: “Get relevant ads for User X”.
Discovery Service (Vector Search):
- Embeddings: We convert Users and Posts into vectors (e.g., 128-float arrays) using models like Word2Vec or BERT.
- Vector DB: We use a Vector Database (e.g., Pinecone, Milvus, FAISS) to find posts that are “semantically similar” to posts the user has liked before.
- ANN (Approximate Nearest Neighbor): Finding the exact nearest neighbor in a billion vectors is too slow. We use HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) algorithms to find “good enough” matches in < 10ms.

Vector Search Simulator

Click anywhere in the embedding space to find the Nearest Neighbors (Top 3) for your query. Click to Add a Post to the database.

● Query Point | ● Nearest Neighbors | ● Database Vectors (Click to Add)

Result: A messy list of ~2,000 Post IDs.

4. Step 2: Filtering & Deduplication

We don’t want to show users posts they have already seen.

Problem: Storing every Post ID a user has ever viewed is expensive.
Solution: Bloom Filters.
- We store a Bloom Filter for each user containing IDs of posts seen in the last 7 days.
- Before passing candidates to the Ranker, we check the Bloom Filter.
- If BloomFilter.contains(post_id) returns true, we skip it.
- See Bloom Filters Explained.

5. Step 3: Ranking (The Scoring Engine)

Visualized as Step 3: ML Scoring in the diagram. Now we have 2,000 filtered candidates. We need to pick the top 20 to show first. We pass them to the Scoring Engine.

Model Types

LightGBM / XGBoost: Tree-based models. Fast and explainable. Good for “Click vs No Click”.
Deep Neural Networks (Two-Tower): One tower for User Features, one for Item Features. Good for capturing complex non-linear relationships.

Feature Engineering

The Scoring Engine retrieves data from the Feature Store (User & Post Vectors) to build a Feature Vector:

Feature Type	Examples
User Features	Age, Gender, Device, Past Click Rate.
Post Features	Media Type (Video/Image), Hashtags, Age of Post, Creator Quality.
Context Features	Time of Day, Day of Week, Network Speed.
Interaction Features	“Does User X follow Creator Y?”, “Did User X like Creator Y’s last 5 posts?”

The Prediction

The model outputs a probability: P(Like) = 0.45, P(Share) = 0.05.

    Final Score = (w1 × P_Like) + (w2 × P_Comment) + (w3 × P_TimeSpent)

6. Step 4: Re-Ranking (Business Logic)

The raw scores might result in a feed of 20 videos from the same creator. This is bad UX. We apply Diversity Rules:

Author Diversity: No more than 2 posts from the same author in a row.
Content Diversity: Mix of Photos, Videos, and Text.
Ad Injection: Insert an Ad every 5th slot.

Negative Feedback Loops

If a user clicks “Hide Post” or “Not Interested”, it is a critical signal.

Immediate Action: Remove post from feed.
Short Term: Downgrade the Author’s Score for this user.
Long Term: Adjust the User’s Vector in the Feature Store.
- If they hid a “Political” post, move their preference vector away from the “Politics” cluster.
- This ensures the Retrieval Step (Step 1) fetches fewer political posts next time.
Real-time updates: We use streaming pipelines (Kafka + Flink) to update these feature vectors instantly, not just in the nightly batch job.

7. The Algorithm (EdgeRank & Beyond)

Facebook’s original algorithm (EdgeRank) is a good mental model:

    Rank = Affinity × Weight × TimeDecay

Affinity: How close are you to the creator? (Close friend = High, Stranger = Low).
Weight: Cost of interaction. (Comment > Like > Click > View).
Time Decay: 1 / (Time Since Posted). Old news is bad news.

See Caching 101 for how we cache these scores.

8. Architecture Diagram

Ranking & Aggregation Pipeline Architecture.

System Architecture: Feed Ranking Pipeline

Retrieval Fanout | ML Scoring Engine | Diversity Re-Ranking

Candidate Flow

Feature Lookup

Final Output

Step 1: Retrieval

Step 2: Filter & Dedup

Step 3: ML Scoring

Step 4: Re-Ranking

Aggregator

Friend Svc

Group Svc

ML Discovery

Ads Svc

Bloom Filter

Seen Posts Check

Safety Filter

Feature Store

User & Post Vectors

Scoring Engine

• LightGBM Model

• P(Click)

• P(Like)

• P(Share)

Re-Ranker

• Diversity Rules

• Ad Injection

• Pagination

Final Feed JSON

9. Interactive Decision Visualizer

The Ranking Calculator

Adjust the parameters to see how a post’s score changes. Notice how Time Decay destroys the score of old posts, even if they are from your best friend.

Scoring Parameters

Affinity (Relationship)

Weight (Content Type)

Time Decay (Hours Ago)

Now 2h ago 48h ago

Final Rank Score

10. Observability: Monitoring the Brain

Since the Ranking Engine is a black box, we need specific metrics.

Feature Freshness: How old is the user’s “Last Clicked” feature? If > 5 mins, the pipeline is broken.
Score Distribution: Histogram of scores. If all scores are 0.001, the model is miscalibrated.
CTR by Model Version: Compare v1.2 vs v1.3 in real-time (A/B Testing).

11. Connection to Other Modules

Caching: We rely heavily on Redis. See Module 6: Caching.
Vector Search: For discovery. See Module 15: Deep Dive Data.
Fanout: The core distribution mechanism. See Twitter Timeline.