Deep Dive: Feed Ranking & Aggregation
1. Beyond the Follow Graph
In the previous chapters (Instagram/Twitter), we assumed the feed was just “Posts from people I follow, sorted by time”. Modern feeds (TikTok, FB, Insta, LinkedIn) are different.
- Algorithmic Sorting: Not chronological. Best content first.
- Mixed Content: Friends + Groups + Ads + “You might like” (Recommendations).
This chapter explains how to build the Brain behind the feed.
2. The Aggregator Service
When you open the app, the Feed Aggregator Service is called. It orchestrates a parallel workflow to build your feed in < 200ms.
The Pipeline
Visualized in the architecture diagram below.
- Step 1: Retrieval (Candidate Generation): Fetch 2,000 potential posts (Fanout).
- Step 2: Filtering & Deduplication: Check Bloom Filters (Seen Posts) and Safety.
- Step 3: ML Scoring: Assign a probability score (P_Like, P_Click) using feature vectors.
- Step 4: Re-Ranking: Apply Diversity Rules (e.g., “Don’t show 5 videos in a row”) and inject ads.
3. Step 1: Retrieval (Candidate Generation)
The Aggregator calls multiple services in parallel (Fanout-on-Read):
- Follow Service: “Get recent posts from friends” (Redis).
- Group Service: “Get top posts from groups I’m in”.
- Ad Service: “Get relevant ads for User X”.
- Discovery Service (Vector Search):
- Embeddings: We convert Users and Posts into vectors (e.g., 128-float arrays) using models like Word2Vec or BERT.
- Vector DB: We use a Vector Database (e.g., Pinecone, Milvus, FAISS) to find posts that are “semantically similar” to posts the user has liked before.
- ANN (Approximate Nearest Neighbor): Finding the exact nearest neighbor in a billion vectors is too slow. We use HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) algorithms to find “good enough” matches in < 10ms.
Vector Search Simulator
Click anywhere in the embedding space to find the Nearest Neighbors (Top 3) for your query. Click to Add a Post to the database.
Result: A messy list of ~2,000 Post IDs.
4. Step 2: Filtering & Deduplication
We don’t want to show users posts they have already seen.
- Problem: Storing every Post ID a user has ever viewed is expensive.
- Solution: Bloom Filters.
- We store a Bloom Filter for each user containing IDs of posts seen in the last 7 days.
- Before passing candidates to the Ranker, we check the Bloom Filter.
- If
BloomFilter.contains(post_id)returns true, we skip it. - See Bloom Filters Explained.
5. Step 3: Ranking (The Scoring Engine)
Visualized as Step 3: ML Scoring in the diagram. Now we have 2,000 filtered candidates. We need to pick the top 20 to show first. We pass them to the Scoring Engine.
Model Types
- LightGBM / XGBoost: Tree-based models. Fast and explainable. Good for “Click vs No Click”.
- Deep Neural Networks (Two-Tower): One tower for User Features, one for Item Features. Good for capturing complex non-linear relationships.
Feature Engineering
The Scoring Engine retrieves data from the Feature Store (User & Post Vectors) to build a Feature Vector:
| Feature Type | Examples |
|---|---|
| User Features | Age, Gender, Device, Past Click Rate. |
| Post Features | Media Type (Video/Image), Hashtags, Age of Post, Creator Quality. |
| Context Features | Time of Day, Day of Week, Network Speed. |
| Interaction Features | “Does User X follow Creator Y?”, “Did User X like Creator Y’s last 5 posts?” |
The Prediction
The model outputs a probability: P(Like) = 0.45, P(Share) = 0.05.
6. Step 4: Re-Ranking (Business Logic)
The raw scores might result in a feed of 20 videos from the same creator. This is bad UX. We apply Diversity Rules:
- Author Diversity: No more than 2 posts from the same author in a row.
- Content Diversity: Mix of Photos, Videos, and Text.
- Ad Injection: Insert an Ad every 5th slot.
Negative Feedback Loops
If a user clicks “Hide Post” or “Not Interested”, it is a critical signal.
- Immediate Action: Remove post from feed.
- Short Term: Downgrade the Author’s Score for this user.
- Long Term: Adjust the User’s Vector in the Feature Store.
- If they hid a “Political” post, move their preference vector away from the “Politics” cluster.
- This ensures the Retrieval Step (Step 1) fetches fewer political posts next time.
- Real-time updates: We use streaming pipelines (Kafka + Flink) to update these feature vectors instantly, not just in the nightly batch job.
7. The Algorithm (EdgeRank & Beyond)
Facebook’s original algorithm (EdgeRank) is a good mental model:
- Affinity: How close are you to the creator? (Close friend = High, Stranger = Low).
- Weight: Cost of interaction. (Comment > Like > Click > View).
- Time Decay:
1 / (Time Since Posted). Old news is bad news.
See Caching 101 for how we cache these scores.
8. Architecture Diagram
Ranking & Aggregation Pipeline Architecture.
9. Interactive Decision Visualizer
The Ranking Calculator
Adjust the parameters to see how a post’s score changes. Notice how Time Decay destroys the score of old posts, even if they are from your best friend.
Scoring Parameters
10. Observability: Monitoring the Brain
Since the Ranking Engine is a black box, we need specific metrics.
- Feature Freshness: How old is the user’s “Last Clicked” feature? If > 5 mins, the pipeline is broken.
- Score Distribution: Histogram of scores. If all scores are 0.001, the model is miscalibrated.
- CTR by Model Version: Compare
v1.2vsv1.3in real-time (A/B Testing).
11. Connection to Other Modules
- Caching: We rely heavily on Redis. See Module 6: Caching.
- Vector Search: For discovery. See Module 15: Deep Dive Data.
- Fanout: The core distribution mechanism. See Twitter Timeline.