Deep Dive: Feed Ranking & Aggregation
[!IMPORTANT] In this lesson, you will master:
- 1. Beyond the Follow Graph: Building intuition behind 1. beyond the follow graph.
- 2. The Aggregator Service: Building intuition behind 2. the aggregator service.
- Retrieval (Candidate Generation): Building intuition behind retrieval (candidate generation).
1. Beyond the Follow Graph
In the previous chapters (Instagram/Twitter), we assumed the feed was just “Posts from people I follow, sorted by time”. Modern feeds (TikTok, FB, Insta, LinkedIn) are different.
- Algorithmic Sorting: Not chronological. Best content first.
- Mixed Content: Friends + Groups + Ads + “You might like” (Recommendations).
This chapter explains how to build the Brain behind the feed.
2. The Aggregator Service
When you open the app, the Feed Aggregator Service is called. It orchestrates a parallel workflow to build your feed in < 200ms.
[!TIP] Analogy: The Head Chef Think of the Aggregator Service as a Head Chef in a massive restaurant. The Chef doesn’t cook everything themselves; they fan out orders to various specialized stations (Grill, Fryer, Salad). Once all the dishes (candidate posts) come back, the Chef filters out the bad ones (deduplication), plates them beautifully (scoring), and arranges them logically for the customer (re-ranking).
The Pipeline
Visualized in the architecture diagram below.
- Step 1: Retrieval (Candidate Generation): Fetch 2,000 potential posts (Fanout).
- Step 2: Filtering & Deduplication: Check Bloom Filters (Seen Posts) and Safety.
- Step 3: ML Scoring: Assign a probability score (P_Like, P_Click) using feature vectors.
- Step 4: Re-Ranking: Apply Diversity Rules (e.g., “Don’t show 5 videos in a row”) and inject ads.
3. Step 1: Retrieval (Candidate Generation)
The Aggregator calls multiple services in parallel (Fanout-on-Read):
- Follow Service: “Get recent posts from friends” (Redis).
- Group Service: “Get top posts from groups I’m in”.
- Ad Service: “Get relevant ads for User X”.
- Discovery Service (Vector Search):
- Embeddings: We convert Users and Posts into vectors (e.g., 128-float arrays) using models like Word2Vec or BERT.
- Vector DB: We use a Vector Database (e.g., Pinecone, Milvus, FAISS) to find posts that are “semantically similar” to posts the user has liked before.
- ANN (Approximate Nearest Neighbor): Finding the exact nearest neighbor in a billion vectors is too slow. We use HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) algorithms to find “good enough” matches in < 10ms.
Vector Search Simulator
[!TIP] Try it yourself: Click anywhere to query the database. The Blue Dot is your query. The Green Dots are the Top 3 Nearest Neighbors (Semantically similar posts). Click repeatedly to add new data points.
Result: A messy list of ~2,000 Post IDs.
4. Step 2: Filtering & Deduplication
We don’t want to show users posts they have already seen.
- Problem: Storing every Post ID a user has ever viewed is expensive and querying a large database table for every post candidate on every feed load would cause catastrophic I/O bottlenecks.
- Solution: Bloom Filters.
- A Bloom Filter is a bit-array. When a user sees a post, we hash the
post_id(using 3 different hash functions) and set bits to 1. - Check: To check if a user has seen
post_123, we hash it. If all bits are 1, we assume they have seen it. - False Positives: Possible (we might hide a post they haven’t seen), but rare (1%).
- False Negatives: Impossible (we never show a post they have seen).
- Hardware Reality & Storage: A Bloom Filter for 1 million items takes only ~1MB. Because it is a simple bit-array, the entire structure fits comfortably inside the CPU’s L3 Cache (and often L1/L2), making bitwise operations practically instantaneous (nanosecond latency). This eliminates the need for slow Disk I/O or expensive network calls to a database, dramatically speeding up the deduplication phase.
[!NOTE] War Story: Database Meltdown Averted A major social media platform once tried querying a Cassandra cluster for every post candidate to check if a user had seen it. At 100 million active users, this caused catastrophic read timeouts and brought down the entire feed. By placing a Bloom Filter in memory before the database check, they eliminated 95% of database reads, saving millions in hardware costs and dropping feed load times by hundreds of milliseconds.
5. Step 3: Ranking (The Scoring Engine)
Visualized as Step 3: ML Scoring in the diagram. Now we have 2,000 filtered candidates. We need to pick the top 20 to show first. We pass them to the Scoring Engine.
Model Types
- LightGBM / XGBoost: Tree-based models. Fast and explainable. Good for “Click vs No Click”.
- Deep Neural Networks (Two-Tower): One tower for User Features, one for Item Features. Good for capturing complex non-linear relationships.
Feature Engineering
The Scoring Engine retrieves data from the Feature Store (User & Post Vectors) to build a Feature Vector:
| Feature Type | Examples |
|---|---|
| User Features | Age, Gender, Device, Past Click Rate. |
| Post Features | Media Type (Video/Image), Hashtags, Age of Post, Creator Quality. |
| Context Features | Time of Day, Day of Week, Network Speed. |
| Interaction Features | “Does User X follow Creator Y?”, “Did User X like Creator Y’s last 5 posts?” |
The Prediction
The model outputs a probability: P(Like) = 0.45, P(Share) = 0.05.
6. Step 4: Re-Ranking (Business Logic)
The raw scores might result in a feed of 20 videos from the same creator. This is bad UX. We apply Diversity Rules:
- Author Diversity: No more than 2 posts from the same author in a row.
- Content Diversity: Mix of Photos, Videos, and Text.
- Ad Injection: Insert an Ad every 5th slot.
Negative Feedback Loops
If a user clicks “Hide Post” or “Not Interested”, it is a critical signal.
- Immediate Action: Remove post from feed.
- Short Term: Downgrade the Author’s Score for this user.
- Long Term: Adjust the User’s Vector in the Feature Store.
- If they hid a “Political” post, move their preference vector away from the “Politics” cluster.
- This ensures the Retrieval Step (Step 1) fetches fewer political posts next time.
- Real-time updates: We use streaming pipelines (Kafka + Flink) to update these feature vectors instantly, not just in the nightly batch job.
7. The Algorithm (EdgeRank & Beyond)
Facebook’s original algorithm (EdgeRank) is a good mental model:
- Affinity: How close are you to the creator? (Close friend = High, Stranger = Low).
- Weight: Cost of interaction. (Comment > Like > Click > View).
- Time Decay:
1 / (Time Since Posted). Old news is bad news.
See Caching 101 for how we cache these scores.
8. Architecture Diagram
Ranking & Aggregation Pipeline Architecture.
9. Interactive Decision Visualizer
The Ranking Calculator
[!TIP] Try it yourself: Adjust the parameters to see how a post’s score changes. Notice how Time Decay destroys the score of old posts, even if they are from your best friend.
Scoring Parameters
10. Observability: Monitoring the Brain
Since the Ranking Engine is a black box, we need specific metrics.
- Feature Freshness: How old is the user’s “Last Clicked” feature? If > 5 mins, the pipeline is broken.
- Score Distribution: Histogram of scores. If all scores are 0.001, the model is miscalibrated.
- CTR by Model Version: Compare
v1.2vsv1.3in real-time (A/B Testing).
11. Connection to Other Modules
- Caching: We rely heavily on Redis. See Module 6: Caching.
- Vector Search: For discovery. See Module 15: Deep Dive Data.
- Fanout: The core distribution mechanism. See Twitter Timeline.