Google Discover Architecture: Clusters, Classifiers, OG Tags, NAIADES - What SDK Telemetry Reveals

Google Discover serves content to hundreds of millions of users every day, yet its internal mechanics remain largely opaque. Most SEO guidance about Discover comes from Google’s own documentation or anecdotal publisher observations. In this post, I want to share a different perspective: what we can learn by examining the observable telemetry, event naming conventions, and client-side state of Google’s own SDK.

Important note: Everything described below reflects what the SDK reveals at a specific point in time. This is Google. They can change any of these systems, ranking signals, pipeline stages, telemetry counters, feature flags, on the server side at any moment, without any client update. What you read here is a snapshot, not a permanent blueprint. Treat it as a lens into how these systems work today, not a guarantee of how they will work tomorrow.

This is not speculation. Every finding referenced below traces back to specific strings, event constants, or configuration values that the SDK exposes during normal operation. Where something is an inference rather than a direct observation, I say so. I needed to handle a “great amount of data” and removed many parts.

Think of this as reading the nutrition label on a packaged food. You cannot see the factory, but the label tells you quite a lot about what is inside. Schema.org exists but is obfuscated and it’s hard to connect with the entire Discover pipeline in a clear & certain way.

Special thanks to Valentin Pletzer

This post is a summary. I created a full dashboard. It’s accessible at metehanai.substack.com’s latest post.

The 9-Stage Content Pipeline

Discover’s content pipeline can be mapped to 9 observable stages, each leaving distinct telemetry traces:

Content Ingestion - Google crawls and indexes content. Entity extraction assigns Knowledge Graph MIDs (/m/xxxxx) or (/g/xxxxx) to recognized topics.
Open Graph Tag Parsing - The SDK parses exactly 6 page-level meta signals: og:image, og:title, og:site_name, og:locale, og:image:secure_url, and article:content_tier. Of these, og:image and og:title LOOKS mandatory, after these: Twitter tag, hardcoded Title or schema?. No image, no card. (We can see Nano Banana soon!)
Content Classification - Content is classified into types like EVERGREEN_VIBRANT or BREAKING_NEWS, and assigned to one of 13 cluster types (more on these below).
Collection Gate - A binary check: isCollectionHiddenFromEmberFeed. If this boolean is TRUE, all content from that publisher is blocked. There is no graduated suppression at this level - it is on or off.
User Interest Matching - Content entity MIDs are matched against the user’s interest profile. The SDK tracks this through ember_item_matched_retrieval_mids.
Ranking (Server-Side) - A predicted click-through rate model runs server-side. The event PCTR_MODEL_TRIGGERED confirms its existence, though the model itself is not visible in the client SDK. That’s what I told you in the Google AI Mode post!
Feed Assembly - Content is organized into the hierarchy: MAIN_FEED → COLLECTION → CLUSTER → CARD.
Delivery - The feed reaches the device via gRPC streaming, background WorkManager sync, beacon push, or cache.
Feedback Loop - User interactions (dismissals, follows, saves, engagement time) feed back into personalization and filtering for future sessions.

What is notable here is the ordering. The collection-level filter (stage 4) runs before interest matching and the pCTR model. This means a publisher blocked at the collection level never even reaches the ranking stage, regardless of how relevant their content might be to a user.

The pCTR Model: What We Can and Cannot See

The existence of a pCTR (predicted click-through rate) model in Discover is well-documented by Google. The SDK confirms this through the event string PCTR_MODEL_TRIGGERED. However, deeper analysis reveals an important nuance: this event lives in a generic telemetry enum shared across many Google features (alongside PAYMENTS**, KEYBOARD**, and other non-Discover events), not in a Discover-specific class. The pCTR model almost certainly runs server-side. It’s also an ad related term and already mentioned several times in DOJ trials. Since the metrics show server-side connections, it’s possible to create a search-specific metric, this is Google!

What the SDK does confirm on the client side are the signals that get packaged and sent to the server:

Title text - extracted from og:title and packaged into a ContentMetadata protobuf sent to discover-pa.googleapis.com
Image quality - the SDK flags images below a threshold as LOW_QUALITY_IMAGE and tracks image_width and image_height
Freshness - measured via freshness_delta_in_seconds
Historical CTR - derived from per-URL click_count and show_count
Image load success - tracked through image_load_failure_count

Whether og:title is a direct pCTR model input is plausible. It is extracted, serialized, and transmitted to Google’s servers alongside the other content metadata but the actual model evaluation happens server-side, beyond what the client SDK can reveal. What we can say with certainty is that title text is part of the data payload the server receives before ranking decisions are made. Title optimization still matters; we just cannot claim to have seen the model consume it directly.

Freshness Decay Buckets

Discover has a well-defined freshness decay system with three named time buckets plus a continuous staleness signal:

Bucket	Window	Weight
`1_TO_7_DAYS`	1–7 days old	Highest freshness weight
`8_TO_14_DAYS`	8–14 days old	Medium
`15_TO_30_DAYS`	15–30 days old	Low
`staleness_in_hours`	30+ days	Continuous decay

The first week is when content has its best chance. After 30 days, staleness is tracked in hours and decays continuously. This does not mean evergreen content cannot appear in Discover. The content classification system has an explicit EVERGREEN_VIBRANT type that may receive different treatment but the freshness signal works against older content by default.

Open Graph Tags: The 6 That Matter

Publishers often wonder which meta tags Discover actually uses. The SDK provides a clear answer: exactly 6 tags are parsed. Client-side confirms, server-side or are these just for fallback? It’s Google.

og:image - Necessary. Without it, no card is rendered. The event EMBER_FEED_THUMBNAILS_DOWNLOADED tracks successful image fetches.
og:title - Necessary. Extracted into the ContentMetadata payload sent to Google’s servers for ranking.
og:site_name - Recommended. Displayed as the publisher attribution on the card.
og:locale - Recommended. Matched against the user’s locale for feed eligibility.
og:image:secure_url - Optional. HTTPS variant needed(of course).
article:content_tier - Recommended. Classifies content as free, metered, or paywall.

One of the most useful findings from the SDK is the exact fallback order when primary tags are missing. These chains are hardcoded:

Title: og:title → twitter:title → title (HTML meta name)
Image: og:image → og:image:secure_url → twitter:image:src → image → twitter:image
Publisher: og:site_name → author (HTML meta name)
Language: og:locale → inLanguage (JSON-LD) → hardcoded "en"
Paywall: article:content_tier + isAccessibleForFree (JSON-LD boolean)

This means if you are missing og:title, the system will try twitter:title before falling back to the HTML “ tag. The image fallback chain is five levels deep, the system tries hard to find an image before giving up.

OG REWRITE IN ACTION

DonanimHaber OG Source Code

DonanimHaber SERP Title

Now let’s see in Discove below.

DonanimHaber OG Discover

And this is the og:image link: https://www.donanimhaber.com/images/images/haber/202436/src_340x1912xtaalas-yapay-zek-ciplerinde-devrim-yaratabilir.jpg

This JPG link placed in only og: image & twitter: image tags, not in the schema.org (Is it proven? No, the other image is 1400x788px wide in schema tag, it’s Google, you can decide it.)

The SDK also reveals two blocking metatags that halt the pipeline entirely: nopagereadaloud and notranslate. When either is detected, the system throws an error and stops processing that page. If your CMS or translation plugin injects notranslate as a meta tag, your content will not enter the Discover pipeline at all.

For images specifically: the minimum width for a large (hero) card format is 1200px. Smaller images result in a thumbnail card format, which typically sees lower engagement. The SDK also confirms WebP support (DISCOVERY_CARD_WEBP_IMAGE_SUPPORT).

13 Cluster Types

Every card in the Discover feed belongs to a cluster, and there are 13 observable cluster types:

neoncluster - the primary content cluster
geotargetingstories - location-based stories
deeptrends and deeptrendsfable - trending topic narratives
freshvideos - recent video content
mustntmiss - priority/must-read content
newsstoriesheadlines - breaking news
homestack - widget cards (weather, sports scores)
garamondrelatedarticlegrouping - related article groups
trendingugc - user-generated trending content
signinlure - sign-in prompts
iospromo - cross-platform promotion
moonstone - an internal-codename cluster

What stands out is how specialized many of these clusters are. mustntmiss suggests there is a priority queue of content the system considers essential to show. garamondrelatedarticlegrouping (and a related feature flag apply_fake_garamond_header) hints that the system can synthetically create related-article groupings - combining separate articles under a shared topic heading.

The Personalization Stack

Discover’s personalization draws from both shared Google infrastructure and Discover-specific systems. It helps to think of it as four layers:

Geller / AIP Interest Graph (shared) - An on-device user interest store synced via named synclets. This is shared infrastructure used across Google Assistant, Search, and Discover. Each interest is stored as a Knowledge Graph MID with confidence and importance scores.
NAIADES (shared) - A Google-wide personalization system with 18 content subtypes, including MID_BASED_NAIADES, QUERY_BASED_NAIADES, SPORTS, TRENDING, WPAS (Web Publisher Articles Signal // deprecated or legacy???), and RECALL_BOOST. Discover is one of several consumers? (server-side can confirm but no access)
Persistent State (Discover-specific) - User actions tracked per content: follows, hearts/likes, saves, and tombstones (dismissed content). Tombstones permanently prevent resurfacing.
Engagement Signals (Discover-specific) - Dwell time (engagement_time_msec), session-level engagement, and a Discover-specific engagement level metric.

An important subtlety: the NAIADES subtype WPAS stands for Web Publisher Articles Signal, which corresponds to Google News Publisher Center registration. This means content from Publisher Center-registered sources receives a distinct classification in the personalization pipeline. Similarly, RECALL_BOOST literally increases retrieval priority from the candidate pool. It boosts content during the retrieval phase, before the pCTR model even runs.

Content Filtering: The Two-Level Architecture

Discover’s content filtering operates at two levels, each with its own telemetry counter:

Collection level (filter_collection_status) - blocks ALL content from a publisher/domain
Entity level (filter_entity_status) - blocks a single URL

The collection-level filter is asymmetric in an important way. The boolean isCollectionHiddenFromEmberFeed can suppress an entire publisher, but there is no observable equivalent “boost collection” flag anywhere in the system. The penalty surface is broader than the reward surface.

When a user selects “Don’t show content from [Publisher]” in the card menu, this triggers the collection-level filter. A single article that generates enough negative feedback can suppress an entire publication. That reaction applies to all content from that domain, not just the triggering article.

Suppression and Counterfactual Experiments

Discover runs counterfactual experiments, a standard practice in recommendation systems where content is intentionally withheld from some users to measure the causal impact of showing vs. not showing it. The SDK exposes 5 suppression mechanisms:

SHOW_SKIPPED_DUE_TO_COUNTERFACTUAL - content withheld for an A/B experiment
DELIVERED_COUNTERFACTUAL - content delivered in a counterfactual group
FETCHED_COUNTERFACTUAL - content fetched but suppressed
VISIBILITY_REPRESSED_COUNTERFACTUAL - visibility explicitly repressed
background_refresh_rug_pull_count - content cards withdrawn after initially being pushed to the feed

The last one is particularly interesting. The “rug pull” counter tracks cases where content was pushed to the feed and then removed during a background refresh. This means Discover can retroactively remove content that was already in the feed, not just filter it before display.

The Beacon Push System

Most Discover content arrives through pull-based feed requests, but there is also a push channel. The Beacon Push system allows Google’s servers to proactively push content to a user’s device:

DISCOVER_BEACON_PUSH_RECEIVED - push arrives from server
DISCOVER_BEACON_PUSH_ACCEPTED - push passes local quality/budget checks
DISCOVER_BEACON_PUSH_REJECTED - push rejected locally

The acceptance/rejection mechanism suggests that beacon pushes go through local filtering even after being server-selected. The SDK also reveals donated_sports_documents_count, indicating that sports content is a primary use case for the beacon push channel.

Web Stories Bypass Standard Ranking

Web Stories (Google’s AMP-based story format, internally codenamed STAMP) have their own rendering pipeline that operates independently from the standard article ranking. Key observations:

They render inline via INLINE_STAMP_VIEWER_FRAGMENT rather than as standard cards
They appear in carousels via INLINE_STAMP_VIEWER_SLIDE_FRAGMENT
They have their own recommendation engine: STAMP_VIEWER_RECOMMENDATIONS
Recommendations are preloaded before the user finishes the current story

This means Web Stories are not competing directly with articles in the standard pCTR ranking. They have their own dedicated pipeline and placement mechanism.

150 Concurrent A/B Experiments

Perhaps the most striking observation is the experiment load. During an observed session, approximately 150 server-side A/B experiment IDs were active simultaneously, stored in the session state. These follow the format gws:NNNNNNN (GWS standing for Google Web Server).

This means that at any given time, a Discover user is participating in roughly 150 concurrent experiments that may affect which content they see, how it is ranked, and how it is rendered. Two otherwise identical users could see meaningfully different feeds purely based on experiment bucket allocation.

51 Runtime Feature Flags

Beyond server-side experiments, 51 client-side feature flags control Discover’s behavior at render time. These flags are organized across 15 categories including ContentViewer, HomestackFeed, PrefabsRendering, and SportsWidget.

A few flags worth noting:

PrefabsRendering.disable_ai_summary_disclaimer - a flag to remove the “Generated with AI” disclaimer from AI-generated summaries
PrefabsRendering.title_expands_ai_summary - clicking a title auto-expands an AI summary
DiscoverGaramondRendering.apply_fake_garamond_header - can create synthetic related-article groupings
HomestackChrome.enable_homestack_on_clank - “Clank” is Chrome’s internal codename, confirming that Homestack widgets integrate with Chrome

Per-Card Diagnostic Metadata

Every card in a Discover feed carries 8 internal diagnostic metadata fields beyond what users see:

Panoptic Source Channel - which internal pipeline surfaced this content
DocFingerprint - a unique hash for deduplication across sessions and devices
Web and App Activity Enabled - whether personalization is possible
Discover Personalization Enabled - whether it is active for this card
NeoformId - the rendering template identifier, mapping to a specific card layout variant
Is Feature Personalized - whether this specific card was personalized vs. generic/trending
Is User Signed-In - sign-in state affects which signals are available
Sherlog URL - Google’s internal debugging system URL for this card render

The DocFingerprint is particularly interesting from a deduplication perspective. It confirms that Discover tracks document identity across sessions and devices, preventing the same content from resurfacing after dismissal or extended exposure.

Real-Time Feed Delivery

Discover does not simply fetch a static list of cards. The SDK reveals a persistent gRPC connection architecture with six distinct service endpoints — from standard feed rendering to a dedicated streaming variant that keeps the feed alive in real-time. The connection lifecycle includes initialization handshakes, action payloads (your taps and scrolls sent back to the server), token refreshes, and automatic reconnection logic when the stream drops.

What makes this interesting for publishers: your content does not wait for the user to pull-to-refresh. The server can inject new cards, reorder existing ones, or remove stale content mid-session through data operation payloads. The feed is a living stream, not a snapshot.

Eligibility Gating

Before any content flows, Discover runs a 3-stage eligibility check:

Local device checks - device-level requirements
Server validation - network-based validation
Google Mobile Services - GMS Core availability

Two ineligibility reasons are worth highlighting: INELIGIBLE_DISCOVER_DISABLED_BY_DSE means setting a non-Google default search engine disables Discover, and INELIGIBLE_DISCOVER_DISABLED_BY_ENTERPRISE_POLICY means enterprise device management can block it entirely.

The Three-Layer Dismissal Chain

When a user dismisses content in Discover, three separate persistence records are created:

Dismiss overlay ID - records the dismissal interaction
Filter status - updates the entity or collection filter
Tombstone - a permanent per-content record at /persistent/tombstone_{id}/data that prevents resurfacing

These share the same content identifiers, creating a chain. The tombstone layer is permanent - dismissed content does not come back.

What This Means for Publishers

Let me be clear about what this analysis is and is not. It is a set of observations about how Google Discover’s client-side systems are instrumented. It is not a reverse-engineering of server-side ranking algorithms, which remain on Google’s servers and are not directly observable.

That said, some practical observations emerge:

og:image is important. No any single image, no Discover card. Use images at least 1200px wide for hero card eligibility. In Schema, in og or twitter tag.
og:title is packaged. Whether it is a direct pCTR input is plausible but unconfirmed from client-side observation alone. Either way, it is part of the data payload that informs ranking decisions. (Possible!)
Freshness matters structurally. The first 7 days carry the highest freshness weight. After 30 days, continuous staleness decay begins.
Collection-level blocking is binary and asymmetric. One bad article can block an entire domain, but there is no equivalent blanket boost.(NEEDS TO BE VERIFIED, ONLY CLIENT-SIDE EVENTS CAPTURED, WE CAN’T ACCESS TO THE SERVER-SIDE CONFIGURATION AND GOOGLE CAN CHANGE ANYTHING, ANYTIME.
Publisher Center registration creates a distinct signal. The WPAS subtype means registered publishers get different classification treatment.
Web Stories bypass standard ranking. They have their own pipeline, carousel placement, and recommendation engine.
User feedback is permanent. Tombstones do not expire. Dismissed content stays dismissed.

Methodology Note

All findings in this analysis are derived from observable SDK telemetry — event constants, configuration values, and client-side state visible during normal Discover operation. No server-side systems were accessed. Where findings are confirmed via exact string matches, they are labeled as such. Where findings are inferences based on naming conventions or event ordering, that is noted.

As a reminder: these observations reflect the current state of the SDK. Google continuously evolves its infrastructure. Server-side ranking models, experiment allocations, and pipeline stages can all change independently of the client. What we can observe is the instrumentation — the questions the system asks and the answers it records — which reveals the architecture even as the parameters shift underneath.

The full technical dashboard with all 276 event constants, 56 telemetry counters, 18 NAIADES subtypes, and individually fact-checked findings is available separately.