Google Discover Architecture: Clusters, Classifiers, OG Tags, NAIADES - What SDK Telemetry Reveals
Google Discover serves content to hundreds of millions of users every day, yet its internal mechanics remain largely opaque. Most SEO guidance about Discover comes from Google’s own documentation or anecdotal publisher observations. In this post, I want to share a different perspective: what we can learn by examining the observable telemetry, event naming conventions, and client-side state of Google’s own SDK.
Important note: Everything described below reflects what the SDK reveals at a specific point in time. This is Google. They can change any of these systems, ranking signals, pipeline stages, telemetry counters, feature flags, on the server side at any moment, without any client update. What you read here is a snapshot, not a permanent blueprint. Treat it as a lens into how these systems work today, not a guarantee of how they will work tomorrow.
This is not speculation. Every finding referenced below traces back to specific strings, event constants, or configuration values that the SDK exposes during normal operation. Where something is an inference rather than a direct observation, I say so. I needed to handle a “great amount of data” and removed many parts.
Think of this as reading the nutrition label on a packaged food. You cannot see the factory, but the label tells you quite a lot about what is inside. Schema.org exists but is obfuscated and it’s hard to connect with the entire Discover pipeline in a clear & certain way.
Special thanks to Valentin Pletzer
The 9-Stage Content Pipeline
Discover’s content pipeline can be mapped to 9 observable stages, each leaving distinct telemetry traces:
- Content Ingestion - Google crawls and indexes content. Entity extraction assigns Knowledge Graph MIDs (
/m/xxxxx) or (/g/xxxxx) to recognized topics. - Open Graph Tag Parsing - The SDK parses exactly 6 page-level meta signals:
og:image,og:title,og:site_name,og:locale,og:image:secure_url, andarticle:content_tier. Of these,og:imageandog:titleLOOKS mandatory, after these: Twitter tag, hardcoded Title or schema?. No image, no card. (We can see Nano Banana soon!) - Content Classification - Content is classified into types like
EVERGREEN_VIBRANTorBREAKING_NEWS, and assigned to one of 13 cluster types (more on these below). - Collection Gate - A binary check:
isCollectionHiddenFromEmberFeed. If this boolean is TRUE, all content from that publisher is blocked. There is no graduated suppression at this level - it is on or off. - User Interest Matching - Content entity MIDs are matched against the user’s interest profile. The SDK tracks this through
ember_item_matched_retrieval_mids. - Ranking (Server-Side) - A predicted click-through rate model runs server-side. The event
PCTR_MODEL_TRIGGEREDconfirms its existence, though the model itself is not visible in the client SDK. That’s what I told you in the Google AI Mode post! - Feed Assembly - Content is organized into the hierarchy: MAIN_FEED → COLLECTION → CLUSTER → CARD.
- Delivery - The feed reaches the device via gRPC streaming, background WorkManager sync, beacon push, or cache.
- Feedback Loop - User interactions (dismissals, follows, saves, engagement time) feed back into personalization and filtering for future sessions.

What is notable here is the ordering. The collection-level filter (stage 4) runs before interest matching and the pCTR model. This means a publisher blocked at the collection level never even reaches the ranking stage, regardless of how relevant their content might be to a user.
The pCTR Model: What We Can and Cannot See
The existence of a pCTR (predicted click-through rate) model in Discover is well-documented by Google. The SDK confirms this through the event string PCTR_MODEL_TRIGGERED. However, deeper analysis reveals an important nuance: this event lives in a generic telemetry enum shared across many Google features (alongside PAYMENTS**, KEYBOARD**, and other non-Discover events), not in a Discover-specific class. The pCTR model almost certainly runs server-side. It’s also an ad related term and already mentioned several times in DOJ trials. Since the metrics show server-side connections, it’s possible to create a search-specific metric, this is Google!
What the SDK does confirm on the client side are the signals that get packaged and sent to the server:
- Title text - extracted from
og:titleand packaged into a ContentMetadata protobuf sent todiscover-pa.googleapis.com - Image quality - the SDK flags images below a threshold as
LOW_QUALITY_IMAGEand tracksimage_widthandimage_height - Freshness - measured via
freshness_delta_in_seconds - Historical CTR - derived from per-URL
click_countandshow_count - Image load success - tracked through
image_load_failure_count
Whether og:title is a direct pCTR model input is plausible. It is extracted, serialized, and transmitted to Google’s servers alongside the other content metadata but the actual model evaluation happens server-side, beyond what the client SDK can reveal. What we can say with certainty is that title text is part of the data payload the server receives before ranking decisions are made. Title optimization still matters; we just cannot claim to have seen the model consume it directly.
Freshness Decay Buckets
Discover has a well-defined freshness decay system with three named time buckets plus a continuous staleness signal:
| Bucket | Window | Weight |
|---|---|---|
1_TO_7_DAYS | 1–7 days old | Highest freshness weight |
8_TO_14_DAYS | 8–14 days old | Medium |
15_TO_30_DAYS | 15–30 days old | Low |
staleness_in_hours | 30+ days | Continuous decay |
The first week is when content has its best chance. After 30 days, staleness is tracked in hours and decays continuously. This does not mean evergreen content cannot appear in Discover. The content classification system has an explicit EVERGREEN_VIBRANT type that may receive different treatment but the freshness signal works against older content by default.
Open Graph Tags: The 6 That Matter
Publishers often wonder which meta tags Discover actually uses. The SDK provides a clear answer: exactly 6 tags are parsed. Client-side confirms, server-side or are these just for fallback? It’s Google.
og:image- Necessary. Without it, no card is rendered. The eventEMBER_FEED_THUMBNAILS_DOWNLOADEDtracks successful image fetches.og:title- Necessary. Extracted into the ContentMetadata payload sent to Google’s servers for ranking.og:site_name- Recommended. Displayed as the publisher attribution on the card.og:locale- Recommended. Matched against the user’s locale for feed eligibility.og:image:secure_url- Optional. HTTPS variant needed(of course).article:content_tier- Recommended. Classifies content asfree,metered, or paywall.
One of the most useful findings from the SDK is the exact fallback order when primary tags are missing. These chains are hardcoded:
- Title:
og:title→twitter:title→title(HTML meta name) - Image:
og:image→og:image:secure_url→twitter:image:src→image→twitter:image - Publisher:
og:site_name→author(HTML meta name) - Language:
og:locale→inLanguage(JSON-LD) → hardcoded"en" - Paywall:
article:content_tier+isAccessibleForFree(JSON-LD boolean)
This means if you are missing og:title, the system will try twitter:title before falling back to the HTML “ tag. The image fallback chain is five levels deep, the system tries hard to find an image before giving up.
OG REWRITE IN ACTION


Now let’s see in Discove below.

And this is the og:image link: https://www.donanimhaber.com/images/images/haber/202436/src_340x1912xtaalas-yapay-zek-ciplerinde-devrim-yaratabilir.jpg
This JPG link placed in only og: image & twitter: image tags, not in the schema.org (Is it proven? No, the other image is 1400x788px wide in schema tag, it’s Google, you can decide it.)
The SDK also reveals two blocking metatags that halt the pipeline entirely: nopagereadaloud and notranslate. When either is detected, the system throws an error and stops processing that page. If your CMS or translation plugin injects notranslate as a meta tag, your content will not enter the Discover pipeline at all.
For images specifically: the minimum width for a large (hero) card format is 1200px. Smaller images result in a thumbnail card format, which typically sees lower engagement. The SDK also confirms WebP support (DISCOVERY_CARD_WEBP_IMAGE_SUPPORT).
13 Cluster Types
Every card in the Discover feed belongs to a cluster, and there are 13 observable cluster types:
neoncluster- the primary content clustergeotargetingstories- location-based storiesdeeptrendsanddeeptrendsfable- trending topic narrativesfreshvideos- recent video contentmustntmiss- priority/must-read contentnewsstoriesheadlines- breaking newshomestack- widget cards (weather, sports scores)garamondrelatedarticlegrouping- related article groupstrendingugc- user-generated trending contentsigninlure- sign-in promptsiospromo- cross-platform promotionmoonstone- an internal-codename cluster
What stands out is how specialized many of these clusters are. mustntmiss suggests there is a priority queue of content the system considers essential to show. garamondrelatedarticlegrouping (and a related feature flag apply_fake_garamond_header) hints that the system can synthetically create related-article groupings - combining separate articles under a shared topic heading.
The Personalization Stack
Discover’s personalization draws from both shared Google infrastructure and Discover-specific systems. It helps to think of it as four layers:
- Geller / AIP Interest Graph (shared) - An on-device user interest store synced via named synclets. This is shared infrastructure used across Google Assistant, Search, and Discover. Each interest is stored as a Knowledge Graph MID with confidence and importance scores.
- NAIADES (shared) - A Google-wide personalization system with 18 content subtypes, including
MID_BASED_NAIADES,QUERY_BASED_NAIADES,SPORTS,TRENDING,WPAS(Web Publisher Articles Signal // deprecated or legacy???), andRECALL_BOOST. Discover is one of several consumers? (server-side can confirm but no access) - Persistent State (Discover-specific) - User actions tracked per content: follows, hearts/likes, saves, and tombstones (dismissed content). Tombstones permanently prevent resurfacing.
- Engagement Signals (Discover-specific) - Dwell time (
engagement_time_msec), session-level engagement, and a Discover-specific engagement level metric.

An important subtlety: the NAIADES subtype WPAS stands for Web Publisher Articles Signal, which corresponds to Google News Publisher Center registration. This means content from Publisher Center-registered sources receives a distinct classification in the personalization pipeline. Similarly, RECALL_BOOST literally increases retrieval priority from the candidate pool. It boosts content during the retrieval phase, before the pCTR model even runs.
Content Filtering: The Two-Level Architecture
Discover’s content filtering operates at two levels, each with its own telemetry counter:
- Collection level (
filter_collection_status) - blocks ALL content from a publisher/domain - Entity level (
filter_entity_status) - blocks a single URL
The collection-level filter is asymmetric in an important way. The boolean isCollectionHiddenFromEmberFeed can suppress an entire publisher, but there is no observable equivalent “boost collection” flag anywhere in the system. The penalty surface is broader than the reward surface.
When a user selects “Don’t show content from [Publisher]” in the card menu, this triggers the collection-level filter. A single article that generates enough negative feedback can suppress an entire publication. That reaction applies to all content from that domain, not just the triggering article.
Suppression and Counterfactual Experiments
Discover runs counterfactual experiments, a standard practice in recommendation systems where content is intentionally withheld from some users to measure the causal impact of showing vs. not showing it. The SDK exposes 5 suppression mechanisms:
SHOW_SKIPPED_DUE_TO_COUNTERFACTUAL- content withheld for an A/B experimentDELIVERED_COUNTERFACTUAL- content delivered in a counterfactual groupFETCHED_COUNTERFACTUAL- content fetched but suppressedVISIBILITY_REPRESSED_COUNTERFACTUAL- visibility explicitly repressedbackground_refresh_rug_pull_count- content cards withdrawn after initially being pushed to the feed
The last one is particularly interesting. The “rug pull” counter tracks cases where content was pushed to the feed and then removed during a background refresh. This means Discover can retroactively remove content that was already in the feed, not just filter it before display.

The Beacon Push System
Most Discover content arrives through pull-based feed requests, but there is also a push channel. The Beacon Push system allows Google’s servers to proactively push content to a user’s device:
DISCOVER_BEACON_PUSH_RECEIVED- push arrives from serverDISCOVER_BEACON_PUSH_ACCEPTED- push passes local quality/budget checksDISCOVER_BEACON_PUSH_REJECTED- push rejected locally
The acceptance/rejection mechanism suggests that beacon pushes go through local filtering even after being server-selected. The SDK also reveals donated_sports_documents_count, indicating that sports content is a primary use case for the beacon push channel.
Web Stories Bypass Standard Ranking
Web Stories (Google’s AMP-based story format, internally codenamed STAMP) have their own rendering pipeline that operates independently from the standard article ranking. Key observations:
- They render inline via
INLINE_STAMP_VIEWER_FRAGMENTrather than as standard cards - They appear in carousels via
INLINE_STAMP_VIEWER_SLIDE_FRAGMENT - They have their own recommendation engine:
STAMP_VIEWER_RECOMMENDATIONS - Recommendations are preloaded before the user finishes the current story
This means Web Stories are not competing directly with articles in the standard pCTR ranking. They have their own dedicated pipeline and placement mechanism.
150 Concurrent A/B Experiments
Perhaps the most striking observation is the experiment load. During an observed session, approximately 150 server-side A/B experiment IDs were active simultaneously, stored in the session state. These follow the format gws:NNNNNNN (GWS standing for Google Web Server).
This means that at any given time, a Discover user is participating in roughly 150 concurrent experiments that may affect which content they see, how it is ranked, and how it is rendered. Two otherwise identical users could see meaningfully different feeds purely based on experiment bucket allocation.
51 Runtime Feature Flags
Beyond server-side experiments, 51 client-side feature flags control Discover’s behavior at render time. These flags are organized across 15 categories including ContentViewer, HomestackFeed, PrefabsRendering, and SportsWidget.
A few flags worth noting:
PrefabsRendering.disable_ai_summary_disclaimer- a flag to remove the “Generated with AI” disclaimer from AI-generated summariesPrefabsRendering.title_expands_ai_summary- clicking a title auto-expands an AI summaryDiscoverGaramondRendering.apply_fake_garamond_header- can create synthetic related-article groupingsHomestackChrome.enable_homestack_on_clank- “Clank” is Chrome’s internal codename, confirming that Homestack widgets integrate with Chrome
Per-Card Diagnostic Metadata
Every card in a Discover feed carries 8 internal diagnostic metadata fields beyond what users see:
- Panoptic Source Channel - which internal pipeline surfaced this content
- DocFingerprint - a unique hash for deduplication across sessions and devices
- Web and App Activity Enabled - whether personalization is possible
- Discover Personalization Enabled - whether it is active for this card
- NeoformId - the rendering template identifier, mapping to a specific card layout variant
- Is Feature Personalized - whether this specific card was personalized vs. generic/trending
- Is User Signed-In - sign-in state affects which signals are available
- Sherlog URL - Google’s internal debugging system URL for this card render
The DocFingerprint is particularly interesting from a deduplication perspective. It confirms that Discover tracks document identity across sessions and devices, preventing the same content from resurfacing after dismissal or extended exposure.
Real-Time Feed Delivery
Discover does not simply fetch a static list of cards. The SDK reveals a persistent gRPC connection architecture with six distinct service endpoints — from standard feed rendering to a dedicated streaming variant that keeps the feed alive in real-time. The connection lifecycle includes initialization handshakes, action payloads (your taps and scrolls sent back to the server), token refreshes, and automatic reconnection logic when the stream drops.
What makes this interesting for publishers: your content does not wait for the user to pull-to-refresh. The server can inject new cards, reorder existing ones, or remove stale content mid-session through data operation payloads. The feed is a living stream, not a snapshot.
Eligibility Gating
Before any content flows, Discover runs a 3-stage eligibility check:
- Local device checks - device-level requirements
- Server validation - network-based validation
- Google Mobile Services - GMS Core availability
Two ineligibility reasons are worth highlighting: INELIGIBLE_DISCOVER_DISABLED_BY_DSE means setting a non-Google default search engine disables Discover, and INELIGIBLE_DISCOVER_DISABLED_BY_ENTERPRISE_POLICY means enterprise device management can block it entirely.
The Three-Layer Dismissal Chain
When a user dismisses content in Discover, three separate persistence records are created:
- Dismiss overlay ID - records the dismissal interaction
- Filter status - updates the entity or collection filter
- Tombstone - a permanent per-content record at
/persistent/tombstone_{id}/datathat prevents resurfacing
These share the same content identifiers, creating a chain. The tombstone layer is permanent - dismissed content does not come back.
What This Means for Publishers
Let me be clear about what this analysis is and is not. It is a set of observations about how Google Discover’s client-side systems are instrumented. It is not a reverse-engineering of server-side ranking algorithms, which remain on Google’s servers and are not directly observable.
That said, some practical observations emerge:
og:imageis important. No any single image, no Discover card. Use images at least 1200px wide for hero card eligibility. In Schema, in og or twitter tag.og:titleis packaged. Whether it is a direct pCTR input is plausible but unconfirmed from client-side observation alone. Either way, it is part of the data payload that informs ranking decisions. (Possible!)- Freshness matters structurally. The first 7 days carry the highest freshness weight. After 30 days, continuous staleness decay begins.
- Collection-level blocking is binary and asymmetric. One bad article can block an entire domain, but there is no equivalent blanket boost.(NEEDS TO BE VERIFIED, ONLY CLIENT-SIDE EVENTS CAPTURED, WE CAN’T ACCESS TO THE SERVER-SIDE CONFIGURATION AND GOOGLE CAN CHANGE ANYTHING, ANYTIME.
- Publisher Center registration creates a distinct signal. The
WPASsubtype means registered publishers get different classification treatment. - Web Stories bypass standard ranking. They have their own pipeline, carousel placement, and recommendation engine.
- User feedback is permanent. Tombstones do not expire. Dismissed content stays dismissed.
Methodology Note
All findings in this analysis are derived from observable SDK telemetry — event constants, configuration values, and client-side state visible during normal Discover operation. No server-side systems were accessed. Where findings are confirmed via exact string matches, they are labeled as such. Where findings are inferences based on naming conventions or event ordering, that is noted.
As a reminder: these observations reflect the current state of the SDK. Google continuously evolves its infrastructure. Server-side ranking models, experiment allocations, and pipeline stages can all change independently of the client. What we can observe is the instrumentation — the questions the system asks and the answers it records — which reveals the architecture even as the parameters shift underneath.
The full technical dashboard with all 276 event constants, 56 telemetry counters, 18 NAIADES subtypes, and individually fact-checked findings is available separately.
Get new research on AI search, SEO experiments, and LLM visibility delivered to your inbox.
Powered by Substack · No spam · Unsubscribe anytime