What Google, Yahoo, Microsoft, and Apple are doing to your email

What Google, Yahoo, Microsoft, and Apple are doing to your email | Jacques Corby-Tuech

Contents

The inbox is a parser, not a mailbox

Opens are dead. Actions decide everything.

Personal email left. Commercial email never did.

Volume coping is structural, not editorial

AI is now between you and the recipient

Privacy is real for content. Behaviour is the surveillance surface.

Search has flipped from time to relevance

Engagement is the new deliverability

What you actually own

Where this is heading

Google, Yahoo, and Microsoft stopped being transport layers years ago. Apple did the same from one layer up, through Mail on iOS and macOS. The four have become active intermediaries between brands and their customers, mediating visibility, extraction, ranking, and interaction. They have spent the last decade publishing papers and patents about how this works. Most of it is open and citable.

The inbox is a parser, not a mailbox

The premise running through every paper from the major providers: consumer email is dominated by machine-generated, templated B2C messages. Yahoo measured north of 60% of inbound from mass senders in 20131. Yahoo's own follow-up work in 2014 put the figure at 90% of non-spam web mail2, and Whittaker et al. cite the same 90% figure for Gmail in 20193. Bentley's 2017 CHI study found 67% of users name "receive coupons and deals" as a top-three use of email, and 56% had searched their inbox for a receipt or shipment in the past week4.

Because the inbox is mostly templated, providers stopped treating it as messaging and started treating it as data extraction.

Crusher architecture

ML in the inbox predates this timeline by 15 years. Bayesian spam filtering at Microsoft Research dates from 1998 (Sahami, Dumais, Heckerman, and Horvitz)5. What's expanded since is the scope. The 1998 task was disposition: deliver, or send to junk. Everything below is about what happens after disposition. How the message appears, when, with what label, with what summary, and increasingly whether the recipient needs to open it at all. The consumer-mail market also consolidated over the same period: top-three concentration went from 55% to 85% across 2006-20126. Free-tier storage, search quality, mobile and desktop OS bundling, and switching costs did most of that work, with spam-filtering economies of scale as one contributing factor. Three server-side providers dominate the consumer mailbox: Google, Microsoft, Yahoo. Apple is the fourth player this post discusses but at a different layer: Apple Mail on iOS and macOS mediates client-side, on top of whichever service the user has connected. Different architecture, same kind of mediation.

Messages get clustered into templates by hashing the DOM/XPath structure of the HTML3, 7, 8. A k-anonymity threshold (a minimum number of unique recipients for the template to be processable) gates whether a template gets analysed at all3, 8. Below the threshold, you're invisible to the ML pipeline. Once a template clears k-anonymity, field extractors pull structured data (order numbers, prices, tracking numbers, hotel addresses) for use in Search, Assistant, and proactive cards7, 9. The scale is real: Google's Crusher system discovers 1.5 million new templates every week3, so the corpus of recognised B2C senders is constantly expanding rather than being a fixed list.

Google RiSER architecture diagram

Gmail's team moved this whole pipeline from hand-written rules to ML models in 2020. They deleted 45,000 lines of rule code in the process10. The HTML structure itself, which tags you use, what's bolded, what's centred, is now a feature in the classifier9. The granularity has also exploded. Yahoo's 2014 work identified 6 latent categories from email folders via LDA: human, career, shopping, travel, financial, social2. Yahoo's most recent classifier (SPICE, 2023) labels 96% of English messages into a 119-class taxonomy of topic + type + objective, all at delivery time, using only your sender name and subject line11. The leap from 6 categories to 119 multi-faceted labels in less than a decade reflects how granular the provider-side classifier has become.

Google's layout-aware document encoder, patented in 2025, treats font-size, is_bold, is_italic, is_underline, colour, and position as block-level features and describes representations that could be used for classification, retrieval, summarisation, and personalising advertisements12. Bigger, bolder, higher on the page would carry more weight than smaller, plainer, further down. The layout encodes a hierarchy the parser can read.

Four practical consequences follow.

Image-only emails lose context, not text. OCR pulls the words out of pixels. Google documented this exact pipeline in 2018: an OCR pass over images in B2C templates lifted offer-template detection by 9.12%13. What OCR cannot recover is the DOM structure or the block-level attributes. The same paper's feature engineering proves the point: the classifier has...

What Google, Yahoo, Microsoft, and Apple are doing to your email

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast