Pulpie: Pareto-Optimal Models for Cleaning the Web

Pulpie: Pareto-Optimal Models for Cleaning the Web | Feyn Open Source Blog Work with us

Research · field note

Pulpie: Pareto-Optimal Models for Cleaning the Web Authors Bhavnick Minhas, Shreyash Nigam

Affiliation Feyn Labs

Published June 25, 2026

Reading time 13 min

Contents

We’re introducing Pulpie, a family of Pareto-optimal models for extracting main content from HTML pages. Pulpie approaches SOTA extraction quality at one twentieth the cost.

Our smallest model, pulpie-orange-small, scores 0.862 ROUGE-5 F1 on WebMainBench. This matches Dripper, the leading extractor, which scores 0.864. Pulpie’s performance is despite it being a third the size: 210M parameters versus Dripper’s 600M.

The gains come from architecture. Pulpie is an encoder that labels every HTML block as content or boilerplate in a single forward pass. This also makes it fast.

On an NVIDIA L4 GPU, pulpie-orange-small processes 13.7 pages/sec against Dripper’s 0.68 pages/sec. At $0.39/hr for an L4 instance, cleaning 1 billion pages costs $7,900 with Pulpie and $159,000 with Dripper.

Pulpie unlocks high quality web extraction at a scale impossible before. We expect this to benefit pre-training and context management.

Our models are open source and available on Hugging Face. See Get started for instructions.

Extraction is the bottleneck

Language models consume the web twice. First in pre-training, where they learn about the world. Then at inference, when they pull in relevant context. Both times the input is mostly noise. During discovery, we found 70% of the blocks on a typical HTML page hold boilerplate like navigation, ads, sidebars, and footers. Main content is only a small fraction of the page.

However, that fraction determines model quality on both ends.

AICC (Ma et al., 2025) measured the effect of cleaner extraction on pre-training. The team built two corpora from the same Common Crawl snapshot. One extracted content with heuristics. The other extracted it with a model-based parser. Everything else in the data pipeline remained equal. They then trained an identical model on each corpus.

The model trained on the model-extracted corpus scored 1.08 percentage points higher in average accuracy across 13 benchmarks. Since only extraction logic changed, we can attribute the gain entirely to having cleaner data.

Impressively, the same model also beat models trained on FineWeb and RefinedWeb, two of the most heavily filtered pre-training corpora. These datasets have earned their reputations through elaborate filtering and deduplication. Beating them by improving the extractor illustrates the high value of clean data.

Beyond setting a low baseline, poor extraction materially harms models. Heuristics break structured content. The table below shows how Trafilatura and model-based extractors compare on preserving code blocks and formulas. Low similarity scores indicate corruption. If used in training, resulting models will inherit this damage.

ContentTrafilatura (heuristic)Model-basedCode blocks0.130.91Formulas0.610.94 Data quality matters at inference too. Shi et al. (ICML 2023) showed that a single irrelevant passage is enough to derail a model’s answer. A model is more accurate and more efficient when its context is free of noise.

Cleaning on a budget

Cleaning the web pays off in both training and inference. The open question is how do we clean well at scale?

First, to understand the landscape, we can divide current extractors into two families based on the question: Does the method read the page, or inspect its structure?

Structure-based extractors judge an HTML block by surface signals. They apply rules over tags, DOM, and text density to separate content from boilerplate. Trafilatura, Readability, and magic-html work this way. Boilerpipe goes one step further and trains a classifier on those same signals. These extractors are easy to run but they confuse similarly built elements. A navigation table and a data table look identical to an algorithm counting cells.

Reading extractors feed the page to a transformer and label each block based on its content. Dripper is a decoder built on this idea. The decoder emits labels one token at a time. Each label forces the full model to be read from memory for a single step of work. This ties speed to memory bandwidth and makes runs expensive.

Pulpie keeps the reading approach but moves the bottleneck to compute. We do this by using an encoder architecture that labels every block in a single forward pass. This enables Pulpie to match Dripper’s quality while being smaller, faster, and cheaper.

Quality vs Cost of Web Content Extraction

ROUGE-5 F1 Cost / 1B pages

$175K $140K $105K $70K $35K $0 0.90 0.80 0.70 0.60 0.50

20x cheaper, same quality

Pulpie Small Quality 0.862 Cost $7.9K

Pulpie Small

Dripper

magic-html Trafilatura Pulpie Small Pulpie Base Pulpie Large Dripper Selected 0.862 Pulpie Small Cost $7.9K per 1B pages

Depulping raw HTML

The full pipeline runs in four...

Pulpie: Pareto-Optimal Models for Cleaning the Web

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level