Hardening AI Web Agents: How We're Securing Tabstack Against Indirect Prompt Injection | Tabstack Blog | Tabstack<br>Skip to content
At Mozilla, we believe that building a useful AI ecosystem requires radical transparency, especially when it comes to security.
Recently, security researchers at Brave reached out to us regarding an Indirect Prompt Injection (IPI) vulnerability they identified in Tabstack's /v1/automate endpoint, which they have since detailed in their public blog post on the flaw. Because Tabstack is built to act as an autonomous web agent that can browse, click, and interact with the live web on behalf of a user, the implications of IPI are a critical design challenge.
The vulnerability has been patched, and the fix was independently verified by the Brave team before their public write-up. We want to share a transparent look at the exploit, how our model handled it, and the architecture we've implemented to harden our automation engine against this entire class of attacks.
The Vulnerability: Bypassing the Scope of the Task
The attack discovered by Brave highlights the unique risks associated with "agentic" AI tools. During a controlled test, researchers passed a standard, routine prompt to the /v1/automate endpoint: "Summarize this page."
However, the target page contained hidden, malicious instructions (rendered in white-on-white text, invisible to a human but fully readable in the page's text layer ingested by the AI). The injected text instructed the model to ignore its previous task, grab the user's full conversation history, paste it into an external web form, and hit submit.
Because Large Language Models mix user intent and third-party data into a single, flat text window, Tabstack didn't view this as a security conflict. Instead, it confidently followed the instructions step-by-step:
Navigated away from the target page to an external domain.
Copied the user's conversation context.
Submitted the form, actively exfiltrating the data to the researcher's server.
When we analyzed the agent's internal reasoning traces, the challenge became even clearer. The model wasn't "tricked" or confused; it was executing what it genuinely believed to be a legitimate workflow continuation.
How We Fixed It: Moving Beyond a Flat Context
Prompt injection is a dynamic, shifting threat vector across the entire AI industry. While it is functionally impossible to guarantee any LLM is 100% immune to prompt injection, we can severely limit an agent's ability to act on malicious inputs. That was our north star: assume the model will occasionally be fooled by injected text, and make sure that being fooled cannot translate into exfiltrated data.
Following the disclosure, our engineering team confirmed the gap and shipped a series of changes to our underlying browser automation engine, Mozilla Pilo. Rather than trying to detect malicious prompts (a losing game of pattern-matching), we focused on shrinking the agent's "blast radius" with structural guardrails that hold regardless of what the page says.
A Structural Action Firewall for Forms
The core of the fix is a new action firewall that sits between the agent's decisions and the browser's execution of them. Critically, it does not work by scanning text for "suspicious" instructions. Instead, it classifies every form interaction using DOM field metadata and reference provenance: where a value came from, what kind of field it is targeting, and whether a human ever approved it.
The agent is free to fill operational controls it legitimately needs to do its job (search boxes, date pickers, range sliders, comboboxes).
It is structurally blocked from auto-filling freeform or sensitive fields, and from submitting any form that contains agent-filled data that was never approved, before the action reaches the browser.
Even operational submissions are now restricted to the same host as the current page. An attacker page can label its collector field as a "search box," but it cannot make the agent submit that data to an attacker-controlled domain. Unknown page hosts fail closed.
When a block fires in non-interactive mode, the CLI prints a remediation footer explaining how to proceed. That footer is user-facing only: the model never sees it, so injected page content cannot instruct the agent to talk the user into disabling the firewall.
External Content Isolation
We also closed the more fundamental gap that made the original exploit possible: the agent treated web-page text and user instructions as one undifferentiated stream.
Web-sourced content returned by the agent's tools (page extracts, fetched markdown, search results, even the completion validator's own feedback) is now wrapped in explicit tags at the point the data enters the conversation, with an inline warning on every block and a matching directive in the system prompt. This mirrors the trust-framing we already applied to raw page snapshots, so untrusted page text is consistently delineated from...