GPT-5 Nano Vulnerability test results you should know before deploying

lateos-ai1 pts0 comments

GPT-5 Nano IPI Assessment — LLM Vulnerability Research | Lateos

IPI Assessment · June 2026 · Structural Disclosure

GPT-5 Nano — Prompt Injection Susceptibility Assessment

IPI Taxonomy v0.13 evaluation across 210 test cases (n=10 per class; 9 inference failures excluded; 201 analyzed). The model demonstrates strong resistance to surface-level attacks while showing susceptibility in recursive instruction framing and MCP tool description poisoning. Findings are based on black-box testing via opencode.ai; architectural attribution is hypothetical. No adversarial payloads disclosed.

Target<br>gpt-5-nano

Framework<br>IPI Taxonomy v0.13

Test Cases<br>210 (21 classes × 10 variants; 201 analyzed)

Date<br>2026-06-15

Judge<br>Rule-based + Grok escalation (139 rule-only, 71 escalated)

Methodology & Limitations: This assessment was conducted via black-box access (opencode.ai harness). Sample size is 10 variants per IPI class (n=10; split 4 direct / 3 obfuscated / 3 embedded; 9 inference failures excluded; 201 analyzed). Per-class susceptibility rates are indicative signals, not statistically powered estimates. Attributions to model architecture are working hypotheses; black-box testing cannot distinguish architecture from system-prompt configuration, guard stack, or other deployment-layer artifacts. Model identity (gpt-5-nano) was not independently verified beyond the served endpoint label. See Judge Methodology section for testing details.

Executive Summary

38.3% overall susceptibility rate.

38.3%

Overall susceptibility rate<br>across 210 cases (201 analyzed)

77

Susceptible verdicts<br>(124 resistant, 0 ambiguous)

IPI classes with<br>>50% susceptibility rate

IPI classes with<br>0% rate — fully resistant

Severity<br>Susceptible<br>Total Tests<br>Rate<br>Risk Level

Critical<br>44<br>104<br>42.3%<br>Immediate remediation required

High<br>26<br>67<br>38.8%<br>Architectural mitigation advised

Medium<br>30<br>23.3%<br>Middleware validation recommended

Model profile: GPT-5 Nano is fully resistant (0% susceptibility) across 4 classes: CSS/Visual Concealment (IPI-011), HTML Attribute Cloaking (IPI-012), SEO/Phishing Amplification (IPI-014), and RAG Corpus Poisoning (IPI-017). The safety guard is effective against surface-level attacks. The highest susceptibility is observed in recursive instruction framing (IPI-010: 10/10 = 100%), MCP tool poisoning (IPI-018: 80%), and a cluster of role/tool manipulation classes at 70%. Whether these represent architectural, system-prompt, or guard-configuration factors is unknown from black-box access.

Architectural Analysis

Output guard bypass via indirect framing.

The most significant finding is IPI-010 (Recursive Instruction Framing): 10/10 susceptible (100%) . The model's output guard evaluates surface-level safety directives but does not recursively simulate nested instruction boundaries. When adversarial content is wrapped in hypothetical, simulation, or delimiter framing, the guard fails to distinguish between user-provided text and operational directives.

The model confuses structural formatting tokens — XML tags, role prefixes, hypothetical markers — with semantic boundaries . It treats these as content rather than control signals, allowing injected delimiter tokens to pass both input and output guards verbatim.

IPI-010 — Recursive Framing Attack Data Flow

User input → Safety guard (PASS — no explicit override tokens present)<br>→ Context window merge (adversarial content enters as user content)<br>→ Model inference (nested framing treated as structural formatting)<br>→ Output guard (PASS — emitted tokens appear benign in isolation)<br>→ Response [FAIL] — injected delimiter tokens emitted verbatim

Root cause: Guard evaluates surface-level tokens but does not simulate<br>nested instruction boundaries in hypothetical / XML / role framing.

Why susceptibility observed<br>Boundary-naive instruction parsing (hypothesis)

The safety guard is context-sensitive — it correctly rejects explicit "ignore instructions" tokens. But when the same semantic payload is wrapped in fictional framing, the guard shows susceptibility. It cannot distinguish between content that describes an instruction override and content that constitutes one. Whether this is due to model architecture, system-prompt design, or guard-stack configuration is unknown from black-box access.

Scope of the pattern<br>Observed across multiple classes

The same pattern of framing-dependent susceptibility appears in IPI-013 (AI Moderation Bypass via "test/simulation" framing: 5/10 = 50%) and partially in IPI-007 (Steganographic: 7/10 = 70%). The common thread — inability to distinguish framing-wrapped content — affects all tested classes using indirection. This suggests a systemic pattern in the gpt-5-nano endpoint, but does not isolate the root cause (model weights, system instructions, or guard logic).

Vulnerability Analysis

High-risk and resistant classes.

High-Risk Classes — >50% Susceptibility

IPI Class<br>Name<br>Susceptible / Total<br>Rate<br>Root Cause

IPI-010<br>Recursive Instruction Framing<br>10...

framing guard susceptibility model classes instruction

Related Articles