CAISI Evaluation of DeepSeek V4 Pro | NIST
Skip to main content
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock (
Lock<br>A locked padlock
) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro
UPDATES
CAISI Evaluation of DeepSeek V4 Pro
May 1, 2026
Share
X.com
In April 2026, the Center for AI Standards and Innovation (CAISI) evaluated the open-weight AI model DeepSeek V4 Pro (“DeepSeek V4”). CAISI evaluations indicate that DeepSeek V4’s capabilities lag behind the frontier by about 8 months (Figure 1).
Figure 1: Comparison of aggregate capabilities over time of the most capable publicly released U.S. and PRC models according to a suite of benchmarks covering five domains.<br>Every 200-point increase on the y-axis equates to a 3x increase in the odds of solving a given task. Model capability was fitted using an approach inspired by Item Response Theory (IRT), as detailed in the Appendix. 16 benchmarks across 35 models were used to produce this figure. Trend lines were fit with least squares regression on frontier models. Error bars denote 95% CIs.
Key Findings<br>DeepSeek V4 is the most capable PRC AI model evaluated by CAISI to date. CAISI evaluations span the domains of cyber, software engineering, natural sciences, abstract reasoning, and mathematics (Figure 2).<br>DeepSeek V4 scores better on DeepSeek’s self-reported evaluations than on CAISI evaluations. According to DeepSeek’s data, DeepSeek V4 is about as capable as Opus 4.6 and GPT-5.4, which were released about 2 months ago. However, CAISI’s evaluations, which include non-public benchmarks, indicate that DeepSeek V4 performs similarly to GPT-5, which was released about 8 months ago (Figure 3).<br>DeepSeek V4 is more cost efficient than other models of similar capability. Compared to the most cost-competitive U.S. reference model (GPT-5.4 mini), DeepSeek V4 was more cost efficient on 5 out of 7 benchmarks. On the 7 benchmarks, DeepSeek V4 ranged from 53% less expensive to 41% more expensive.
Capability Results<br>DeepSeek V4’s capability lags behind leading U.S. models by about 8 months (Figure 1). The methodology applied can be found in the Appendix.<br>DeepSeek V4 is the most capable PRC model to date across the domains that CAISI evaluated: cyber, software engineering, natural sciences, abstract reasoning, and mathematics. CAISI evaluated models on nine benchmarks across these five domains, including two held-out and uncontaminated benchmarks: ARC-AGI-2’s semi-private dataset, and CAISI’s internally-built software engineering evaluation PortBench.<br>Domain<br>Benchmark<br>Model (reasoning level)<br>OpenAI GPT-5.5<br>(xhigh)<br>OpenAI GPT-5.4 mini<br>(xhigh)<br>Anthropic Opus 4.6<br>(max)<br>DeepSeek V4 Pro<br>(max)<br>Cyber<br>CTF-Archive-Diamond<br>71%<br>32%<br>46%<br>32%***<br>Software Engineering<br>SWE-Bench Verified*<br>81%<br>73%<br>79%<br>74%<br>PortBench<br>78%<br>41%<br>60%<br>44%<br>Natural Sciences<br>FrontierScience<br>79%<br>74%<br>72%<br>74%<br>GPQA-Diamond<br>96%<br>87%<br>91%<br>90%<br>Abstract Reasoning<br>ARC-AGI-2 semi-private**<br>79%<br>63%<br>46%<br>Mathematics<br>OTIS-AIME-2025<br>100%<br>90%<br>92%<br>97%<br>PUMaC 2024<br>96%<br>93%<br>95%<br>96%<br>SMT 2025<br>99%<br>92%<br>94%<br>96%<br>IRT-Estimated Elo<br>1260 ± 28<br>749 ± 46<br>999 ± 27<br>800 ± 28<br>Figure 2: Summary of model performance per capability benchmark (higher is better).<br>Results show accuracy (percentage of tasks solved) on each benchmark. For each benchmark, the top-performing model is highlighted and bolded. IRT-estimated Elo uncertainties reflect a 95% confidence interval. *CAISI scores on SWE-Bench Verified tend to be lower than those of other evaluators, likely due to system prompt, scaffolding, and token budget differences. **CAISI reports mean score across tasks, which differs from ARC-AGI-2’s official score aggregation methodology. ***Imputed from a subset of samples via IRT.
Benchmarks<br>This evaluation used the following benchmarks:<br>ARC-AGI-2 Semi-Private: a non-public dataset measuring abstract reasoning from the ARC Prize Foundation. "Semi-Private" means these tasks may have been exposed to limited third-parties.<br>CTF-Archive-Diamond : a CAISI-developed benchmark based on 285 difficult CTF challenges drawn from the pwn.challenge cybersecurity platform developed by Arizona State University.<br>PortBench : a CAISI-developed non-public evaluation that assesses the ability of AI models to port command line interface (CLI) tools to different programming languages, given a reference implementation in one language. CAISI plans to release an in-depth description of PortBench in the future.<br>FrontierScience : a benchmark that evaluates expert-level scientific reasoning through international science olympiad problems and PhD-level, open-ended problems representative of sub-tasks in scientific research in physics, chemistry, and biology, developed by OpenAI.<br>For descriptions of...