Teaching AI agents to ask better questions by playing "Battleship"

droidjj1 pts0 comments

Teaching AI agents to ask better questions by playing “Battleship” | MIT News | Massachusetts Institute of Technology

Skip to content ↓

Massachusetts Institute of Technology

Search websites, locations, and people

See More Results

Suggestions or feedback?

Enter keywords to search for news articles:

Submit

Browse By

Topics

View All →

Explore:

Machine learning

Sustainability

Startups

Black holes

Classes and programs

Departments

View All →

Explore:

Aeronautics and Astronautics

Brain and Cognitive Sciences

Architecture

Political Science

Mechanical Engineering

Centers, Labs, & Programs

View All →

Explore:

Abdul Latif Jameel Poverty Action Lab (J-PAL)

Picower Institute for Learning and Memory

Media Lab

Lincoln Laboratory

Schools

School of Architecture + Planning

School of Engineering

School of Humanities, Arts, and Social Sciences

Sloan School of Management

School of Science

MIT Schwarzman College of Computing

View all news coverage of MIT in the media →

Listen to audio content from MIT News →

Subscribe to MIT newsletter →

Close

Breadcrumb

MIT News

Teaching AI agents to ask better questions by playing “Battleship”

Teaching AI agents to ask better questions by playing “Battleship”

MIT researchers use the classic game as a test bed for AI agents, finding a small AI model can outperform the biggest ones at 1 percent of the cost.

Alex Shipps<br>MIT CSAIL

Publication Date:

June 3, 2026

Press Inquiries

Press Contact:

Rachel

Gordon

Email:<br>rachelg@csail.mit.edu

Phone:<br>617-258-0675

MIT Computer Science and Artificial Intelligence Laboratory

Close

Caption:

AI models improved at MIT researchers’ “Collaborative Battleship” game by carefully weighing options about where game pieces might be hidden at each turn. The approach helped much-smaller models finish in fewer turns than leading ones.

Credits:

Image: Alex Shipps/MIT CSAIL, using assets from AdobeStock

Previous image<br>Next image

In 2026, the hype for artificial intelligence agents is louder than ever before. These semi-autonomous programs can “think” and execute well-defined tasks in areas like customer service and software development, typically using language models (LMs). But fields like medical diagnosis and scientific discovery require them to inquire about a vast range of solutions in uncertain environments, which LMs struggle with.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences (SEAS) peered deeper into LMs to understand their main issues in high-stakes settings. Their test: “Battleship,” a classic guessing game that’s helped cognitive scientists study how humans seek information.<br>CSAIL and SEAS scholars added a twist by reframing the game around asking and answering natural language questions. In their “Collaborative Battleship” game, one participant is a “captain” who inquires about where hidden ships are, while their teammate plays the “spotter” by responding to those questions in real-time.<br>The researchers first had over 40 humans play the game together, collecting their questions and yes-no answers to build the “BattleshipQA” dataset. These results were a helpful point of comparison when the team tested state-of-the-art LMs (like GPT-5) and smaller models (like Llama 4 Scout) on their game. Without training the models beforehand, they found that top LMs can “beat” humans at “Battleship” — that is, complete the game in fewer turns — but smaller systems are far less rational.<br>The chief issue was that many models are simply not adept at coming up with useful questions. To get LMs to inquire in ways that reveal more information about hidden ships, the researchers gave each model a Monte Carlo inference strategy, which carefully measures the likelihood of different options being correct with each response. The result: AI models that can beat regular players at “Battleship,” regardless of scale.<br>Perhaps the most striking results were Llama 4 Scout’s gains. As a relatively small LM, it only beat humans 8 percent of the time. But with refinements to its inference strategy, the model reached a “Battleship” win rate of 82 percent versus humans. This careful and efficient style of asking questions also enabled the model to outpace a frontier model (GPT-5), while operating at around 1 percent of its cost.<br>On top of this improvement, the researchers shrank the gap between humans and LMs in answering questions. While GPT-5 was a reliable spotter that helped models finish games faster, smaller systems had a bad habit of giving the wrong answers about where ships were hidden. The models saw an accuracy boost of 15 percent on average when they began converting questions into code that explicitly tells them how to verify their answers (for example, having the model run a quick search of an area when asked if a ship was there).<br>“Today’s language models are primarily optimized to answer complex queries, but it’s...

questions battleship models game agents school

Related Articles