Autodata: An agentic data scientist to create high quality synthetic data

[2606.25996] Autodata: An agentic data scientist to create high quality synthetic data

-->

Computer Science > Artificial Intelligence

arXiv:2606.25996 (cs)

[Submitted on 24 Jun 2026]

Title:Autodata: An agentic data scientist to create high quality synthetic data

Authors:Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu, Yixin Nie, Swarnadeep Saha, Eryk Helenowski, Weizhe Yuan, Olga Golovneva, Jack Lanchantin, Yoram Bachrach, Jakob Foerster, Xian Li, Han Fang, Sainbayar Sukhbaatar, Jason Weston View a PDF of the paper titled Autodata: An agentic data scientist to create high quality synthetic data, by Ilia Kulikov and 14 other authors

View PDF HTML (experimental)

Abstract:We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as: arXiv:2606.25996 [cs.AI]

(or arXiv:2606.25996v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2606.25996

Focus to learn more

arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jason Weston [view email] [v1] Wed, 24 Jun 2026 16:08:31 UTC (19,889 KB)

Full-text links: Access Paper:

View a PDF of the paper titled Autodata: An agentic data scientist to create high quality synthetic data, by Ilia Kulikov and 14 other authors View PDF HTML (experimental) TeX Source

view license

Current browse context:

cs.AI

next >

new recent | 2026-06

Change to browse by:

cs cs.CL cs.LG

References & Citations

NASA ADS Google Scholar

Semantic Scholar

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Autodata: An agentic data scientist to create high quality synthetic data

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi