Building multilingual AI with a new open dataset

gmays1 pts0 comments

Accelerating researchers and developers building multilingual AI with a new open dataset - The GitHub Blog

Try GitHub Copilot CLI

Attend GitHub Universe

Search

Kevin Xu·@khxu

June 15, 2026

5 minutes

Share:

Software may be written in programming languages, but human language is at the heart of developer collaboration. Developers explain how projects work in READMEs. They ask for help in issues. They review, debate, and improve code in pull requests. That collaboration often happens in English—but not always. As AI becomes a bigger part of how developers build software, multilingual developer content matters more than ever.

Today, GitHub is publishing the GitHub Multilingual Repositories Dataset , a repository-level metadata dataset designed to help researchers and developers discover public GitHub repositories with evidence of non-English natural-language content. When building the dataset, we found that language distribution differs across READMEs, issues and pull requests: Korean is the most common non-English language in issue text, but only the fifth-most common in READMEs. Portuguese tops the non-English README list with more than 3 million repositories.

The dataset is now available on GitHub under CC0-1.0. It follows through on a commitment we made in 2025, as part of Microsoft’s European Digital Commitments, to make multilingual data more accessible, including to open source AI developers.

What’s in the dataset

The GitHub Multilingual Repositories Dataset is intentionally not a dump of repository content. Instead, it is a metadata dataset that helps developers and researchers find repositories where multilingual collaboration may be happening. The dataset covers over 80 million classification rows across more than 40 million repositories . For each public repository, we provide:

Language classifications of the README, the most-commented issue, and the most-commented pull request, with the first 150 characters of each used as the input sample. We exclude texts under 20 characters.

Classifications for each text source, from fastText, gcld3, and lingua-py, each with a confidence score. The dataset only includes classifications with >0.5 confidence.

Repository metadata: creation timestamp, disk usage, stars, forks, primary programming language, SPDX license, issue and pull request counts, and the snapshot date.

We deliberately did not collapse the three classifiers into a single label. Different classifiers have different coverage and confidence calibration, especially for lower-resource languages. By exposing all three, we let you decide how strict you want to be. Want a high-precision Greek subset? Require all three classifiers to agree above some confidence threshold. Want broad recall for an exploratory study of Romance languages? One classifier may be enough.

What you can build with it

The dataset is designed for the kind of work that’s hard to do with general web text:

Discover repositories likely to contain developer documentation or collaboration in specific languages.

Study how non-English developer communities use issues, pull requests, and READMEs.

Build evaluation sets for AI coding tools, doc generators, or review assistants that need to behave well across languages.

Encourage decision-makers to expand language coverage for new developer tools and AI features using data-backed arguments on the rich multilingual diversity of developers.

Measure representation of European and other underrepresented languages in open source.

Some caveats

Language identification is hard, especially in software repositories. Repository text is often short. It may include badges, templates, installation commands, code snippets, usernames, or mixed-language content. A 150-character sample may not represent the whole repository. Classifiers also vary in coverage and calibration, especially for lower-resource languages.

That is why the dataset should not be treated as a ground-truth benchmark for language identification. Instead, it is designed as a transparent discovery tool. Users can inspect classifications, confidence scores, and sources, then choose the precision and recall tradeoffs that fit their own research or development workflow.

The dataset also should not be used to infer sensitive attributes about repository owners, contributors, or communities. The signals are repository-level metadata, not person-level attributes.

Why open multilingual data matters

Today, many European languages remain underrepresented in the online text used to build and evaluate AI systems. That creates a risk that AI tools work well for some developers, languages, and communities, while leaving others behind. Open data can help close that gap. We built this dataset because developer content is different from general web text. READMEs, issues, and pull requests contain the language of software collaboration: installation instructions, bug reports, feature requests, review comments, and...

dataset language multilingual languages developers github

Related Articles