Why Google Stores Billions of Lines of Code in a Single Repository (2016)

downbad_3 pts0 comments

Why Google Stores Billions of Lines of Code in a Single Repository – Communications of the ACM

Skip to content

Latest Issue

Search

Sign In

Join ACM

Introduction

Key Insights

Google-Scale

Background

Analysis

Alternatives

Conclusion

Acknowledgments

References

Authors

Footnotes

Figures

Tables

Early Google employees decided to work with a shared codebase managed through a centralized source control system. This approach has served Google well for more than 16 years, and today the vast majority of Google’s software assets continues to be stored in a single, shared repository. Meanwhile, the number of Google software developers has steadily increased, and the size of the Google codebase has grown exponentially (see Figure 1). As a result, the technology used to host the codebase has also evolved significantly.

Back to Top

Key Insights<br>Google has shown the monolithic model of source code management can scale to a repository of one billion files, 35 million commits, and tens of thousands of developers.<br>Benefits include unified versioning, extensive code sharing, simplified dependency management, atomic changes, large-scale refactoring, collaboration across teams, flexible code ownership, and code visibility.<br>Drawbacks include having to create and scale tools for development and execution and maintain code health, as well as potential for codebase complexity (such as unnecessary dependencies).

This article outlines the scale of that codebase and details Google’s custom-built monolithic source repository and the reasons the model was chosen. Google uses a homegrown version-control system to host one large codebase visible to, and used by, most of the software developers in the company. This centralized system is the foundation of many of Google’s developer workflows. Here, we provide background on the systems and workflows that make feasible managing and working productively with such a large repository. We explain Google’s "trunk-based development" strategy and the support systems that structure workflow and keep Google’s codebase healthy, including software for static analysis, code cleanup, and streamlined code review.

Back to Top

Google-Scale

Google’s monolithic software repository, which is used by 95% of its software developers worldwide, meets the definition of an ultra-large-scale4 system, providing evidence the single-source repository model can be scaled successfully.

The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google’s entire 18-year existence. The repository contains 86TBa of data, including approximately two billion lines of code in nine million unique source files. The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, configuration files, documentation, and supporting data files; see the table here for a summary of Google’s repository statistics from January 2015.

In 2014, approximately 15 million lines of code were changedb in approximately 250,000 files in the Google repository on a weekly basis. The Linux kernel is a prominent example of a large open source software repository containing approximately 15 million lines of code in 40,000 files.14

Google’s codebase is shared by more than 25,000 Google software developers from dozens of offices in countries around the world. On a typical workday, they commit 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems. Each day the repository serves billions of file read requests, with approximately 800,000 queries per second during peak traffic and an average of approximately 500,000 queries per second each workday. Most of this traffic originates from Google’s distributed build-and-test systems.c

Figure 2 reports the number of unique human committers per week to the main repository, January 2010-July 2015. Figure 3 reports commits per week to Google’s main repository over the same time period. The line for total commits includes data for both the interactive use case, or human users, and automated use cases. Larger dips in both graphs occur during holidays affecting a significant number of employees (such as Christmas Day and New Year’s Day, American Thanksgiving Day, and American Independence Day).

In October 2012, Google’s central repository added support for Windows and Mac users (until then it was Linux-only), and the existing Windows and Mac repository was merged with the main repository. Google’s tooling for repository merges attributes all historical changes being merged to their original authors, hence the corresponding bump in the graph in Figure 2. The effect of this merge is also apparent in Figure 1.

The commits-per-week graph shows the commit rate was dominated by human users until 2012, at which point Google switched to a custom-source-control implementation for hosting the central repository, as discussed later....

google repository code codebase files source

Related Articles