AI in SRE: Where and how Google is deploying agentic AI to improve operations

geoffbp1 pts0 comments

How Google SRE is using agentic AI to improve operations | Google Cloud Blog<br>Contact sales Get started for free

DevOps & SRE

AI in SRE: Where and how Google is deploying agentic AI to improve operations

May 28, 2026

Stevan Malesevic<br>Distinguished Software Engineer

Christopher Heiser<br>Distinguished Site Reliability Engineer

Try Gemini Enterprise Business Edition today<br>The front door to AI in the workplace<br>Try now

Since its inception over 20 years ago, Google has used Site Reliability Engineering (SRE) to keep services like Search, Gmail, Maps, YouTube and Google Cloud reliable and highly available, adhering to the principles and practices of the reliability-first mindset.

Recently though, the emergence of AI has driven multiple step-changes in system complexity. Interactions between components are now more complicated due to a variety of factors:

With microservice architectures, systems are distributed across wider geographical locations and data centers that have greater hardware diversity.

Enterprise cloud products offer an extensive array of capabilities with an incredibly complex set of products.

Google services now cover more unique business and regulatory requirements, making the overall topology and taxonomy much more complex and difficult to understand, a challenge amplified by the constant stream of system changes resulting from continuous deployment pipelines.

AI code generation capabilities have enabled software developers to deliver orders of magnitude more code, resulting in more opportunities to introduce reliability issues.

While AI is in some ways making the SRE team’s work more challenging, it also provides new ways to understand and improve software development lifecycles, including production operations. Google SRE is on the path to fully adopt AI and agentic technologies, leveraging AI as a force multiplier while also maintaining control. We call this SRE AI.

Read on for a summary of considerations when thinking about this topic, or you can dive straight into our comprehensive whitepaper, AI in SRE Practice: Moving Beyond Automation at Google, for an in-depth look at how Google SRE is navigating the transition from deterministic automation to agentic AI.

The SRE AI opportunity landscape

To help define our SRE AI strategy, we considered the overall software development lifecycle (SDLC) for areas of opportunity.

The above diagram shows each of the phases where SRE is involved, and that could be improved with SRE AI.

Perhaps the most obvious SRE area that could benefit from agentic AI is investigation and mitigation , sometimes referred to as root cause analysis (RCA), a cornerstone of the traditional SRE discipline. But RCA is by no means the whole SRE AI. Our plans for SRE AI go far beyond RCA and troubleshooting, and address the entire SDLC. Here are a few areas we are working on:

Reliability design

SRE has been working on the policies, tooling and procedures you need to ensure reliability is an integral part of system design through the design, launch, and deployment phases. An agentic approach does not necessarily imply removing people from the process, specifically for higher-risk services and features, but it does significantly reduce the time people need to spend, as a number of issues can be detected and auto-addressed before they need to be reviewed by a person.

Runbooks (playbooks) and other documentation to be used during incidents are important production artifacts. Google SRE has developed AI agents to continuously monitor and improve playbooks and production documentation based on their usage during incidents. AI agents can also generate new playbooks from incidents.

Anomaly detection and alerting

A core SRE practice is to define service level indicators (SLIs) and service level objectives (SLOs), and to configure alerts for them. This approach tends to be ok if service use cases are fairly uniform, and if it is possible to define objectives that align to customers' expectations.

However, for products that support a range of customer use cases and workloads, like many in Google Cloud, it can be difficult to define a static threshold that works across a variety of workloads. With AI, Google SRE is augmenting our more traditional approaches with anomaly detection , with alerts based on detecting anomalies in regular behavior rather than statically predefined thresholds. This approach relies on agents to collect signals and feed them to a model (e.g., TimesFM) to perform anomaly detection. Historical signals from prior customer cases help the AI agent to predict customer-oriented SLOs. Further, AI-based anomaly detection can consult sources beyond signals produced by service itself — for instance, customer feedback.

In this model, when the SRE AI agent detects an anomaly, it triggers an alert. Then, the SRE AI alerting agent groups, pre-processes, and enriches the alerts with the necessary context and information. These alerts in turn are run through...

google agentic reliability from improve anomaly

Related Articles