Home
>
DevOps News
>
Site Reliability Engineering at a Global Scale – InApps 2022

March 18, 2022 by Anh Hoang

Site Reliability Engineering at a Global Scale – InApps 2022

Main Contents:

Site Reliability Engineering at a Global Scale – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn Site Reliability Engineering at a Global Scale – InApps in today’s post !

One Team to Rule Them All, But Not to Rule

Google’s site reliability engineering team is treated as one central organism, spanning across internal networking and developer tools, as well as customer-facing ones. Each service lifecycle stage has different needs and the types of SRE engagement vary.

“What they all have in common is they are scoped around SRE’s mission: reliability, velocity, maintainability and efficiency. And a shared set of principles,” said Dr. Christof Leng, Google’s SRE Engagements engineering lead. He heads three horizontal SRE teams in Munich and is responsible for maintaining Google’s SRE Engagement Model, that collection of policies and principles around SRE and developer collaboration.

Google has more than 3,000 engineers, grouped into product areas, of between 50 and 300 SREs each. This makes it one of the largest, if not the largest, SRE team in the world, yet it is still asymmetrically smaller than the developer team. Leng says that keeps SREs focused on their core mission. And it limits the amount of work developers can offload onto SREs.

Also SRE support is not automatic or for all dev teams at Google. SRE remains an intentionally scarce resource. SRE teams are funded by the development teams — decided at director or VP level — and is made up of at least six SREs each. Both the dev and SRE teams must agree to start an engagement and either side can end it. But it’s usually intended to be longer term.

“Production excellence is a multi-year investment so engagements are not considered in isolation, but at the SRE-product area level,” explained Dr. Jennifer Petoff, Google’s director of SRE education and co-author of the original SRE O’Reilly guide.

“It takes time to build up that deep understanding of the services that team is responsible for.”

The Specific Scope of Google SRE-Developer Relationships

While managing a service is a shared endeavor with shared goals, service level objectives and error budgets, Petoff noted that, even though the day-to-day production responsibly rests with the SRE team, ultimately the uptime and availability buck stops with the dev team.

“Responsibility for having a reliable service is not off-loaded onto the SRE or thrown over the fence. SRE’s job is to help the dev team meet their reliability and velocity goals and to meet the needs of our users first and foremost,” she said.

In fact, it’s quite clear what the SRE team can engage with and not. The Google SRE team is only able to work on certain projects — and the existence of an on-call team is not seen as a justification. They can only work on what they can do significantly more efficiently than anyone else. If devs can do it, that should remain a dev headcount.

The Google SRE Engagement Model concerns production only, which includes:

System architecture and inter-service dependencies
Instrumentation, metrics and monitoring
Emergency response
Capacity planning
Change management
Performance — availability, latency and efficiency

By design the work of the SREs must also be “interesting, impactful and challenging for the SRE team,” Petoff said. This is not handing off pager duty. “SRE is not to be the ops team. Our mission is not to handle operations, but to improve inherent reliability of systems through engineering.”

The SRE team aims to reduce the ops workload by answering what broke, how to fix it, and then how to make sure it’s fixed for good.

For SREs, there’s always more work to do than there is time, so all their work should have a clear scope and connection to them championing for users. These benefits may not be visible to users like infrastructure updates such as converging toward standard platforms in order to increase feature velocity. Standardization also benefits the SRE team by reducing cognitive load.

Finally, an important role of an SRE is that of a teacher, passing on production knowledge. Petoff says this is the only way to keep the SRE team from becoming a human abstraction layer from production. “You can’t build a wall and then complain about a ‘throw it over the wall’ mentality.”

How Google SRE Functions in Practice

Yes, an SRE team can join a development team at any stage in the app or service lifecycle. But they are most effective from the start, bringing reliability along as you shift left.

Leng says at the design stage, “You make many decisions that are incredibly hard or practically impossible to change later — architecture, technology, failover capabilities.” He continued that “When a production expert has a voice at the table, you can fix problems before they actually happen.”

However, this isn’t often the case. For example, SLOs aren’t usually discussed until the implementation is done, but the architecture that was already chosen should scale to the expected number of nines. Otherwise, either the whole system has to be redesigned, lest you disappoint your users, or you’ve gone the other way and invested in architecture that’s far too complicated than is needed to satisfy them.

Not every SRE engagement will be the same either. Leng groups them into three as-needed buckets, which cover both headcount budget and project time commitment:

Baseline support — tactical and reactive ad-hoc support like office hours or consulting projects, where the developers execute based on advice received, or as part of the incident response team in larger-scale outages
Assisted engagement — SRE provides strategic, proactive, product-focused consultancy, with a dedicated SRE point of contact and a shared production roadmap; this can be an SRE temporarily embedded on a dev team for a critical product where an SRE can be a force multiplier
Full support — SRE is the effective owner of production, does on-call rotations to solve less obvious and complex production problems — the goal is the SRE to automate themselves out of a job in 18 months.

“Higher is not always better. It comes at a higher cost. Especially for the earlier lifecycle phases, with a high rate of change, a lower-tier engagement can be more effective.” Leng said engagements can be scaled up or down over time, but that isn’t needed for all services.

Everything is situational. If an SRE team is focused on core infrastructure, they may be offering full support to a few different engagements, but if they are working on earlier-stage, experimental projects, they could be working on several baseline engagements.

When Things Go Wrong

Just because it’s Google doesn’t mean it’s perfect… by any means. But the Google site reliability engagement model hopes for the best and prepares for the worst. This could be anything from operations overload to a disagreement on direction to the developers just not doing their share anymore.

That’s when they apply the best practices for incident management at a strategic level. Start by looking for the root cause. Start to look for buy-in from both dev partners and critical dependencies. If an agreement can’t be made, escalate it up both the dev and SRE management chains. Then declare “Code Yellow” — that the work required to fix the problem trumps all other project work.

When all else fails, don’t be a hero, don’t be a constant firefighter — recognize when it may be time to turn in your pager. But that’s OK because Leng says mobility among SREs across Google is typically very high.

“This is not what typically happens. Everyone understands the SREs need to be kept happy as well, you can’t throw them under the bus. And the developers understand the value that they get out of it,” Leng said.

In the end, as he addressed fellow SREs, “Whatever you do, remember that heroics are not sustainable. You can’t firefight production forever. Neither can you work day and night, it’s not sustainable. Solve the problem through smart engineering, not brute force.”

United you stand. Divided you fall.

Source: InApps.net

Rate this post

Anh Hoang

Anh Hoang is Head of SEO Optimization at InApps Technology, ensuring that the message and research of InApps Technology reach the most people possible while adhering to our strict journalistic standards of excellence and integrity.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

April 8, 2026 by Anh Hoang

Site Reliability Engineering at a Global Scale – InApps 2022

Read more about Site Reliability Engineering at a Global Scale – InApps at Wikipedia

One Team to Rule Them All, But Not to Rule

The Specific Scope of Google SRE-Developer Relationships

How Google SRE Functions in Practice

When Things Go Wrong

Offshore Product Development and How It Differs?

Is It Too Late to Switch Into Tech? What Reddit Career Changers Say

Are Developers Becoming Too Dependent on AI Tools?

Is Being a Self-Taught Developer Still Viable in 2026?

Imposter Syndrome in Tech: Why So Many Developers Feel Like Frauds

Too Many Tools, Too Little Time: How Developers Deal With Stack Fatigue

Why AI Productivity Is Making Developers Feel More Stressed, Not Faster

How to Stay Relevant in Tech Without Learning Everything

Why So Many Developers Feel Burned Out (And What Actually Helps)

Hire Software Engineers in Vietnam: The 2026 Cost & Compliance Guide for Australian CTO

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

Offshore Product Development and How It Differs?

Onshore vs Nearshore vs Offshore Outsourcing Models (+Case studies)

Is It Too Late to Switch Into Tech? What Reddit Career Changers Say

Locations

Read more about Site Reliability Engineering at a Global Scale – InApps at Wikipedia

One Team to Rule Them All, But Not to Rule

The Specific Scope of Google SRE-Developer Relationships

How Google SRE Functions in Practice

When Things Go Wrong

Get a custom Proposal

You need to enter your email to download

Blog post

Locations