Site reliability engineering

Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations.^[1] SRE aims to create highly reliable and scalable IT systems. Although they are closely related, SRE is slightly different from DevOps.^[2]^[3]^[4]

History

The field of site reliability engineering originated at Google with Ben Treynor Sloss,^[5]^[6] who founded a site reliability team after joining the company in 2003.^[7] By 2016, Google employed more than 1,000 site reliability engineers.^[8] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.^[9] Dedicated SRE teams are common at larger web companies, however it is not uncommon to find Devops team serving dual purpose of SRE in some midsize and many smaller companies.^[9] Organizations that have adopted the concept include Airbnb, Dropbox, IBM,^[10] LinkedIn,^[11] Netflix,^[8] and Wikimedia.^[12] According to a 2021 report by the DevOps Institute, 22% of respondents in a survey of 2,000 worldwide IT professionals had adopted the SRE model compared to 15% percent the previous year.^[13]^[14]

Definition

Site reliability engineering, as a job role, may be performed by individual contributors or organized in teams, responsible for a combination of the following within a broader engineering organization: System availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.^[15] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration.^[16] Focuses of SRE include automation, system design, and improvements to system resilience.^[16]

Site reliability engineering, as a set of principles and practices, can be performed by anyone. Though everyone should contribute to good practices, as occurs in security engineering, a company may eventually hire specialists and engineers for the job.^{[citation needed]}

Site reliability engineering is considered a specific implementation of DevOps;^[17] SRE focuses specifically on building reliable systems, whereas DevOps focuses more broadly.^[2]^[3]^[4] Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful change.^[9]

Principles and practices

There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions:^[1]^[18]

Automation or elimination of anything repetitive in a cost-effective way.
Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself (see list of practices below).
Systems designed with a bias toward the reduction of risks to availability, latency, and efficiency.
Observability—as in, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.^[19]

The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:

Toil management as the implementation of the first principle outlined above.
Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
Designing for and implementing observability.
Defining, testing, and running an incident management process.
Capacity planning.
Change and release management, including CI/CD.
Chaos engineering.

Implementations

Site Reliability Engineering (SRE) teams collaborate with other departments within organizations to implement SRE principles effectively. Below is an overview of common practices:^[20]

Kitchen Sink, a.k.a. “Everything SRE”

In Site Reliability Engineering (SRE), "Kitchen Sink" refers to the expansive and often unbounded scope of services and workflows that SRE teams oversee. Unlike traditional roles with clearly defined boundaries, SREs are tasked with various responsibilities, including everything from system design and performance optimization to incident management and automation. This holistic approach allows SREs to address many challenges, ensuring that systems run efficiently and evolve in response to changing demands and complexities. By embracing this comprehensive perspective, SRE teams can foster a culture of continuous improvement and resilience, ultimately enhancing the overall reliability of services.

Infrastructure

Infrastructure SRE (Site Reliability Engineering) teams focus on maintaining and improving the reliability of key systems that support other teams’ workflows. While they sometimes collaborate with platform engineering teams, their primary responsibility is ensuring uptime, performance, and efficiency. Platform teams, on the other hand, primarily develop the software and systems used across the organization. While reliability is a goal for both, platform teams prioritize creating and maintaining the tools and services used by internal stakeholders, whereas Infrastructure SRE teams are tasked with ensuring those systems run smoothly and meet reliability standards.

Tools

Teams utilize a variety of tools to measure, maintain, and enhance system reliability. These tools play a crucial role in monitoring performance, identifying issues, and facilitating proactive maintenance. For instance, Nagios Core is widely used for system monitoring and alerting, while Prometheus (software) is popular for collecting and querying metrics in cloud-native environments. Leveraging these tools, SRE teams can ensure optimal performance and quickly respond to potential reliability challenges.

Product or application

Site Reliability Engineering (SRE) teams dedicated to specific products or applications are common in large organizations. These teams are responsible for ensuring the reliability, scalability, and performance of key services. In larger companies, it's typical to have multiple SRE teams, each focusing on different products or applications, ensuring that each area receives specialized attention to meet performance and availability targets

Embedded

In an embedded model, individual SREs or small SRE pairs are integrated directly within software engineering teams. These SREs work closely with developers, applying core SRE principles, such as automation, monitoring, and incident response—directly to the software development lifecycle. This approach helps improve reliability and performance while fostering collaboration between SREs and developers.

Consulting

Consulting SRE teams specialize in advising organizations on the implementation of SRE principles and practices. Typically composed of seasoned SREs with extensive experience across various implementations, these teams provide valuable insights and guidance tailored to specific organizational needs. When working directly with clients, these SREs are often referred to as 'Customer Reliability Engineers.'

In large organizations that have adopted SRE, a hybrid model is common. This model includes various implementations, such as multiple Product/Application SRE teams dedicated to addressing the unique reliability needs of different products. An Infrastructure SRE team may collaborate with a Platform engineering group to achieve shared reliability goals for a unified platform that supports all products and applications

Industry

Since 2014, the USENIX organization has hosted the annual SREcon conference, bringing together site reliability engineers from various industries. This conference serves as a platform for professionals to share knowledge, explore best practices, and discuss the latest trends in site reliability engineering.^[21]

References

^ ^a ^b "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.
^ ^a ^b Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.
^ ^a ^b Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
^ ^a ^b "What is SRE? - SRE Explained - AWS". Amazon Web Services, Inc. Retrieved 2022-11-05.
^ Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.
^ "What is SRE?". Red Hat. Retrieved June 17, 2021.
^ Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.
^ ^a ^b Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.
^ ^a ^b ^c Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.
^ "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.
^ "Site Reliability Engineering (SRE)". engineering.linkedin.com. Retrieved March 12, 2024.
^ "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.
^ Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.
^ Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.
^ Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
^ ^a ^b Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.
^ Dave Harrison (9 Oct 2018). "Interview with Betsy Beyer, Stephen Thorne of Google". Retrieved 24 July 2024.
^ "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.
^ "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.
^ "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.
^ "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

External links

Awesome Site Reliability Engineering resources list
How they SRE resources list
SRE Weekly weekly newsletter devoted to SRE
SRE at Google landing page for learning more about SRE in Google
Komodor K8s Reliability learning center with resources for SREs working with Kubernetes
SRE: What Do You Need To Know To Master This Role? resource list

[:7-1] "Evaluating where your team lies on the SRE spectrum". Google Cloud Blog. Retrieved 2021-06-26.

[:0-2] Beyer, Betsy; Jones, Chris; Petoff, Jennifer; Murphy, Niall, eds. (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030.

[:2-3] Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.

[:6-4] "What is SRE? - SRE Explained - AWS". Amazon Web Services, Inc. Retrieved 2022-11-05.

[5] Hill, Patrick. "Love DevOps? Wait until you meet SRE". Atlassian. Retrieved June 17, 2021.

[:3-6] "What is SRE?". Red Hat. Retrieved June 17, 2021.

[7] Treynor, Ben (2014). "Keys to SRE". USENIX SREcon14. Retrieved June 17, 2021.

[:1-8] Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?". TechCrunch. Retrieved June 17, 2021.

[:5-9] Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?". Built In. Retrieved June 17, 2021.

[10] "Site Reliability Engineering". IBM Cloud Education. IBM. November 12, 2020. Retrieved June 21, 2021.

[11] "Site Reliability Engineering (SRE)". engineering.linkedin.com. Retrieved March 12, 2024.

[12] "SRE - Wikitech". wikitech.wikimedia.org. Retrieved 2021-10-17.

[13] Oehrlich, Eveline; Groll, Jayne; Garbani, Jean-Pierre (2021). Upskilling 2021 Enterprise DevOps SkillsReport (PDF) (Report). DevOps Institute. Retrieved June 17, 2021.

[14] Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer". TechBeacon. Micro Focus. Retrieved June 17, 2021.

[15] Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.

[:4-16] Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers" (PDF). ;login:. Vol. 40, no. 3. pp. 35–39. Retrieved June 17, 2021.

[17] Dave Harrison (9 Oct 2018). "Interview with Betsy Beyer, Stephen Thorne of Google". Retrieved 24 July 2024.

[18] "The 7 SRE Principles [And How to Put Them Into Practice]". www.blameless.com. Retrieved 2021-06-26.

[19] "Learn about observability | Honeycomb". docs.honeycomb.io. Retrieved 2021-06-26.

[20] "SRE at Google: How to structure your SRE team". Google Cloud Blog. Retrieved 2021-06-26.

[21] "Usenix SREcon". USENIX. 2021. Retrieved June 17, 2021.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]