Welcome to the SRE Principles Guide
This site is a living collection of Site Reliability Engineering (SRE) best practices, designed to be a practical resource for building and maintaining scalable, reliable, and efficient systems.
Our Core Philosophy
Site Reliability Engineering (SRE) is what you get when you treat operations as a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of our services. We aim to balance the risk of unavailability with the goal of rapidly and safely launching new features.
This guide is designed to be a practical resource for SREs, developers, and operations engineers. Whether you are new to the field or an experienced practitioner, we hope you find these principles valuable.
Explore the Documentation
This guide is organized into several key areas. Use the navigation menu to explore the topics that are most relevant to you.
Core Concepts
- Service Level Management: Dive into the foundational pillars of SRE, including SLIs, SLOs, and Error Budgets.
- Toil Reduction: Learn the strategies to identify, measure, and eliminate operational work.
SRE Pillars
This guide is broken down into several key pillars of Site Reliability Engineering. Explore the topics below to dive deeper into each area.