
Download the free Kindle app and start reading Kindle books instantly on your smartphone, tablet or computer—no Kindle device required.
Read instantly on your browser with Kindle for Web.
Using your mobile phone camera, scan the code below and download the Kindle app.
Site Reliability Engineering: How Google Runs Production Systems Paperback – Illustrated, 10 May 2016
Purchase options and add-ons
The overwhelming majority of a software systemâ s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?
In this collection of essays and articles, key members of Googleâ s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. Youâ ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficientâ lessons directly applicable to your organization.
This book is divided into four sections:
- Introductionâ Learn what site reliability engineering is and why it differs from conventional IT industry practices
- Principlesâ Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
- Practicesâ Understand the theory and practice of an SREâ s day-to-day work: building and operating large distributed computing systems
- Managementâ Explore Google's best practices for training, communication, and meetings that your organization can use
- ISBN-10149192912X
- ISBN-13978-1491929124
- Edition1st
- PublisherO'Reilly Media
- Publication date10 May 2016
- LanguageEnglish
- Dimensions17.53 x 3.3 x 23.11 cm
- Print length552 pages
Frequently bought together

Customers who viewed this item also viewed
From the Publisher

This book is divided into four sections:
- Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
- Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
- Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
- Management—Explore Google's best practices for training, communication, and meetings that your organization can use
How to Read This Book
This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. (If there are other articles that support or inform the text, we reference them so you can follow up accordingly.)
You don’t need to read in any particular order, though we’d suggest at least starting with Chapters 2 and 3, which describe Google’s production environment and outline how SRE approaches risk, respectively. (Risk is, in many ways, the key quality of our profession.) Reading cover-to-cover is, of course, also useful and possible; our chapters are grouped thematically, into Principles (Part II), Practices (Part III), and Management (Part IV). Each has a small introduction that highlights what the individual pieces are about, and references other articles published by Google SREs, covering specific topics in more detail. Additionally, there’s a companion website mentioned in the book that has a number of helpful resources.
We hope this will be at least as useful and interesting to you as putting it together was for us.
— The Editors.
![]()
Site Reliability Engineering
|
![]()
The Site Reliability Workbook
|
|
---|---|---|
Add to Cart
|
Add to Cart
|
|
Customer Reviews |
4.6 out of 5 stars 1,139
|
4.7 out of 5 stars 399
|
Price | $70.55$70.55 | $55.50$55.50 |
Explore the book & companion workbook | How Google Runs Production Systems | Practical Ways to Implement SRE |
Product description
About the Author
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland's peering hub. He is the author or coauthor of a number of technical papers and/or books, including IPv6 Network Administration for O'Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.
Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google's advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He's also a licensed professional engineer.
Jennifer Petoff is a Program Manager for Google's Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.
Product details
- Publisher : O'Reilly Media; 1st edition (10 May 2016)
- Language : English
- Paperback : 552 pages
- ISBN-10 : 149192912X
- ISBN-13 : 978-1491929124
- Dimensions : 17.53 x 3.3 x 23.11 cm
- Best Sellers Rank: 86,754 in Books (See Top 100 in Books)
- Customer Reviews:
About the authors
- Working on an SRE-based startup from Dublin, Ireland
- Twitter http://twitter.com/niallm
- Photos at http://www.edge-cases.photos
Discover more of the author’s books, see similar authors, read book recommendations and more.
Jennifer Petoff is Director of Google Cloud Platform (GCP) & Technical Infrastructure (TI) Education and is based in Lisbon, Portugal. She leads training programs for Google's GCP and TI Engineering Teams. Jennifer is one of the co-editors of the best-selling book, "Site Reliability Engineering: How Google Runs Production Systems"; lead author of "Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program"; and is a regular speaker at DevOps and SRE conferences around the world.
Jennifer joined Google in 2007 after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester in the United States.
Betsy is a Technical Writer for Google in NYC specializing in Site Reliability Engineering. She has previously written documentation for Google's Data Center and Hardware Operations Teams in Mountain View and across its globally-distributed data centers. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.
Customer reviews
Top reviews from Australia
There was a problem filtering reviews. Please reload the page.
- Reviewed in Australia on 1 November 2023Verified PurchaseWhat to know about an Engineer - read this book - it's Deep
- Reviewed in Australia on 29 July 2016Verified PurchaseA brilliant read with real world patterns and models to follow. Very much enjoyed this book.
- Reviewed in Australia on 10 March 2020Verified PurchaseOverview
This book is a solid description of Site Reliability Engineering at Google. It is full of good ideas. However, most would be difficult to implement to many organisations without revolutionary change in the culture.
Need for Revolutionary Cultural Change
The revolutionary cultural changes needed are that operational work is something that we do as our first job. Operational work is not something that si done on the side.
The change that organisations need to make is to recognise that operational work is a vital component of a product. A product is more than features shovelled out the door—it is about the experience of using that product. This is where operational work is critical: we find ways to make the product stable and reliable.
Good Ideas From SRE
The good ideas I got from this book are:
Continual incident management training
Continual improvement in alerting
Continual automation
Incident Management Training
In all service organisations I have worked at, incident management training has been limited to a few professionals in the Service Delivery/Operations. All operational personnel should have regular incident management training to keep their skills current.
The practice of having a few people trained means that there is confusion about roles and expectations in a real incident. And there is usually just person trying to juggle being an incident commander, customer liaison, incident recorder, etc. In the end, they become less effective in these critical role.
Google ensures that all SRE personnel are able to do those roles, and holds regular drills to practice them. These drills are based upon post-mortems of production issues.
Alerting
Google’s policy is that pages should only be sent if a human has to done something. Google aims for a maximum of two (2) pages per 12 hour shift. All other alerts should either have an automated response or just logged for future reference.
In many organisations, alert management is seen as unwelcome toil. I have been to sites where there are thousands of critical database alerts that no one was investigating. (One site had over 6,000, and another had about 2,000.) In both cases, management was wondering why the systems were so unstable.
Alerts need to tell the SRE about a potential problem before a customer notices. Too often, operational personnel are only reacting to customer complaints.
To help people look at alerts, the alerts should be tuned for relevance (not all threshold violations will impact service delivery), and frequency (alert storms should be curtailed or throttled).
Automation
Automation is key to a successful SRE team. The more work can be done by computers, the better. The book does have a salutary lesson about an automated task wiping all data in a data centre. And with automation, there comes the issue of deskilling of SRE personnel.
SRE automation should be treated as production changes. The same care and attention that is taken for customer facing applications should be applied to critical automation scripts. This is where software development experience and knowledge becomes vital for SRE personnel.
Deskilling can be counteracted through live drills for incident management training. However, this means systems should be set aside for such a purpose.
Top reviews from other countries
- J. AndrewsReviewed in the United Kingdom on 21 January 2018
5.0 out of 5 stars The book every infrastructure engineer and DevOps person should read
Verified PurchaseIf you are new to infrastructure engineering this book will inform you as to an approach and model to use as you start down this road. If you are an experienced engineer then you will see a lot of truth in what is written here. It may change you viewpoint or solidify an existing one, whatever the case this book is an essential reference and an honest account with a huge amount of wisdom.
- Niels AlbersReviewed in the Netherlands on 4 May 2016
5.0 out of 5 stars Must read for the serious DevOps engineer
Verified PurchaseJust the first chapter alone lists a number of concrete issues that anyone who has any experience with operations at all will both recognise, and the recommendations this book makes just make sense. Actually, not only people with DevOps experience should be reading this, there is a lot in here that their managers could certainly profit from, in every sense of the word.
Key words:
- Error budget
- Toil / development ballance (and the 50% time rule)
- The impossibility of never having a failure.
I'm still working my way through the book, but every new chapter has new insights that really help to put our complex job into perspective, and offer concrete ways of making our work better.
- Christian FerrantiReviewed in Italy on 16 May 2016
5.0 out of 5 stars Interesting and useful
Verified PurchaseOf course, I have not the same infrastructure like Google but many problems are the same.
This book is very interesting because shows different tips & tricks to resolve and manage communication problems between departments and of course reliability problems.
I suggest it to every IT professional, ITIL experts, DevOps wannabe and of course CTO.
-
Óscar Casal SánchezReviewed in Spain on 18 April 2017
5.0 out of 5 stars Excelente libro
Verified PurchaseLibro excelente que da muchos puntos de vista de como formar un equipo de trabajo y cómo afrontar los problemas. También recorre todos los procesos de una empresa: presupuestos, monitorización, sla, puesta marcha servicio, mantenimiento de un servicio...
En este libro se ve que la cultura de Google es "blameless" y que no hay una línea entre devs y ops, existe el concepto de SRE que podría decirse que es parecido al actual de devops, aunque con más funciones.
Libro que debería leer toda persona que trabaja en IT y también a toda la