Available instantly

Audiobook
$0.99
with membership

Paperback
$70.55

Audio CD
$120.16

$70.55

FREE delivery Thursday, 10 April

Or fastest delivery Today. Order within 8 hrs 36 mins

Only 2 left in stock.

$$70.55 () Includes selected options. Includes initial monthly payment and selected options. Details

Ships from

Amazon AU

Sold by

Amazon AU

Returns

Eligible for change of mind returns within 30 days of receipt

Payment

Secure transaction

Add a gift receipt for easy returns

Other sellers on Amazon

New & Used (25) from $70.52 & FREE Delivery

Image Unavailable

Image not available for
Colour:

To view this video download Flash Player

Follow the authors

Name: Site Reliability Engineering: How Google Runs Production Systems : Murphy, Niall Richard, Beyer, Betsy, Jones, Chris, Petoff, Jennifer: Amazon.com.au: Books
Price: 70.55 AUD
Availability: LimitedAvailability
Rating: 4.6 (1139 reviews)

Site Reliability Engineering: How Google Runs Production Systems Paperback – Illustrated, 10 May 2016

by Niall Richard Murphy (Author), Betsy Beyer (Author), Chris Jones (Author),

1,139 ratings

Edition: 1^st

See all formats and editions

{"desktop_buybox_group_1":[{"displayPrice":"$70.55","priceAmount":70.55,"currencySymbol":"$","integerValue":"70","decimalSeparator":".","fractionalValue":"55","symbolPosition":"left","hasSpace":false,"showFractionalPartIfEmpty":true,"offerListingId":"hnQPdn5W6lwPPKipl7iWM5XdeICnpgv0P0NL7CRyQsrwpeTpyO2HaQqOMFW0JxKLKIbSe66TUdD3CuE2OlDwIUDLTZH%2Br7JvCOxoRHL5wtRpsM9D9pdVbB4F%2Bj6kERWxJ24PbFLos22gVTcSRxrEPdDWluxZpQ6U","locale":"en-AU","buyingOptionType":"NEW","aapiBuyingOptionIndex":0}]}

Purchase options and add-ons

The overwhelming majority of a software systemâ s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Googleâ s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. Youâ ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficientâ lessons directly applicable to your organization.

This book is divided into four sections:

Introductionâ Learn what site reliability engineering is and why it differs from conventional IT industry practices
Principlesâ Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
Practicesâ Understand the theory and practice of an SREâ s day-to-day work: building and operating large distributed computing systems
Managementâ Explore Google's best practices for training, communication, and meetings that your organization can use

From the Publisher

This book is divided into four sections:

Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
Management—Explore Google's best practices for training, communication, and meetings that your organization can use

How to Read This Book

This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. (If there are other articles that support or inform the text, we reference them so you can follow up accordingly.)

You don’t need to read in any particular order, though we’d suggest at least starting with Chapters 2 and 3, which describe Google’s production environment and outline how SRE approaches risk, respectively. (Risk is, in many ways, the key quality of our profession.) Reading cover-to-cover is, of course, also useful and possible; our chapters are grouped thematically, into Principles (Part II), Practices (Part III), and Management (Part IV). Each has a small introduction that highlights what the individual pieces are about, and references other articles published by Google SREs, covering specific topics in more detail. Additionally, there’s a companion website mentioned in the book that has a number of helpful resources.

We hope this will be at least as useful and interesting to you as putting it together was for us.

— The Editors.

	Site Reliability Engineering	The Site Reliability Workbook

Customer Reviews	1,139	399
Price	$70.55	$55.50
Explore the book & companion workbook	How Google Runs Production Systems	Practical Ways to Implement SRE

Product description

About the Author

Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland's peering hub. He is the author or coauthor of a number of technical papers and/or books, including IPv6 Network Administration for O'Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.

Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google's advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He's also a licensed professional engineer.

Jennifer Petoff is a Program Manager for Google's Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.

Product details

Publisher ‏ : ‎ O'Reilly Media; 1st edition (10 May 2016)
Language ‏ : ‎ English
Paperback ‏ : ‎ 552 pages
ISBN-10 ‏ : ‎ 149192912X
ISBN-13 ‏ : ‎ 978-1491929124
Dimensions ‏ : ‎ 17.53 x 3.3 x 23.11 cm

Best Sellers Rank: 86,754 in Books (See Top 100 in Books)
- 11 in Network Disaster & Recovery Administration
- 15 in Linux & UNIX Administration
- 18 in Project Management Software (Books)

Customer Reviews:
1,139 ratings

About the authors

Follow authors to get new release updates, plus improved recommendations.

Niall Richard Murphy
Brief content visible, double tap to read full content.
Full content visible, double tap to read brief content.
- Working on an SRE-based startup from Dublin, Ireland
- Twitter http://twitter.com/niallm
- Photos at http://www.edge-cases.photos
See more on the author's page
Chris Jones
Brief content visible, double tap to read full content.
Full content visible, double tap to read brief content.
Discover more of the author’s books, see similar authors, read book recommendations and more.
See more on the author's page
Jennifer Petoff
Brief content visible, double tap to read full content.
Full content visible, double tap to read brief content.
Jennifer Petoff is Director of Google Cloud Platform (GCP) & Technical Infrastructure (TI) Education and is based in Lisbon, Portugal. She leads training programs for Google's GCP and TI Engineering Teams. Jennifer is one of the co-editors of the best-selling book, "Site Reliability Engineering: How Google Runs Production Systems"; lead author of "Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program"; and is a regular speaker at DevOps and SRE conferences around the world.
Jennifer joined Google in 2007 after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester in the United States.
See more on the author's page
Betsy Beyer
Brief content visible, double tap to read full content.
Full content visible, double tap to read brief content.
Betsy is a Technical Writer for Google in NYC specializing in Site Reliability Engineering. She has previously written documentation for Google's Data Center and Hardware Operations Teams in Mountain View and across its globally-distributed data centers. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University. En route to her current career, Betsy studied International Relations and English Literature, and holds degrees from Stanford and Tulane.
See more on the author's page

Customer reviews

1,139 global ratings

How are ratings calculated?

Review this product

Write a customer review

Top reviews from Australia

There was a problem filtering reviews. Please reload the page.

Grisi
Deep info
Reviewed in Australia on 1 November 2023
Verified Purchase

What to know about an Engineer - read this book - it's Deep

Read more

Helpful

Report
justinhennessy
Well worth the read!
Reviewed in Australia on 29 July 2016
Verified Purchase
A brilliant read with real world patterns and models to follow. Very much enjoyed this book.

Read more

One person found this helpful

Helpful

Report
D. Hawthorne
Solid description of Site Reliability Engineering at Google
Reviewed in Australia on 10 March 2020
Verified Purchase
Overview

This book is a solid description of Site Reliability Engineering at Google. It is full of good ideas. However, most would be difficult to implement to many organisations without revolutionary change in the culture.
Need for Revolutionary Cultural Change

The revolutionary cultural changes needed are that operational work is something that we do as our first job. Operational work is not something that si done on the side.

The change that organisations need to make is to recognise that operational work is a vital component of a product. A product is more than features shovelled out the door—it is about the experience of using that product. This is where operational work is critical: we find ways to make the product stable and reliable.
Good Ideas From SRE

The good ideas I got from this book are:

Continual incident management training
Continual improvement in alerting
Continual automation

Incident Management Training

In all service organisations I have worked at, incident management training has been limited to a few professionals in the Service Delivery/Operations. All operational personnel should have regular incident management training to keep their skills current.

The practice of having a few people trained means that there is confusion about roles and expectations in a real incident. And there is usually just person trying to juggle being an incident commander, customer liaison, incident recorder, etc. In the end, they become less effective in these critical role.

Google ensures that all SRE personnel are able to do those roles, and holds regular drills to practice them. These drills are based upon post-mortems of production issues.
Alerting

Google’s policy is that pages should only be sent if a human has to done something. Google aims for a maximum of two (2) pages per 12 hour shift. All other alerts should either have an automated response or just logged for future reference.

In many organisations, alert management is seen as unwelcome toil. I have been to sites where there are thousands of critical database alerts that no one was investigating. (One site had over 6,000, and another had about 2,000.) In both cases, management was wondering why the systems were so unstable.

Alerts need to tell the SRE about a potential problem before a customer notices. Too often, operational personnel are only reacting to customer complaints.

To help people look at alerts, the alerts should be tuned for relevance (not all threshold violations will impact service delivery), and frequency (alert storms should be curtailed or throttled).
Automation

Automation is key to a successful SRE team. The more work can be done by computers, the better. The book does have a salutary lesson about an automated task wiping all data in a data centre. And with automation, there comes the issue of deskilling of SRE personnel.

SRE automation should be treated as production changes. The same care and attention that is taken for customer facing applications should be applied to critical automation scripts. This is where software development experience and knowledge becomes vital for SRE personnel.

Deskilling can be counteracted through live drills for incident management training. However, this means systems should be set aside for such a purpose.

Read more

One person found this helpful

Helpful

Report

Top reviews from other countries

Translate all reviews to English

J. Andrews
The book every infrastructure engineer and DevOps person should read
Reviewed in the United Kingdom on 21 January 2018
Verified Purchase

If you are new to infrastructure engineering this book will inform you as to an approach and model to use as you start down this road. If you are an experienced engineer then you will see a lot of truth in what is written here. It may change you viewpoint or solidify an existing one, whatever the case this book is an essential reference and an honest account with a huge amount of wisdom.

Read more
Report
Niels Albers
Must read for the serious DevOps engineer
Reviewed in the Netherlands on 4 May 2016
Verified Purchase
Just the first chapter alone lists a number of concrete issues that anyone who has any experience with operations at all will both recognise, and the recommendations this book makes just make sense. Actually, not only people with DevOps experience should be reading this, there is a lot in here that their managers could certainly profit from, in every sense of the word.
Key words:
- Error budget
- Toil / development ballance (and the 50% time rule)
- The impossibility of never having a failure.

I'm still working my way through the book, but every new chapter has new insights that really help to put our complex job into perspective, and offer concrete ways of making our work better.

Read more
Report
Christian Ferranti
Interesting and useful
Reviewed in Italy on 16 May 2016
Verified Purchase
Of course, I have not the same infrastructure like Google but many problems are the same.
This book is very interesting because shows different tips & tricks to resolve and manage communication problems between departments and of course reliability problems.
I suggest it to every IT professional, ITIL experts, DevOps wannabe and of course CTO.

Read more
Report
Óscar Casal Sánchez
Excelente libro
Reviewed in Spain on 18 April 2017
Verified Purchase
Libro excelente que da muchos puntos de vista de como formar un equipo de trabajo y cómo afrontar los problemas. También recorre todos los procesos de una empresa: presupuestos, monitorización, sla, puesta marcha servicio, mantenimiento de un servicio...

En este libro se ve que la cultura de Google es "blameless" y que no hay una línea entre devs y ops, existe el concepto de SRE que podría decirse que es parecido al actual de devops, aunque con más funciones.

Libro que debería leer toda persona que trabaja en IT y también a toda la

Read more
Report
Translate review to English
devnull
A must read book
Reviewed in France on 20 August 2018
Verified Purchase
A must read book

Read more
Report

See more reviews

Image Unavailable

Follow the authors