Shop securely with PayTo on Amazon. Direct payments from your trusted bank. No card details required.
$70.55
FREE delivery Thursday, 10 April
Or fastest delivery Today. Order within 8 hrs 36 mins
Only 2 left in stock.
$$70.55 () Includes selected options. Includes initial monthly payment and selected options. Details
Price
Subtotal
$$70.55
Subtotal
Initial payment breakdown
Delivery cost, delivery date and order total (including tax) shown at checkout.
Ships from
Amazon AU
Amazon AU
Ships from
Amazon AU
Sold by
Amazon AU
Amazon AU
Sold by
Amazon AU
Returns
Eligible for change of mind returns within 30 days of receipt
Eligible for change of mind returns within 30 days of receipt
This item can be returned in its original condition within 30 days of receipt for change of mind. If this item is damaged or defective, you may be entitled to a remedy after 30 days. Visit Returning Faulty Items for more information.
Payment
Secure transaction
Your transaction is secure
We work hard to protect your security and privacy. Our payment security system encrypts your information during transmission. We don’t share your credit card details with third-party sellers, and we don’t sell your information to others. Learn more
Kindle app logo image

Download the free Kindle app and start reading Kindle books instantly on your smartphone, tablet or computer—no Kindle device required.

Read instantly on your browser with Kindle for Web.

Using your mobile phone camera, scan the code below and download the Kindle app.

QR code to download the Kindle App

Follow the authors

See all
Something went wrong. Please try your request again later.

Site Reliability Engineering: How Google Runs Production Systems Paperback – Illustrated, 10 May 2016

4.6 out of 5 stars 1,139 ratings
Edition: 1st

{"desktop_buybox_group_1":[{"displayPrice":"$70.55","priceAmount":70.55,"currencySymbol":"$","integerValue":"70","decimalSeparator":".","fractionalValue":"55","symbolPosition":"left","hasSpace":false,"showFractionalPartIfEmpty":true,"offerListingId":"hnQPdn5W6lwPPKipl7iWM5XdeICnpgv0P0NL7CRyQsrwpeTpyO2HaQqOMFW0JxKLKIbSe66TUdD3CuE2OlDwIUDLTZH%2Br7JvCOxoRHL5wtRpsM9D9pdVbB4F%2Bj6kERWxJ24PbFLos22gVTcSRxrEPdDWluxZpQ6U","locale":"en-AU","buyingOptionType":"NEW","aapiBuyingOptionIndex":0}]}

Purchase options and add-ons

The overwhelming majority of a software systemâ s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Googleâ s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. Youâ ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficientâ lessons directly applicable to your organization.

This book is divided into four sections:

  • Introductionâ Learn what site reliability engineering is and why it differs from conventional IT industry practices
  • Principlesâ Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
  • Practicesâ Understand the theory and practice of an SREâ s day-to-day work: building and operating large distributed computing systems
  • Managementâ Explore Google's best practices for training, communication, and meetings that your organization can use

Frequently bought together

This item: Site Reliability Engineering: How Google Runs Production Systems
$70.55
Get it as soon as Thursday, April 10
Only 2 left in stock.
Ships from and sold by Amazon AU.
+
$55.50
Only 2 left in stock.
Ships from and sold by Amazon AU.
+
$59.50
Get it as soon as Thursday, April 10
In stock
Sold by M&M All Deals and ships from Amazon Fulfillment.
Total Price: $00
To see our price, add these items to your cart.
Details
Added to Cart
Some of these items dispatch sooner than the others.
Choose items to buy together.

From the Publisher


This book is divided into four sections:
  • Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
  • Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
  • Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems
  • Management—Explore Google's best practices for training, communication, and meetings that your organization can use

How to Read This Book

This book is a series of essays written by members and alumni of Google’s Site Reliability Engineering organization. It’s much more like conference proceedings than it is like a standard book by an author or a small number of authors. Each chapter is intended to be read as a part of a coherent whole, but a good deal can be gained by reading on whatever subject particularly interests you. (If there are other articles that support or inform the text, we reference them so you can follow up accordingly.)

You don’t need to read in any particular order, though we’d suggest at least starting with Chapters 2 and 3, which describe Google’s production environment and outline how SRE approaches risk, respectively. (Risk is, in many ways, the key quality of our profession.) Reading cover-to-cover is, of course, also useful and possible; our chapters are grouped thematically, into Principles (Part II), Practices (Part III), and Management (Part IV). Each has a small introduction that highlights what the individual pieces are about, and references other articles published by Google SREs, covering specific topics in more detail. Additionally, there’s a companion website mentioned in the book that has a number of helpful resources.

We hope this will be at least as useful and interesting to you as putting it together was for us.

— The Editors.

Site Reliability Engineering
The Site Reliability Workbook
Customer Reviews
4.6 out of 5 stars 1,139
4.7 out of 5 stars 399
Price $70.55 $55.50
Explore the book & companion workbook How Google Runs Production Systems Practical Ways to Implement SRE

Product description

About the Author

Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland's peering hub. He is the author or coauthor of a number of technical papers and/or books, including IPv6 Network Administration for O'Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.



Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.



Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google's advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He's also a licensed professional engineer.



Jennifer Petoff is a Program Manager for Google's Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.

Product details

  • Publisher ‏ : ‎ O'Reilly Media; 1st edition (10 May 2016)
  • Language ‏ : ‎ English
  • Paperback ‏ : ‎ 552 pages
  • ISBN-10 ‏ : ‎ 149192912X
  • ISBN-13 ‏ : ‎ 978-1491929124
  • Dimensions ‏ : ‎ 17.53 x 3.3 x 23.11 cm
  • Customer Reviews:
    4.6 out of 5 stars 1,139 ratings

About the authors

Follow authors to get new release updates, plus improved recommendations.

Customer reviews

4.6 out of 5 stars
1,139 global ratings

Review this product

Share your thoughts with other customers

Top reviews from Australia

  • Reviewed in Australia on 1 November 2023
    Verified Purchase
    What to know about an Engineer - read this book - it's Deep
  • Reviewed in Australia on 29 July 2016
    Verified Purchase
    A brilliant read with real world patterns and models to follow. Very much enjoyed this book.
    One person found this helpful
    Report
  • Reviewed in Australia on 10 March 2020
    Verified Purchase
    Overview

    This book is a solid description of Site Reliability Engineering at Google. It is full of good ideas. However, most would be difficult to implement to many organisations without revolutionary change in the culture.
    Need for Revolutionary Cultural Change

    The revolutionary cultural changes needed are that operational work is something that we do as our first job. Operational work is not something that si done on the side.

    The change that organisations need to make is to recognise that operational work is a vital component of a product. A product is more than features shovelled out the door—it is about the experience of using that product. This is where operational work is critical: we find ways to make the product stable and reliable.
    Good Ideas From SRE

    The good ideas I got from this book are:

    Continual incident management training
    Continual improvement in alerting
    Continual automation

    Incident Management Training

    In all service organisations I have worked at, incident management training has been limited to a few professionals in the Service Delivery/Operations. All operational personnel should have regular incident management training to keep their skills current.

    The practice of having a few people trained means that there is confusion about roles and expectations in a real incident. And there is usually just person trying to juggle being an incident commander, customer liaison, incident recorder, etc. In the end, they become less effective in these critical role.

    Google ensures that all SRE personnel are able to do those roles, and holds regular drills to practice them. These drills are based upon post-mortems of production issues.
    Alerting

    Google’s policy is that pages should only be sent if a human has to done something. Google aims for a maximum of two (2) pages per 12 hour shift. All other alerts should either have an automated response or just logged for future reference.

    In many organisations, alert management is seen as unwelcome toil. I have been to sites where there are thousands of critical database alerts that no one was investigating. (One site had over 6,000, and another had about 2,000.) In both cases, management was wondering why the systems were so unstable.

    Alerts need to tell the SRE about a potential problem before a customer notices. Too often, operational personnel are only reacting to customer complaints.

    To help people look at alerts, the alerts should be tuned for relevance (not all threshold violations will impact service delivery), and frequency (alert storms should be curtailed or throttled).
    Automation

    Automation is key to a successful SRE team. The more work can be done by computers, the better. The book does have a salutary lesson about an automated task wiping all data in a data centre. And with automation, there comes the issue of deskilling of SRE personnel.

    SRE automation should be treated as production changes. The same care and attention that is taken for customer facing applications should be applied to critical automation scripts. This is where software development experience and knowledge becomes vital for SRE personnel.

    Deskilling can be counteracted through live drills for incident management training. However, this means systems should be set aside for such a purpose.
    One person found this helpful
    Report

Top reviews from other countries

Translate all reviews to English
  • J. Andrews
    5.0 out of 5 stars The book every infrastructure engineer and DevOps person should read
    Reviewed in the United Kingdom on 21 January 2018
    Verified Purchase
    If you are new to infrastructure engineering this book will inform you as to an approach and model to use as you start down this road. If you are an experienced engineer then you will see a lot of truth in what is written here. It may change you viewpoint or solidify an existing one, whatever the case this book is an essential reference and an honest account with a huge amount of wisdom.
  • Niels Albers
    5.0 out of 5 stars Must read for the serious DevOps engineer
    Reviewed in the Netherlands on 4 May 2016
    Verified Purchase
    Just the first chapter alone lists a number of concrete issues that anyone who has any experience with operations at all will both recognise, and the recommendations this book makes just make sense. Actually, not only people with DevOps experience should be reading this, there is a lot in here that their managers could certainly profit from, in every sense of the word.
    Key words:
    - Error budget
    - Toil / development ballance (and the 50% time rule)
    - The impossibility of never having a failure.

    I'm still working my way through the book, but every new chapter has new insights that really help to put our complex job into perspective, and offer concrete ways of making our work better.
  • Christian Ferranti
    5.0 out of 5 stars Interesting and useful
    Reviewed in Italy on 16 May 2016
    Verified Purchase
    Of course, I have not the same infrastructure like Google but many problems are the same.
    This book is very interesting because shows different tips & tricks to resolve and manage communication problems between departments and of course reliability problems.
    I suggest it to every IT professional, ITIL experts, DevOps wannabe and of course CTO.
  • Óscar Casal Sánchez
    5.0 out of 5 stars Excelente libro
    Reviewed in Spain on 18 April 2017
    Verified Purchase
    Libro excelente que da muchos puntos de vista de como formar un equipo de trabajo y cómo afrontar los problemas. También recorre todos los procesos de una empresa: presupuestos, monitorización, sla, puesta marcha servicio, mantenimiento de un servicio...

    En este libro se ve que la cultura de Google es "blameless" y que no hay una línea entre devs y ops, existe el concepto de SRE que podría decirse que es parecido al actual de devops, aunque con más funciones.

    Libro que debería leer toda persona que trabaja en IT y también a toda la
    Report
  • devnull
    5.0 out of 5 stars A must read book
    Reviewed in France on 20 August 2018
    Verified Purchase
    A must read book