AI

Abstract

This article is a follow-up to Vivek Rau's chapter "Eliminating Toil" in Site Reliability Engineering: How Google Runs Production Systems. We begin by recapping Vivek's definition of toil and Google's approach to balancing operational work with engineering project work. The Bigtable SRE case study then presents a concrete example of how one team at Google went about reducing toil. Finally, we leave readers with a series of best practices that should be helpful in reducing toil no matter the size or makeup of the organization.