Achieving Rapid Response Times in Large Online Services

Jeffrey Dean

Talk given at Berkeley AMPLab Cloud Seminar, March 26, 2012 (2012)

Download Google Scholar

Abstract

Today’s large-scale web services provide rapid responses to interactive requests by applying large amounts of computational resources to massive datasets. They typically operate in warehouse-sized datacenters and run on clusters of machines that are shared across many kinds of interactive and batch jobs. As these systems distribute work to ever larger numbers of machines and sub-systems in order to provide interactive response times, it becomes increasingly difficult to tightly control latency variability across these machines, and often the 95%ile and 99%ile response times suffer in an effort to improve average response times. As systems scale up, simply stamping out all sources of variability does not work. Just as fault-tolerant techniques needed to be developed when guaranteeing fault-free operation by design became unfeasible, techniques that deliver predictably low service-level latency in the presence of highly-variable individual components are increasingly important at larger scales. In this talk, I’ll describe a collection of techniques and practices lowering response times in large distributed systems whose components run on shared clusters of machines, where pieces of these systems are subject to interference by other tasks, and where unpredictable latency hiccups are the norm, not the exception. Some of the techniques adapt to trends observed over periods of a few minutes, making them effective at dealing with longer-lived interference or resource contention. Others react to latency anomalies within a few milliseconds, making them suitable for mitigating variability within the context of a single interactive request. I’ll discuss examples of how these techniques are used in various pieces of Google’s systems infrastructure and in various higher-level online services.

Research Areas

Distributed Systems and Parallel Computing

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations  & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Achieving Rapid Response Times in Large Online Services

Abstract

Research Areas

Learn more about how we conduct our research

Defining the technology of today and tomorrow.

Philosophy

People

Teams

AI/ML Foundations & Capabilities

Algorithms & Optimization

Computing Paradigms

Responsible Human-Centric Technology

Science & Societal Impact

Projects

Publications

Resources

Shaping the future, together.

Student programs

Faculty programs

Conferences & events

Achieving Rapid Response Times in Large Online Services

Abstract

Research Areas

Learn more about how we conduct our research

AI/ML Foundations  & Capabilities