AI

Characterizing Task Usage Shapes in Google Compute Clusters

Abstract

The increase in scale and complexity of large compute clus- ters motivates a need for representative workload bench- marks to evaluate the performance impact of system changes, so as to assist in designing better scheduling algorithms and in carrying out management activities. To achieve this goal, it is necessary to construct workload characterizations from which realistic performance benchmarks can be created. In this paper, we focus on characterizing run-time task resource usage for CPU, memory and disk. The goal is to find an accurate characterization that can faithfully reproduce the performance of historical workload traces in terms of key performance metrics, such as task wait time and machine resource utilization. Through experiments using workload traces from Google production clusters, we find that simply using the mean of task usage can generate synthetic work- load traces that accurately reproduce resource utilizations and task waiting time. This seemingly surprising result can be justified by the fact that resource usage for CPU, mem- ory and disk are relatively stable over time for the majority of the tasks. Our work not only presents a simple tech- nique for constructing realistic workload benchmarks, but also provides insights into understanding workload perfor- mance in production compute clusters.