AI

Take me to your leader! Online Optimization of Distributed Storage Configurations

Abstract

The configuration of a distributed storage system typically includes, among other parameters, the set of servers and their roles in the replication protocol. Although mechanisms for changing the configuration at runtime exist, it is usually left to system administrators to manually determine the “best” configuration and periodically reconfigure the system, often by trial and error. This paper describes a new workload-driven optimization framework that dynamically determines the optimal configuration at runtime. We focus on optimizing leader and quorum based replication schemes and divide the framework into three optimization tiers, dynamically optimizing different configuration aspects: 1) leader placement, 2) roles of different servers in the replication protocol, and 3) replica locations. We showcase our optimization framework by applying it to a large-scale distributed storage system used internally in Google and demonstrate that most client applications significantly benefit from using our framework, reducing average operation latency by up to 94%.