Parthasarathy Ranganathan
Parthasarathy (Partha) Ranganathan is currently at Google designing their next-generation systems. Before this, he was a HP Fellow and Chief Technologist at Hewlett Packard Labs where he led their research on systems and datacenters. Dr. Ranganathan's research interests are in systems architecture and manageability, energy-efficiency, and systems modeling and evaluation. He has done extensive work in these areas including key contributions around energy-aware user interfaces, heterogeneous multi-core processors, power capping and power-aware server designs, federated enterprise power management, energy modeling and benchmarking, disaggregated blade server architectures, and most recently, storage hierarchy and systems redesign for non-volatile memory. He was also one of the primary developers of the publicly distributed Rice Simulator for ILP Multiprocessors (RSIM).
Dr. Ranganathan's work has led to broad impact on both academia and industry including several commercial products such as Power Capping and HP Moonshot servers. He holds more than 50 patents (with another 45 pending) and has published extensively, including several award-winning papers. He also teaches regularly (including, most recently, at Stanford) and has contributed to several popular computer architecture textbooks. Dr. Ranganathan and his work have been featured on numerous occasions in the press including the New York Times, Wall Street Journal, Business Week, San Francisco Chronicle, Times of India, Slashdot, Youtube, and Tom's hardware guide. Dr. Ranganathan has been named one of the world's top young innovators by MIT Technology Review, as one of the top 15 enterprise technology rock stars by Business Insider, and has been recognized with several other awards including the ACM SIGARCH Maurice Wilkes award and Rice University's Outstanding Young Engineering Alumni award. Dr. Ranganathan received his B.Tech degree from the Indian Institute of Technology, Madras and his M.S. and Ph.D. from Rice University, Houston. He is also an ACM and IEEE Fellow.
Authored Publications
Google Publications
Other Publications
Sort By
Characterizing a Memory Allocator at Warehouse Scale
Zhuangzhuang Zhou
Nilay Vaish
Patrick Xia
Christina Delimitrou
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Association for Computing Machinery, La Jolla, CA, USA (2024), 192–206
Preview abstract
Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings.
We present the first comprehensive characterization study of TCMalloc, a warehouse-scale memory allocator used in our production fleet. Our characterization reveals a profound diversity in the memory allocation patterns, allocated object sizes and lifetimes, for large-scale datacenter workloads, as well as in their performance on heterogeneous hardware platforms. Based on these insights, we redesign TCMalloc for warehouse-scale environments. Specifically, we propose optimizations for each level of its cache hierarchy that include usage-based dynamic sizing of allocator caches, leveraging hardware topology to mitigate inter-core communication overhead, and improving allocation packing algorithms based on statistical data. We evaluate these design choices using benchmarks and fleet-wide A/B experiments in our production fleet, resulting in a 1.4% improvement in throughput and a 3.4% reduction in RAM usage for the entire fleet. At our scale, even a single percent CPU or memory improvement translates to significant savings in server costs.
View details
Limoncello: Prefetchers for Scale
Carlos Villavieja
Baris Kasikci
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, United States (2024)
Preview abstract
This paper presents Limoncello, a novel software system that dynamically configures data prefetching for high utilization systems. We demonstrate that in resource-constrained environments, such as large data centers, traditional methods of hardware prefetching can increase memory latency and decrease available memory bandwidth. To address this, Limoncello dynamically configures data prefetching, disabling hardware prefetchers when memory bandwidth utilization is high and leveraging targeted software prefetching to reduce cache misses when hardware prefetchers are disabled. Limoncello is software-centric and does not require any modifications to hardware. Our evaluation of the deployment on a real-world hyperscale system reveals that Limoncello unlocks significant performance gains for high-utilization systems: it improves application throughput by 10%, due to a 15% reduction in memory latency, while maintaining minimal change in cache miss rate for targeted library functions.
View details
CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems
Ani Udipi
JunSun Choi
Joonho Whangbo
Jerry Zhao
Edwin Lim
Vrishab Madduri
Yakun Sophia Shao
Borivoje Nikolic
Krste Asanovic
Proceedings of the 50th Annual International Symposium on Computer Architecture, Association for Computing Machinery, New York, NY, USA (2023)
Preview abstract
General-purpose lossless data compression and decompression ("(de)compression") are used widely in hyperscale systems and are key "datacenter taxes". However, designing optimal hardware compression and decompression processing units ("CDPUs") is challenging due to the variety of algorithms deployed, input data characteristics, and evolving costs of CPU cycles, network bandwidth, and memory/storage capacities.
To navigate this vast design space, we present the first large-scale data-driven analysis of (de)compression usage at a major cloud provider by profiling Google's datacenter fleet. We find that (de)compression consumes 2.9% of fleet CPU cycles and 10-50% of cycles in key services. Demand is also artificially limited; 95% of bytes compressed in the fleet use less capable algorithms to reduce compute, motivating a CDPU that changes cost vs. size tradeoffs.
Prior work has improved the microarchitectural state-of-the-art for CDPUs supporting various algorithms in fixed contexts. However, we find that higher-level design parameters like CDPU placement, hash table sizing, history window sizes, and more have as significant of an impact on the viability of CDPU integration, but are not well-studied. Thus, we present the first end-to-end design/evaluation framework for CDPUs, including: 1. An open-source RTL-based CDPU generator that supports many run-time and compile-time parameters. 2. Integration into an open-source RISC-V SoC for rapid performance and silicon area evaluation across CDPU placements and parameters. 3. An open-source (de)compression benchmark, HyperCompressBench, that is representative of (de)compression usage in Google's fleet.
Using our framework, we perform an extensive design space exploration running HyperCompressBench. Our exploration spans a 46× range in CDPU speedup, 3× range in silicon area (for a single pipeline), and evaluates a variety of CDPU integration techniques to optimize CDPU designs for hyperscale contexts. Our final hyperscale-optimized CDPU instances are up to 10× to 16× faster than a single Xeon core, while consuming a small fraction (as little as 2.4% to 4.7%) of the area.
View details
Profiling Hyperscale Big Data Processing
Aasheesh Kolli
Abraham Gonzalez
Samira Khan
Sihang Liu
Krste Asanovic
ISCA (2023)
Preview abstract
Computing demand continues to grow exponentially, largely driven by "big data" processing on hyperscale data stores. At the same time, the slowdown in Moore's law is leading the industry to embrace custom computing in large-scale systems. Taken together, these trends motivate the need to characterize live production traffic on these large data processing platforms and understand the opportunity of acceleration at scale.
This paper addresses this key need. We characterize three important production distributed database and data analytics platforms at Google to identify key hardware acceleration opportunities and perform a comprehensive limits study to understand the trade-offs among various hardware acceleration strategies.
We observe that hyperscale data processing platforms spend significant time on distributed storage and other remote work across distributed workers. Therefore, optimizing storage and remote work in addition to compute acceleration is critical for these platforms. We present a detailed breakdown of the compute-intensive functions in these platforms and identify dominant key data operations related to datacenter and systems taxes. We observe that no single accelerator can provide a significant benefit but collectively, a sea of accelerators, can accelerate many of these smaller platform-specific functions. We demonstrate the potential gains of the sea of accelerators proposal in a limits study and analytical model. We perform a comprehensive study to understand the trade-offs between accelerator location (on-chip/off-chip) and invocation model (synchronous/asynchronous). We propose and evaluate a chained accelerator execution model where identified compute-intensive functions are accelerated and pipelined to avoid invocation from the core, achieving a 3x improvement over the baseline system while nearly matching identical performance to an ideal fully asynchronous execution model.
View details
CRISP: Critical Slice Prefetching
Heiner Litz
Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2022), pp. 300-313
Preview abstract
The high access latency of DRAM continues to be a performance challenge for contemporary microprocessor systems. Prefetching is a well-established technique to address this problem, however, existing implemented designs fail to provide any performance benefits in the presence of irregular memory access patterns. The hardware complexity of prior techniques that can predict irregular memory accesses such as runahead execution has proven untenable for implementation in real hardware. We propose a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling. In particular, our technique executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach. Our proposal only requires minimal hardware modifications by performing memory access classification, load and branch slice extraction, as well as priority analysis exclusively in software. As a result, our technique is feasible to implement, introducing only a simple new instruction prefix while requiring minimal modifications of the instruction scheduler. Our technique increases the IPC of memory-latency-bound applications by up to 38% and by 8.4% on average.
View details
A Hardware Accelerator for Protocol Buffers
Chris Leary
Jerry Zhao
Dinesh Parimi
Borivoje Nikolic
Krste Asanovic
Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-54), Association for Computing Machinery, New York, NY, USA (2021), 462–478
Preview abstract
Serialization frameworks are a fundamental component of scale-out systems, but introduce significant compute overheads. However, they are amenable to acceleration with specialized hardware. To understand the trade-offs involved in architecting such an accelerator, we present the first in-depth study of serialization framework usage at scale by profiling Protocol Buffers (“protobuf”) usage across Google’s datacenter fleet. We use this data to build HyperProtoBench, an open-source benchmark representative of key serialization-framework user services at scale. In doing so, we identify key insights that challenge prevailing assumptions about serialization framework usage.
We use these insights to develop a novel hardware accelerator for protobufs, implemented in RTL and integrated into a RISC-V SoC. Applications can easily harness the accelerator, as it integrates with a modified version of the open-source protobuf library and is wire-compatible with standard protobufs. We have fully open-sourced our RTL, which, to the best of our knowledge, is the only such implementation currently available to the community.
We also present a first-of-its-kind, end-to-end evaluation of our entire RTL-based system running hyperscale-derived benchmarks and microbenchmarks. We boot Linux on the system using FireSim to run these benchmarks and implement the design in a commercial 22nm FinFET process to obtain area and frequency metrics. We demonstrate an average 6.2x to 11.2x performance improvement vs. our baseline RISC-V SoC with BOOM OoO cores and despite the RISC-V SoC’s weaker uncore/supporting components, an average 3.8x improvement vs. a Xeon-based server.
View details
Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator
Andrew Hamilton Hunter
15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) (2021) (to appear)
Preview abstract
Memory allocation represents significant compute cost at the warehouse scale and its optimization can yield considerable cost savings. One classical approach is to increase the efficiency of an allocator to minimize the cycles spent in the allocator code. However, memory allocation decisions also impact overall application performance via data placement, offering opportunities to improve fleetwide productivity by completing more units of application work using fewer hardware resources. Here, we focus on hugepage coverage. We present TEMERAIRE, a hugepage-aware enhancement of TCMALLOC to reduce CPU overheads in the application’s code. We discuss the design and implementation of TEMERAIRE including strategies for hugepage-aware memory layouts to maximize hugepage coverage and to minimize fragmentation overheads. We present application studies for 8 applications, improving requests-per-second (RPS) by 7.7% and reducing RAM usage 2.4%. We present the results of a 1% experiment at fleet scale as well as the longitudinal rollout in Google’s warehouse scale computers. This yielded 6% fewer TLB miss stalls, and 26% reduction in memory wasted due to fragmentation. We conclude with a discussion of additional techniques for improving the allocator development process and potential optimization strategies for future memory allocators.
View details
Warehouse-Scale Video Acceleration: Co-design and Deployment in the Wild
Danner Stodolsky
Jeff Calow
Jeremy Dorfman
Clint Smullen
Aki Kuusela
Aaron James Laursen
Alex Ramirez
Amir Salek
Anna Cheung
Ben Gelb
Brian Fosco
Cho Mon Kyaw
Dake He
David Alexander Munday
David Wickeraad
Devin Persaud
Don Stark
Elisha Indupalli
Fong Lou
Hon Kwan Wu
In Suk Chong
Indira Jayaram
Jia Feng
JP Maaninen
Maire Mahony
Mark Steven Wachsler
Mercedes Tan
Niranjani Dasharathi
Poonacha Kongetira
Prakash Chauhan
Raghuraman Balasubramanian
Ramon Macias
Richard Ho
Rob Springer
Roy W Huffman
Sandeep Bhatia
Sathish K Sekar
Srikanth Muroor
Ville-Mikko Rautio
Yolanda Ripley
Yoshiaki Hase
Yuan Li
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, USA (2021), pp. 600-615
Preview abstract
Video sharing (e.g., YouTube, Vimeo, Facebook, TikTok) accounts for the majority of internet traffic, and video processing is also foundational to several other key workloads (video conferencing, virtual/augmented reality, cloud gaming, video in Internet-of-Things devices, etc.). The importance of these workloads motivates larger video processing infrastructures and – with the slowing of Moore’s law – specialized hardware accelerators to deliver more computing at higher efficiencies. This paper describes the design and deployment, at scale, of a new accelerator targeted at warehouse-scale video transcoding. We present our hardware design including a new accelerator building block – the video coding unit (VCU) – and discuss key design trade-offs for balanced systems at data center scale and co-designing accelerators with large-scale distributed software systems. We evaluate these accelerators “in the wild" serving live data center jobs, demonstrating 20-33x improved efficiency over our prior well-tuned non-accelerated baseline. Our design also enables effective adaptation to changing bottlenecks and improved failure management, and new workload capabilities not otherwise possible with prior systems. To the best of our knowledge, this is the first work to discuss video acceleration at scale in large warehouse-scale environments.
View details
A Hierarchical Neural Model of Data Prefetching
Zhan Shi
Akanksha Jain
Calvin Lin
Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2021)
Preview abstract
This paper presents Voyager, a novel neural network for data prefetching. Unlike previous neural models for prefetching, which are limited to learning delta correlations, our model can also learn address correlations, which are important for prefetching irregular sequences of memory accesses. The key to our solution is its hierarchical structure that separates addresses into pages and offsets and that introduces a mechanism for learning important relations among pages and offsets.
Voyager provides significant prediction benefits over current data prefetchers. For a set of irregular programs from the SPEC 2006 and GAP benchmark suites, Voyager sees an average IPC improvement of 41.6% over a system with no prefetcher, compared with 21.7% and 28.2%, respectively, for idealized Domino and ISB prefetchers. We also find that for two commercial workloads for which current data prefetchers see very little benefit, Voyager dramatically improves both accuracy and coverage.
At present, slow training and prediction preclude neural models from being practically used in hardware, but Voyager’s overheads are significantly lower—in every dimension—than those of previous neural models. For example, computation cost is reduced by 15-20×, and storage overhead is reduced by 110-200×. Thus, Voyager represents a significant step towards a practical neural prefetcher.
View details
Cores that don't count
Proc. 18th Workshop on Hot Topics in Operating Systems (HotOS 2021)
Preview abstract
We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation.
We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects.
This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.
Please watch our short video summarizing the paper.
View details
Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale
Shaohong Li
Sreekumar Kodakara
14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), {USENIX} Association (2020), pp. 1241-1255
Preview abstract
As the demand for data center capacity continues to grow, hyperscale providers have used power oversubscription to increase efficiency and reduce costs. Power oversubscription requires power capping systems to smooth out the spikes that risk overloading power equipment by throttling workloads. Modern compute clusters run latency-sensitive serving and throughput-oriented batch workloads on the same servers, provisioning resources to ensure low latency for the former while using the latter to achieve high server utilization. When power capping occurs, it is desirable to maintain low latency for serving tasks and throttle the throughput of batch tasks. To achieve this, we seek a system that can gracefully throttle batch workloads and has task-level quality-of-service (QoS) differentiation.
In this paper we present Thunderbolt, a hardware-agnostic power capping system that ensures safe power oversubscription while minimizing impact on both long-running throughput-oriented tasks and latency-sensitive tasks. It uses a two-threshold, randomized unthrottling/multiplicative decrease control policy to ensure power safety with minimized performance degradation. It leverages the Linux kernel's CPU bandwidth control feature to achieve task-level QoS-aware throttling. It is robust even in the face of power telemetry unavailability. Evaluation results at the node and cluster levels demonstrate the system's responsiveness, effectiveness for reducing power, capability of QoS differentiation, and minimal impact on latency and task health. We have deployed this system at scale, in multiple production clusters. As a result, we enabled power oversubscription gains of 9%--25%, where none was previously possible.
View details
Classifying Memory Access Patterns for Prefetching
Heiner Litz
Christos Kozyrakis
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery (2020), 513–526
Preview abstract
Prefetching is a well-studied technique for addressing the memory access stall time of contemporary microprocessors. However, despite a large body of related work, the memory access behavior of applications is not well understood, and it remains difficult to predict whether a particular application will benefit from a given prefetcher technique. In this work we propose a novel methodology to classify the memory access patterns of applications, enabling well-informed reasoning about the applicability of a certain prefetcher. Our approach leverages instruction dataflow information to uncover a wide range of access patterns, including arbitrary combinations of offsets and indirection. These combinations or prefetch kernels represent reuse, strides, reference locality, and complex address generation. By determining the complexity and frequency of these access patterns, we enable reasoning about prefetcher timeliness and criticality, exposing the limitations of existing prefetchers today. Moreover, using these kernels, we are able to compute the next address for the majority of top-missing instructions, and we propose a software prefetch injection methodology that is able to outperform state-of-the-art hardware prefetchers.
View details
Autonomous Warehouse-Scale Computers
Proceedings of the 57th Annual Design Automation Conference 2020, Association for Computing Machinery, New York, NY United States
Preview abstract
Modern Warehouse-Scale Computers (WSCs), composed of many generations of servers and a myriad of domain specific accelerators, are becoming increasingly heterogeneous. Meanwhile, WSC workloads are also becoming incredibly diverse with different communication patterns, latency requirements, and service level objectives (SLOs). Insufficient understanding of the interactions between workload characteristics and the underlying machine architecture leads to resource over-provisioning, thereby significantly impacting the utilization of WSCs.
We present Autonomous Warehouse-Scale Computers, a new WSC design that leverages machine learning techniques and automation to improve job scheduling, resource management, and hardware-software co-optimization to address the increasing heterogeneity in WSC hardware and workloads. Our new design introduces two new layers in the WSC stack, namely: (a) a Software-Defined Server (SDS) Abstraction Layer which redefines the hardware-software boundary and provides greater control of the hardware to higher layers of the software stack through stable abstractions; and (b) a WSC Efficiency Layer which regularly monitors the resource usage of workloads on different hardware types, autonomously quantifies the performance sensitivity of workloads to key system configurations, and continuously improves scheduling decisions and hardware resource QoS policies to maximize cluster level performance. Our new WSC design has been successfully deployed across all WSCs at Google for several years now. The new WSC design improves throughput of workloads (by 7-10%, on
average), increases utilization of hardware resources (up to 2x), and reduces performance variance for critical workloads (up to 25%).
View details
Preview abstract
A significant effort has been made to train neural networks that replicate algorithmic reasoning, but they often fail to learn the abstract concepts underlying these algorithms. This is evidenced by their inability to generalize to data distributions that are outside of their restricted training sets, namely larger inputs and unseen data. We study these generalization issues at the level of numerical subroutines that comprise common algorithms like sorting, shortest paths, and minimum spanning trees. First, we observe that transformer-based sequence-to-sequence models can learn subroutines like sorting a list of numbers, but their performance rapidly degrades as the length of lists grows beyond those found in the training set. We demonstrate that this is due to attention weights that lose fidelity with longer sequences, particularly when the input numbers are numerically similar. To address the issue, we propose a learned conditional masking mechanism, which enables the model to strongly generalize far outside of its training range with near-perfect accuracy on a variety of algorithms. Second, to generalize to unseen data, we show that encoding numbers with a binary representation leads to embeddings with rich structure once trained on downstream tasks like addition or multiplication. This allows the embedding to handle missing data by faithfully interpolating numbers not seen during training.
View details
An Imitation Learning Approach for Cache Replacement
Evan Z. Liu
International Conference on Machine Learning (2020)
Preview abstract
Program execution speed critically depends on increasing cache hits, as cache hits are orders of magnitude faster than misses. To increase cache hits, we focus on the problem of cache replacement: choosing which cache line to evict upon inserting a new line. This is challenging because it requires planning far ahead and currently there is no known practical solution. As a result, current replacement policies typically resort to heuristics designed for specific common access patterns, which fail on more diverse and complex access patterns. In contrast, we propose an imitation learning approach to automatically learn cache access patterns by leveraging Belady’s, an oracle policy that computes the optimal eviction decision given the future cache accesses. While directly applying Belady’s is infeasible since the future is unknown, we train a policy conditioned only on past accesses that accurately approximates Belady’s even on diverse and complex access patterns, and call this approach PARROT. When evaluated on 13 of the most memory-intensive SPEC applications, PARROT increases cache miss rates by 20% over the current state of the art. In addition, on a large-scale web search benchmark, PARROT increases cache hit rates by 61% over a conventional LRU policy. We release a Gym environment to facilitate research in this area, as data is plentiful, and further advancements can have significant real-world impact.
View details
Preview abstract
As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn static representations of source code, these representations do not understand how code executes at runtime. In this work, we propose a new approach using GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related algorithm classification task.
View details
Data Center Power Oversubscription with a Medium Voltage Power Plane and Priority-Aware Capping
David Landhuis
Shaohong Li
Darren De Ronde
Thomas Blooming
Anand Ramesh
James Kennedy
Christopher Malone
Jimmy Clidaras
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Association for Computing Machinery, New York, NY, USA (2020), 497–511
Preview abstract
As major web and cloud service providers continue to accelerate the demand for new data center capacity worldwide, the importance of power oversubscription as a lever to reduce provisioning costs has never been greater. Building on insights from Google-scale deployments, we design and deploy a new architecture across hardware and software to improve power oversubscription significantly. Our design includes (1) a new medium voltage power plane to enable larger power sharing domains (across tens of MW of equipment) and (2) a scalable, fast, and robust power capping service coordinating multiple priorities of workload on every node. Over several years of production deployment, our co-design has enabled power oversubscription of 25% or higher, saving hundreds of millions of dollars of data center costs, while preserving the desired availability and performance of all workloads.
View details
AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers
Nayana Prasad Nagendra
David I. August
Christos Kozyrakis
Trivikram Krishnamurthy
Heiner Litz
International Symposium on Computer Architecture (ISCA) (2019)
Preview abstract
The large instruction working sets of private and public cloud workloads lead to frequent instruction cache misses and costs in the millions of dollars. While prior work has identified the growing importance of this problem, to date, there has been little analysis of where the misses come from, and what the opportunities are to improve them. To address this challenge, this paper makes three contributions. First, we present the design and deployment of a new, always-on, fleet-wide monitoring system, AsmDB, that tracks front-end bottlenecks. AsmDB uses hardware support to collect bursty execution traces, fleet-wide temporal and spatial sampling, and sophisticated offline post-processing to construct full-program dynamic control-flow graphs. Second, based on a longitudinal analysis of AsmDB data from real-world online services, we present two detailed insights on the sources of front-end stalls: (1) cold code that is brought in along with hot code leads to significant cache fragmentation and a corresponding large number of instruction cache misses; (2) distant branches and calls that are not amenable to traditional cache locality or next-line prefetching strategies account for a large fraction of cache misses. Third, we prototype two optimizations that target these insights. For misses caused by fragmentation, we focus on memcmp, one of the hottest functions contributing to cache misses, and show how fine-grained layout optimizations lead to significant benefits. For misses at the targets of distant jumps, we propose new hardware support for software code prefetching and prototype a new feedback-directed compiler optimization that combines static program flow analysis with dynamic miss profiles to demonstrate significant benefits for several large warehouse-scale workloads. Improving upon prior work, our proposal avoids invasive hardware modifications by prefetching via software in an efficient and scalable way. Simulation results show that such an approach can eliminate up to 96% of instruction cache misses with negligible overheads.
View details
Kelp: QoS for Accelerators in Machine Learning Platforms
Haishan Zhu
Mattan Erez
International Symposium on High Performance Computer Architecture (2019)
Preview abstract
Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning for large models.
In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs.
View details
Software-defined far memory in warehouse-scale computers
Andres Lagar-Cavilla
Suleiman Souhlal
Neha Agarwal
Radoslaw Burny
Shakeel Butt
Junaid Shahid
Greg Thelen
Kamil Adam Yurtsever
Yu Zhao
International Conference on Architectural Support for Programming Languages and Operating Systems (2019)
Preview abstract
Increasing memory demand and slowdown in technology scaling pose important challenges to total cost of ownership (TCO) of warehouse-scale computers (WSCs). One promising idea to reduce the memory TCO is to add a cheaper, but slower, "far memory" tier and use it to store infrequently accessed (or cold) data. However, introducing a far memory tier brings new challenges around dynamically responding to workload diversity and churn, minimizing stranding of capacity, and addressing brownfield (legacy) deployments.
We present a novel software-defined approach to far memory that proactively compresses cold memory pages to effectively create a far memory tier in software. Our end-to-end system design encompasses new methods to define performance service-level objectives (SLOs), a mechanism to identify cold memory pages while meeting the SLO, and our implementation in the OS kernel and node agent. Additionally, we design learning-based autotuning to periodically adapt our design to fleet-wide changes without a human in the loop. Our system has been successfully deployed across Google's WSC since 2016, serving thousands of production services. Our software-defined far memory is significantly cheaper (67% or higher memory cost reduction) at relatively good access speeds (6 us) and allows us to store a significant fraction of infrequently accessed data (on average, 20%), translating to significant TCO savings at warehouse scale.
View details
Memory Hierarchy for Web Search
Jung Ho Ahn
Christos Kozyrakis
International Symposium on High Performance Computer Architecture (HPCA) (2018)
Preview abstract
Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests.
Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new
insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why
this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.
View details
Learning Memory Access Patterns
Jamie Alexander Smith
Heiner Litz
Christos Kozyrakis
ICML (2018)
Preview abstract
The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research.
View details
Improving Resource Efficiency at Scale with Heracles
Christos Kozyrakis
ACM Transactions on Computer Systems (TOCS), vol. 34 (2016), 6:1-6:33
Preview abstract
User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy efficiency of large-scale datacenters. With the slowdown in technology scaling caused by the sunsetting of Moore’s law, it becomes important to address this opportunity.
We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.
View details
Heracles: Improving Resource Efficiency at Scale
Christos Kozyrakis
Proceedings of the 42th Annual International Symposium on Computer Architecture (2015)
Preview abstract
User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity.
We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.
View details
Profiling a warehouse-scale computer
Juan Darago
Kim Hazelwood
Gu-Yeon Wei
David Brooks
ISCA '15 Proceedings of the 42nd Annual International Symposium on Computer Architecture, ACM (2014), pp. 158-169
Preview abstract
With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications.
We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.
View details
Preview abstract
Although the field of datacenter computing is arguably still in its relative infancy, a sizable body of work from both academia and industry is already available and some consistent technological trends have begun to emerge. This special issue presents a small sample of the work underway by researchers and professionals in this new field. The selection of articles presented reflects the key role that hardware-software codesign plays in the development of effective datacenter-scale computer systems.
View details
Power Management of Datacenter Workloads Using Per-Core Power Gating
Preview
Jacob Leverich
Matteo Monchiero
Vanish Talwar
Christos Kozyrakis
Computer Architecture Letters, vol. 8 (2009), pp. 48-51
Models and Metrics to Enable Energy-Efficiency Optimizations
Preview
Suzanne Rivoire
Mehul A. Shah
Christos Kozyrakis
Justin Meza
IEEE Computer, vol. 40 (2007), pp. 39-48
JouleSort: a balanced energy-efficiency benchmark
Preview
Suzanne Rivoire
Mehul A. Shah
Christos Kozyrakis
SIGMOD Conference (2007), pp. 365-376
The new (system) balance of power and opportunities for optimizations
ISLPED (2014), pp. 331-332
An FPGA memcached appliance
Sai Rahul Chalamalasetti
Kevin T. Lim
Mitch Wright
Alvin AuYoung
Martin Margala
FPGA (2013), pp. 245-254
Thin servers with smart pipes: designing SoC accelerators for memcached
Kevin T. Lim
David Meisner
Ali G. Saidi
Thomas F. Wenisch
ISCA (2013), pp. 36-47
Hardware acceleration for similarity measurement in natural language processing
Prateek Tandon
Vahed Qazvinian
Ronald G. Dreslinski
Thomas F. Wenisch
ISLPED (2013), pp. 409-414
Consistent, durable, and safe memory management for byte-addressable non volatile main memory
Iulian Moraru
David G. Andersen
Michael Kaminsky
Niraj Tolia
Nathan L. Binkert
TRIOS@SOSP (2013), pp. 1
Meet the walkers: accelerating index traversals for in-memory databases
Yusuf Onur Koçberber
Boris Grot
Javier Picorel
Babak Falsafi
Kevin T. Lim
MICRO (2013), pp. 468-479
(Re)Designing Data-Centric Data Centers
Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management
Justin Meza
HanBin Yoon
Onur Mutlu
Computer Architecture Letters, vol. 11 (2012), pp. 61-64
Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies
Doe Hyun Yoon
Tobin Gonzalez
Robert S. Schreiber
Conf. Computing Frontiers (2012), pp. 95-102
Totally green: evaluating and designing servers for lifecycle environmental impact
Justin Meza
Amip Shah
Rocky Shih
Cullen Bash
ASPLOS (2012), pp. 25-36
A limits study of benefits from nanostore-based future data-centric system architectures
Trevor N. Mudge
David Roberts
Mehul A. Shah
Kevin T. Lim
Conf. Computing Frontiers (2012), pp. 33-42
BOOM: Enabling mobile memory based low-power server DIMMs
Free-p: A Practical End-to-End Nonvolatile Memory Protection Mechanism
Doe Hyun Yoon
Naveen Muralimanohar
Norman P. Jouppi
Mattan Erez
IEEE Micro, vol. 32 (2012), pp. 79-87
System-level implications of disaggregated memory
Kevin T. Lim
Yoshio Turner
Jose Renato Santos
Alvin AuYoung
Thomas F. Wenisch
HPCA (2012), pp. 189-200
Evaluating FPGA-acceleration for real-time unstructured search
Sai Rahul Chalamalasetti
Martin Margala
Wim Vanderbauwhede
Mitch Wright
ISPASS (2012), pp. 200-209
Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems
Vishakha Gupta
Karsten Schwan
Niraj Tolia
Vanish Talwar
USENIX Annual Technical Conference (2011)
Saving the World, One Server at a Time, Together
Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory
From Microprocessors to Nanostores: Rethinking Data-Centric Systems
IEEE Computer, vol. 44 (2011), pp. 39-48
Topology-aware resource allocation for data-intensive workloads
Gunho Lee
Niraj Tolia
Randy H. Katz
Computer Communication Review, vol. 41 (2011), pp. 120-124
On energy efficiency for enterprise and data center networks
Priya Mahadevan
Sujata Banerjee
Puneet Sharma
Amip Shah
IEEE Communications Magazine, vol. 49 (2011), pp. 94-100
Loosely coupled coordinated management in virtualized data centers
Sanjay Kumar
Vanish Talwar
Vibhore Kumar
Karsten Schwan
Cluster Computing, vol. 14 (2011), pp. 259-274
FREE-p: Protecting non-volatile memory against both hard and soft errors
Doe Hyun Yoon
Naveen Muralimanohar
Norman P. Jouppi
Mattan Erez
HPCA (2011), pp. 466-477
Everything as a Service: Powering the New Information Economy
Prith Banerjee
Rich Friedrich
Cullen Bash
P. Goldsack
Bernardo A. Huberman
J. Manley
Chandrakant D. Patel
A. Veitch
IEEE Computer, vol. 44 (2011), pp. 36-43
System-level integrated server architectures for scale-out datacenters
Sheng Li
Kevin T. Lim
Paolo Faraboschi
Norman P. Jouppi
MICRO (2011), pp. 260-271
Guest Editors' Introduction: Datacenter-Scale Computing
Online detection of utility cloud anomalies using metric distributions
Topology-aware resource allocation for data-intensive workloads
Recipe for efficiency: principles of power-aware computing
Commun. ACM, vol. 53 (2010), pp. 60-67
sNICh: efficient last hop networking in the data center
Kaushik Kumar Ram
Jayaram Mudigonda
Alan L. Cox
Scott Rixner
Jose Renato Santos
ANCS (2010), pp. 26
vManage: loosely coupled platform and virtualization management in data centers
Sanjay Kumar
Vanish Talwar
Vibhore Kumar
Karsten Schwan
ICAC (2009), pp. 127-136
Energy Efficiency: The New Holy Grail of Data Management Systems Research
Server Designs for Warehouse-Computing Environments
Kevin T. Lim
Chandrakant D. Patel
Trevor N. Mudge
Steven K. Reinhardt
IEEE Micro, vol. 29 (2009), pp. 41-49
Models and Metrics for Energy-Efficient Computing
Suzanne Rivoire
Justin D. Moore
Advances in Computers, vol. 75 (2009), pp. 159-233
Disaggregated memory for expansion and sharing in blade servers
Kevin T. Lim
Trevor N. Mudge
Steven K. Reinhardt
Thomas F. Wenisch
ISCA (2009), pp. 267-278
Sustainable data centers: enabled by supply and demand side management
Industrial perspectives panel
PPOPP (2009), pp. 197
A Power Benchmarking Framework for Network Devices
Priya Mahadevan
Puneet Sharma
Sujata Banerjee
Networking (2009), pp. 795-808
Industrial perspectives panel
HPCA (2009), pp. 325-326
Energy Efficiency: The New Holy Grail of Data Management Systems Research
Stavros Harizopoulos
Mehul A. Shah
Justin Meza
CoRR, vol. abs/0909.1784 (2009)
Tracking the power in an enterprise decision support system
Justin Meza
Mehul A. Shah
Mike Fitzner
Judson Veazey
ISLPED (2009), pp. 261-266
No "power" struggles: coordinated multi-level power management for the data center
Ramya Raghavendra
Vanish Talwar
Zhikui Wang
Xiaoyun Zhu
ASPLOS (2008), pp. 48-59
Fabric convergence implications on systems architecture
Using Asymmetric Single-ISA CMPs to Save Energy on Operating Systems
Jayaram Mudigonda
Nathan L. Binkert
Vanish Talwar
IEEE Micro, vol. 28 (2008), pp. 26-41
Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments
Kevin T. Lim
Chandrakant D. Patel
Trevor N. Mudge
Steven K. Reinhardt
ISCA (2008), pp. 315-326
Delivering Energy Proportionality with Non Energy-Proportional Systems - Optimizing the Ensemble
Niraj Tolia
Zhikui Wang
Manish Marwah
Cullen Bash
Xiaoyun Zhu
HotPower (2008)
Power management from cores to datacenters: where are we going to get the next ten-fold improvements?
ISLPED (2008), pp. 139-140
Active storage revisited: the case for power and performance benefits for unstructured data processing applications
Clinton Wills Smullen IV
Shahrukh Rohinton Tarapore
Sudhanva Gurumurthi
Mustafa Uysal
Conf. Computing Frontiers (2008), pp. 293-304
Implementing high availability memory with a duplication cache
Nidhi Aggarwal
James E. Smith
Kewal K. Saluja
Norman P. Jouppi
MICRO (2008), pp. 71-82
General-purpose blade infrastructure for configurable system architectures
Kevin Leigh
Jaspal Subhlok
Distributed and Parallel Databases, vol. 21 (2007), pp. 115-144
Motivating co-ordination of power management solutions in data centers
Ramya Raghavendra
Vanish Talwar
Xiaoyun Zhu
Zhikui Wang
CLUSTER (2007), pp. 473
Cost-aware scheduling for heterogeneous enterprise machines (CASH'EM)
Isolation in Commodity Multicore Processors
Nidhi Aggarwal
Norman P. Jouppi
James E. Smith
IEEE Computer, vol. 40 (2007), pp. 49-59
Configurable isolation: building high availability systems with commodity multi-core processors
Ensemble-level Power Management for Dense Blade Servers
Energy-Aware User Interfaces and Energy-Adaptive Displays
Erik Geelhoed
Meera Manahan
Ken Nicholas
IEEE Computer, vol. 39 (2006), pp. 31-38
Weatherman: Automated, Online and Predictive Thermal Mapping and Management for Data Centers
IT Infrastructure in Emerging Markets: Arguing for an End-to-End Perspective
Ajay Gupta 0005
Prashant Sarin
Mehul A. Shah
IEEE Pervasive Computing, vol. 5 (2006), pp. 24-31
Enterprise IT Trends and Implications for Architecture Research
Heterogeneous Chip Multiprocessors
Rakesh Kumar 0002
Dean M. Tullsen
Norman P. Jouppi
IEEE Computer, vol. 38 (2005), pp. 32-38
Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers
Justin D. Moore
Jeffrey S. Chase
Ratnesh K. Sharma
USENIX Annual Technical Conference, General Track (2005), pp. 61-75
Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance
Rakesh Kumar 0002
Dean M. Tullsen
Norman P. Jouppi
Keith I. Farkas
ISCA (2004), pp. 64-75
Energy-aware user interfaces: an evaluation of user acceptance
Tim Harter
Sander Vroegindeweij
Erik Geelhoed
Meera Manahan
CHI (2004), pp. 199-206
Investigating the Relationship Between Battery Life and User Acceptance of Dynamic, Energy-Aware Interfaces on Handhelds
Lance Bloom
Rachel Eardley
Erik Geelhoed
Meera Manahan
Mobile HCI (2004), pp. 13-24
Energy-Adaptive Display System Designs for Future Mobile Environments
Processor Power Reduction Via Single-ISA Heterogeneous Multi-Core Architectures
Rakesh Kumar 0002
Keith I. Farkas
Norman P. Jouppi
Dean M. Tullsen
Computer Architecture Letters, vol. 2 (2003)
Energy Consumption in Mobile Devices: Why Future Systems Need Requirements-Aware Energy Scale-Down
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction
Rakesh Kumar 0002
Keith I. Farkas
Norman P. Jouppi
Dean M. Tullsen
MICRO (2003), pp. 81-92
Energy-Driven Statistical Sampling: Detecting Software Hotspots
Fay Chang
Keith I. Farkas
Workshop on Power Aware Computing Systems (PACS) (2002), pp. 110-129
Topological navigation and qualitative localization for indoor environment using multi-sensory perception
Jean-Bernard Hayet
Michel Devy
Seth Hutchinson
Frédéric Lerasle
Robotics and Autonomous Systems, vol. 41 (2002), pp. 137-144
Reconfigurable caches and their application to media processing
Performance of Image and Video Processing with General-Purpose Processors and Media ISA Extensions
The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors
Vijay S. Pai
Hazim Abdel-Shafi
Sarita V. Adve
IEEE Trans. Computers, vol. 48 (1999), pp. 218-226
Performance of database workloads on shared-memory systems with out-of-order processors
Kourosh Gharachorloo
Sarita V. Adve
ASPLOS-VIII: Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ACM, New York, NY, USA (1998), pp. 307-318
Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap Between Memory Consistency Models
The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology
An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors