Jump to Content
Rama Govindaraju

Rama Govindaraju

Rama is a Director of Engineering at Google where he leads the Systems Infrastructure Architecture team. Prior to that Rama was a Distinguished Engineer at IBM responsible for leading the Software Architecture at IBM's Supercomputing Lab where he led the development of 5 generations of Supercomputers. Prior to that Rama received his MS and Phd in Computer Science from Rensselaer Polytechnic Institute in New York and BE in Computer Science from BIT Mesra, Ranchi, India.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent'': the only symptom is an erroneous computation. We refer to a core that develops such behavior as "mercurial.'' Mercurial cores are extremely rare, but in a large fleet of servers we can observe the correlated disruption they cause, often enough to see them as a distinct problem -- one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause. Please watch our short video summarizing the paper. View details
    Kelp: QoS for Accelerators in Machine Learning Platforms
    Haishan Zhu
    Mattan Erez
    International Symposium on High Performance Computer Architecture (2019)
    Preview abstract Development and deployment of machine learning (ML) accelerators in Warehouse Scale Computers (WSCs) demand significant capital investments and engineering efforts. However, even though heavy computation can be offloaded to the accelerators, applications often depend on the host system for various supporting tasks. As a result, contention on host resources, such as memory bandwidth, can significantly discount the performance and efficiency gains of accelerators. The impact of performance interference is further amplified in distributed learning for large models. In this work, we study the performance of four production machine learning workloads on three accelerator platforms. Our experiments show that these workloads are highly sensitive to host memory bandwidth contention, which can cause 40% average performance degradation when left unmanaged. To tackle this problem, we design and implement Kelp, a software runtime that isolates high priority accelerated ML tasks from memory resource interference. We evaluate Kelp with both production and artificial aggressor workloads, and compare its effectiveness with previously proposed solutions. Our evaluation shows that Kelp is effective in mitigating performance degradation of the accelerated tasks, and improves performance by 24% on average. Compared to previous work, Kelp reduces performance degradation of ML tasks by 7% and improves system efficiency by 17%. Our results further expose opportunities in future architecture designs. View details
    WSMeter: A Fast, Accurate, and Low-Cost Performance Evaluation for Warehouse-Scale Computers
    Jaewon Lee
    Changkyu Kim
    Jangwoo Kim
    Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (2018) (to appear)
    Preview abstract A warehouse-scale computer (WSC) is a vast collection of tightly networked computers providing modern internet services, that is becoming increasingly popular as the most cost-effective approach to serve users at global scale. It is however extremely difficult to accurately measure the holistic performance of WSC. The existing load-testing benchmarks are tailored towards a dedicated machine model and do not address shared infrastructure environments. Evaluating the performance of a live shared production WSC environment presents many challenges due to the lack of holistic performance metrics, high evaluation costs, and potential service disruptions they may cause. WSC providers and customers are in need of a cost effective methodology to accurately evaluate the holistic performance of their platforms and hosted services. To address these challenges, we propose WSMeter, a cost effective framework and methodology to accurately evaluate the holistic performance of WSC in a live production environment. We define a new performance metric to accurately reflect the holistic performance of a WSC running a wide variety of unevenly distributed jobs. We propose a model to statistically embrace the performance variances amplified by co-located jobs, to evaluate holistic performance with minimum costs. For validation of our approach, we analyze two real-world use cases and show that WSMeter accurately discerns 7% and 1% performance improvements, using only 0.9% and 6.6% of the machines in the WSC, respectively. We show through a Cloud customer case study, where WSMeter helped quantify the performance benefits of service software optimization with minimal costs. View details
    Improving Resource Efficiency at Scale with Heracles
    Christos Kozyrakis
    ACM Transactions on Computer Systems (TOCS), vol. 34 (2016), 6:1-6:33
    Preview abstract User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy efficiency of large-scale datacenters. With the slowdown in technology scaling caused by the sunsetting of Moore’s law, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated. View details
    Heracles: Improving Resource Efficiency at Scale
    Christos Kozyrakis
    Proceedings of the 42th Annual International Symposium on Computer Architecture (2015)
    Preview abstract User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated. View details
    Towards Energy Proportionality for Large-Scale Latency-Critical Workloads
    Christos Kozyrakis
    Proceedings of the 41th Annual International Symposium on Computer Architecture, ACM (2014)
    Preview abstract Reducing the energy footprint of warehouse-scale computer (WSC) systems is key to their affordability, yet difficult to achieve in practice. The lack of energy proportionality of typical WSC hardware and the fact that important workloads (such as search) require all servers to remain up regardless of traffic intensity renders existing power management techniques ineffective at reducing WSC energy use. We present PEGASUS, a feedback-based controller that significantly improves the energy proportionality of WSC systems, as demonstrated by a real implementation in a Google search cluster. PEGASUS uses request latency statistics to dynamically adjust server power management limits in a fine-grain manner, running each server just fast enough to meet global service-level latency objectives. In large cluster experiments, PEGASUS reduces power consumption by up to 20%. We also estimate that a distributed version of PEGASUS can nearly double these savings. View details
    No Results Found