Jump to Content

WarpFlow: Exploring Petabytes of Space-Time Data

Catalin Teodor Popescu
Deepak Merugu
Giao Nguyen
Shiva Shivakumar
(2019)

Abstract

WarpFlow is a fast, interactive querying and processing sys- tem for big data, with a special treatment for petabyte-scale spatio-temporal datasets. It processes and tranforms rich, hierarchical data end-to-end (e.g., Protocol Buffers – a common data format at Google). WarpFlow speeds up three key metrics for data scientists – time-to-first-result, time- to-full-scale-result, and time-to-trained-model for machine learning (e.g., using TensorFlow). In this paper, we describe the architecture and implementation of WarpFlow. We present a custom data storage format optimized for fast, index-based selection of hierarchical data. We also describe a functional, extensible, pipelined query language (with op- erators such as map, filter, aggregate, etc.) that greatly simplifies writing queries on big datasets with hierarchical data.