ExtDict: Extensible Dictionaries for Data- and Platform-Aware Large-Scale Learning


This paper proposes ExtDict, a novel data- and platform-aware framework for iterative analysis/learning of massive and dense datasets. Iterative execution is prohibitively costly for distributed architectures where the cost of moving data is continually growing compared with the cost of arithmetic computing. ExtDict creates a performance model that quantifies the computational cost of iterative analysis algorithms on a target platform in terms of FLOPs, communication, and memory, which characterize runtime, energy, and storage respectively. The core of ExtDict is a novel parametric data projection algorithm, called Extensible Dictionary, that enables versatile and sparse representations of the data to minimize this computational cost. We show that ExtDict can achieve the optimal performance objective, according to our quantified cost model, by platform-aware tuning of the Extensible Dictionary parameters. An accompanying API ensures automated applicability of ExtDict to various algorithms, datasets, and platforms. Proof-of-concept evaluations of massive and dense data on different platforms demonstrate more than an order of magnitude improvement in performance compared to the state-of-the-art, within guaranteed user-defined error bounds.