Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training


Large-scale distributed training requires significant communication bandwidth to exchange gradients. The intensive gradient communication limits the scalability of multi-machine multi-GPU training, and requires expensive high-bandwidth network switches. In this paper, we discover that 99.9\% of the gradient exchange are redundant and can be safely removed without impacting the convergence accuracy. We propose "Deep Gradient Compression" that can efficiently save the communication bandwidth by up to 600$\times$ (after taking meta-data into account). We introduce four components of Deep Gradient Compression: momentum correction, local gradient clipping, momentum factor masking, and warm-up training that fully preserves the convergence accuracy. We extensively experimented Deep Gradient Compression on multiple types of machine learning tasks including image classification, speech recognition, and language modeling; and multiple datasets on Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On all these scenarios, Deep Gradient Compression with only 0.1\% gradient exchange achieved the same accuracy and the same learning curves compared with the conventional dense update. With such techniques, we enable distributed training on the cheap commodity 1Gbps Ethernet.