Large-Scale Deep Learning in Heterogeneous Distributed Systems (Xiaowen Chu et al.)

Large-Scale Deep Learning in Heterogeneous Distributed Systems (Xiaowen Chu et al.)

A key driving force of the success of deep learning is the growing computing power of multi-core and many-core processors such as GPUs, FPGA, and ASIC. With the increase of training data size and complexity of deep neural networks, how to efficiently utilize the limited, expensive, and shared computing and communicating resources in a heterogeneous distributed system to support large-scale deep learning tasks from different users becomes an important issue for cloud service providers. Our ultimate goal is to make the deep learning tasks as fast as possible by (1) exploiting the hardware potential to the limit; (2) optimizing the related software components; (3) designing smart resource allocation and task schedule for different simultaneous deep learning tasks.

Objectives:

To develop performance models that can estimate the execution time of a deep learning task given a set of computing and communication resources;
To design efficient parallel algorithms for key computing components in deep learning, such as matrix multiplication, matrix transpose, convolution, etc;
To design highly efficient communication primitives to support distributed training of deep learning models;
To design resource management and task scheduling algorithms that can optimize the overall performance of the heterogeneous distributed system.

Our Impact:

We have collaborated with Tencent Ltd. to develop a large-scale distributed AI training system that can train AlexNet and ResNet-50 in a few minutes by using 1024~2048 GPUs. [Media Report]
We have been maintaining one of the most influential performance benchmarking suites for state-of-the-art deep learning frameworks: https://github.com/hclhkbu/dlbench, which has received great attention and generous support from industry, including Nvidia, Inspur, Intel, Tencent, Microsoft CNTK team, and MXNet team.

The PI, Dr. Xiaowen Chu has been invited to AI Computing Conference 2017 to introduce this work. The interview by a famous AI media, XinZhiYuan, can be found at [report].
Our research papers have already received a lot of attention since their publication. For example, our GPU memory hierarchy papers [9] have 100+ citations, the deep learning benchmark paper [4] has 90+ citations, the GPU DVFS paper [10] has received 50+ citations, and the deep learning performance model paper [3] has received the Best Paper Award from IEEE DataCom 2018.

Selected Publications:

Shi, X.-W. Chu, and B. Li, “MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms,” IEEE INFOCOM 2019, Paris, France, May 2019.
Jia, S. Song, S. Shi, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, and X.-W. Chu, “Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes,” Workshop on Systems for ML and Open Source Software, collocated with NeurIPS 2018, Montreal, Canada, Dec 2018.
Shi, Q. Wang, and X.-W. Chu, “Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs,” IEEE DataCom 2018, Athens, Greece, August 2018. (Best Paper Award)
Shi, Q. Wang, P. Xu, and X.-W. Chu, “Benchmarking State-of-the-Art Deep Learning Software Tools,” arXiv:1608.07249, https://arxiv.org/abs/1608.07249 .
Wang and X.-W. Chu, “GPGPU Power Estimation with Core and Memory Frequency Scaling,” GreenMetrics 2017, in conjunction with ACM Sigmetrics 2017, Champaign-Urbana, USA June 2017. (Also appeared in ACM Performance Evaluation Review.)
Chau, X.-W. Chu, H. Liu, and Y.-W. Leung, “Energy Efficient Job Scheduling with DVFS for CPU-GPU Heterogeneous Systems,” ACM e-Energy 2017, Hong Kong, May 2017.
Mei, X.-W. Chu, Y.-W. Leung, H. Liu, and Z. Li, “Energy Efficient Real-time Task Scheduling on CPU-GPU Hybrid Clusters,” IEEE INFOCOM 2017, Atlanta, GA, USA, 1-4 May, 2017.
Mei, Q. Wang, and X.-W. Chu, “A Survey and Measurement Study of GPU DVFS on Energy Conservation,” Digital Communications and Networks, Vol. 3, No. 2, Pages 89-100, May 2017.
Mei and X.-W. Chu, “Dissecting GPU Memory Hierarchy through Microbenchmarking,” IEEE Transactions on Parallel and Distributed Systems, Vol. 28. No. 1, pages 72-86, Jan 2017. (An earlier short version has been presented at IFIP NPC 2014.)
Mei, L. Yung, K. Zhao, and X.-W. Chu, “A Measurement Study of GPU DVFS on Energy Conservation,” USENIX HotPower’13, co-located with the 24th ACM Symposium on Operating Systems Principles (SOSP), Pennsylvania, USA, November 2013.

For further information on this research topic, please contact Prof. Xiaowen Chu.