Large-Scale Deep Learning in Heterogeneous Distributed Systems (Xiaowen Chu et al.)

Large-Scale Deep Learning in Heterogeneous Distributed Systems (Xiaowen Chu et al.)

A key driving force of the success of deep learning is the growing computing power of multi-core and many-core processors such as GPUs, FPGA, and ASIC. With the increase of training data size and complexity of deep neural networks, how to efficiently utilize the limited, expensive, and shared computing and communicating resources in a heterogeneous distributed system to support large-scale deep learning tasks from different users becomes an important issue for cloud service providers. Our ultimate goal is to make the deep learning tasks as fast as possible by (1) exploiting the hardware potential to the limit; (2) optimizing the related software components; (3) designing smart resource allocation and task schedule for different simultaneous deep learning tasks.

Objectives:

To develop performance models that can estimate the execution time of a deep learning task given a set of computing and communication resources;
To design efficient parallel algorithms for key computing components in deep learning, such as matrix multiplication, matrix transpose, convolution, etc;
To design highly efficient communication primitives to support distributed training of deep learning models;
To design resource management and task scheduling algorithms that can optimize the overall performance of the heterogeneous distributed system.

Our Impact:

In 2020, we collaborated with Tencent Ltd. again to improve the distributed AI training system and broke the DAWNBench world record of training ResNet-50 with 128 Nvidia V100 GPUs. [Media Report]
In 2018, we collaborated with Tencent Ltd. to develop a large-scale distributed AI training system that can train AlexNet and ResNet-50 in a few minutes by using 1024~2048 GPUs [Media Report]. The research paper [10] has received 180+ citations.
We have been maintaining one of the most influential performance benchmarking suites for state-of-the-art deep learning frameworks:
https://github.com/hclhkbu/dlbench, which has received great attention and generous support from industry, including Nvidia, Inspur, Intel, Tencent, Microsoft CNTK team, and MXNet team.The research paper [12] has received 230+ citations. The PI, Dr. Xiaowen Chu has been invited to AI Computing Conference 2017 to introduce this work. The interview by a famous AI media, XinZhiYuan, can be found at [report].

Selected Publications:

R. Zeng, S. Zhang, J. Wang, and X.-W. Chu, “FMore: An Incentive Scheme of Multi-dimensional Auction for Federated Learning in MEC,” IEEE ICDCS 2020, Singapore, Dec 2020.
S. Shi, Z. Tang, Q. Wang, K. Zhao, and X.-W. Chu, “Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees,” the 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, Aug-Sept 2020.
S. Shi, Q. Wang, X-W. Chu, B. Li, Y. Qin, R. Liu, and X. Zhao, “Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs,” IEEE INFOCOM 2020, Canada, July 2020.
D. Yan, W. Wang, and X.-W. Chu, “Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply,” IEEE IPDPS 2020, New Orleans, USA, May 2020.
D. Yan, W. Wang, and X.-W. Chu, “Optimizing Batched Winograd Convolution on GPUs,” ACM PPoPP 2020, San Diego, USA, Feb 2020.
S. Shi, K. Zhao, Q. Wang, Z. Tang, and X.-W. Chu, “A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification,” IJCAI 2019, Macau, P.R.C., August 2019.
S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X.-W. Chu, “A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks,” IEEE ICDCS 2019, Dallas, Texas, USA, July 2019.
Z. Tang, Y. Wang, Q. Wang, and X.-W. Chu, “The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,” ACM e-Energy 2019, Phoenix, AZ, USA, June 2019.
S. Shi, X.-W. Chu, and B. Li, “MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms,” IEEE INFOCOM 2019, Paris, France, May 2019.
X. Jia, S. Song, S. Shi, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, and X.-W. Chu, “Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes,” Workshop on Systems for ML and Open Source Software, collocated with NeurIPS 2018, Montreal, Canada, Dec 2018.
S. Shi, Q. Wang, and X.-W. Chu, “Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs,” IEEE DataCom 2018, Athens, Greece, August 2018. (Best Paper Award)
S. Shi, Q. Wang, P. Xu, and X.-W. Chu, “Benchmarking State-of-the-Art Deep Learning Software Tools,” arXiv:1608.07249, https://arxiv.org/abs/1608.07249 .
V. Chau, X.-W. Chu, H. Liu, and Y.-W. Leung, “Energy Efficient Job Scheduling with DVFS for CPU-GPU Heterogeneous Systems,” ACM e-Energy 2017, Hong Kong, May 2017.
X. Mei, X.-W. Chu, Y.-W. Leung, H. Liu, and Z. Li, “Energy Efficient Real-time Task Scheduling on CPU-GPU Hybrid Clusters,” IEEE INFOCOM 2017, Atlanta, GA, USA, 1-4 May, 2017.
X. Mei and X.-W. Chu, “Dissecting GPU Memory Hierarchy through Microbenchmarking,” IEEE Transactions on Parallel and Distributed Systems, Vol. 28. No. 1, pages 72-86, Jan 2017.

For further information on this research topic, please contact Prof. Xiaowen Chu.