Large-Scale Deep Learning in Heterogeneous Distributed Systems (Xiaowen Chu et al.)

A key driving force of the success of deep learning is the growing computing power of multi-core and many-core processors such as GPUs, FPGA, and ASIC. With the increase of training data size and complexity of deep neural networks, how to efficiently utilize the limited, expensive, and shared computing and communicating resources in a heterogeneous distributed system to support large-scale deep learning tasks from different users becomes an important issue for cloud service providers. Our ultimate goal is to make the deep learning tasks as fast as possible by (1) exploiting the hardware potential to the limit; (2) optimizing the related software components; (3) designing smart resource allocation and task schedule for different simultaneous deep learning tasks.


Objectives:


Our Impact:


Selected Publications:

  1. R. Zeng, S. Zhang, J. Wang, and X.-W. Chu, “FMore: An Incentive Scheme of Multi-dimensional Auction for Federated Learning in MEC,” IEEE ICDCS 2020, Singapore, Dec 2020.
  2. S. Shi, Z. Tang, Q. Wang, K. Zhao, and X.-W. Chu, “Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees,” the 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, Aug-Sept 2020.
  3. S. Shi, Q. Wang, X-W. Chu, B. Li, Y. Qin, R. Liu, and X. Zhao, “Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs,” IEEE INFOCOM 2020, Canada, July 2020.
  4. D. Yan, W. Wang, and X.-W. Chu, “Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply,” IEEE IPDPS 2020, New Orleans, USA, May 2020.
  5. D. Yan, W. Wang, and X.-W. Chu, “Optimizing Batched Winograd Convolution on GPUs,” ACM PPoPP 2020, San Diego, USA, Feb 2020.
  6. S. Shi, K. Zhao, Q. Wang, Z. Tang, and X.-W. Chu, “A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification,” IJCAI 2019, Macau, P.R.C., August 2019.
  7. S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X.-W. Chu, “A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks,” IEEE ICDCS 2019, Dallas, Texas, USA, July 2019.
  8. Z. Tang, Y. Wang, Q. Wang, and X.-W. Chu, “The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study,” ACM e-Energy 2019, Phoenix, AZ, USA, June 2019.
  9. S. Shi, X.-W. Chu, and B. Li, “MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms,” IEEE INFOCOM 2019, Paris, France, May 2019.
  10. X. Jia, S. Song, S. Shi, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, and X.-W. Chu, “Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes,” Workshop on Systems for ML and Open Source Software, collocated with NeurIPS 2018, Montreal, Canada, Dec 2018.
  11. S. Shi, Q. Wang, and X.-W. Chu, “Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs,” IEEE DataCom 2018, Athens, Greece, August 2018. (Best Paper Award)
  12. S. Shi, Q. Wang, P. Xu, and X.-W. Chu, “Benchmarking State-of-the-Art Deep Learning Software Tools,” arXiv:1608.07249, https://arxiv.org/abs/1608.07249 .
  13. V. Chau, X.-W. Chu, H. Liu, and Y.-W. Leung, “Energy Efficient Job Scheduling with DVFS for CPU-GPU Heterogeneous Systems,” ACM e-Energy 2017, Hong Kong, May 2017.
  14. X. Mei, X.-W. Chu, Y.-W. Leung, H. Liu, and Z. Li, “Energy Efficient Real-time Task Scheduling on CPU-GPU Hybrid Clusters,” IEEE INFOCOM 2017, Atlanta, GA, USA, 1-4 May, 2017.
  15. X. Mei and X.-W. Chu, “Dissecting GPU Memory Hierarchy through Microbenchmarking,” IEEE Transactions on Parallel and Distributed Systems, Vol. 28. No. 1, pages 72-86, Jan 2017.  

For further information on this research topic, please contact Prof. Xiaowen Chu.