Energy-efficient Task Scheduling for CPU-GPU Hybrid Clusters (Xiaowen Chu et al.)

In the past decade, we have witnessed the proliferation of GPU computing in many scientific and industrial applications. As compared with contemporary CPUs, GPUs can improve the computing power and memory bandwidth by an order of magnitude. Therefore, many supercomputers and datacenters have begun to use GPUs as their auxiliary or even major computing resources. E.g., Oak Ridge National Laboratory’s Titan supercomputer deployed 18,688 Nvidia GPUs to achieve 17.59 petaflops of raw computing power. Very recently, Google DeepMind used a large-scale hybrid CPUGPU cluster to train the computer program AlphaGo and won the Go World Champion. Many commercial cloud service providers, including Amazon, Microsoft, Alibaba and Outscale, already offer pay-as-you-go GPU computing services based on the hybrid CPU-GPU clusters.

Although GPUs are much more powerful and energy-efficient than CPUs, they still consume significant power. E.g., a single Nvidia DGX-1 GPU server consumes up to 3,200 Watts of electricity, 75% of which are used by its 8 GPUs. The electricity cost of Titan supercomputer is an amazing 23 million dollar per year. Thus, energy efficiency becomes a leading design constraint for large-scale hybrid CPU-GPU clusters. Many energy optimization solutions have been proposed in the literature for traditional CPU based clusters, among which dynamic voltage and frequency scaling (DVFS) and task scheduling are the most important ones.

This project aims to design theoretically sound yet practical energy-efficient task mapping and scheduling solutions for large-scale CPU-GPU clusters. The end users submit their tasks to the cluster, and the scheduler allocates appropriate resources with certain DVFS configuration for each task (i.e., task mapping) and schedule them to execute at the right time (i.e., task scheduling). To achieve this purpose, we propose to carry out the following major research tasks. First, we will develop quantitative performance and power models for GPUs that incorporate the effect of DVFS, through micro-benchmarking and machine learning techniques. Second, we consider the offline scheduling case where all user tasks are known in advance. This is popular for private clusters with known users and task patterns. Third, we will extend our investigation to the online scenario where user tasks arrive over time. This model is popular for public service providers with unknown future tasks. We plan to design theoretically sound online algorithm through competitive analysis, i.e., the worst-case performance is guaranteed to be within a certain bound of the optimal offline strategy with complete future task information. Besides theoretical analysis, we also plan to evaluate the performance through simulations and real-world experiments.

Related Publications:

For further information on this research topic, please contact Dr. Xiaowen Chu .