![]() |
Energy-efficient Task Scheduling for CPU-GPU Hybrid Clusters (Xiaowen Chu et al.)
In the past decade, we have witnessed the proliferation of GPU computing in many
scientific and industrial applications. As compared with contemporary CPUs, GPUs
can improve the computing power and memory bandwidth by an order of magnitude.
Therefore, many supercomputers and datacenters have begun to use GPUs as their
auxiliary or even major computing resources. E.g., Oak Ridge National Laboratory’s
Titan supercomputer deployed 18,688 Nvidia GPUs to achieve 17.59 petaflops of raw
computing power. Very recently, Google DeepMind used a large-scale hybrid CPUGPU
cluster to train the computer program AlphaGo and won the Go World
Champion. Many commercial cloud service providers, including Amazon, Microsoft,
Alibaba and Outscale, already offer pay-as-you-go GPU computing services based on
the hybrid CPU-GPU clusters.
Although GPUs are much more powerful and energy-efficient than CPUs, they still
consume significant power. E.g., a single Nvidia DGX-1 GPU server consumes up to
3,200 Watts of electricity, 75% of which are used by its 8 GPUs. The electricity cost
of Titan supercomputer is an amazing 23 million dollar per year. Thus, energy
efficiency becomes a leading design constraint for large-scale hybrid CPU-GPU
clusters. Many energy optimization solutions have been proposed in the literature for
traditional CPU based clusters, among which dynamic voltage and frequency scaling
(DVFS) and task scheduling are the most important ones.
This project aims to design theoretically sound yet practical energy-efficient task
mapping and scheduling solutions for large-scale CPU-GPU clusters. The end users
submit their tasks to the cluster, and the scheduler allocates appropriate resources with
certain DVFS configuration for each task (i.e., task mapping) and schedule them to
execute at the right time (i.e., task scheduling). To achieve this purpose, we propose
to carry out the following major research tasks. First, we will develop quantitative
performance and power models for GPUs that incorporate the effect of DVFS,
through micro-benchmarking and machine learning techniques. Second, we consider
the offline scheduling case where all user tasks are known in advance. This is popular
for private clusters with known users and task patterns. Third, we will extend our
investigation to the online scenario where user tasks arrive over time. This model is
popular for public service providers with unknown future tasks. We plan to design
theoretically sound online algorithm through competitive analysis, i.e., the worst-case
performance is guaranteed to be within a certain bound of the optimal offline strategy
with complete future task information. Besides theoretical analysis, we also plan to
evaluate the performance through simulations and real-world experiments.
Related Publications:
For further information on this research topic, please contact Dr. Xiaowen Chu .
![]() |