COMP Research Team Breaks World Record for Training ImageNet in Four Minutes

2 Aug 2018
(Right) Dr. Chu Xiaowen and PhD student Shi Shaohuai.
A diagram showing data transmission of a 5-layer model.


Department of Computer Science Associate Professor Dr Chu Xiaowen and his research group teamed up with Tencent Machine Learning to train AlexNet in just 4 minutes and ResNet-50 in 6.6 minutes on the ImageNet dataset, which breaks the world record. The international record previously was training AlexNet in 11 minutes and ResNet-50 in 15minutes.

In order to train neural networks at high speed with stable accuracy of software, COMP research team and researchers from Tencent Machine Learning strove to reduce the training time for AlexNet and ResNet-50 by increasing the batch size. For this study, the researchers rinsed 65,536 ImageNet images per batch through the neural network during training.

The team also came up with a communications technique called “tensor fusion”. When nodes are sharing information over their cluster's network, multiple small size tensors are packaged together to reduce the amount of information that has to be transferred, thus reducing latency and increasing throughput. The team also used a mix of 32-bit full and 16-bit half-precision floating-point math (FP32 and FP16) during training, rather than purely FP32, which further reduced the amount of data shunted through a node's memory, also improving the throughput and cutting into the training time.

With improving training speed on ImageNet, the team plans to leverage their accelerated training capability in other AI businesses and services such as game AI.