We verified validity and scalability of our system using various models and training problems. For all experiments we used 4 bits quantization with 1024 bucket size.

Validation

First, we validated that using quantization with the aforementioned parameters does not affect the training quality.

  ResNet50 VGG16 Vision Transformer, base Transformer-XL GPT-2 BERT
Baseline 75.8 69.1 79.2 22.81 14.1 93.12
CGX 75.9 68.9 78.6 22.9 13.9 93.06

Scalability

Next, we performed the weak scaling experiments on the server with low-bandwidth inter-GPU communication, 8 RTX3090 GPUs. We ran two image classification models: ResNet50 (25M parameters) and Vision Transformer (86M parameters) - on ImageNet and two language models: Transformer-XL base (192M parameters) on wikitext-103 and BERT (335M parameters) on SQUAD We see that CGX allows to achieve up to 100% speedup compared to Nvidia NCCL reaching up to 90% of ideal scaling on 8 GPUs.

ResNet50 Vision Transformer base

Transformer-XL base BERT

For more benchmarks, please refer to the CGX paper.