Welcome to Pytorch-CGX
Pytorch-CGX is a pytorch extension adding a backend for pytorch distributed supporting allreduce of quantized buffers. It supports quantizations of float16, float32 to 1-8 bits.
CGX is based on MPI torch.distributed backend. The extension essentially only replaces allreduce primitive.
Install
export MPI_HOME=/path/to/mpi
export NCCL_HOME=/path/to/nccl
pip install pytorch-cgx
We also provide a Dockerfile that allows to use pytorch-cgx
.
See install for the installation details.
Introduction
pytorch-cgx
is a plug-in extension for torch.distributed
that substantially improves scalability on servers with low-bandwidth inter GPU communication.
Keep your training scripts the same, only need to replace torch.distributed
backend and launching script.
The only changes to the training script using pytorch distributed required
are importing the built extension and specifying cgx
as torch.distributed.init_process_group
backend parameter and setting some environment variables.
See usage for the detailed instructions.
Example:
import torch
import torch.distributed as dist
import torch_cgx
# set `CGX_COMPRESSION_QUANTIZATION_BITS` to 4 to enable compression.
dist.init_process_group('cgx', init_method='env://', rank=args.local_rank)
Look at the speedup using CGX training on 8xRTX3090 compared to the Nvidia NCCL:
Cloud costs
We checked the prices of moving the training to the cloud. We compared the low-bandwidth servers on Genesis Cloud with 8xRTX3090 gpus and bandwidth over-provisioned servers on AWS with 8 x V100. As the benchmark we chose training of BERT on question-answering task
Instance | Price per hour ($) | Training cost ($) |
---|---|---|
Genesis Cloud without CGX | 10.4 | 61.7 |
AWS | 24.5 | 49.0 |
Genesis Cloud with CGX | 10.4 | 18.6 |
See benchmarks for further benchmark details.
Citation
If you use this library in your research cite the corresponding paper as follows:
@inproceedings{markov2022cgx,
title={CGX: Adaptive System Support for Communication-Efficient Deep Learning},
author={Markov, Ilia and Ramezanikebrya, Hamidreza and Alistarh, Dan},
booktitle={23rd International Middleware Conference (Middleware'22), November 7--11, 2022, Quebec, QC, Canada},
year={2022}
}