By following this manual, users should be able to effectively utilize the provided scalable distributed training script for large-scale deep learning tasks on single or multiple nodes. This script has been designed to lower hardware requirements while maintaining performance in deep learning training.
- Overview
- Key Concepts
- Requirements and Environment Setup
- Script Structure
- Command-Line Interface (CLI)
- Features
- Running the Script
- Distributed Training Setup
- Best Practices
- Troubleshooting
- Disclaimers
This Python script is provided "as is" to the open-source community to contribute to human advancement and collaborative research in deep learning. It offers a scalable solution for distributed training using PyTorch, featuring Distributed Data Parallel (DDP), mixed precision training, gradient accumulation, and elastic scaling. The script is designed to make efficient use of hardware and resources, enabling training across multiple GPUs or machines.
Please note, this code is provided without warranties, and it is your responsibility to ensure its suitability for your specific use case, particularly in commercial or sensitive environments. Consult your legal advisor if you plan to use this code for commercial purposes.
-
Distributed Data Parallel (DDP): A PyTorch feature that distributes the training process across multiple GPUs or machines. It splits data across GPUs and synchronizes gradients to ensure optimal model training across distributed environments.
-
Mixed Precision Training: A technique that uses both FP16 (half-precision floating point) and FP32 (single-precision) during training. This reduces memory usage and speeds up computations.
-
Gradient Accumulation: A method that simulates larger batch sizes by accumulating gradients over several smaller mini-batches. This is helpful when GPU memory is limited.
-
Synchronized Batch Normalization: This ensures that batch statistics are synchronized across all GPUs in multi-GPU setups, particularly useful when batch sizes per GPU are small.
-
Elastic Scaling: The script can handle dynamic scaling of GPUs or nodes during runtime, making it suitable for environments where the number of resources can vary.
- One or more GPUs.
- For multi-node setups, ensure each node has access to at least one GPU and that they are connected via a high-speed network (e.g., Infiniband or 10Gb Ethernet).
- Python 3.7 or later.
- PyTorch 1.8 or later (with
torch.distributedandtorch.cuda.ampfor mixed precision training). - NCCL (Collective Communications Library) for efficient communication between GPUs.
Install the required Python packages by running:
pip install torch torchvision numpyEnsure that NCCL is correctly configured for multi-GPU environments. NCCL is included in most PyTorch installations and optimizes communication between GPUs.
-
Distributed Initialization: The script initializes distributed training using
torch.distributedand sets up communication using the NCCL backend. -
Data Loading and Augmentation: The script uses
DataLoaderwith aDistributedSamplerto ensure the dataset is evenly distributed across GPUs, improving efficiency during training. -
Model Setup: A pretrained ResNet18 model is used, with its final layer modified to classify images from the CIFAR-10 dataset. All layers except the last one are frozen to optimize for transfer learning.
-
Training Loop: The script includes a training loop that handles mixed precision training, gradient accumulation, learning rate scheduling, and model checkpointing.
The script offers a range of command-line options to allow for customization. Below are some common CLI arguments:
--epochs: Number of epochs for training (default: 50).--batch_size: Batch size for training (default: 32).--learning_rate: Initial learning rate (default: 0.001).--accumulation_steps: Number of gradient accumulation steps (default: 4).--checkpoint: File path for saving model checkpoints (default: 'model_checkpoint.pth').--world_size: Total number of distributed processes (default: 1).--rank: Rank of the current process in multi-node setups (default: 0).--gpu_ids: Comma-separated list of GPU IDs to use (default: '0').--dist_url: URL for initializing distributed training (default: 'tcp://127.0.0.1:23456').
-
Single-GPU Training:
python train.py --epochs 50 --batch_size 64 --learning_rate 0.001
-
Multi-GPU Training on a Single Node:
python -m torch.distributed.launch --nproc_per_node=2 train.py --epochs 50 --batch_size 64 --gpu_ids 0,1 --world_size 2
-
Multi-Node Training: On Node 1:
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=23456 train.py --epochs 50 --gpu_ids 0,1 --world_size 4On Node 2:
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_port=23456 train.py --epochs 50 --gpu_ids 0,1 --world_size 4
-
Distributed Data Parallel (DDP): Efficiently splits the workload across multiple GPUs and synchronizes gradients to optimize training time.
-
Mixed Precision Training: Uses FP16 computations to reduce memory usage while speeding up model training.
-
Gradient Accumulation: Allows training with larger effective batch sizes by accumulating gradients over several smaller batches.
-
Synchronized Batch Normalization: Ensures consistent training results by synchronizing batch statistics across GPUs in distributed setups.
-
Checkpointing: Periodically saves the model’s state and optimizer’s state so that training can be resumed from where it left off in case of interruptions.
-
Learning Rate Warmup: Gradually increases the learning rate during the initial training epochs to stabilize the model’s performance early on.
-
Elastic Scaling: Supports dynamic adjustment of the number of GPUs or nodes during runtime without needing to restart the job.
For single-GPU training, run the script with the default settings:
python train.py --epochs 50 --batch_size 64For training on multiple GPUs on the same machine, use PyTorch’s torch.distributed.launch utility:
python -m torch.distributed.launch --nproc_per_node=2 train.py --epochs 50 --batch_size 64 --gpu_ids 0,1To train across multiple nodes, you must have SSH access between nodes. Set the correct --nnodes, --node_rank, and --master_addr parameters:
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=23456 train.py --epochs 50 --gpu_ids 0,1 --world_size 4-
SSH Setup for Multi-Node Training: Ensure passwordless SSH access between nodes by setting up SSH keys.
-
NCCL Configuration: Configure the NCCL backend for efficient GPU communication. For debugging, use:
export NCCL_DEBUG=INFO -
Launching Distributed Training: Use PyTorch’s
torch.distributed.launchutility to initialize distributed processes.
-
Network Setup for Multi-Node Training: Use high-speed networking, such as InfiniBand or 10Gb Ethernet, for communication between nodes in multi-node training.
-
Monitor GPU Utilization: Use tools like
nvidia-smi(or your system’s equivalent) to monitor GPU utilization and ensure each GPU is being used efficiently. -
Learning Rate Warmup: For large-batch distributed training, warm up the learning rate during the first few epochs to avoid instability.
-
NCCL Configuration: If there are communication issues between GPUs, ensure that the NCCL backend is set up properly.
-
Low GPU Utilization: If GPUs are underutilized, check that the batch size is large enough to keep them busy.
-
Debugging Distributed Training: For debugging distributed errors, set
NCCL_DEBUG=INFOand monitor the logs for communication issues.
This code and accompanying manual are provided as is for open-source use. While the code is intended to facilitate scalable distributed deep learning, there are no warranties or guarantees regarding the accuracy, completeness, or reliability of the code and documentation.
By using this code, you agree that you are responsible for ensuring its suitability for your intended use case, particularly in commercial or high-risk environments. The authors of this script assume no liability for any issues arising from its use.
This script is intended for research and educational purposes. If you plan to use it in a commercial setting, please consult your legal advisor to ensure that you comply with any relevant regulations or licensing requirements.
By following this manual, users should be able to effectively utilize the provided scalable distributed training script for large-scale deep learning tasks on single or multiple nodes. This script has been designed to lower hardware requirements while maintaining performance in deep learning training. However, please ensure that you are fully aware of the legal implications of using this script, especially in commercial contexts.