如何在PyTorch中实现多GPU训练

在PyTorch中实现多GPU训练，你可以使用torch.nn.DataParallel或torch.nn.parallel.DistributedDataParallel。下面是两种方法的简要说明和示例：

方法1：使用`torch.nn.DataParallel`

DataParallel是一个简单的并行计算方法，它将模型复制到每个GPU上，并将输入数据分割成多个子批次，然后将这些子批次发送到不同的GPU上进行计算。

import torch
import torch.nn as nn
from torchvision import models
# 检查是否有可用的GPU
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# 将模型复制到每个GPU上
model = nn.DataParallel(model)
model.to('cuda')  # 将模型发送到GPU
# 训练模型
for input, target in dataloader:
input, target = input.to('cuda'), target.to('cuda')
output = model(input)
loss = nn.CrossEntropyLoss()(output, target)
loss.backward()
optimizer.step()

方法2：使用`torch.nn.parallel.DistributedDataParallel`

DistributedDataParallel是一个更高级的并行计算方法，它支持多进程和多GPU训练。这种方法通常比DataParallel更快，因为它避免了数据在GPU之间的传输。
首先，你需要安装torch.distributed包：

pip install torch.distributed

然后，你可以使用以下代码实现多GPU训练：

import torch
import torch.nn as nn
from torchvision import models
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# 初始化分布式训练环境
dist.init_process_group(backend='nccl')
# 获取当前进程的GPU ID
local_rank = dist.get_rank()
# 创建模型并将其发送到对应的GPU上
model = models.resnet18(pretrained=True)
model.cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
# 训练模型
for input, target in dataloader:
input, target = input.to(local_rank), target.to(local_rank)
output = model(input)
loss = nn.CrossEntropyLoss().to(local_rank)
loss.backward()
optimizer.step()

注意：在使用DistributedDataParallel时，你需要确保每个进程使用不同的GPU。你可以通过设置环境变量CUDA_VISIBLE_DEVICES来控制每个进程可见的GPU。
例如，在命令行中运行以下命令：

CUDA_VISIBLE_DEVICES=0 python train.py
CUDA_VISIBLE_DEVICES=1 python train.py
CUDA_VISIBLE_DEVICES=2 python train.py
CUDA_VISIBLE_DEVICES=3 python train.py

这将分别在第0、1、2和3个GPU上运行train.py脚本。

方法1：使用torch.nn.DataParallel

方法2：使用torch.nn.parallel.DistributedDataParallel

方法1：使用`torch.nn.DataParallel`

方法2：使用`torch.nn.parallel.DistributedDataParallel`