Hi @vfdev-5 ,
I am developing distributed evaluation feature, and facing a problem that the preds and labels on different GPUs don't have the same length, then ignite.idist.all_gather() can't work. For example: GPU0 has 5 images to handle, GPU1 has 4 images, total=9 images.
Could you please help on how to idist.all_gather() the values?
I don't want to pad data for the input to make it evenly-divisible, because it will cause the metrics different on single GPU and multi-GPUs.
Thanks in advance.
BTW, I mean when using ignite.metrics.EpochMetric, the inputs preds and labels have different lengths on different GPUs.
Thanks.
Hi @Nic-Ma ,
I think this can be done with the broadcast collective op:
Probably, this can be slow and thus could not be used everytime... I'll prototype a code snippet a bit later.
In this issue: https://github.com/pytorch/ignite/issues/1288, we though about providing a feature to customize reduction/gather ops for metrics, such that user could use its own adapted solution as well.
Hi @vfdev-5 ,
Or maybe we can pad NaN Tensor to make it evenly-divisible, and delete NaN number after all_gather()?
Thanks.
Or maybe we can pad NaN Tensor to make it evenly-divisible, and delete NaN number after all_gather()?
Well, yes, padding can be an option too. But in the first message you said you did not want to pad...
Hi @vfdev-5 ,
Sorry I didn't make it clear, I mean we don't want to pad data for the DistributedSampler before model prediction.
So maybe it's possible to only pad NaN before all_gather() and delete NaN after.
Thanks.
Hi @vfdev-5 ,
I am facing an issue with idist.all_gather() for string:
Here I put all the filenames of 1 GPU to self.filenames list and join them into 1 string, then try to gather, but it always hangs at the all_gather line.
_filenames = "\t".join(self._filenames)
_filenames = idist.all_gather(_filenames)
My PyTorch version is 1.7.0 and GPU is V100, do you know any reason of this issue?
Thanks.
Hi @Nic-Ma
here is a draft version for all_gather with padding
# Question : https://github.com/pytorch/ignite/issues/1569
import os
import time
from typing import Optional, List
import torch
import torch.nn.functional as F
import torch.distributed as dist
def compute_padding(shape, new_shape):
padding = []
for dim, new_dim in zip(shape, new_shape):
padding.insert(0, new_dim - dim)
padding.insert(0, 0)
return padding
def all_gather(tensor: torch.Tensor, fixed_shape: Optional[List] = None) -> List[torch.Tensor]:
input_shape = tensor.shape
if fixed_shape is not None:
padding = compute_padding(tensor.shape, fixed_shape)
if sum(padding) > 0:
tensor = F.pad(tensor, pad=padding, mode='constant', value=0)
output = [torch.zeros_like(tensor) for _ in range(dist.get_world_size())]
dist.all_gather(output, tensor)
all_input_shapes = None
if fixed_shape is not None:
# gather all shapes
tensor_shape = torch.tensor(input_shape, device=tensor.device)
all_input_shapes = [torch.zeros_like(tensor_shape) for _ in range(dist.get_world_size())]
dist.all_gather(all_input_shapes, tensor_shape)
all_input_shapes = [t.tolist() for t in all_input_shapes]
if all_input_shapes:
for i, shape in enumerate(all_input_shapes):
padding = compute_padding(output[i].shape, shape)
if sum(padding) < 0:
output[i] = F.pad(output[i], pad=padding)
return output
if __name__ == "__main__":
dist.init_process_group("nccl", init_method="env://")
lrank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(lrank)
rank = dist.get_rank()
size = 5
uneven_size = 2
t = torch.arange(dist.get_world_size() * size - uneven_size, device="cuda")
input_tensor = t[size * rank: size * (rank + 1)]
time.sleep(0.5 * (dist.get_rank() + 1))
print(rank, " - Input : ", input_tensor)
result = all_gather(input_tensor, fixed_shape=(size, ))
time.sleep(0.5 * (dist.get_rank() + 1))
print(rank, " - Output: ", result)
dist.destroy_process_group()
Run as
python -u -m torch.distributed.launch --nproc=2 --use_env question-uneven-input-all-gather.py
0 - Input : tensor([0, 1, 2, 3, 4], device='cuda:0')
1 - Input : tensor([5, 6, 7], device='cuda:1')
0 - Output: [tensor([0, 1, 2, 3, 4], device='cuda:0'), tensor([5, 6, 7], device='cuda:0')]
1 - Output: [tensor([0, 1, 2, 3, 4], device='cuda:1'), tensor([5, 6, 7], device='cuda:1')]
As for idist.all_gather hang with string input. Let me try to reproduce it and will see.
BTW, what is the best way to reach out to you to communicate with messages to discuss about few things ?
I sent you few messages on MONAIBoot2020 slack...
@Nic-Ma I couldn't reproduce your issue.
Here is my code:
import os
import time
import torch
import torch.distributed as dist
import ignite
import ignite.distributed as idist
if __name__ == "__main__":
print(torch.__version__, ignite.__version__)
dist.init_process_group("nccl", init_method="env://")
lrank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(lrank)
rank = dist.get_rank()
all_filenames = [f"filename_{i}_{rank}" for i in range(10)]
data = "\t".join(all_filenames)
time.sleep(0.5 * (dist.get_rank() + 1))
print(rank, " - Input : ", data)
result = idist.all_gather(data)
time.sleep(0.5 * (dist.get_rank() + 1))
print(rank, " - Output: ", result)
dist.destroy_process_group()
and output
# python -u -m torch.distributed.launch --nproc=2 --use_env issue-1569-repro-hang-idist-gather-string.py
1.7.1 0.5.0.dev20210121
1.7.1 0.5.0.dev20210121
0 - Input : filename_0_0 filename_1_0 filename_2_0 filename_3_0 filename_4_0 filename_5_0 filename_6_0 filename_7_0 filename_8_0 filename_9_0
1 - Input : filename_0_1 filename_1_1 filename_2_1 filename_3_1 filename_4_1 filename_5_1 filename_6_1 filename_7_1 filename_8_1 filename_9_1
0 - Output: ['filename_0_0\tfilename_1_0\tfilename_2_0\tfilename_3_0\tfilename_4_0\tfilename_5_0\tfilename_6_0\tfilename_7_0\tfilename_8_0\tfilename_9_0', 'filename_0_1\tfilename_1_1\tfilename_2_1\tfilename_3_1\tfilename_4_1\tfilename_5_1\tfilename_6_1\tfilename_7_1\tfilename_8_1\tfilename_9_1']
1 - Output: ['filename_0_0\tfilename_1_0\tfilename_2_0\tfilename_3_0\tfilename_4_0\tfilename_5_0\tfilename_6_0\tfilename_7_0\tfilename_8_0\tfilename_9_0', 'filename_0_1\tfilename_1_1\tfilename_2_1\tfilename_3_1\tfilename_4_1\tfilename_5_1\tfilename_6_1\tfilename_7_1\tfilename_8_1\tfilename_9_1']
Please, let me know if my example correctly implements your issue. Thanks
Hi @vfdev-5 ,
Thanks very much for your detailed program!!
I also can't reproduce the issue with your program, let me try to make a program to reproduce it.
And BTW, I don't know why I can't login to the BootCamp workspace in slack anymore...So maybe let's communicate in email? my email [email protected] is always online.
Thanks.
I compared your program with mine and found the root cause, sorry for my mistake, the all_gather works for my string now.
Thanks very much for your help and example program!!!
@Nic-Ma how about this code: https://github.com/pytorch/ignite/issues/1569#issuecomment-766739042 for all_gather tensors of different shapes ?
We may think to add more functions like that (e.g. all_reduce) and put them into contrib.distributed.utils module...
Hi @vfdev-5 ,
Your example code looks good, and I developed a evenly_divisible_all_gather() in MONAI now to handle this case:
def evenly_divisible_all_gather(data: torch.Tensor):
"""
Utility function for distributed data parallel to pad tensor to make it evenly divisible for all_gather.
Args:
data: source tensor to pad and execute all_gather in distributed data parallel.
"""
if idist.get_world_size() <= 1:
return data
# make sure the data is evenly-divisible on multi-GPUs
length = data.shape[0]
all_lens = idist.all_gather(length)
max_len = max(all_lens).item()
if length < max_len:
size = [max_len - length] + list(data.shape[1:])
data = torch.cat([data, data.new_full(size, float("NaN"))], dim=0)
# all gather across all processes
data = idist.all_gather(data)
# delete the padding NaN items
return torch.cat([data[i * max_len : i * max_len + l, ...] for i, l in enumerate(all_lens)], dim=0)
Thanks.