Allennlp: multiple gpu training question

Created on 2 Jan 2019  Â·  6Comments  Â·  Source: allenai/allennlp

I tried to use multiple gpu training feature. It seems my module has not been replicated successfully across different gpus.
q = self._projection(query_antecedent)
File "/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(input, *kwargs)
File "/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 67, in forward
return F.linear(input, self.weight, self.bias)
File "/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1354, in linear
output = input.matmul(weight.t())
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:253

I check the source code in trainer.py and found this line
replicas = replicate(self.model, used_device_ids)
which seems do the model replication work. I'm confused with the above error.

def _data_parallel(self, batch):
"""
Do the forward pass using multiple GPUs. This is a simplification
of torch.nn.parallel.data_parallel to support the allennlp model
interface.
"""
inputs, module_kwargs = scatter_kwargs((), batch, self._cuda_devices, 0)
used_device_ids = self._cuda_devices[:len(inputs)]
replicas = replicate(self.model, used_device_ids)
outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)
# Only the 'loss' is needed.
# a (num_gpu, ) tensor with loss on each GPU
losses = gather([output['loss'].unsqueeze(0) for output in outputs], used_device_ids[0], 0)
return {'loss': losses.mean()}

Most helpful comment

@robbine, that's usually the only change required, but sometimes small changes are required in your model code. In particular, if you have any aggregate data structures like dictionaries or lists in your model that themselves contain PyTorch modules, you should be sure to use ModuleList or ModuleDict. Similarly, if you have a custom class containing modules, be sure to subclass Module.

Another possible issue could be that your code is directly placing query_antecedent on a particular GPU. To avoid this I suggest using the torch.*_like factory functions. See https://pytorch.org/docs/stable/torch.html#torch.zeros_like.

If none of those apply, it would be great if you could share a minimal repro of the issue. Thanks!

All 6 comments

updates:
I assume that parallel gpu training only require setting "cuda_device": [0, 1]
Is this correct ?
I read trainer.py which states that all model parameters are stored on gpu:0 by default, and input data tensor should be split equally across all gpu devices. By enabling parallel training, model parameters should be replicated on each gpu device and the above error should not occur.

@robbine, that's usually the only change required, but sometimes small changes are required in your model code. In particular, if you have any aggregate data structures like dictionaries or lists in your model that themselves contain PyTorch modules, you should be sure to use ModuleList or ModuleDict. Similarly, if you have a custom class containing modules, be sure to subclass Module.

Another possible issue could be that your code is directly placing query_antecedent on a particular GPU. To avoid this I suggest using the torch.*_like factory functions. See https://pytorch.org/docs/stable/torch.html#torch.zeros_like.

If none of those apply, it would be great if you could share a minimal repro of the issue. Thanks!

Thanks @brendan-ai2 , here is the line https://github.com/robbine/allennlp-as-a-library-example/blob/dfb08444dcf71850f9a018063a8de206f4ab7d1e/my_library/modules/layers/common_attention.py#L158
Actually it's a bert implementation which I intended to contribute to allennlp, however allennlp has already used the pytorch-pretrained-bert. I tried to add these following two lines, and find that input tensor query_antecedent stays on gpu:1 whereas query_projection stays on gpu:0.
print(util.get_device_of(query_antecedent)) print(util.get_device_of(query_projection.weight))
Also aggregate data structures you mentioned above is taken care of on line https://github.com/robbine/allennlp-as-a-library-example/blob/dfb08444dcf71850f9a018063a8de206f4ab7d1e/my_library/modules/seq2seq_encoders/transformer.py#L128
Moreover, all user defined modules are inherited from Seq2SeqEncoder which itself is in subclass of Module.

Thanks for the links, but I can't easily reconcile this with your stack trace.

  • Neither of the files you linked matches the line self._projection(query_antecedent) in the stack trace.
  • It's not immediately clear where multihead_attention (which calls compute_qkv which contains the line you linked as being broken) is called.

If you could provide a full stack trace matching the linked code, I might be able to see the problem quickly, but in general we don't have the resources to debug something of this size. A minimal reproduction of the bug (single file, couple hundred lines at most) would be really helpful.

sorry, my bad. I've made several commits since then. changing from self._projection(query_antecedent) to query_projection(query_antecedent) is one of those.
Also multihead_attention is called from https://github.com/robbine/allennlp-as-a-library-example/blob/5b1e4e74b2ee11d7c8502cc61c8c5d46302ca7c9/my_library/modules/seq2seq_encoders/multi_head_attention.py#L188
The good news is that the problem is solved by using a nn.ModuleList instead of a python list. Your advice is very helpful. Here is the link of lines I changed.
https://github.com/robbine/allennlp-as-a-library-example/blob/5b1e4e74b2ee11d7c8502cc61c8c5d46302ca7c9/my_library/modules/seq2seq_encoders/transformer.py#L103

Thanks a lot !

Excellent, I'm glad it worked out! Marking this as closed, but do let us know if you hit other issues. Cheers!

Was this page helpful?
0 / 5 - 0 ratings