Pytorch-lightning: Multi-GPU when using Torchtext iterator for data loading

Created on 16 Sep 2019  ยท  22Comments  ยท  Source: PyTorchLightning/pytorch-lightning

Hi there,

I have just discovered pytorch-lightning few days ago and it seems awesome (congratulations!)
I have a question I cannot solve by reading the docs and examples.
Is is fully compatible with Torchtext?
I am trying to use a Torchtext iterator to load the data in batches, and I have managed to make it work for a single GPU, but when I add additional GPUs to the trainer:

trainer = Trainer(experiment=exp, gpus=[0, 1])

it breaks saying:

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:397

I understand that the problem comes from the model and the data not being placed in the same GPU.
I am following the provided template, replacing the MNIST parts with my own data.
The way I load the training data is:

@pl.data_loader
    def tng_dataloader(self):
        print('tng data loader called')
        DEVICE = torch.device(next(self.parameters()).device if torch.cuda.is_available() else "cpu")
        print('Current device ', DEVICE)
        (train_iter,) = BucketIterator.splits(
            (self.train_data,),
            sort=False,
            batch_size=self.hparams.batch_size,
            shuffle=True,
            repeat=False,
            device=DEVICE  # what to do here? what will do Lightning w.r.t. device?
        )

        return train_iter

I use that little hack to get the current gpu device to parameterize the Torchtext BucketIterator, because if I leave the Torchtext iterator "device" field empty it defaults to cpu, and I get the corresponding complaint when the training starts with the model in the gpu:
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

But this hack does not work for more-than-one-gpu setting.
Am I missing something or am I doing something wrong?
I could also reimplement my data loading using regular Pytorch dataloaders as in the template, but I would like to know if I can stick to Torchtext and still get the multi-gpu goodies from Lightning :)

Thanks in advance!

question

Most helpful comment

Currently, you need to transfer manually data to a GPU using torchtext. Take a look at my gist
https://gist.github.com/mateuszpieniak/f290b3a727db7e94b9da0bd3bd2e33c1 and the method transfer_batch_to_device.

All 22 comments

hi! install from master and try again? i believe we pushed a fix for this on master. if not, i can look at it deeper

@aitor-garcia-p actually, just released a new version with these fixes. Try again? if not we'll take a deeper look at it

Hi again,

After digging a bit (with my limited understanding), I see that in this function,

https://github.com/williamFalcon/pytorch-lightning/blob/4c61d1f30a29db8404606c6c933a5a2f3c0ae1ae/pytorch_lightning/trainer/trainer.py#L1077

if the "batch" parameter is a torchtext.data.Batch object (as it happens when using a Torchtext Iterator) the Trainer function transfer_batch_to_gpu will miss it despite having several conditionals.

I have made a test adding this additional condition:

elif getattr(batch, 'fields', None):
            for f_name in batch.fields:
                setattr(batch, f_name, self.transfer_batch_to_gpu(getattr(batch, f_name), gpu_id))
            return batch

(Or any other condition that catches a torchtext.data.Batch instance)
And it has started working nicely in single GPU with any additional hack to set the device on the iterators.

But still I cannot make multi-gpu working when the batches come from a torchtext iterator.
I instantiate the Trainer like this:

trainer = Trainer(experiment=exp, gpus=[0, 1], distributed_backend='ddp')

And it complains about the following:

   mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/DATA/agarciap_data/python_stuff/python_envs/venv_for_remotes/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/local/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/usr/local/lib/python3.7/pickle.py", line 437, in dump
    self.save(obj)
  File "/usr/local/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/local/lib/python3.7/pickle.py", line 662, in save_reduce
    save(state)
  File "/usr/local/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/pickle.py", line 856, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/local/lib/python3.7/pickle.py", line 882, in _batch_setitems
    save(v)
  File "/usr/local/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/pickle.py", line 786, in save_tuple
    save(element)
  File "/usr/local/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/pickle.py", line 771, in save_tuple
    save(element)
  File "/usr/local/lib/python3.7/pickle.py", line 549, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/local/lib/python3.7/pickle.py", line 662, in save_reduce
    save(state)
  File "/usr/local/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/pickle.py", line 856, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/local/lib/python3.7/pickle.py", line 882, in _batch_setitems
    save(v)
  File "/usr/local/lib/python3.7/pickle.py", line 524, in save
    rv = reduce(self.proto)
TypeError: 'generator' object is not callable

It seems that there is something with the torchtext iterator that prevents from a proper serialization for the distributed processes spawn.
Or there is something else missing. For example, I have noticed that for ddp the Pytorch Dataloaders needs a DistributedSampler, but I don't know how does the torchtext iterators deal with that.

yeah, looks like torchtext can't be pickled and thus not used with DDP. But you should verify that on the torchtext issues. If that's true, then i'd recommend DP or we can try to come up with a work around

Also, feel free to submit a PR with your changes so we can enable torchtext support

Hey @williamFalcon ,
Any progress on this, or a preferred workaround?
The ubiquity of torchtext together with this issue make if difficult to do NLP/seq2seq.

hey! sorry, been busy with deadlines but will look at it this week.

want to take a stab at a PR? can help you finish it once you submit it

@ctlaltdefeat did you still want to submit this PR?
@jeffling @neggert anyone want to take a look at this?

I've been busy too, and I think it may be more of an issue between DistributedDataParallel and torchtext than anything that lightning adds per se.

That's correct, torchtext can't be pickled and you'll want to use DP.

Could you give a full stacktrace of the issue with DP? I'm not sure which step is emitting that error or if it's coming from dataloading or training.

The issue with DP (for me) is that the inability to use mixed-precision training offsets the benefit of multi-GPU training.
I think tomorrow I'll try to work around the issue by either converting the torchtext object to a standard DataLoader or by somehow separating the torchtext object from the model that needs to be pickled.

Any recent updates on this issue?

I am trying to run a torchtext dataset, it works fine with single GPU, but fails to work on dp, ddp (ddp2 out of bounds for me as no slurm). I think ddp may be an issue with another library (wandb.com).

But for dp I am getting the same error as OP.

RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:397

@jeffling this is the error trace for DP with torchtext:

๏…
16:17:51
Starting Model
๏…
16:17:51
Traceback (most recent call last):
๏…
16:17:51
File "/Siamese_BERT_blogpost/train.py", line 107, in <module>
๏…
16:17:51
trainer.fit(model)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 348, in fit
๏…
16:17:51
self.dp_train(model)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/dp_mixin.py", line 104, in dp_train
๏…
16:17:51
self.run_pretrain_routine(model)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 455, in run_pretrain_routine
๏…
16:17:51
self.evaluate(model, self.get_val_dataloaders(), self.nb_sanity_val_steps, self.testing)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 50, in evaluate
๏…
16:17:51
test)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop_mixin.py", line 174, in evaluation_forward
๏…
16:17:51
output = model(*args)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
๏…
16:17:51
result = self.forward(*i
๏…
16:17:51
wandb: Waiting for W&B process to finish, PID 162
๏…
16:17:51
nput, **kwargs)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/pt_overrides/override_data_parallel.py", line 65, in forward
๏…
16:17:51
outputs = self.parallel_apply(replicas, inputs, kwargs)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/pt_overrides/override_data_parallel.py", line 69, in parallel_apply
๏…
16:17:51
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/pt_overrides/override_data_parallel.py", line 199, in parallel_apply
๏…
16:17:51
raise output
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/pt_overrides/override_data_parallel.py", line 165, in _worker
๏…
16:17:51
output = module.validation_step(*input, **kwargs)
๏…
16:17:51
File "/Siamese_BERT_blogpost/wrapper.py", line 42, in validation_step
๏…
16:17:51
out = self.forward(batch)
๏…
16:17:51
File "/Siamese_BERT_blogpost/wrapper.py", line 35, in forward
๏…
16:17:51
return self.siamese(batch)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py",wandb: Program failed with code 1. Press ctrl-c to abort syncing.
๏…
16:17:51
line 541, in __call__
๏…
16:17:51
result = self.forward(*input, **kwargs)
๏…
16:17:51
File "/Siamese_BERT_blogpost/models.py", line 46, in forward
๏…
16:17:51
premise = self.language_model(premise)[0]
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
๏…
16:17:51
result = self.forward(*input, **kwargs)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 735, in forward
๏…
16:17:51
embedding_output = self.embeddings(input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
๏…
16:17:51
result = self.forward(*input, **kwargs)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 186, in forward
๏…
16:17:51
inputs_embeds = self.word_embeddings(input_ids)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
๏…
16:17:51
result = self.forward(*input, **kwargs)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/sparse.py", line 114, in forward
๏…
16:17:51
self.norm_type, self.scale_grad_by_freq, self.sparse)
๏…
16:17:51
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1484, in embedding
๏…
16:17:51
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
๏…
16:17:51
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:400

I am running train.py from this repository: https://github.com/Genei-Ltd/Siamese_BERT_blogpost/blob/master/train.py

@aced125 it looks like the batches aren't put into the right GPU. Could you look at the example code the OP put out regarding the hack with torchtext to place things into the right GPU?

It also looks like @ctlaltdefeat had this working with DP, but couldn't use DP due to other reasons. Any tips?

@jeffling I've given up on torchtext datasets to be honest. It was easy enough to switch to a torch.utils.data.DataLoader instead.

I am going to try PL on graph convolutions soon (will be using the pytorch-geometric library which also uses a custom DataLoader (which inherits from the torch DataLoader) so will let you know if that works well.

Hello,

I am not able to get a simple toy example running using Torchtext iterators even on a single gpu. I am using a BucketIterator as follows:

train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
train_iter, valid_iter, test_iter = BucketIterator.splits((train_data, valid_data, test_data), batch_size=batch_size)

My trainer code is:

model = SegmenterModule(80, 76)
trainer = Trainer(gpus=1, max_nb_epochs=3, default_save_path='checkpoints')
trainer.fit(model)

But I get an error because the batch data is still on cpu and not moved to the gpu.

Stack trace:
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

Can someone please help me figure out the problem or share a working example using torchtext iterators? Also, should I open a new issue with this problem or let this question be here?

As some people mentioned here, I cannot make it work even for a single GPU. I debugged the code and it seems like Batch object generated by torchtext.data.Iterator doesn't follow the rules described here

https://github.com/PyTorchLightning/pytorch-lightning/blob/45d671a4a81788b9d97fd6b47763816926e58e95/pytorch_lightning/trainer/distrib_parts.py#L420

As a result the data are not moved to GPU and the code gives my exception:

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same

@aitor-garcia-p @mateuszpieniak @jeffling let's close this one and continue discission how to improve the situation in #1245

Hello,

I am not able to get a simple toy example running using Torchtext iterators even on a single gpu. I am using a BucketIterator as follows:

train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
train_iter, valid_iter, test_iter = BucketIterator.splits((train_data, valid_data, test_data), batch_size=batch_size)

My trainer code is:

model = SegmenterModule(80, 76)
trainer = Trainer(gpus=1, max_nb_epochs=3, default_save_path='checkpoints')
trainer.fit(model)

But I get an error because the batch data is still on cpu and not moved to the gpu.

Stack trace:
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

Can someone please help me figure out the problem or share a working example using torchtext iterators? Also, should I open a new issue with this problem or let this question be here?

Have you find out the solution yet?

Hello,
I am not able to get a simple toy example running using Torchtext iterators even on a single gpu. I am using a BucketIterator as follows:

train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, min_freq=2)
train_iter, valid_iter, test_iter = BucketIterator.splits((train_data, valid_data, test_data), batch_size=batch_size)

My trainer code is:

model = SegmenterModule(80, 76)
trainer = Trainer(gpus=1, max_nb_epochs=3, default_save_path='checkpoints')
trainer.fit(model)

But I get an error because the batch data is still on cpu and not moved to the gpu.
Stack trace:
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select
Can someone please help me figure out the problem or share a working example using torchtext iterators? Also, should I open a new issue with this problem or let this question be here?

Have you find out the solution yet?

I also have the same problem.

Currently, you need to transfer manually data to a GPU using torchtext. Take a look at my gist
https://gist.github.com/mateuszpieniak/f290b3a727db7e94b9da0bd3bd2e33c1 and the method transfer_batch_to_device.

Was this page helpful?
0 / 5 - 0 ratings