Pytorch-lightning: Step-wise processing, better support for `IterableDataset`, and others

Created on 19 Dec 2019 · 5Comments · Source: PyTorchLightning/pytorch-lightning

I have been using PTL for a month. It is nice and saves a lot of time, and I intend to use it in future projects. That said, I have a list of feature requests and improvements that would be very helpful to have to support a wider set of use cases. I am not sure what the best format for this list, so I will just write them here.

1) Better support for IterableDataset

In addition to val_check_interval, we also need num_val_steps and num_train_steps. num_val_steps is needed because the validation set is also using an IterableDataset. num_train_steps is needed because you usually need to carefully pick number of gradient updates which has some interaction with the learning rate scheduler (num_train_steps=inf is not sufficient)
For validation, keep the same DataLoader object instead of instantiating a new one on each validation cycle, because it is costly to construct new workers each time.
Some of the debugging features that run on a small percentage of the training/validation don't work because they are assuming a Dataset not IterableDataset

2) Step-wise processing
Thinking of the "gradient update" as the unit of training instead of (or in addition to) an epoch. A typical use case is pretraining a language model, where you want to control for number of gradient updates, not epochs (e.g. check the RoBERTa/BERT papers).

Add an option to do scheduler.step() after every gradient update
Have self.trainer.num_train_steps be available for the LR scheduler. The scheduler is usually a function of number of steps
Checkpointing the current step, and resume from that step. Again, this is important to get the right scheduler, and also important for the tensorboard logging. It will be nice to resume from the same training example, but this is less important.

3) Misc. These are smaller points, but nice to have.

Having the default tensorboard logging include LR, time per steps, allgradnorm (check fairseq)
Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus.
with ddp, sync validation stats across processes. This is a common mistake, and it will be nice to guard users against it. It is something like having the following line at the end of validation_end:

val_loss = torch.distributed.all_reduce(val_loss, op=torch.distributed.ReduceOp.SUM)/self.trainer.world_size

various logs refer to "batches", and it is not clear if it a "batch" or a "step". They are usually the same except with gradient accumulation. Personally, I prefer the word step because it eliminates that confusion.

Thanks for the helpful library and sorry for the long list.

enhancement help wanted

Source

ibeltagy

👍11

Most helpful comment

@matthew-z , PTL does override my CUDA_DEVICE_ORDER to 0,1,.., making it difficult to run multiple jobs on the same machine. Check the code here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_data_parallel.py#L251
and similarity here: https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_parts.py#L516

ibeltagy on 21 Jan 2020

👍2

All 5 comments

Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus.

Trainer cannot ignore CUDA_VISIBLE_DEVICES, as it is handled by the underlaying libraries.

The device ids in pytorch/tensorflow applications are not consistent with the ids showed in nvidia-smi. E.g, the gpu:0 in pytorch may be the gpu:2 in nvidia-smi. I guess that's why you failed to set the device id to use card you want.

You may set this env variable export CUDA_DEVICE_ORDER=PCI_BUS_ID, then the order of device ids will be consistent.

matthew-z on 25 Dec 2019

ibeltagy on 21 Jan 2020

👍2

probably linked to #323 and #698

Borda on 24 Jan 2020

Is num_val_steps added, or some alias?

sshleifer on 20 May 2020

As far as I can tell, no. My workaround is to add

    def __len__(self):
        return 1000000  # a large positive constant

in the IterableDataset, then use val_percent_check to run for a smaller number of steps.

ibeltagy on 20 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Namespace Cleaning

monney · 3Comments

Access the logging directory through LightningModule or Trainer

DavidRuhe · 3Comments

Simplification: Merge load_from_metrics and load_from_checkpoint

awaelchli · 3Comments

How to use pytorch-lightning to run GAN？

as754770178 · 3Comments

Dataloader starving the gpu

maxime-louis · 3Comments