Pytorch-lightning: Step-wise processing, better support for `IterableDataset`, and others

Created on 19 Dec 2019  ·  5Comments  ·  Source: PyTorchLightning/pytorch-lightning

I have been using PTL for a month. It is nice and saves a lot of time, and I intend to use it in future projects. That said, I have a list of feature requests and improvements that would be very helpful to have to support a wider set of use cases. I am not sure what the best format for this list, so I will just write them here.

1) Better support for IterableDataset

  • In addition to val_check_interval, we also need num_val_steps and num_train_steps. num_val_steps is needed because the validation set is also using an IterableDataset. num_train_steps is needed because you usually need to carefully pick number of gradient updates which has some interaction with the learning rate scheduler (num_train_steps=inf is not sufficient)
  • For validation, keep the same DataLoader object instead of instantiating a new one on each validation cycle, because it is costly to construct new workers each time.
  • Some of the debugging features that run on a small percentage of the training/validation don't work because they are assuming a Dataset not IterableDataset

2) Step-wise processing
Thinking of the "gradient update" as the unit of training instead of (or in addition to) an epoch. A typical use case is pretraining a language model, where you want to control for number of gradient updates, not epochs (e.g. check the RoBERTa/BERT papers).

  • Add an option to do scheduler.step() after every gradient update
  • Have self.trainer.num_train_steps be available for the LR scheduler. The scheduler is usually a function of number of steps
  • Checkpointing the current step, and resume from that step. Again, this is important to get the right scheduler, and also important for the tensorboard logging. It will be nice to resume from the same training example, but this is less important.

3) Misc. These are smaller points, but nice to have.

  • Having the default tensorboard logging include LR, time per steps, allgradnorm (check fairseq)
  • Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus.
  • with ddp, sync validation stats across processes. This is a common mistake, and it will be nice to guard users against it. It is something like having the following line at the end of validation_end:
val_loss = torch.distributed.all_reduce(val_loss, op=torch.distributed.ReduceOp.SUM)/self.trainer.world_size
  • various logs refer to "batches", and it is not clear if it a "batch" or a "step". They are usually the same except with gradient accumulation. Personally, I prefer the word step because it eliminates that confusion.

Thanks for the helpful library and sorry for the long list.

enhancement help wanted

Most helpful comment

All 5 comments

Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus.

Trainer cannot ignore CUDA_VISIBLE_DEVICES, as it is handled by the underlaying libraries.

The device ids in pytorch/tensorflow applications are not consistent with the ids showed in nvidia-smi. E.g, the gpu:0 in pytorch may be the gpu:2 in nvidia-smi. I guess that's why you failed to set the device id to use card you want.

You may set this env variable export CUDA_DEVICE_ORDER=PCI_BUS_ID, then the order of device ids will be consistent.

probably linked to #323 and #698

Is num_val_steps added, or some alias?

As far as I can tell, no. My workaround is to add

    def __len__(self):
        return 1000000  # a large positive constant 

in the IterableDataset, then use val_percent_check to run for a smaller number of steps.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

monney picture monney  ·  3Comments

DavidRuhe picture DavidRuhe  ·  3Comments

awaelchli picture awaelchli  ·  3Comments

as754770178 picture as754770178  ·  3Comments

maxime-louis picture maxime-louis  ·  3Comments