I have been using PTL for a month. It is nice and saves a lot of time, and I intend to use it in future projects. That said, I have a list of feature requests and improvements that would be very helpful to have to support a wider set of use cases. I am not sure what the best format for this list, so I will just write them here.
1) Better support for IterableDataset
val_check_interval, we also need num_val_steps and num_train_steps. num_val_steps is needed because the validation set is also using an IterableDataset. num_train_steps is needed because you usually need to carefully pick number of gradient updates which has some interaction with the learning rate scheduler (num_train_steps=inf is not sufficient)DataLoader object instead of instantiating a new one on each validation cycle, because it is costly to construct new workers each time. Dataset not IterableDataset2) Step-wise processing
Thinking of the "gradient update" as the unit of training instead of (or in addition to) an epoch. A typical use case is pretraining a language model, where you want to control for number of gradient updates, not epochs (e.g. check the RoBERTa/BERT papers).
scheduler.step() after every gradient updateself.trainer.num_train_steps be available for the LR scheduler. The scheduler is usually a function of number of steps3) Misc. These are smaller points, but nice to have.
allgradnorm (check fairseq)Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus. validation_end: val_loss = torch.distributed.all_reduce(val_loss, op=torch.distributed.ReduceOp.SUM)/self.trainer.world_size
Thanks for the helpful library and sorry for the long list.
Trainer(gpus=2) ignores CUDA_VISIBLE_DEVICES and always picks the first two gpus.
Trainer cannot ignore CUDA_VISIBLE_DEVICES, as it is handled by the underlaying libraries.
The device ids in pytorch/tensorflow applications are not consistent with the ids showed in nvidia-smi. E.g, the gpu:0 in pytorch may be the gpu:2 in nvidia-smi. I guess that's why you failed to set the device id to use card you want.
You may set this env variable export CUDA_DEVICE_ORDER=PCI_BUS_ID, then the order of device ids will be consistent.
@matthew-z , PTL does override my CUDA_DEVICE_ORDER to 0,1,.., making it difficult to run multiple jobs on the same machine. Check the code here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_data_parallel.py#L251
and similarity here: https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_parts.py#L516
probably linked to #323 and #698
Is num_val_steps added, or some alias?
As far as I can tell, no. My workaround is to add
def __len__(self):
return 1000000 # a large positive constant
in the IterableDataset, then use val_percent_check to run for a smaller number of steps.
Most helpful comment
@matthew-z , PTL does override my
CUDA_DEVICE_ORDERto 0,1,.., making it difficult to run multiple jobs on the same machine. Check the code here:https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_data_parallel.py#L251
and similarity here: https://github.com/PyTorchLightning/pytorch-lightning/blob/06242c200a318a37d1f882c786e60354ec04533f/pytorch_lightning/trainer/distrib_parts.py#L516