Pytorch-lightning: Issue with running multiple models in PyTorch Lightning

Created on 3 Aug 2020  ·  8Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug

I am developing a system which needs to train dozens of individual models (>50) using Lightning, each with their own TensorBoard plots and logs. My current implementation has one Trainer object per model and it seems like I'm running into an error when I go over ~90 Trainer objects. Interestingly, the error only appears when I run the .test() method, not during .fit().

As I just started with Lightning, I am not sure if having one Trainer/model is the best approach. However, I require individual plots from each model, and it seems that if I use a single trainer for multiple models the results get overridden.

To Reproduce

Steps to reproduce the behaviour:

1.Define more than 90 Trainer objects, each with their own model.

  1. Run training for each model.
  2. Run testing for each model.
  3. See error
Traceback (most recent call last):
  File "lightning/main_2.py", line 193, in <module>
    main()
  File "lightning/main_2.py", line 174, in main
    new_trainer.test(model=new_model, test_dataloaders=te_loader)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in test
    results = self.__test_given_model(model, test_dataloaders)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1343, in __test_given_model
    self.set_random_port(force=True)
  File "\Anaconda3\envs\pysyft\lib\site-packages\pytorch_lightning\trainer\distrib_data_parallel.py", line 398, in set_random_port
    default_port = RANDOM_PORTS[-1]
IndexError: index -1 is out of bounds for axis 0 with size 0

Code sample

Defining the Trainer objects:

for i in range(args["num_users"]):
    trainer_list_0.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                          fast_dev_run=args["fast_dev_run"], weights_summary=None))
    trainer_list_1.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                            fast_dev_run=args["fast_dev_run"], weights_summary=None))
    trainer_list_2.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
                                            fast_dev_run=args["fast_dev_run"], weights_summary=None))

Training:

for i in range(args["num_users"]):
    trainer_list_0[i].fit(model_list_0[i], train_dataloader=dataloader_list[i],
                                      val_dataloaders=val_loader)
    trainer_list_1[i].fit(model_list_1[i], train_dataloader=dataloader_list[i],
                                        val_dataloaders=val_loader)
    trainer_list_2[i].fit(model_list_2[i], train_dataloader=dataloader_list[i],
                                        val_dataloaders=val_loader)

Testing:

for i in range(args["num_users"]):
    trainer_list_0[i].test(test_dataloaders=te_loader)
    trainer_list_1[i].test(test_dataloaders=te_loader)
    trainer_list_2[i].test(test_dataloaders=te_loader)

Expected behaviour

I expected the code to work without crashing.

Environment

  • PyTorch Version (e.g., 1.0): 1.4
  • OS (e.g., Linux): Windows 10 Pro 2004
  • How you installed PyTorch (conda, pip, source): conda
  • Python version: 3.7.6
  • CUDA/cuDNN version: CUDA 10.1/cuDNN 7.0
  • GPU models and configuration: RTX 2060 Super
bug / fix help wanted

All 8 comments

Hi! thanks for your contribution!, great first issue!

Thanks for the report. Will fix it here #2790 as it is directly related.

That's great, thanks. Looking forward to getting the patch!

@epiicme Just a quick follow up here why this issue got closed:
In your case, you are calling trainer.fit() multiple times, or even instantiate Trainer multiple times. In DDP mode, this cannot work since the Trainer will call the same script multiple times. When this happens, we have no way of controlling which trainer.fit() is executed. I added a note in the docs.

For your use case it means:

  • if you want to call trainer.fit() multiple times, use distributed_backend="ddp_spawn"
  • if you want to use distributed_backend="ddp", you must make sure your script only calls trainer.fit once (or trainer.test)

It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs.
Hope this helps you!

* if you want to call trainer.fit() multiple times, use distributed_backend="ddp_spawn"

* if you want to use distributed_backend="ddp", you must make sure your script only calls trainer.fit once (or trainer.test)

It is a tradeoff between these two backends, both have their advantages and disadvantages, as outlined in the docs.
Hope this helps you!

I was running into the same error as OP in 0.8.5 when going with the first option in Using multiple trainers vs. single trainer if max_epochs needs to change.. When upgrading to 0.9.0rc16 the issue looks like it went away for now, but I want to confirm what's going on:

I'm not passing any distributed_backend to Trainer, it ends up being None. Did ddp fix for trainer.test() + add basic ddp tests
address this issue only for ddp mode, or also for no distributed backend?

only for ddp mode. But distributed_backend defaults to ddp_spawn if you run multi-gpu (single node), so that should not be affected by this issue here. Does that answer your question?

I am running single node / single gpu (passing gpus=1 to Trainer), would the default backend for that be affected?

no, in this case the backend is nothing special. For running on a single gpu, we don't need to do any extra work than to put the tensors on that device. That's why you see distributed_backend=None, because there is nothing distributed about running on 1 gpu.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

edenlightning picture edenlightning  ·  3Comments

Vichoko picture Vichoko  ·  3Comments

baeseongsu picture baeseongsu  ·  3Comments

as754770178 picture as754770178  ·  3Comments

mmsamiei picture mmsamiei  ·  3Comments