Pytorch-lightning: No error message when distributed_backend = "invalid choice", Trainer runs on CPU

Created on 5 Sep 2020  路  9Comments  路  Source: PyTorchLightning/pytorch-lightning

馃悰 Bug

I'm trying to implemented and run new BERT-based model as always, used gpus option, but strangely my model is still running on CPU. I know this from 1.the training is too slow, 2.print(self.device) -> "cpu.", 3.The logs (right below). I never encountered this before so I'm confused. I'm using pytorch-lightning=0.9.0

GPU available: True, used: True
[2020-09-05 08:54:00,565][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-09-05 08:54:00,565][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
[2020-09-05 08:54:00,566][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0]

[GPU memory used, but GPU utility is zero]
image

I also attach a strange warning message I see here.

...\pytorch_lightning\utilities\distributed.py:37: UserWarning: Could not log 
computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given

The code for the model I initialize inside my LightningModule is here (ColBERT). Below is how I initialize my LightningModule. ColBERT.from_pretrained() initializes the model of the link. I print(self.device) at the end of __init__ and I see "cpu" as a result.

class ColBERTLightning(pl.LightningModule):
    def __init__(self, hparams):
        super().__init__()
        self.hparams = hparams
        # BERT-based sub-module initialized here
        model_params = hparams.model
        self.model = ColBERT.from_pretrained(
            model_params.base,
            query_maxlen=model_params.query_maxlen,
            doc_maxlen=model_params.doc_maxlen,
            dim=model_params.projection_dim,
            similarity_metric=model_params.similarity_metric,
        )
        self.labels = torch.zeros(
            hparams.train.batch_size, dtype=torch.long, device=self.device
        )
        print(self.device) # it prints "cpu" even when I use gpus=1

This is the code for trainer. I'm using hydra and DataModule. I'm using pandas inside DataModule to load data.

@hydra.main(config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
    print(OmegaConf.to_yaml(cfg))
    hparams = cfg
    # if hparams.train.gpus is not None:
    #     hparams.train.gpus = str(hparams.train.gpus)

    # init model
    model = ColBERTLightning(hparams)

    # init data module
    data_dir = hparams.dataset.dir
    batch_size = hparams.train.batch_size
    dm = TripleTextDataModule(data_dir, batch_size=batch_size)
    # dm.setup("fit")

    # logger
    source_files_path = str(Path(hydra.utils.get_original_cwd()) / "**/*.py")
    ## TODO: Neptune or wandb?

    # # trainer
    trainer = Trainer(
        accumulate_grad_batches=hparams.train.accumulate_grad_batches,
        distributed_backend=hparams.train.distributed_backend,
        fast_dev_run=hparams.train.fast_dev_run,
        gpus=hparams.train.gpus,
        auto_select_gpus=True,
        gradient_clip_val=hparams.train.gradient_clip_val,
        max_steps=hparams.train.max_steps,
        benchmark=True,
        profiler=hparams.train.use_profiler,
        # profiler=AdvancedProfiler(),
        # sync_batchnorm=True,
        # log_gpu_memory="min_max",
    )

    # # fit
    trainer.fit(model, dm)

Environment

Two envs I've tested.

* CUDA:
        - GPU:
                - GeForce GTX 1080 Ti
                - GeForce GTX 1080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.18.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.5.1
        - pytorch-lightning: 0.9.0
        - tensorboard:       2.2.0
        - tqdm:              4.48.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                -
        - processor:         x86_64
        - python:            3.7.6
        - version:           #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020
* CUDA:
        - GPU:
                - GeForce GTX 1070 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.18.5
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 0.9.0
        - tensorboard:       2.2.0
        - tqdm:              4.46.1
* System:
        - OS:                Windows
        - architecture:
                - 64bit
                - WindowsPE
        - processor:         AMD64 Family 23 Model 8 Stepping 2, AuthenticAMD
        - python:            3.7.7
        - version:           10.0.19041
bug / fix help wanted

Most helpful comment

In your config, you have distributed_backend = None. This is fine, since None means PL will select the appropriate backend for you, depending if you have gpus or not. However!!!

print(type(hparams.train.distributed_backend))
# output: <class 'str'>

instead of proper None built in type.
Trainer does not throw an error when we select an invalid choice for backend. We should change that.
@PyTorchLightning/core-contributors

To solve your issue: convert to the real type at runtime or tell hydra the type somehow, not sure if that's possible.
When you do, you will get an error saying:

torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

so somewhere in your code you have an input on the wrong device. move it/create it on the right device.
One other note:
I saw in your init you have

self.labels = torch.zeros(hparams.train.batch_size, dtype=torch.long)
# but it should be
self.register_buffer("labels", torch.zeros(hparams.train.batch_size, dtype=torch.long))

I will leave you the rest of debugging, since this is now about your tensors that need to be created correctly. If you need additional help, let me know, in the meantime I will make sure we print a error message for wrong backend selection.

All 9 comments

  1. You can ignore that warning about graph, in 0.9.1 it is gone.
  2. If you print self.device in init, then of course it will show cpu because at the time the model is constructed, it has not yet been moved to GPU yet. This print statement should go into your training_step, for example.
  3. Would it be possible for you to create a Google Colab to reproduce this and share here. I don't really know how to approach this without seeing the rest of the code. Something is definitely running on your gpu, it shows a process and allocated memory. maybe you have a bottle neck and it's just spiking to 100 then going to 0 again. Run nvidia smi with watch -n 0.1 nvidia-smi to observe the continuous output

Thanks. I'll be back soon with Colab.

https://colab.research.google.com/drive/1cyXGqnorxnGwNnM_-SF0tIRLU5QQofoZ#scrollTo=ykojvOS6rfH-

Here. Everything will be downloaded as you run the code.

Or you may try locally.

Download Code

git clone https://github.com/kyoungrok0517/sparse-neural-ranker 

cd sparse-neural-ranker 

pip install torch torchvision && pip install -e . && pip install -r requirements.txt

Download Data

mkdir -p data && cd data
wget https://storage.googleapis.com/kyoungrok-public/msmarco-passage-triple-text-sm/test.parquet
cp test.parquet train.parquet && cp test.parquet val.parquet

Run

python trainer.py dataset.dir="<DATA_DIR>" train.gpus=1

In your config, you have distributed_backend = None. This is fine, since None means PL will select the appropriate backend for you, depending if you have gpus or not. However!!!

print(type(hparams.train.distributed_backend))
# output: <class 'str'>

instead of proper None built in type.
Trainer does not throw an error when we select an invalid choice for backend. We should change that.
@PyTorchLightning/core-contributors

To solve your issue: convert to the real type at runtime or tell hydra the type somehow, not sure if that's possible.
When you do, you will get an error saying:

torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

so somewhere in your code you have an input on the wrong device. move it/create it on the right device.
One other note:
I saw in your init you have

self.labels = torch.zeros(hparams.train.batch_size, dtype=torch.long)
# but it should be
self.register_buffer("labels", torch.zeros(hparams.train.batch_size, dtype=torch.long))

I will leave you the rest of debugging, since this is now about your tensors that need to be created correctly. If you need additional help, let me know, in the meantime I will make sure we print a error message for wrong backend selection.

Great! Thanks for the support!

Ah one more thing. Is distributed_backend config important even if I don't use multi-gpu? I'm getting impression that somethings wrong with the latest pytorch-lightning on handling distributed training. I'm getting warning/errors related to distributed training even when I don't use it. The things get much worse in Windows.

You don't have to answer this. Just I'm reporting the issue. Hope the next version deals with it more smoothly. Cheers~

@kyoungrok0517
I removed the reduceop warnings #3555 on windows, since distributed training is not available anyway on this platform.
You won't see these anmyore :)

Great! Thanks for your effort 馃槂

Was this page helpful?
0 / 5 - 0 ratings

Related issues

baeseongsu picture baeseongsu  路  3Comments

edenlightning picture edenlightning  路  3Comments

srush picture srush  路  3Comments

anthonytec2 picture anthonytec2  路  3Comments

remisphere picture remisphere  路  3Comments