I'm trying to implemented and run new BERT-based model as always, used gpus option, but strangely my model is still running on CPU. I know this from 1.the training is too slow, 2.print(self.device) -> "cpu.", 3.The logs (right below). I never encountered this before so I'm confused. I'm using pytorch-lightning=0.9.0
GPU available: True, used: True
[2020-09-05 08:54:00,565][lightning][INFO] - GPU available: True, used: True
TPU available: False, using: 0 TPU cores
[2020-09-05 08:54:00,565][lightning][INFO] - TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
[2020-09-05 08:54:00,566][lightning][INFO] - CUDA_VISIBLE_DEVICES: [0]
[GPU memory used, but GPU utility is zero]

I also attach a strange warning message I see here.
...\pytorch_lightning\utilities\distributed.py:37: UserWarning: Could not log
computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
The code for the model I initialize inside my LightningModule is here (ColBERT). Below is how I initialize my LightningModule. ColBERT.from_pretrained() initializes the model of the link. I print(self.device) at the end of __init__ and I see "cpu" as a result.
class ColBERTLightning(pl.LightningModule):
def __init__(self, hparams):
super().__init__()
self.hparams = hparams
# BERT-based sub-module initialized here
model_params = hparams.model
self.model = ColBERT.from_pretrained(
model_params.base,
query_maxlen=model_params.query_maxlen,
doc_maxlen=model_params.doc_maxlen,
dim=model_params.projection_dim,
similarity_metric=model_params.similarity_metric,
)
self.labels = torch.zeros(
hparams.train.batch_size, dtype=torch.long, device=self.device
)
print(self.device) # it prints "cpu" even when I use gpus=1
This is the code for trainer. I'm using hydra and DataModule. I'm using pandas inside DataModule to load data.
@hydra.main(config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
print(OmegaConf.to_yaml(cfg))
hparams = cfg
# if hparams.train.gpus is not None:
# hparams.train.gpus = str(hparams.train.gpus)
# init model
model = ColBERTLightning(hparams)
# init data module
data_dir = hparams.dataset.dir
batch_size = hparams.train.batch_size
dm = TripleTextDataModule(data_dir, batch_size=batch_size)
# dm.setup("fit")
# logger
source_files_path = str(Path(hydra.utils.get_original_cwd()) / "**/*.py")
## TODO: Neptune or wandb?
# # trainer
trainer = Trainer(
accumulate_grad_batches=hparams.train.accumulate_grad_batches,
distributed_backend=hparams.train.distributed_backend,
fast_dev_run=hparams.train.fast_dev_run,
gpus=hparams.train.gpus,
auto_select_gpus=True,
gradient_clip_val=hparams.train.gradient_clip_val,
max_steps=hparams.train.max_steps,
benchmark=True,
profiler=hparams.train.use_profiler,
# profiler=AdvancedProfiler(),
# sync_batchnorm=True,
# log_gpu_memory="min_max",
)
# # fit
trainer.fit(model, dm)
Two envs I've tested.
* CUDA:
- GPU:
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- available: True
- version: 10.2
* Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.5.1
- pytorch-lightning: 0.9.0
- tensorboard: 2.2.0
- tqdm: 4.48.2
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.6
- version: #46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020
* CUDA:
- GPU:
- GeForce GTX 1070 Ti
- available: True
- version: 10.2
* Packages:
- numpy: 1.18.5
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0
- tensorboard: 2.2.0
- tqdm: 4.46.1
* System:
- OS: Windows
- architecture:
- 64bit
- WindowsPE
- processor: AMD64 Family 23 Model 8 Stepping 2, AuthenticAMD
- python: 3.7.7
- version: 10.0.19041
watch -n 0.1 nvidia-smi to observe the continuous outputThanks. I'll be back soon with Colab.
https://colab.research.google.com/drive/1cyXGqnorxnGwNnM_-SF0tIRLU5QQofoZ#scrollTo=ykojvOS6rfH-
Here. Everything will be downloaded as you run the code.
Or you may try locally.
git clone https://github.com/kyoungrok0517/sparse-neural-ranker
cd sparse-neural-ranker
pip install torch torchvision && pip install -e . && pip install -r requirements.txt
mkdir -p data && cd data
wget https://storage.googleapis.com/kyoungrok-public/msmarco-passage-triple-text-sm/test.parquet
cp test.parquet train.parquet && cp test.parquet val.parquet
python trainer.py dataset.dir="<DATA_DIR>" train.gpus=1
In your config, you have distributed_backend = None. This is fine, since None means PL will select the appropriate backend for you, depending if you have gpus or not. However!!!
print(type(hparams.train.distributed_backend))
# output: <class 'str'>
instead of proper None built in type.
Trainer does not throw an error when we select an invalid choice for backend. We should change that.
@PyTorchLightning/core-contributors
To solve your issue: convert to the real type at runtime or tell hydra the type somehow, not sure if that's possible.
When you do, you will get an error saying:
torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
so somewhere in your code you have an input on the wrong device. move it/create it on the right device.
One other note:
I saw in your init you have
self.labels = torch.zeros(hparams.train.batch_size, dtype=torch.long)
# but it should be
self.register_buffer("labels", torch.zeros(hparams.train.batch_size, dtype=torch.long))
I will leave you the rest of debugging, since this is now about your tensors that need to be created correctly. If you need additional help, let me know, in the meantime I will make sure we print a error message for wrong backend selection.
Great! Thanks for the support!
Ah one more thing. Is distributed_backend config important even if I don't use multi-gpu? I'm getting impression that somethings wrong with the latest pytorch-lightning on handling distributed training. I'm getting warning/errors related to distributed training even when I don't use it. The things get much worse in Windows.
You don't have to answer this. Just I'm reporting the issue. Hope the next version deals with it more smoothly. Cheers~
@kyoungrok0517
I removed the reduceop warnings #3555 on windows, since distributed training is not available anyway on this platform.
You won't see these anmyore :)
Great! Thanks for your effort 馃槂
Most helpful comment
In your config, you have distributed_backend = None. This is fine, since None means PL will select the appropriate backend for you, depending if you have gpus or not. However!!!
instead of proper None built in type.
Trainer does not throw an error when we select an invalid choice for backend. We should change that.
@PyTorchLightning/core-contributors
To solve your issue: convert to the real type at runtime or tell hydra the type somehow, not sure if that's possible.
When you do, you will get an error saying:
so somewhere in your code you have an input on the wrong device. move it/create it on the right device.
One other note:
I saw in your init you have
I will leave you the rest of debugging, since this is now about your tensors that need to be created correctly. If you need additional help, let me know, in the meantime I will make sure we print a error message for wrong backend selection.