Pytorch-lightning: Very slow training on colab with TPU

Created on 11 Jun 2020 · 8Comments · Source: PyTorchLightning/pytorch-lightning

https://colab.research.google.com/drive/1OxoEcbNVCF5aj_9o0axTnKAh8p5I4Ikw?usp=sharing

TPU help wanted information needed

Source

ipyffor

Most helpful comment

@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?

Borda on 29 Jul 2020

❤2

All 8 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 11 Jun 2020

@ipyffor mind share your PL version and sample notebook?

Borda on 11 Jun 2020

After running the progress has not changed

ipyffor on 11 Jun 2020

Sorry, it seemed that my picture could not be uploaded.

ipyffor on 11 Jun 2020

@ipyffor I cant access the colab file anymore. are you still facing the issue?

lezwon on 23 Jun 2020

@ipyffor @lezwon Not just on TPU. Even on GPU, it makes the entire browser unresponsive. It doesn't look like, it is code specific.

@Borda The pytorch-lightning version: 0.8.5

Am running into this issue only when I run the code inline. Instead of that, if I have the code in a separate file, say train.py and just use !python train.py, this problem is non-existent.

!pip install pytorch-lightning
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
from pytorch_lightning.core.lightning import LightningModule
import os, sys

class LitModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

    def train_dataloader(self):
        dataset = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)
        return loader

    def train_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        tensorboard_logs = {'train_loss': avg_loss}
        return {'avg_train_loss': avg_loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

    def val_dataloader(self):
        dataset = MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4)
        return loader

model1 = LitModel()

checkpoint_callback = ModelCheckpoint(filepath='model1/{epoch}', save_last=True, save_top_k=-1)

trainer = Trainer(max_epochs=100, gpus=1, fast_dev_run=False, checkpoint_callback=checkpoint_callback)
trainer.fit(model)

rahulvigneswaran on 13 Jul 2020

I am facing the same issue. Even if I run the code on 8 core TPU one iteration takes 35s, the same as running on 1 core TPU.

iliemihai on 29 Jul 2020

@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?

Borda on 29 Jul 2020

❤2

Was this page helpful?

0 / 5 - 0 ratings