Pytorch-lightning: Very slow training on colab with TPU

Created on 11 Jun 2020  ·  8Comments  ·  Source: PyTorchLightning/pytorch-lightning

Most helpful comment

@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?

All 8 comments

Hi! thanks for your contribution!, great first issue!

@ipyffor mind share your PL version and sample notebook?

image
After running the progress has not changed

Sorry, it seemed that my picture could not be uploaded.

@ipyffor I cant access the colab file anymore. are you still facing the issue?

@ipyffor @lezwon Not just on TPU. Even on GPU, it makes the entire browser unresponsive. It doesn't look like, it is code specific.

@Borda The pytorch-lightning version: 0.8.5

Am running into this issue only when I run the code inline. Instead of that, if I have the code in a separate file, say train.py and just use !python train.py, this problem is non-existent.

!pip install pytorch-lightning
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
from pytorch_lightning.core.lightning import LightningModule
import os, sys

class LitModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

    def train_dataloader(self):
        dataset = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)
        return loader

    def train_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        tensorboard_logs = {'train_loss': avg_loss}
        return {'avg_train_loss': avg_loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

    def val_dataloader(self):
        dataset = MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor())
        loader = DataLoader(dataset, batch_size=32, num_workers=4)
        return loader

model1 = LitModel()

checkpoint_callback = ModelCheckpoint(filepath='model1/{epoch}', save_last=True, save_top_k=-1)

trainer = Trainer(max_epochs=100, gpus=1, fast_dev_run=False, checkpoint_callback=checkpoint_callback)
trainer.fit(model)

I am facing the same issue. Even if I run the code on 8 core TPU one iteration takes 35s, the same as running on 1 core TPU.

@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mmsamiei picture mmsamiei  ·  3Comments

Vichoko picture Vichoko  ·  3Comments

monney picture monney  ·  3Comments

williamFalcon picture williamFalcon  ·  3Comments

as754770178 picture as754770178  ·  3Comments