Hi! thanks for your contribution!, great first issue!
@ipyffor mind share your PL version and sample notebook?

After running the progress has not changed
Sorry, it seemed that my picture could not be uploaded.
@ipyffor I cant access the colab file anymore. are you still facing the issue?
@ipyffor @lezwon Not just on TPU. Even on GPU, it makes the entire browser unresponsive. It doesn't look like, it is code specific.
@Borda The pytorch-lightning version: 0.8.5
Am running into this issue only when I run the code inline. Instead of that, if I have the code in a separate file, say train.py and just use !python train.py, this problem is non-existent.
!pip install pytorch-lightning
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
from pytorch_lightning.core.lightning import LightningModule
import os, sys
class LitModel(LightningModule):
def __init__(self):
super().__init__()
self.l1 = torch.nn.Linear(28 * 28, 10)
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
tensorboard_logs = {'train_loss': loss}
return {'loss': loss, 'log': tensorboard_logs}
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.001)
def train_dataloader(self):
dataset = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
loader = DataLoader(dataset, batch_size=32, num_workers=4, shuffle=True)
return loader
def train_epoch_end(self, outputs):
avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
tensorboard_logs = {'train_loss': avg_loss}
return {'avg_train_loss': avg_loss, 'log': tensorboard_logs}
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
return {'val_loss': F.cross_entropy(y_hat, y)}
def validation_epoch_end(self, outputs):
avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
tensorboard_logs = {'val_loss': avg_loss}
return {'val_loss': avg_loss, 'log': tensorboard_logs}
def val_dataloader(self):
dataset = MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor())
loader = DataLoader(dataset, batch_size=32, num_workers=4)
return loader
model1 = LitModel()
checkpoint_callback = ModelCheckpoint(filepath='model1/{epoch}', save_last=True, save_top_k=-1)
trainer = Trainer(max_epochs=100, gpus=1, fast_dev_run=False, checkpoint_callback=checkpoint_callback)
trainer.fit(model)
I am facing the same issue. Even if I run the code on 8 core TPU one iteration takes 35s, the same as running on 1 core TPU.
@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?
Most helpful comment
@iliemihai @rahulvigneswaran we had a bug there so multicore was not running in fact... shall be fixed now #2632 mind try actual master? also, mind send a PR with some parity speed testing?