Pytorch-lightning: why won't auto resubmit work?

Created on 4 Sep 2020 · 13Comments · Source: PyTorchLightning/pytorch-lightning

@awaelchli (just tagged based on the previous issue here)

🐛 Bug

auto resubmit doesn't seem to work

To Reproduce

Steps to reproduce the behavior:

sbatch sample_05_gpu.sh

Code sample

Please take a look at my code and job script

Expected behavior

Should checkpoint and restart from the last checkpoint every 2 minutes.

Environment

torch = 1.3.1
pytorch_lightning = 0.9.0

help wanted information needed question

Source

vr25

👍2

Most helpful comment

Yes, the checkpoint is saved. Also, the job terminates completely and the slurm log file says, "Job cancelled because time value exceeded"

@vr25 In your sbatch options, you have this:
```shell script

SBATCH --time=00:00:02

```
Seems to me like you're asking for 2 seconds rather than 2 minutes, which would explain why it's not working. There's no time for the code to kick in at this speed :stuck_out_tongue:

You should try 00:02:00 for the time instead.

Edit: typo

nathanpainchaud on 2 Oct 2020

👍2

All 13 comments

after the job finishes, does it completely terminate or is there any process still running in the background? and did you observe checkpoints being saved?

awaelchli on 6 Sep 2020

Yes, the checkpoint is saved. Also, the job terminates completely and the slurm log file says, "Job cancelled because time value exceeded"

vr25 on 6 Sep 2020

@vr25 mind add important snapshots or your code some it helps also other in the future?

Borda on 15 Sep 2020

@Borda Both code and slurm script are linked in the reported bug description. Please let me know if you cannot access it. Thanks!

vr25 on 16 Sep 2020

@Borda Both code and slurm script are linked in the reported bug description. Please let me know if you cannot access it. Thanks!

import torch
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from torch.utils.data import TensorDataset, Dataset, DataLoader


l_rate = 0.2
mse_loss = nn.MSELoss(reduction = 'mean')

df = pd.read_csv('bike_sharing_hourly.csv')
print(df.head(5))

onehot_fields = ['season', 'mnth', 'hr', 'weekday', 'weathersit']
for field in onehot_fields:
    dummies = pd.get_dummies(df[field], prefix=field, drop_first=False)
    df = pd.concat([df, dummies], axis=1)
df = df.drop(onehot_fields, axis = 1)
print(df.head(5))

continuous_fields = ['casual', 'registered', 'cnt', 'temp', 'hum', 'windspeed']
# Store scalings in a dictionary so we can convert back later
scaled_features = {}
for field in continuous_fields:
    mean, std = df[field].mean(), df[field].std()
    scaled_features[field] = [mean, std]
    df.loc[:, field] = (df[field] - mean)/std
scaled_features

df_backup = df.copy()

fields_to_drop = ['instant', 'dteday', 'atemp', 'workingday']
df.drop(fields_to_drop, axis=1, inplace = True)

# Split of 60 days of data from the end of the df for validation
validation_data = df[-60*24:]
df = df[:-60*24]

# Split of 21 days of data from the end of the df for testing
test_data = df[-21*24:]
df = df[:-21*24]

# The remaining (earlier) data will be used for training
train_data = df

# What have we ended up with?
print(f'''Validation data length: {len(validation_data)}
Test data length: {len(test_data)}
Train data length: {len(train_data)}''')

target_fields = ['cnt', 'casual', 'registered']

train_features, train_targets = train_data.drop(target_fields, axis=1), train_data[target_fields]
test_features, test_targets = test_data.drop(target_fields, axis=1), test_data[target_fields]
validation_features, validation_targets = validation_data.drop(target_fields, axis=1), validation_data[target_fields]


class Regression(pl.LightningModule):

### The Model ### 

    # Question: what will your model architecture look like?
    # Initialize the layers
    # Here we have one input layer (size 56 as we have 56 features), one hidden layer (size 10), 
    # and one output layer (size 1 as we are predicting a single value)
    def __init__(self):
        super(Regression, self).__init__()
        self.fc1 = nn.Linear(56, 10)
        self.fc2 = nn.Linear(10, 1)

    # Question: how should the forward pass be performed, and what will its ouputs be?
    # Perform the forward pass
    # We're using the sigmoid activation function on our hidden layer, but our output layer has no activation 
    # function as we're predicting a continuous variable so we want the actual number predicted
    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = self.fc2(x)
        return x

### The Data Loaders ###     

    # Question: How do you want to load your data into the model?
    # Define functions for data loading: train / validate / test
    def train_dataloader(self):
        train_dataset = TensorDataset(torch.tensor(train_features.values).float(), torch.tensor(train_targets[['cnt']].values).float())
        train_loader = DataLoader(dataset = train_dataset, batch_size = 128)
        return train_loader

    def val_dataloader(self):
        validation_dataset = TensorDataset(torch.tensor(validation_features.values).float(), torch.tensor(validation_targets[['cnt']].values).float())
        validation_loader = DataLoader(dataset = validation_dataset, batch_size = 128)
        return validation_loader

    def test_dataloader(self):
        test_dataset = TensorDataset(torch.tensor(test_features.values).float(), torch.tensor(test_targets[['cnt']].values).float())
        test_loader = DataLoader(dataset = test_dataset, batch_size = 128)
        return test_loader

### The Optimizer ### 

    # Question: what optimizer will I use?
    # Define optimizer function: here we are using Stochastic Gradient Descent
    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=l_rate)

### Training ### 

    # Question: what should a training step look like?
    # Define training step
    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = mse_loss(logits, y)
        # Add logging
        logs = {'loss': loss}
        return {'loss': loss, 'log': logs}

### Validation ### 

    # Question: what should a validation step look like?
    # Define validation step
    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = mse_loss(logits, y)
        return {'val_loss': loss}

    # Define validation epoch end
    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        return {'avg_val_loss': avg_loss, 'log': tensorboard_logs}

### Testing ###     

    # Question: what should a test step look like?
    # Define test step
    def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = mse_loss(logits, y)
        correct = torch.sum(logits == y.data)

        # I want to visualize my predictions vs my actuals so here I'm going to 
        # add these lines to extract the data for plotting later on
        predictions_pred.append(logits)
        predictions_actual.append(y.data)
        return {'test_loss': loss, 'test_correct': correct, 'logits': logits}

    # Define test end
    def test_epoch_end(self, outputs):
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        logs = {'test_loss': avg_loss}      
        return {'avg_test_loss': avg_loss, 'log': logs, 'progress_bar': logs }


predictions_pred = []
predictions_actual = []

model = Regression()
trainer = pl.Trainer(max_epochs=1000, gpus=1) #4, num_nodes=4, distributed_backend='ddp',)
trainer.fit(model)

Borda on 16 Sep 2020

@vr25 I know this is not related to the issue reported, but looking at your script here I noticed that your main script is not guarded by if __name__ == "__main__"
also this line at the and is commented on purpose? I'm pretty sure you need these settings for slurm.

awaelchli on 16 Sep 2020

@awaelchli Yes, you're right, I just got the code from here.

I am sorry but which last line is commented? code or slurm script?

vr25 on 16 Sep 2020

Here

trainer = pl.Trainer(max_epochs=1000, gpus=1) #4, num_nodes=4, distributed_backend='ddp',)

There is a comment there, the ddp should be included, im pretty surw

awaelchli on 16 Sep 2020

Sorry, you're right, I was just trying w/ and w/o ddp and so I didn't uncomment before uploading the code. Anyway, neither worked.
Also, another issue on total time taken: below is a screenshot of one (left) vs two (right) GPU on one node and a single GPU seems faster in this case.

vr25 on 16 Sep 2020

Yes, the checkpoint is saved. Also, the job terminates completely and the slurm log file says, "Job cancelled because time value exceeded"

@vr25 In your sbatch options, you have this:
```shell script

SBATCH --time=00:00:02

```
Seems to me like you're asking for 2 seconds rather than 2 minutes, which would explain why it's not working. There's no time for the code to kick in at this speed :stuck_out_tongue:

You should try 00:02:00 for the time instead.

Edit: typo

nathanpainchaud on 2 Oct 2020

👍2

oh!! Right, thanks for pointing it out.

vr25 on 2 Oct 2020

@nathanpainchaud you have good eyes!!
@vr25 let us know if it worked!

awaelchli on 2 Oct 2020

👍1

If the hint by @nathanpainchaud did not work, let us know and we can reopen.

awaelchli on 25 Oct 2020

Was this page helpful?