Pytorch-lightning: Why I have to write `.epoch_end` and all the `*_step`

Created on 16 Jul 2020 · 15Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Dear all,

Thank you for this amazing library. However, there are some features that leads to avoidable boilerplate code.

Let's take an example

class TestModule(LightningModule):
    def __init__(self, model: torch.nn.Module):
        super().__init__()
        self.model = model

    def forward(self, x):
        return self.model(x)

    def get_metrics(self, pred, y):
        return {
            'accuracy': accuracy(pred, y),
            'f1': f1_score(pred, y, num_classes=2),
            'roc_auc': auroc(pred, y)
        }

    def step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        pred = torch.argmax(y_hat, dim=1)
        metrics = self.get_metrics(pred, y)
        loss = F.cross_entropy(y_hat, y)

        return loss, metrics

    def training_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'loss': loss, 'metrics': metrics}

    def validation_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'val_loss': loss, 'metrics': metrics}

    def test_step(self, batch, batch_idx):
        loss, metrics = self.step(batch, batch_idx)
        return {'test_loss': loss, 'metrics': metrics}

Why I have to write all the `*_step` methods

Most of the time you want to use the same loss and return the same metrics you use in training_step.

We should have a .step the method that, if one of the *_step methods are not implemented, it will be used instead, this will remove most of the boilerplate code.

Why I have to manually compute the average metrics for each epoch.

I was expecting Pytorch Lighting to compute the average of each metric by itself as all other libraries do. Is there a specific reason for not doing it?

Thank you,

Best Regards,

Francesco

enhancement help wanted

Source

FrancescoSaverioZuppichini

👍1

Most helpful comment

in the latest version on master you don’t need step_end or epoch_end.

to use the same step you can just call validation_step from training_step.

docs will be updated in a few days.

Before

(to be fair, only training_step was required)

def training_step(...):
    return loss

def training_step_end(...):
    ...

def training_epoch_end(...):
   ...

Now:

def training_step(...):
    result = TrainResult(minimize=loss)
    return result

To use the same loop for val, test and train

def step(...):
   return something

def training_step(...):
    something = self.step(...)
    result = TrainResult(minimize=something)
    return result

def validation_step(...):
    something = self.step(...)
    result = EvalResult(checkpoint_on=something)
    result.log('val_loss', something)
    return result

def test_step(...):
    result = self.validation_step(...)
    result.log('test_loss', something)
    return result

Maybe for your applications you use the same loop for all three... however, that's very unusual and usually what happens in the train step is different than the other loops. And the metrics need to be names according to what loop you are in... For instance in train loop you log or track things on every batch. On val/test loop you only care about full epoch statistics.

So, I'm curious @FrancescoSaverioZuppichini. In what application are you able to use the same loop for everything?

williamFalcon on 26 Jul 2020

👍2

All 15 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 16 Jul 2020

See #2615 PR that tries to simplify this.

awaelchli on 16 Jul 2020

@awaelchli Thank you for your reply. IMHO the new syntax proposed in #2615 seems even weirded and does not seem to solve the issue.

Why do you have to reinvent the wheel? Most libraries have something like:


tr = Trainer(..., batch_metrics=[Accuracy()], epoch_metrics=[Accuracy(), ROCAUC()]
...

This has three advantages:

1) it makes sense
2) no need to explain in the doc how to properly return the metrics or to return them in order to be added to the logger etc
3) no need to write more code than needed

Then one can also choose to manually return a metric from the *_step functions.

To save memory and compute the epochs metrics you can have default names to the metric and store the batch ones only if the epoch one exists.

What do you think?

Cheers,

Francesco

FrancescoSaverioZuppichini on 16 Jul 2020

Why do you have to reinvent the wheel? Most libraries have something like:

I disagree. PyTorchLighting is not trying to be like other libraries. Which parts do you see as "reinvented"?

it makes sense

Not to me by a large margin, but your example code for me perfectly highlights the design concepts behind LightningModule: Everything that is specific to the training and evaluation to a system goes into the LightningModule. The Trainer can be thought of the encapsulation of all the "boilerplate" PyTorch training loop etc and does not contain any research specific properties (like metrics or how losses are computed etc.)
A Trainer object can also be used for several different LightningModules. So having these metrics as argument to the Trainer is not a good thing in my view.

no need to explain in the doc how to properly return the metrics or to return them in order to be added to the logger etc

It looks like the docs will become simpler with the structured results feature, but remain backward compatible.

no need to write more code than needed

With the structured results, you won't have to implement the *_end and *_epoch_end unless you want to do crazy things as I understand it.

awaelchli on 16 Jul 2020

Thank you @awaelchli for the fast reply!

I disagree. PyTorchLighting is not trying to be like other libraries. Which parts do you see as "reinvented"?

It remembers me of web development some years ago, everybody claiming they were doing different things ... but this is out of topic

Well, in my opinion, if I had to rewrite three times the same code train_step, val_step, etc.. it means that something was not well made and can be definitely improved! I hope you agree to at least one point :) To be honest, you have to rewrite the same code three times for the *_step methods and three-time for the *_end methods if you want to also log something at the end of each epoch.

So having these metrics as argument to the Trainer is not a good thing in my view.

You can pass them to the .fit method :)

Anyway, thank you again for the fast reply. I really appreciate it!

Cheers,

Francesco

FrancescoSaverioZuppichini on 16 Jul 2020

You can pass them to the .fit method :)

My same argument applies there. It's not different from the Trainer args case. I want the metrics etc. in my Lightningmodule, and not outside. If I wanted to pass them in from the outside, I would make it a hyperparameter.

@PyTorchLightning/core-contributors any thoughts here?

awaelchli on 17 Jul 2020

@awaelchli Thank you for your reply. At this point, I don't see any point in continuing the discussion. I will just copy and paste the code over and over again hoping somebody will implement a new library with better practices soon. Thank you.

Be safe!

Francesco

FrancescoSaverioZuppichini on 25 Jul 2020

About the code repetitions you mention, there is a way to get rid of them. You can define your default val- and test_step methods in a separate class and then inherit it in your various LightningModules. Btw I like the idea of the general purpose step method.

awaelchli on 26 Jul 2020

in the latest version on master you don’t need step_end or epoch_end.

to use the same step you can just call validation_step from training_step.

docs will be updated in a few days.

Before

(to be fair, only training_step was required)

def training_step(...):
    return loss

def training_step_end(...):
    ...

def training_epoch_end(...):
   ...

Now:

def training_step(...):
    result = TrainResult(minimize=loss)
    return result

To use the same loop for val, test and train

def step(...):
   return something

def training_step(...):
    something = self.step(...)
    result = TrainResult(minimize=something)
    return result

def validation_step(...):
    something = self.step(...)
    result = EvalResult(checkpoint_on=something)
    result.log('val_loss', something)
    return result

def test_step(...):
    result = self.validation_step(...)
    result.log('test_loss', something)
    return result

So, I'm curious @FrancescoSaverioZuppichini. In what application are you able to use the same loop for everything?

williamFalcon on 26 Jul 2020

👍2

Ok, docs have been updated!

Quick start guide

https://pytorch-lightning.readthedocs.io/en/latest/new-project.html

Or the walk-through

https://pytorch-lightning.readthedocs.io/en/latest/introduction_guide.html#validating

@FrancescoSaverioZuppichini is that more in line with what you were hoping for? :)

williamFalcon on 26 Jul 2020

Hi @williamFalcon,

I hope you are doing great. Thank you for the reply, I am doing image classification and I want to return the same metrics from all my dataset splits.

I have found it very hard to understand what I have to return from each *_step and I am used to code a lot. I am sorry to say but the new APIs do not look great, it seems (to me) to have even complicated more the codebase. What I want is to just pass a metric and that's it, it should work. I think this is something that people expect to be super easy because it is!

Let me know if I can help or you are fine with the current implementation.

Thanks :)

Francesco

FrancescoSaverioZuppichini on 26 Jul 2020

Could you give me an example of how it might work? if it meets the following requirements we could consider it:

it needs to allow the user to only log
it needs to allow the user to only show in progress bar
it can allow both or neither
it can be the metric for every step
or it can be the metric for the full epoch
or it can be both
and for more complicated use cases (lightning js a Pro tool after all and not something for beginners), i might want to do a distributed softmax and would want access to all outputs of all GPUs on the same node.
the thing i want to log is not just MSE loss or accuracy... but something more complicated like a GAN loss with a custom regularizer or even a perceptual loss using MSE on VGG feature maps.

i think your metrics passed to trainer suggestion is interesting but i think this has a few pitfalls:

you don’t have control as a user of what actually happens in that metric.
models often depend on certain metrics... removing them from the model makes it unusable.
what about non standard metrics? if your work is 99% accuracy based then ok. but unlikely to be true in a majority of cases.

also, what about when i may want accuracy but calculated in a weird way or for a different type of batch structure? ie: metric learning or something. then my model is no longer self contained

thanks!

williamFalcon on 26 Jul 2020

@williamFalcon sorry for the late reply, I was super busy and this topic went under the radar, my bad. I see your point but (with all due respect) I think you are wrong. Unfortunately, I don't have the time know to show my counter-arguments and I suspect the discussion will go on forever. Very quickly, you can have default behavior and allow users to implement specific methods as you are doing now, this is classic OOP.

I am down for a quick voice chat if you would like to hear my suggestion in detail or I will comment here in the not so near future. In the meantime, I will drop PyTorch lighting since it doesn't speed up my productivity, and, sorry to say, bugs are everywhere!

Thank you for the discussion ;)

Bests

FrancescoSaverioZuppichini on 13 Sep 2020

I hear you!
Unfortunately Lightning is a professional tool... so, it's not meant to abstract details as you're suggesting.

This is why it's used to create new state of the art methods and new research.
(example: https://github.com/mindslab-ai/cotatron).

You might be looking for something else that doesn't require as much pytorch knowledge or is meant for simpler cases.

This is unfortunately not the aim of lightning.

williamFalcon on 13 Sep 2020

Hi @williamFalcon, thank you for the reply! I have a lot of programming knowledge and I have been using PyTorch for years. But if I have to write tons of code to just do a very basic image classification, then probably there is something not so well design in your library (+ bugs). Don't get me wrong, there are lots of good features.

Tensorflow is also used to create sota methods, but it is a terrible tool. IMHO it should be easy to use for the easy task and hard to use for the hard task, not hard for both. I hope no offense is taken, but because you are the creator you are not so willing to hear :)

Thank you :)

Bests,

P.S. You can close the issue now

FrancescoSaverioZuppichini on 14 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Ditch Trainer percent arguments, make overfit great again

iakremnev · 3Comments

Dataloader starving the gpu

maxime-louis · 3Comments

Wandb Flatten Dict

anthonytec2 · 3Comments

AttributeError: 'tqdm_notebook' object has no attribute 'reset'

mmsamiei · 3Comments

Namespace Cleaning

monney · 3Comments

Pytorch-lightning: Why I have to write `.epoch_end` and all the `*_step`

🚀 Feature

Why I have to write all the *_step methods

Why I have to manually compute the average metrics for each epoch.

Most helpful comment

Before

Now:

To use the same loop for val, test and train

All 15 comments

Before

Now:

To use the same loop for val, test and train

Quick start guide

Or the walk-through

Related issues

Why I have to write all the `*_step` methods