Pytorch-lightning: Early Stopping behavior

Created on 7 May 2020  路  10Comments  路  Source: PyTorchLightning/pytorch-lightning

Hi there,
thanks for the great library (I am using 0.7.5.). I am not following the bug report template as I'm not sure this is indeed a bug, or simply I cannot understand how early stopping is implemented. My code looks as follows:

    early_stop_callback = EarlyStopping(
        monitor='val_acc',
        min_delta=0.0,
        patience=80,
        verbose=True,
        mode=self.mode
    )

    trainer = Trainer(
        early_stop_callback=early_stop_callback,
        auto_select_gpus=True,
        max_epochs=200,
        terminate_on_nan=True,
        show_progress_bar=True,
        fast_dev_run=False,
        gpus=1
    )

As I understand it, the model should perform early stopping after AT LEAST 80 epochs have passed without improvement on the validation accuracy. However, in my case, early stopping happened at epoch 75. Is this how it should be?

As I said, I am not sure this is actually a bug or a choice (perhaps early stopping is implemented at the batch level?). If it is indeed a bug, I will work a reproducible example. Thank you!

bug / fix help wanted

Most helpful comment

It is definitely a bug. I discovered that EarlyStopping.on_epoch_end is executed twice within one epoch, meaning that patience=160 should solve your issue temporarily.

In the file training_loop.py:
First call:

            if self.fast_dev_run or should_check_val:
                self.run_evaluation(test_mode=self.testing)
                self.call_checkpoint_callback()
                self.call_early_stop_callback()

Second call:

                # TODO wrap this logic into the callback
                if self.enable_early_stop:
                    if (met_min_epochs and met_min_steps) or self.fast_dev_run:
                        should_stop = self.early_stop_callback.on_epoch_end(self, self.get_model())
                        # stop training
                        stop = should_stop and met_min_epochs
                        if stop:
                            self.run_training_teardown()
                            return

All 10 comments

Hi! thanks for your contribution!, great first issue!

I would expect that it should iterate for _at least 80 epochs_, too. So to me, it looks like a bug or some kind of unexpected behavior. Would be great to figure it out!

Ok then, I'll work out some notebook to see if I can reproduce.

Thanks @mateuszpieniak
Here is a working example. As you can see, it stops at epoch 41 even though patience is set to 80.
https://github.com/marcopodda/pl-es-example/blob/master/ES%20example.ipynb

It is definitely a bug. I discovered that EarlyStopping.on_epoch_end is executed twice within one epoch, meaning that patience=160 should solve your issue temporarily.

In the file training_loop.py:
First call:

            if self.fast_dev_run or should_check_val:
                self.run_evaluation(test_mode=self.testing)
                self.call_checkpoint_callback()
                self.call_early_stop_callback()

Second call:

                # TODO wrap this logic into the callback
                if self.enable_early_stop:
                    if (met_min_epochs and met_min_steps) or self.fast_dev_run:
                        should_stop = self.early_stop_callback.on_epoch_end(self, self.get_model())
                        # stop training
                        stop = should_stop and met_min_epochs
                        if stop:
                            self.run_training_teardown()
                            return

I upgraded to the bleeding edge version yesterday and can confirm that this started happening to me too. I didn't have an issue before I upgraded (I think I was on 0.7.3 before?)

Yep we ran into this as well. It is called once in trainer and once in the on epoch end callback.

@Anjum48 @ricpruss mind send a fix, PR?

@Borda Well, I would love to make my first PL's PR if that's okay? :wink:

@mateuszpieniak sure go ahead! :rocket:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

versatran01 picture versatran01  路  3Comments

jcreinhold picture jcreinhold  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

monney picture monney  路  3Comments

remisphere picture remisphere  路  3Comments