Pytorch-lightning: Early Stopping behavior

Created on 7 May 2020 · 10Comments · Source: PyTorchLightning/pytorch-lightning

Hi there,
thanks for the great library (I am using 0.7.5.). I am not following the bug report template as I'm not sure this is indeed a bug, or simply I cannot understand how early stopping is implemented. My code looks as follows:

    early_stop_callback = EarlyStopping(
        monitor='val_acc',
        min_delta=0.0,
        patience=80,
        verbose=True,
        mode=self.mode
    )

    trainer = Trainer(
        early_stop_callback=early_stop_callback,
        auto_select_gpus=True,
        max_epochs=200,
        terminate_on_nan=True,
        show_progress_bar=True,
        fast_dev_run=False,
        gpus=1
    )

As I understand it, the model should perform early stopping after AT LEAST 80 epochs have passed without improvement on the validation accuracy. However, in my case, early stopping happened at epoch 75. Is this how it should be?

As I said, I am not sure this is actually a bug or a choice (perhaps early stopping is implemented at the batch level?). If it is indeed a bug, I will work a reproducible example. Thank you!

bug / fix help wanted

Source

marcopodda

👍5 👀1

Most helpful comment

It is definitely a bug. I discovered that EarlyStopping.on_epoch_end is executed twice within one epoch, meaning that patience=160 should solve your issue temporarily.

In the file training_loop.py:
First call:

            if self.fast_dev_run or should_check_val:
                self.run_evaluation(test_mode=self.testing)
                self.call_checkpoint_callback()
                self.call_early_stop_callback()

Second call:

                # TODO wrap this logic into the callback
                if self.enable_early_stop:
                    if (met_min_epochs and met_min_steps) or self.fast_dev_run:
                        should_stop = self.early_stop_callback.on_epoch_end(self, self.get_model())
                        # stop training
                        stop = should_stop and met_min_epochs
                        if stop:
                            self.run_training_teardown()
                            return

mateuszpieniak on 7 May 2020

👍5

All 10 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 7 May 2020

I would expect that it should iterate for _at least 80 epochs_, too. So to me, it looks like a bug or some kind of unexpected behavior. Would be great to figure it out!

devforfu on 7 May 2020

Ok then, I'll work out some notebook to see if I can reproduce.

marcopodda on 7 May 2020

Thanks @mateuszpieniak
Here is a working example. As you can see, it stops at epoch 41 even though patience is set to 80.
https://github.com/marcopodda/pl-es-example/blob/master/ES%20example.ipynb

marcopodda on 7 May 2020

It is definitely a bug. I discovered that EarlyStopping.on_epoch_end is executed twice within one epoch, meaning that patience=160 should solve your issue temporarily.

In the file training_loop.py:
First call:

            if self.fast_dev_run or should_check_val:
                self.run_evaluation(test_mode=self.testing)
                self.call_checkpoint_callback()
                self.call_early_stop_callback()

Second call:

                # TODO wrap this logic into the callback
                if self.enable_early_stop:
                    if (met_min_epochs and met_min_steps) or self.fast_dev_run:
                        should_stop = self.early_stop_callback.on_epoch_end(self, self.get_model())
                        # stop training
                        stop = should_stop and met_min_epochs
                        if stop:
                            self.run_training_teardown()
                            return

mateuszpieniak on 7 May 2020

👍5

I upgraded to the bleeding edge version yesterday and can confirm that this started happening to me too. I didn't have an issue before I upgraded (I think I was on 0.7.3 before?)