Pytorch-lightning: Have an example of showing explicitly how to calculate metrics in DDP

Created on 25 Aug 2020  ·  16Comments  ·  Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Given the new updates in 0.9.0, it is desirable to have an example of showing exactly and explicitly how to calculate metrics in DDP. The metrics of interest are those that requires all the labels and prediction for an entire epoch, such as F1 score or average precision.

Motivation

As a big fan of this project and a data scientist who already used Lightning in my work, I am still not sure if I have done metrics in DDP correctly or not. While it is easy to spot some obvious mistakes in F1 if the calculated F1 goes over 1 (by definition, F1 is between 0 and 1 and would never go above 1) with DDP, it is hard to know for sure if this is calculated correctly as there is no official document detailing how to do that exactly. Having the proposed example would greatly boost the adoption of Lightning in the PyTorch community.

Pitch

Have an example of calculating metrics such as F1 in DDP

Alternatives

Additional context

There are many issues related to calculating metrics in DDP and honestly this could be the most challenging part for further adoption. Having the proposed example will greatly help in this regard.

Metrics enhancement help wanted tutorial / example

Most helpful comment

@justusschock @awaelchli @Borda @SkafteNicki thanks for the heads-up. Meanwhile, I figured out a way to do DDP sync with 0.9.0 manually, based on the code from https://github.com/allenai/longformer/blob/master/scripts/triviaqa.py. Specifically, one can manually add the following member method of the LightningModule. The following code is a simple modification of the sync_list_across_gpus in the link above:

    def sync_across_gpus(self, t):   # t is a tensor

        gather_t_tensor = [torch.ones_like(t) for _ in range(self.trainer.world_size)]
        torch.distributed.all_gather(gather_t_tensor, t)
        return torch.cat(gather_t_tensor)

To make it work, one can do the following when a metrics is needed, for example at the end of valid/test epoch using the pseudo-code above:

def __valid_test_epoch_end(self, result, kind):

       y_pred = result.y_pred
       y = result.y

      # sync across gpus
      if self.trainer.use_ddp:
           y_pred = self.sync_across_gpus(y_pred)
           y = self.sync_across_gpus(y)


      f1 = self.__compute_metrics(y_pred, y)
      result.f1 = f1
      return result

In addition, we have to average in the __compute_metrics function because now each gpu gets the same f1 score and by default (somehow they sum together):

def __compute_metrics(self, y_pred, y):
    f1 = self.f1(y_pred, y)[1] / self.trainer.world_size     # need to take f1 for positive label 1
    return f1

This is definitely not the best thing to do but at least this is a work-round for those like me who need sa metrics with DDP working as soon as possible.

All 16 comments

mind have look @SkafteNicki @justusschock 🐰

Below is my pseudo-code for DDP with F1 metric, however I have seen the F1 metric for the validation/test the code returns greater than 1 , meaning somehow it is aggregated incorrectly.

class Transformer(LightingModule):
    def __init__(self, hparams):

        self.hparams = hparams

        # load the pretrained model
        self.model = BertForSequenceClassification.from_pretrained(hparams.model_loc)

        # define metric for binary classification
        self.f1 = pl.metrics.F1(num_classes=2, reduction='none')

    def __calculate(self, batch):
        input, y = batch
        y_hat = self.model(input)
        loss = torch.nn.CrossEntropy(y_hat, y)
        _, y_pred = torch.max(y_hat, dim=-1)

       return loss, y, y_pred

    def __compute_metrics(self, y_pred, y):
        f1 = self.f1(y_pred, y)[1]      # need to take f1 for positive label 1
       return f1

    def __train_valid_test_step(self, batch, kind):
        loss, y, y_pred = self.__calculate(batch)
        if kind == 'train':
            result = pl.TrainResult(loss)
       else:
            result = pl.EvalResult(checkpoint_on=loss)

        result.y = y
        result.y_pred = y_pred

        return result

    def __valid_test_epoch_end(self, result, kind):

        f1 = self.__compute_metrics(y_pred, y)
        result.f1 = f1
        return result

    def validation_step(self, batch, batch_idx):
        return self.__train_valid_test_step(batch, kind='valid')

    def validation_step_end(self, result):
        return self.__valid_test_epoch_end(result, kind='valid')

the reason of having reduction='none' for F1

self.f1 = pl.metrics.F1(num_classes=2, reduction='none')

is because I noticed the following behaviors of F1:

pl.seed_everything(2020)
n = 10000  # number of samples
y = np.random.choice([0, 1], n)
y_hat = np.random.random(n)
threshold = 0.2
y_pred = (y_hat > threshold).astype(int)

y_tensor = torch.tensor(y)
y_hat_tensor = torch.tensor(y_hat)
y_pred_tensor = torch.tensor(y_pred)


print('F1 from sklearn', f1_score(y, y_pred))
print('F1 from lightning functional', pl.metrics.functional.f1_score(y_pred_tensor, y_tensor, num_classes=2, reduction='none'))
print('F1 from lightning tensor', pl.metrics.F1(num_classes=2, reduction='none')(y_pred_tensor, y_tensor))

# print resutls are as follows
# F1 from sklearn 0.6127105666156202
# F1 from lightning functional tensor([0.2712, 0.6127])
# F1 from lightning tensor tensor([0.2712, 0.6127])

print('F1 from sklearn', f1_score(y, y_pred))
print('F1 from lightning functional', pl.metrics.functional.f1_score(y_pred_tensor, y_tensor, num_classes=2))
print('F1 from lightning tensor', pl.metrics.F1(num_classes=2)(y_pred_tensor, y_tensor))

# print results are as follows
# F1 from sklearn 0.6127105666156202
# F1 from lightning functional tensor(0.4419)
# F1 from lightning tensor tensor(0.4419)

Therefore, without reduction='none', F1 score from Lightning will automatically average F1 score of each class, which is not consistent with sklearn and not what we want.

@PyTorchLightning/core-contributors Could some core developer can chime in and provide an example? Thanks.

related: #3225, #3230
:+1: for adding an example to docs, that would be useful

@awaelchli I'd like to but currently the code is not working :(

@junwen-austin I am hoping to finish a PR in the following days that changes the reduction parameter for metrics used in classification problem to work similar to how sklearn is doing it (micro average as default and with the option to which to macro, weighted ect)

@justusschock @awaelchli @Borda @SkafteNicki thanks for the heads-up. Meanwhile, I figured out a way to do DDP sync with 0.9.0 manually, based on the code from https://github.com/allenai/longformer/blob/master/scripts/triviaqa.py. Specifically, one can manually add the following member method of the LightningModule. The following code is a simple modification of the sync_list_across_gpus in the link above:

    def sync_across_gpus(self, t):   # t is a tensor

        gather_t_tensor = [torch.ones_like(t) for _ in range(self.trainer.world_size)]
        torch.distributed.all_gather(gather_t_tensor, t)
        return torch.cat(gather_t_tensor)

To make it work, one can do the following when a metrics is needed, for example at the end of valid/test epoch using the pseudo-code above:

def __valid_test_epoch_end(self, result, kind):

       y_pred = result.y_pred
       y = result.y

      # sync across gpus
      if self.trainer.use_ddp:
           y_pred = self.sync_across_gpus(y_pred)
           y = self.sync_across_gpus(y)


      f1 = self.__compute_metrics(y_pred, y)
      result.f1 = f1
      return result

In addition, we have to average in the __compute_metrics function because now each gpu gets the same f1 score and by default (somehow they sum together):

def __compute_metrics(self, y_pred, y):
    f1 = self.f1(y_pred, y)[1] / self.trainer.world_size     # need to take f1 for positive label 1
    return f1

This is definitely not the best thing to do but at least this is a work-round for those like me who need sa metrics with DDP working as soon as possible.

@junwen-austin - thanks this is nice. I am doing it similarly, but worse, pickling the results of each worker and collecting them on worker 0. Instead of doing the final average of the metric I assume you could also limit this calculation to worker 0, which is anyways the one doing the logging.

@psinger I did not limit the calculation to worker 0 though I did try to find rank 0 GPU with no success :(

That's why I have to divide the metric results by self.trainer.world_size as the method above will get sum of identical metric results from each GPU.

@junwen-austin You can get the rank via self.global_rank

metric = some_unreduced_metric_fn(x, y)
group = torch.distributed.group.WORLD
group_size = torch.distributed.get_world_size(group)
gather_list = [torch.empty_like(t) for _ in group_size] if self.global_rank == 0 else None
torch.distributed.barrier(group)
torch.distributed.gather(metric, gather_list=gather_list, dst=0, group=group, async_op=False)
if self.global_rank == 0:
    metric = torch.cat(gather_list)
    metric = some_reduce_fn(metric)
    some_logging_fn(metric)

something like this?

metric = some_unreduced_metric_fn(x, y)
group = torch.distributed.group.WORLD
group_size = torch.distributed.get_world_size(group)
gather_list = [torch.empty_like(t) for _ in group_size] if self.global_rank == 0 else None
torch.distributed.barrier(group)
torch.distributed.gather(metric, gather_list=gather_list, dst=0, group=group, async_op=False)
if self.global_rank == 0:
  metric = torch.cat(gather_list)
  metric = some_reduce_fn(metric)
  some_logging_fn(metric)

something like this?

did this feature in master branch code?

Class based metrics have been revamped!
Please check out the documentation for the new interface, and see if the new interface solves your problem (less metrics are available at the moment as we are in the process of converting them to the new api).

@SkafteNicki Thanks I'll test it out with the new version!

@junwen-austin feel free to reopen if still needed 🐰

Was this page helpful?
0 / 5 - 0 ratings

Related issues

baeseongsu picture baeseongsu  ·  3Comments

monney picture monney  ·  3Comments

versatran01 picture versatran01  ·  3Comments

anthonytec2 picture anthonytec2  ·  3Comments

williamFalcon picture williamFalcon  ·  3Comments