Pytorch-lightning: 'Trainer' object has no attribute 'proc_rank'

Created on 19 Jun 2020  ·  8Comments  ·  Source: PyTorchLightning/pytorch-lightning

🐛 Bug


1st epoch runs to completion and the above error is thrown in the is_logger() method.

To Reproduce

AttributeError                            Traceback (most recent call last)

<ipython-input-14-1b9ebf437115> in <module>()
      3 trainer = pl.Trainer(**train_params)
      4 
----> 5 trainer.fit(model)

8 frames

<ipython-input-3-bb983543bb31> in is_logger(self)
      8 
      9   def is_logger(self):
---> 10     return self.trainer.proc_rank <= 0
     11 
     12   def forward(

AttributeError: 'Trainer' object has no attribute 'proc_rank'

Code sample

class T5FineTuner(pl.LightningModule):
  def __init__(self, hparams):
    super(T5FineTuner, self).__init__()
    self.hparams = hparams

    self.model = T5ForConditionalGeneration.from_pretrained(hparams.model_name_or_path)
    self.tokenizer = T5Tokenizer.from_pretrained(hparams.tokenizer_name_or_path)

  def is_logger(self):
    return self.trainer.proc_rank <= 0

  def forward(
      self, input_ids, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, lm_labels=None
  ):
    return self.model(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        lm_labels=lm_labels,
    )

  def _step(self, batch):
    lm_labels = batch["target_ids"]
    lm_labels[lm_labels[:, :] == self.tokenizer.pad_token_id] = -100

    outputs = self(
        input_ids=batch["source_ids"],
        attention_mask=batch["source_mask"],
        lm_labels=lm_labels,
        decoder_attention_mask=batch['target_mask']
    )

    loss = outputs[0]

    return loss

  def training_step(self, batch, batch_idx):
    loss = self._step(batch)

    tensorboard_logs = {"train_loss": loss}
    return {"loss": loss, "log": tensorboard_logs}

  def training_epoch_end(self, outputs):
    avg_train_loss = torch.stack([x["loss"] for x in outputs]).mean()
    tensorboard_logs = {"avg_train_loss": avg_train_loss}
    return {"avg_train_loss": avg_train_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

  def validation_step(self, batch, batch_idx):
    loss = self._step(batch)
    return {"val_loss": loss}

  def validation_epoch_end(self, outputs):
    avg_loss = torch.stack([x["val_loss"] for x in outputs]).mean()
    tensorboard_logs = {"val_loss": avg_loss}
    return {"avg_val_loss": avg_loss, "log": tensorboard_logs, 'progress_bar': tensorboard_logs}

  def configure_optimizers(self):
    "Prepare optimizer and schedule (linear warmup and decay)"

    model = self.model
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": self.hparams.weight_decay,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=self.hparams.learning_rate, eps=self.hparams.adam_epsilon)
    self.opt = optimizer
    return [optimizer]

  def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, second_order_closure=None):
    if self.trainer.use_tpu:
      xm.optimizer_step(optimizer)
    else:
      optimizer.step()
    optimizer.zero_grad()
    self.lr_scheduler.step()

  def get_tqdm_dict(self):
    tqdm_dict = {"loss": "{:.3f}".format(self.trainer.avg_loss), "lr": self.lr_scheduler.get_last_lr()[-1]}

    return tqdm_dict

  def train_dataloader(self):
    train_dataset = get_dataset(tokenizer=self.tokenizer, type_path="train", args=self.hparams)
    dataloader = DataLoader(train_dataset, batch_size=self.hparams.train_batch_size, drop_last=True, shuffle=True, num_workers=4)
    t_total = (
        (len(dataloader.dataset) // (self.hparams.train_batch_size * max(1, self.hparams.n_gpu)))
        // self.hparams.gradient_accumulation_steps
        * float(self.hparams.num_train_epochs)
    )
    scheduler = get_linear_schedule_with_warmup(
        self.opt, num_warmup_steps=self.hparams.warmup_steps, num_training_steps=t_total
    )
    self.lr_scheduler = scheduler
    return dataloader

  def val_dataloader(self):
    val_dataset = get_dataset(tokenizer=self.tokenizer, type_path="val", args=self.hparams)
    return DataLoader(val_dataset, batch_size=self.hparams.eval_batch_size, num_workers=4)

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.6.9
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: nVidia K80
  • Any other relevant information:

Additional context


This code does run in 0.7.6 version, but it breaks in the latest release

help wanted question

Most helpful comment

Actually I borrowed most of the code from HuggingFace's T5 finetuning script, but I guess if it's not needed then I will remove it.

All 8 comments

Hi! thanks for your contribution!, great first issue!

proc_rank was renamed to global_rank in #2166

EDIT: it is also recommended use rank_zero_only wrapper

but also, why do you need to check the proc rank?

Actually I borrowed most of the code from HuggingFace's T5 finetuning script, but I guess if it's not needed then I will remove it.

@vishal-burman do you have ddp working with transformers and pytorch-lightning==0.8.1?
I am struggling. I fixed this proc_rank bug, but now have nan loss in both fp16 and fp32. Very strange.

@sshleifer . I am also struggling with making ddp work. According to PyTorch documentation, there is a warning not to change model parameters after ddp construction. I wonder if that could be the case.

can you guys post a minimal example that is breaking? in lightning we don’t change model stuff once ddp starts. maybe trnasformers is doing that?

but either way, the best thing is for us to have a model or test to test against.

QUICK FIX: if your are training your model on a single GPU, set:

  def is_logger(self):
    return True
Was this page helpful?
0 / 5 - 0 ratings