Allennlp: cuda runtime error (59) on training allennlp BiDAF

Created on 18 Jun 2018  ·  13Comments  ·  Source: allenai/allennlp

Edit: Sorry forgot to include system specs

  • allennlp: 0.5.0
  • pytorch: '0.4.0'
  • gpu devices 'TITAN Xp'

So I'm trying to retrain the allennlp BiDAF model on the Squad 2.0 dataset. I've modified the Squad 2.0 dataset so that it works the Squad 1.1. dataset reader. For unanswerable questions, I've fixed the answer to the first char in the context paragraph and set the answer start to 0.

I was able to get BiDAF to train successfully on a small subset (the first 45 topics) of the Squad 2.0 train dataset and using the full dev dataset. When I increase number of topics to 46, regardless of the topic I include, I run into a wierd CUDA runtime error (below). It consistently happens midway through training the first epoch.

2018-06-18 16:00:42,404 - INFO - allennlp.training.trainer - Beginning training.
2018-06-18 16:00:42,404 - INFO - allennlp.training.trainer - Epoch 0/19
2018-06-18 16:00:42,404 - INFO - allennlp.training.trainer - Peak CPU memory usage MB: 3518.464
2018-06-18 16:00:42,500 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 12
2018-06-18 16:00:42,501 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 705
  0%|          | 0/369 [00:00<?, ?it/s]2018-06-18 16:00:42,502 - INFO - allennlp.training.trainer - Training
start_acc: 0.2136, end_acc: 0.1931, span_acc: 0.1876, em: 0.0145, f1: 0.0220, loss: 8.0499 ||:  14%|#3        | 50/369 [00:26<02:51,  1.86it/s]/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorCopy.c line=70 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/run.py", line 18, in <module>
    main(prog="allennlp")
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 70, in main
    args.func(args)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/commands/train.py", line 101, in train_model_from_args
    args.recover)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/commands/train.py", line 131, in train_model_from_file
    return train_model(params, serialization_dir, file_friendly_logging, recover)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/commands/train.py", line 296, in train_model
    metrics = trainer.train()
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/training/trainer.py", line 679, in train
    train_metrics = self._train_epoch(epoch)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/training/trainer.py", line 463, in _train_epoch
    loss = self._batch_loss(batch, for_training=True)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/training/trainer.py", line 398, in _batch_loss
    output_dict = self._model(**batch)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/models/reading_comprehension/bidaf.py", line 263, in forward
    self._span_start_accuracy(span_start_logits, span_start.squeeze(-1))
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/training/metrics/categorical_accuracy.py", line 36, in __call__
    predictions, gold_labels, mask = self.unwrap_to_tensors(predictions, gold_labels, mask)
  File "/home/dhairya/.conda/envs/nlp/lib/python3.6/site-packages/allennlp/training/metrics/metric.py", line 58, in <genexpr>
    return (x.detach().cpu() if isinstance(x, torch.Tensor) else x for x in tensors)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorCopy.c:70

My initial experiment is just to see if BiDAF with no modification can handle Squad 2.0 with a fixed answer for unanswerable questions.

Here's my code for modifying Squad 2.0:

from copy import deepcopy
import json

break_idx = 46

def convert_squad(data):
    sample_train = { "version":"2.0", "data":list() }

    for i,item in enumerate(data["data"]):

        if i == break_idx:
            break     

        ic = deepcopy(item)

        for j,p in enumerate(ic["paragraphs"]):

            # 1. Update context
            answer_start = 0

            # 2. Loop over qa dicts and reduct to fit squad 1.1 spec
            for q in p["qas"]:
                if not q["is_impossible"]:
                    q.pop('is_impossible')
                else:
                    q.pop('plausible_answers')
                    q.pop('is_impossible')
                    q["answers"].append( { 'text':  p["context"][answer_start],
                                           'answer_start': answer_start } )
        sample_train["data"].append(ic)
    return sample_train

train = json.loads(open("train-v.2.0.json", "r").read())
dev   = json.loads(open("dev-v.2.0.json", "r").read())

train_cv = convert_squad(train)
dev_cv = convert_squad(dev)

with open('squad_fc2/full_train.json', 'w') as outfile:
    json.dump(train_cv, outfile)

with open('squad_fc2/full_dev.json', 'w') as outfile:
    json.dump(dev_cv, outfile)

And the config.json file for bidaf (just modified the paths for the data). Intentionally validating against itself to see if there is an issue with the dataset itself. Training BiDAF on dev and validiating against dev (total of 35 topics) worked fine.

{
  "dataset_reader": {
    "type": "squad",
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "token_characters": {
        "type": "characters",
        "character_tokenizer": {
          "byte_encoding": "utf-8",
          "start_tokens": [259],
          "end_tokens": [260]
        }
      }
    }
  },
  "train_data_path": "full_train.json",
  "validation_data_path": "full_train.json",
  "model": {        
    "type": "bidaf",
    "text_field_embedder": {
      "tokens": {
        "type": "embedding",
        "pretrained_file": "../glove.6B.100d.txt.gz",
        "embedding_dim": 100,
        "trainable": false
      },
      "token_characters": {
        "type": "character_encoding",
        "embedding": {
          "num_embeddings": 262,
          "embedding_dim": 16
        },
        "encoder": {
          "type": "cnn",
          "embedding_dim": 16,
          "num_filters": 100,
          "ngram_filter_sizes": [5]
        },
        "dropout": 0.2
      }
    },
    "num_highway_layers": 2,
    "phrase_layer": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 200,
      "hidden_size": 100,
      "num_layers": 1,
      "dropout": 0.2
    },
    "similarity_function": {
      "type": "linear",
      "combination": "x,y,x*y",
      "tensor_1_dim": 200,
      "tensor_2_dim": 200
    },
    "modeling_layer": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 800,
      "hidden_size": 100,
      "num_layers": 2,
      "dropout": 0.2
    },
    "span_end_encoder": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 1400,
      "hidden_size": 100,
      "num_layers": 1,
      "dropout": 0.2
    },
    "dropout": 0.2
  },
  "iterator": {
    "type": "bucket",
    "sorting_keys": [["passage", "num_tokens"], ["question", "num_tokens"]],
    "batch_size": 40
  },

  "trainer": {
    "num_epochs": 20,
    "grad_norm": 5.0,
    "patience": 10,
    "validation_metric": "+em",
    "cuda_device": 0,
    "learning_rate_scheduler": {
      "type": "reduce_on_plateau",
      "factor": 0.5,
      "mode": "max",
      "patience": 2
    },
    "optimizer": {
      "type": "adam",
      "betas": [0.9, 0.9]
    }
  }
}

All 13 comments

it looks like the gold/predicted span indices are out of range.

In order to debug this better, I suggest:

  1. Use the allennlp dry-run command on your new dataset. This will print out statistics about your labels etc, notifying you if any of the values look incorrect.
  2. reproduce the error by running the same script with CUDA_LAUNCH_BLOCKING=1 python ..../your/script. This will give you a more precise stack trace, because CUDA calls are asynchronous, meaning the python stack trace doesn't necessarily correspond to where the error happened (and given the error is in the cross entropy loss kernel, it's almost certainly happening in the loss function, and not the metric).
  3. Print out all the values of the gold span indices in the bidaf model as you run it to check they are 0 < gold_index < max_paragraph_length.

is there an easy way to do number 3 if using the allennlp implementations. Or do I need to reimplement it and manually put in the print statements?

Nothing stood out in the allennlp dry-run.

----Dataset Statistics----

Statistics for passage.num_tokens:
Lengths: Mean: 140.71698126476878, Standard Dev: 58.79612917186057, Max: 809, Min: 22
Statistics for passage.num_token_characters:
Lengths: Mean: 15.24082930122651, Standard Dev: 2.194584132541406, Max: 62, Min: 9
Statistics for question.num_tokens:
Lengths: Mean: 11.221158714976932, Standard Dev: 3.614018563223931, Max: 60, Min: 1
Statistics for question.num_token_characters:
Lengths: Mean: 11.294369584786768, Standard Dev: 2.0488910861808285, Max: 29, Min: 3

----Vocabulary Statistics----

Vocabulary with namespaces:
Non Padded Namespaces: ('tags', 'labels')
Namespace: tokens, Size: 97621

Top 10 most frequent tokens in namespace 'tokens':
Token: the Frequency: 1435094
Token: , Frequency: 1082813
Token: of Frequency: 733148
Token: . Frequency: 701132
Token: and Frequency: 555806
Token: in Frequency: 517621
Token: to Frequency: 407954
Token: a Frequency: 335960
Token: is Frequency: 190312
Token: " Frequency: 186709

Top 10 longest tokens in namespace 'tokens':
Token: history.:60–63,67–68,75,78–79,215–216 length: 37 Frequency: 5
Token: www.skillsinstitute.tas.edu.au length: 30 Frequency: 1
Token: coincidentally,[clarification length: 29 Frequency: 4
Token: http://www.voicefmradio.co.uk length: 29 Frequency: 2
Token: norepinephrine[clarification length: 28 Frequency: 15
Token: turbosuperchargers.[citation length: 28 Frequency: 10
Token: constitution,[clarification length: 27 Frequency: 10
Token: privatisation,[verification length: 27 Frequency: 5
Token: wiedergutmachungsinitiative length: 27 Frequency: 3
Token: intelligence.[clarification length: 27 Frequency: 3

Top 10 shortest tokens in namespace 'tokens':
Token: \ length: 1 Frequency: 1
Token: ƿ length: 1 Frequency: 1
Token: þ length: 1 Frequency: 1
Token: 韓 length: 1 Frequency: 3
Token: 한 length: 1 Frequency: 3
Token: 父 length: 1 Frequency: 3
Token: 부 length: 1 Frequency: 3
Token: 下 length: 1 Frequency: 3
Token: 하 length: 1 Frequency: 3
Token: 小 length: 1 Frequency: 3

>>> import allennlp
>>> from allennlp.models import BidirectionalAttentionFlow
>>> import inspect
>>> inspect.getfile(BidirectionalAttentionFlow)
'/Users/markn/allen_ai/allennlp/allennlp/models/reading_comprehension/bidaf.py'

Then just add a print statement to that file.

Thanks @DeNeutoy for your help with debugging this!

This is can be closed out. So the issue was that a small set of paragraphs (.001%) that have a leading white space or '\n' as the first character. Removing those paragraphs resolved the issue. Digging into the code, the nll.loss function expects only positive values where running on CUDA. White space and /n were encoded to -1 which caused the cuda assertion error.

Also if anyone is curious here's the vanilla bidaf w/ glove embeddings performance on Squad 2.0 with the naive strategy of fixing unanswerable questions to the first char in the paragraph. I'll update this comment w/ the second experiment where we also fixed unanswerable questions to a dummy character appended to the end of the passage.

2018-06-20 22:36:22,231 - INFO - allennlp.commands.train - Metrics: {
  "training_duration": "04:13:05",
  "training_start_epoch": 0,
  "training_epochs": 20,
  "training_start_acc": 0.7761044794486832,
  "training_end_acc": 0.8029396381983755,
  "training_span_acc": 0.7138429116416442,
  "training_em": 0.45988955205513166,
  "training_f1": 0.5296586244709655,
  "training_loss": 1.2709182788136921,
  "validation_start_acc": 0.5493135686010275,
  "validation_end_acc": 0.5472079508127685,
  "validation_span_acc": 0.4913669670681378,
  "validation_em": 0.29040680535669167,
  "validation_f1": 0.322344423727657,
  "validation_loss": 4.038828652314465,
  "best_validation_em": 0.3043881074707319,
  "best_epoch": 13
}

If anyone is curious, bidaf + glove trained on Squad 2.0. We fixed unanswerable questions the & token which was appended to the end of the context paragraph.

2018-06-21 15:04:35,018 - INFO - allennlp.commands.train - Metrics: {

  "training_duration": "04:03:31",

  "training_start_epoch": 0,

  "training_epochs": 18,

  "training_start_acc": 0.7819913233438971,

  "training_end_acc": 0.8075828436048121,

  "training_span_acc": 0.7225008461278114,

  "training_em": 0.757630534445094,

  "training_f1": 0.5361028669992506,

  "training_loss": 1.2408861718116193,

  "validation_start_acc": 0.5429538721963606,

  "validation_end_acc": 0.5454083791790098,

  "validation_span_acc": 0.49614896318239526,

  "validation_em": 0.554126110876005,

  "validation_f1": 0.3202905788967037,

  "validation_loss": 4.1162775263593,

  "best_validation_em": 0.5855268726195514,

  "best_epoch": 8

}

(Sorry to be slow; I was on vacation.) How do you have F1 lower than EM? Is something funny going on with your metrics when you have the extra dummy token?

@matt-gardner No worries. That's a good point, I'll double check what going on and report back if it helpful to the community. I encoded both the training and dev set in the same way. I add the dummy token to each passage before it gets passed to the model. It worth noting, I haven't modified the last layer of BiDAF like the Zero shot paper, since we were curious if the base implementation can handle unanswerable questions.

For our purposes, BiDAF trained on SQUAD 2.0 was way to aggressive in predicting the dummy token. We ended up actually creating a simpler dataset with mismatched questions that were obviously unanswerable. Interestingly enough, your observation holds here too. We'll revisit this after the 4th and post our findings.

2018-06-25 21:09:27,857 - INFO - allennlp.commands.train - Metrics: {
"training_duration": "03:31:20",
"training_start_epoch": 0,
"training_epochs": 20,
"training_start_acc": 0.8362958661896646,
"training_end_acc": 0.8720710258667516,
"training_span_acc": 0.768632115229729,
"training_em": 0.8118526288186705,
"training_f1": 0.7273442133657387,
"training_loss": 0.8864690384838019,
"validation_start_acc": 0.6778763817530749,
"validation_end_acc": 0.7101821578701542,
"validation_span_acc": 0.6065701385645337,
"validation_em": 0.7129067413981006,
"validation_f1": 0.6215613452574806,

"validation_loss": 2.74923387780693,
"best_validation_em": 0.721469718200218,
"best_epoch": 10
}

Hey @dhairyadalal,
I hope this does not come across as cheeky but I was just wondering is there any chance you would make your changes public as I would love to train and use this model?

@murphp15 Sorry, just saw this in my email.

There were no model changes. I modified how squad 2.0 was structured so that empty answer spans for unanswerable questions pointed to the last character in the passage. I also appended a special character '&' at the end of all passages sent to the model, just to standardize the model's output for unanswerable questions. Note the model results above are not based on squad 2.0. There based on generating an easier set of unanswerable questions (basically mismatched questions from different passages).

BIDAF's performance on Squad 2.0 data using the fixed end character strategy was:
2018-06-21 15:04:35,018 - INFO - allennlp.commands.train - Metrics: {
"training_duration": "04:03:31",
"training_start_epoch": 0,
"training_epochs": 18,
"training_start_acc": 0.7819913233438971,
"training_end_acc": 0.8075828436048121,
"training_span_acc": 0.7225008461278114,
"training_em": 0.757630534445094,
"training_f1": 0.5361028669992506,
"training_loss": 1.2408861718116193,
"validation_start_acc": 0.5429538721963606,
"validation_end_acc": 0.5454083791790098,
"validation_span_acc": 0.49614896318239526,
"validation_em": 0.554126110876005,
"validation_f1": 0.3202905788967037,
"validation_loss": 4.1162775263593,

"best_validation_em": 0.5855268726195514,
"best_epoch": 8
}

If it's helpful I can create a new dataset reader class that will read squad 2.0 dataset and make it compatible w/ bidaf. I'm also happy to publish our trained model for squad 2.0. Not sure if it there is a way to share directly w/ allennlp (@matt-gardner ) but I can set a repo on my github. But like I said it's pretty terrible, since we are hacking BiDAF.

@dhairyadalal, we love contributions. We're just not sure yet what the right way is to handle contributed models like this. Our current idea is to put up a page with links to third-party models, but we'd also like to be able to host demos of them, and such. For now, if you get it up in a github repo, we can link to it, and think about better ways of handling contributed models in the future.

@dhairyadalal I think we should collaborate on Squad2.0 as Squad1.1 is largely a done deal. Squad2.0 is actively being worked on to beat human performance. https://rajpurkar.github.io/SQuAD-explorer/

I don't think a new reader is enough. We would need to change the underlying model as well. An EM of 0.58 is nowhere close to where we would need to be to beat human performance.

@matt-gardner would love to get help from the allennp team to guide us through what you think the needed changes might be for model changes. Happy to allocate Dev, GPU resources to make this happen.

@muni2773 Happy to collaborate. you'll find me more responsive over email dhairya.b.dalal [at] gmail. There's couple of models already on the Squad 2.0 leaderboard, but none of them have associated papers ready. I'd be curious how folks are thinking of representing null answer. My strategy of fixing it to a char at the end of the passage is not ideal. Empirically we've found that this tends to shorten the attention and truncate returned answers.

This one has a paper, and it seems like it'd be a good starting place: https://arxiv.org/abs/1808.05759. Unfortunately, we don't have the bandwidth to help on this problem. Best of luck working on it!

Was this page helpful?
0 / 5 - 0 ratings