Allennlp: "TypeError: not a sequence" on simple coreference resolution

Created on 25 May 2020  路  20Comments  路  Source: allenai/allennlp

Describe the bug

I get a TypeError: not a sequence when trying to predict this simple string: "Besides its prominence in sports, Notre Dame is also a large, four-year, highly residential research University, and is consistently ranked among the top twenty universities in the United States and as a major global university."

Full stacktrace:

Traceback (most recent call last):
  File "/home/arthur/question-generation/models/sg_dqg.py", line 156, in <module>
    preprocess(args.ds)
  File "/home/arthur/question-generation/models/sg_dqg.py", line 124, in preprocess
    coreferences = coreference_resolution(evidences_list)
  File "/home/arthur/question-generation/models/sg_dqg.py", line 74, in coreference_resolution
    document="Besides its prominence in sports, Notre Dame is also a large, four-year, highly residential research University, and is consistently ranked among the top twenty universities in the United States  and as a major global university."
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp_models/coref/coref_predictor.py", line 65, in predict
    return self.predict_json({"document": document})
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 48, in predict_json
    return self.predict_instance(instance)
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 171, in predict_instance
    outputs = self._model.forward_on_instance(instance)
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/models/model.py", line 142, in forward_on_instance
    return self.forward_on_instances([instance])[0]
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/models/model.py", line 167, in forward_on_instances
    model_input = util.move_to_device(dataset.as_tensor_dict(), cuda_device)
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/batch.py", line 139, in as_tensor_dict
    for field, tensors in instance.as_tensor_dict(lengths_to_use).items():
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/instance.py", line 99, in as_tensor_dict
    tensors[field_name] = field.as_tensor(padding_lengths[field_name])
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/fields/text_field.py", line 103, in as_tensor
    self._indexed_tokens[indexer_name], indexer_lengths[indexer_name]
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 96, in as_padded_tensor_dict
    offsets_tokens, offsets_padding_lengths, default_value=lambda: (0, 0)
TypeError: not a sequence

To Reproduce
Run this simple piece of code :

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz")
predictor.predict(
document="Besides its prominence in sports, Notre Dame is also a large, four-year, highly residential research University, and is consistently ranked among the top twenty universities in the United States  and as a major global university."
)

Expected behavior
The function should return the proper resolved coreferences.

System (please complete the following information):

  • OS: Linux
  • Python version: 3.7.7
  • AllenNLP version: 1.0.0rc4
  • PyTorch version: 1.5.0
bug

All 20 comments

Hi @arthurdeschamps, I did some debugging and a found that the root of the issue comes from there being some None values in the offsets_tokens list from your stacktrace, which causes torch to fail when trying to turn this list into a tensor because it doesn't know how to handle the None values. These None values come from this line:

https://github.com/allenai/allennlp/blob/master/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py#L360

@dirkgr any ideas how we should fix this?

This is something that has to be handled in the model (or in the original tokenizer).

Recall: _intra_word_tokenize takes an existing tokenization of a string, and cuts the tokens into word pieces suitable for a transformer. What should happen when the existing tokenization produces tokens that have zero word pieces? I can't answer that in general. It depends on your case.

If your answer is "That should never happen.", then look at the cases where it happens anyways and find out why. Maybe the original tokenizer produces tokens that are nothing but spaces? Zero-length tokens? But sometimes there is a legitimate reason for this, and you need to handle it somehow downstream.

@dirkgr in any case, I think we need to improve our error message here. Seems to me like _intra_word_tokenize should raise an exception when a token results in zero word pieces?

@arthurdeschamps is this case you have extra whitespace in your document:

...he United States  and as...
                   ^^

Nevertheless, the error message should be improved here. See https://github.com/allenai/allennlp/pull/4291

Seems to me like _intra_word_tokenize should raise an exception when a token results in zero word pieces?

Sometimes this is a case you want to handle specifically. If we throw an exception, you could never deal with that case.

Sometimes this is a case you want to handle specifically. If we throw an exception, you could never deal with that case.

My issue with not throwing an exception is that there are several places that rely on this function returning all non-Nonetype offsets. This issue brought to light one of these places.

mypy should really have caught this before in our CI checks because this is a typing error on our part.

That said, the other option would be to raise an exception downstream in all of the places that use these offsets when None is encountered. I didn't go that route with #4291 because I couldn't find a single example where a None offset was actually handled.

What are all of those places that expect it?

Can we fix it by adding a fixed embedding to the mismatched embedder for these cases, which gets substituted when there is nothing else to add?

I'm looking at the other case right now.

For the coref case, it would be pretty messed up if the mention starts or ends in a token that has no word pieces. I'd be OK just letting it crash in that case, or just throwing an exception right there.

We're losing a little bit of generality if we do this. The proper fix would be to scan forward from the start token until we find one that's not None, and scan backwards from the end token. But I think the case where that's necessary is quite rare, and possibly not worth the complexity.

Might be relevant: #3779, esp. the part where it was re-opened. Specifically see https://github.com/allenai/allennlp/pull/3808/files#r381036253

4301 fixes this issue, in the sense that you don't get an exception anymore. But instead it puts a zero vector into your embeddings. If the model isn't trained for that possibility, anything could happen. There is no guarantee it will give good performance.

I found that upgrading to huggingface transformers 3.0.0 facilitates this error

What do you mean by "facilitates"?

Sorry I should have given you more to go off - really big fan of you guys!

I upgraded to transformers==3.0.1 without changing any code and got the same type error. When I reverted versions of transformers<3.0.0 the same code worked fine.

I just tried running the code from the issue description at the top on the latest master of allennlp and allennlp-models, and transformers==3.0.2. I had no problems. What are you running?

really big fan of you guys!

Thanks! We appreciate it :-)

Actually, if you could put repro steps into the description of a new issue, that would be great. Then we don't have to have a discussion on a closed issue, which might get lost easily.

That's strange. I think transformers introduced a few tokenizer fixes between 3.0.1 and 3.0.2, but might also be something else. I can try repro this in the next week or two as have some hard deadlines approaching and have not recorded all the steps I took to fix this error as well as I could have...

Was this page helpful?
0 / 5 - 0 ratings