Allennlp: "TypeError: not a sequence" on simple coreference resolution

Created on 25 May 2020 · 20Comments · Source: allenai/allennlp

Describe the bug

I get a TypeError: not a sequence when trying to predict this simple string: "Besides its prominence in sports, Notre Dame is also a large, four-year, highly residential research University, and is consistently ranked among the top twenty universities in the United States and as a major global university."

Full stacktrace:

Traceback (most recent call last):
  File "/home/arthur/question-generation/models/sg_dqg.py", line 156, in <module>
    preprocess(args.ds)
  File "/home/arthur/question-generation/models/sg_dqg.py", line 124, in preprocess
    coreferences = coreference_resolution(evidences_list)
  File "/home/arthur/question-generation/models/sg_dqg.py", line 74, in coreference_resolution
    document="Besides its prominence in sports, Notre Dame is also a large, four-year, highly residential research University, and is consistently ranked among the top twenty universities in the United States  and as a major global university."
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp_models/coref/coref_predictor.py", line 65, in predict
    return self.predict_json({"document": document})
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 48, in predict_json
    return self.predict_instance(instance)
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 171, in predict_instance
    outputs = self._model.forward_on_instance(instance)
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/models/model.py", line 142, in forward_on_instance
    return self.forward_on_instances([instance])[0]
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/models/model.py", line 167, in forward_on_instances
    model_input = util.move_to_device(dataset.as_tensor_dict(), cuda_device)
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/batch.py", line 139, in as_tensor_dict
    for field, tensors in instance.as_tensor_dict(lengths_to_use).items():
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/instance.py", line 99, in as_tensor_dict
    tensors[field_name] = field.as_tensor(padding_lengths[field_name])
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/fields/text_field.py", line 103, in as_tensor
    self._indexed_tokens[indexer_name], indexer_lengths[indexer_name]
  File "/home/arthur/anaconda3/envs/tensorflow_env/lib/python3.6/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 96, in as_padded_tensor_dict
    offsets_tokens, offsets_padding_lengths, default_value=lambda: (0, 0)
TypeError: not a sequence

To Reproduce
Run this simple piece of code :

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2020.02.27.tar.gz")
predictor.predict(
document="Besides its prominence in sports, Notre Dame is also a large, four-year, highly residential research University, and is consistently ranked among the top twenty universities in the United States  and as a major global university."
)

Expected behavior
The function should return the proper resolved coreferences.

System (please complete the following information):

OS: Linux
Python version: 3.7.7
AllenNLP version: 1.0.0rc4
PyTorch version: 1.5.0

bug

Source

arthurdeschamps

All 20 comments

Hi @arthurdeschamps, I did some debugging and a found that the root of the issue comes from there being some None values in the offsets_tokens list from your stacktrace, which causes torch to fail when trying to turn this list into a tensor because it doesn't know how to handle the None values. These None values come from this line:

https://github.com/allenai/allennlp/blob/master/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py#L360

@dirkgr any ideas how we should fix this?

epwalsh on 26 May 2020

This is something that has to be handled in the model (or in the original tokenizer).

Recall: _intra_word_tokenize takes an existing tokenization of a string, and cuts the tokens into word pieces suitable for a transformer. What should happen when the existing tokenization produces tokens that have zero word pieces? I can't answer that in general. It depends on your case.

If your answer is "That should never happen.", then look at the cases where it happens anyways and find out why. Maybe the original tokenizer produces tokens that are nothing but spaces? Zero-length tokens? But sometimes there is a legitimate reason for this, and you need to handle it somehow downstream.

dirkgr on 26 May 2020

@dirkgr in any case, I think we need to improve our error message here. Seems to me like _intra_word_tokenize should raise an exception when a token results in zero word pieces?

epwalsh on 26 May 2020

@arthurdeschamps is this case you have extra whitespace in your document:

...he United States  and as...
                   ^^

Nevertheless, the error message should be improved here. See https://github.com/allenai/allennlp/pull/4291

epwalsh on 26 May 2020

👍1

Seems to me like _intra_word_tokenize should raise an exception when a token results in zero word pieces?

Sometimes this is a case you want to handle specifically. If we throw an exception, you could never deal with that case.

dirkgr on 26 May 2020

Sometimes this is a case you want to handle specifically. If we throw an exception, you could never deal with that case.

My issue with not throwing an exception is that there are several places that rely on this function returning all non-Nonetype offsets. This issue brought to light one of these places.

mypy should really have caught this before in our CI checks because this is a typing error on our part.

That said, the other option would be to raise an exception downstream in all of the places that use these offsets when None is encountered. I didn't go that route with #4291 because I couldn't find a single example where a None offset was actually handled.

epwalsh on 27 May 2020

What are all of those places that expect it?

dirkgr on 27 May 2020

What are all of those places that expect it?

epwalsh on 27 May 2020

Can we fix it by adding a fixed embedding to the mismatched embedder for these cases, which gets substituted when there is nothing else to add?

I'm looking at the other case right now.

dirkgr on 27 May 2020

For the coref case, it would be pretty messed up if the mention starts or ends in a token that has no word pieces. I'd be OK just letting it crash in that case, or just throwing an exception right there.

We're losing a little bit of generality if we do this. The proper fix would be to scan forward from the start token until we find one that's not None, and scan backwards from the end token. But I think the case where that's necessary is quite rare, and possibly not worth the complexity.

dirkgr on 27 May 2020

Might be relevant: #3779, esp. the part where it was re-opened. Specifically see https://github.com/allenai/allennlp/pull/3808/files#r381036253

ZhaofengWu on 27 May 2020

https://github.com/allenai/allennlp/blob/8ff47d34a5368fb85f27816aefa5739910a9e3e4/tests/data/tokenizers/pretrained_transformer_tokenizer_test.py#L225 was supposed to test this behavior.

ZhaofengWu on 27 May 2020

I proposed a fix at https://github.com/allenai/allennlp/pull/4301.

dirkgr on 29 May 2020

4301 fixes this issue, in the sense that you don't get an exception anymore. But instead it puts a zero vector into your embeddings. If the model isn't trained for that possibility, anything could happen. There is no guarantee it will give good performance.

dirkgr on 3 Jun 2020

I found that upgrading to huggingface transformers 3.0.0 facilitates this error

chrisdoyleIE on 10 Jul 2020

What do you mean by "facilitates"?

dirkgr on 10 Jul 2020

Sorry I should have given you more to go off - really big fan of you guys!

I upgraded to transformers==3.0.1 without changing any code and got the same type error. When I reverted versions of transformers<3.0.0 the same code worked fine.

chrisdoyleIE on 10 Jul 2020

I just tried running the code from the issue description at the top on the latest master of allennlp and allennlp-models, and transformers==3.0.2. I had no problems. What are you running?

really big fan of you guys!

Thanks! We appreciate it :-)

dirkgr on 13 Jul 2020

Actually, if you could put repro steps into the description of a new issue, that would be great. Then we don't have to have a discussion on a closed issue, which might get lost easily.

dirkgr on 13 Jul 2020

That's strange. I think transformers introduced a few tokenizer fixes between 3.0.1 and 3.0.2, but might also be something else. I can try repro this in the next week or two as have some hard deadlines approaching and have not recorded all the steps I took to fix this error as well as I could have...

chrisdoyleIE on 14 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings