Allennlp: PretrainedTransformerTokenizer doesn't work with seq2seq dataset reader

Created on 17 Nov 2020 · 8Comments · Source: allenai/allennlp

Checklist

[X] I have verified that the issue exists against the master branch of AllenNLP.
[X] I have read the relevant section in the contribution guide on reporting bugs.
[X] I have checked the issues list for similar or identical bug reports.
[X] I have checked the pull requests list for existing proposed fixes.
[X] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
[X] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[X] I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
[X] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[X] I have included in the "Environment" section below the output of pip freeze.
[X] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

As far as I can tell, PretrainedTransformerTokenizer is not compatible with the seq2seq dataset reader of allennlp-models when it is used as the source_tokenizer. The same error, contained in this try/except block here is triggered in multiple cases.

When allennlp.common.util.START_SYMBOL and allennlp.common.util.END_SYMBOL are not in the pretrained transformers vocabulary. I was able to solve this in the config as follows:

    "dataset_reader": {
        "type": "copynet_seq2seq",
        "target_namespace": "target_tokens",
        "source_tokenizer": {
            "type": "pretrained_transformer",
            "model_name": "distilroberta-base",
            "tokenizer_kwargs": {
                "additional_special_tokens": {
                    "allennlp_start_symbol": "@start@",
                    "allennlp_end_symbol": "@end@",
                },
            }
        },

If PretrainedTransformerTokenizer.add_special_tokens is True (the default) for wordpiece-based tokenizers.
For any BPE-based tokenizer I tried.

The error arises because there are more than two tokens in the list returned by self._source_tokeniser.tokenizer in the try/except block for all cases listed above:

try:
    self._start_token, self._end_token = self._source_tokenizer.tokenize(
        start_symbol + " " + end_symbol
    )
except ValueError:
    raise ValueError(
        f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
        f"for tokenizer {self._source_tokenizer}"
    )

Python traceback:

2020-11-16 16:53:24,760 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 98, in __init__
    self._start_token, self._end_token = self._source_tokenizer.tokenize(
ValueError: too many values to unpack (expected 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/johnmg/seq2rel/bin/allennlp", line 33, in <module>
    sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')())
  File "/project/6006286/johnmg/allennlp/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/project/6006286/johnmg/allennlp/allennlp/commands/__init__.py", line 118, in main
    args.func(args)
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 110, in train_model_from_args
    train_model_from_file(
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 170, in train_model_from_file
    return train_model(
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 236, in train_model
    model = _train_worker(
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 453, in _train_worker
    train_loop = TrainModel.from_params(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
    return retyped_subclass.from_params(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 627, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 198, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 305, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 339, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
    return retyped_subclass.from_params(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 629, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 102, in __init__
    raise ValueError(
ValueError: Bad start or end symbol ('@start@', '@end@') for tokenizer <allennlp.data.tokenizers.pretrained_transformer_tokenizer.PretrainedTransformerTokenizer object at 0x2b6503248970>

Related issues or possible duplicates

None

Environment

OS: Linux

Python version: 3.8.0

Output of pip freeze:

-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/avx2
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/generic
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
-e git+https://github.com/allenai/allennlp.git@0d8873cfef628eaf0457bee02422bbf8dae475a2#egg=allennlp
-e git+https://github.com/allenai/allennlp-models.git@236034ff54ac3197ec4d710438cebdfa919c5a45#egg=allennlp_models
attrs==20.2.0
blis==0.4.1
boto3==1.16.2
botocore==1.19.2
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
conllu==4.2.1
cymem==2.0.2
filelock==3.0.12
ftfy==5.5.1
future==0.18.2
h5py==2.10.0
idna==2.10
importlib-metadata==2.0.0
iniconfig==1.0.1
jmespath==0.10.0
joblib==0.17.0
jsonnet==0.14.0
jsonpickle==1.4.1
more-itertools==8.5.0
murmurhash==1.0.2
nltk==3.5
numpy==1.19.1
overrides==3.1.0
packaging==20.4
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
protobuf==3.13.0
py==1.9.0
py-rouge==1.1
pyparsing==2.4.7
pytest==6.0.1
python-dateutil==2.8.1
regex==2019.11.1
requests==2.24.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn==0.23.0
scipy==1.5.2
sentencepiece==0.1.91
six==1.15.0
spacy==2.2.2
srsly==0.2.0
tensorboardX==2.1
thinc==7.3.1
threadpoolctl==2.1.0
tokenizers==0.9.3
toml==0.10.1
torch==1.7.0
tqdm==4.51.0
transformers==3.5.1
typing-extensions==3.7.4.3
urllib3==1.25.10
wasabi==0.6.0
wcwidth==0.2.5
word2number==1.1
zipp==3.4.0

Steps to reproduce

Example source:

The proximate cause of the error can be reproduced as follows:

from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL

tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}

# Case 1, don't add AllenNLPs start/end symbols to vocabulary
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased")
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)

# Case 2, set add_special_tokens=True (the default) in PretrainedTransformerTokenizer for a wordpiece based tokenizer
# this WON'T fail
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)

# Case 3, BPE-based tokenizers fail regardless
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)

bug

Source

JohnGiorgi

All 8 comments

What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67

If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].

epwalsh on 17 Nov 2020

What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67

At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right? If thats correct, then the ValueError is still raised even if source_add_start_token and source_add_end_token are False (I confirmed this).

If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].

Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise? I went looking but couldn't find it.

JohnGiorgi on 17 Nov 2020

At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right?

Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.

Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise?

Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69

epwalsh on 17 Nov 2020

Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.

Gotcha. That is probably the best solution for me.

Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69

Thanks. Not sure how I missed that.

It still seems like there is a bug here. BPE-based tokenizers will throw an error if at least one of source_add_start_token,
source_add_end_token, target_add_start_token, or target_add_end_token is True, even if you took care to add AllenNLPs start and end symbols to its vocabulary, and you disable the tokenizer from adding its own special tokens with add_special_tokens=False:

from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL

tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# >> [@start@, Ġ, @end@]

This is because the try/except block of Seq2SeqDatasetReader will fail as the tokenizer will not return a list of length two (it encodes the space as "Ġ").

What if the check was more explict. Something like:

tokens = self._source_tokenizer.tokenize(start_symbol + " " + end_symbol)
if tokens[0] != self._start_token or tokens[-1] != self._end_token:
raise ValueError(
  f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
  f"for tokenizer {self._source_tokenizer}"
 )

JohnGiorgi on 17 Nov 2020

I think that's a good idea. Want to make a PR? Feel free to tag me as a reviewer. I also just realized the start/end_symbol parameters are missing from the docstring. Would be good to update the docstring as well.

epwalsh on 17 Nov 2020

Sounds good, I'll do both!

JohnGiorgi on 17 Nov 2020

🎉1

@epwalsh I took a closer look at the start_symbol and end_symbol arguments in Seq2SeqDatasetReader, and it looks like they are never used (besides in the try/except)? Either this is a bug, or their purpose is not to update the default choices of START_SYMBOL and END_SYMBOL?

JohnGiorgi on 17 Nov 2020

Hmm, it's a weird way of doing it, but the result of https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L95-L97 is essentially the same as

self._start_symbol = start_symbol
self._end_symbol = end_symbol

epwalsh on 17 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to continue training the model after it is out of patience?

MeiqiGuo · 4Comments

ModuleNotFoundError

silencemaker · 4Comments

Metrics for Multi-label Classification Task

shounakpaul95 · 4Comments

Configuration error on coreference resolution while using model coref-bert-lstm-2020.02.12.

lighteternal · 4Comments

token_character_encoder dies with a sentence with short tokens

masashi-y · 4Comments