Allennlp: PretrainedTransformerTokenizer doesn't work with seq2seq dataset reader

Created on 17 Nov 2020  路  8Comments  路  Source: allenai/allennlp

Checklist

  • [X] I have verified that the issue exists against the master branch of AllenNLP.
  • [X] I have read the relevant section in the contribution guide on reporting bugs.
  • [X] I have checked the issues list for similar or identical bug reports.
  • [X] I have checked the pull requests list for existing proposed fixes.
  • [X] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
  • [X] I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • [X] I have included in the "Related issues or possible duplicates" section below all related issues and possible duplicate issues (If there are none, check this box anyway).
  • [X] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • [X] I have included in the "Environment" section below the output of pip freeze.
  • [X] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

As far as I can tell, PretrainedTransformerTokenizer is not compatible with the seq2seq dataset reader of allennlp-models when it is used as the source_tokenizer. The same error, contained in this try/except block here is triggered in multiple cases.

  1. When allennlp.common.util.START_SYMBOL and allennlp.common.util.END_SYMBOL are not in the pretrained transformers vocabulary. I was able to solve this in the config as follows:
    "dataset_reader": {
        "type": "copynet_seq2seq",
        "target_namespace": "target_tokens",
        "source_tokenizer": {
            "type": "pretrained_transformer",
            "model_name": "distilroberta-base",
            "tokenizer_kwargs": {
                "additional_special_tokens": {
                    "allennlp_start_symbol": "@start@",
                    "allennlp_end_symbol": "@end@",
                },
            }
        },
  1. If PretrainedTransformerTokenizer.add_special_tokens is True (the default) for wordpiece-based tokenizers.
  2. For any BPE-based tokenizer I tried.

The error arises because there are more than two tokens in the list returned by self._source_tokeniser.tokenizer in the try/except block for all cases listed above:

try:
    self._start_token, self._end_token = self._source_tokenizer.tokenize(
        start_symbol + " " + end_symbol
    )
except ValueError:
    raise ValueError(
        f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
        f"for tokenizer {self._source_tokenizer}"
    )


Python traceback:

2020-11-16 16:53:24,760 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 98, in __init__
    self._start_token, self._end_token = self._source_tokenizer.tokenize(
ValueError: too many values to unpack (expected 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/johnmg/seq2rel/bin/allennlp", line 33, in <module>
    sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')())
  File "/project/6006286/johnmg/allennlp/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/project/6006286/johnmg/allennlp/allennlp/commands/__init__.py", line 118, in main
    args.func(args)
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 110, in train_model_from_args
    train_model_from_file(
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 170, in train_model_from_file
    return train_model(
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 236, in train_model
    model = _train_worker(
  File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 453, in _train_worker
    train_loop = TrainModel.from_params(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
    return retyped_subclass.from_params(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 627, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 198, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 305, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 339, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
    return retyped_subclass.from_params(
  File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 629, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 102, in __init__
    raise ValueError(
ValueError: Bad start or end symbol ('@start@', '@end@') for tokenizer <allennlp.data.tokenizers.pretrained_transformer_tokenizer.PretrainedTransformerTokenizer object at 0x2b6503248970>

Related issues or possible duplicates

  • None

Environment


OS: Linux


Python version: 3.8.0


Output of pip freeze:

-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/avx2
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/generic
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
-e git+https://github.com/allenai/allennlp.git@0d8873cfef628eaf0457bee02422bbf8dae475a2#egg=allennlp
-e git+https://github.com/allenai/allennlp-models.git@236034ff54ac3197ec4d710438cebdfa919c5a45#egg=allennlp_models
attrs==20.2.0
blis==0.4.1
boto3==1.16.2
botocore==1.19.2
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
conllu==4.2.1
cymem==2.0.2
filelock==3.0.12
ftfy==5.5.1
future==0.18.2
h5py==2.10.0
idna==2.10
importlib-metadata==2.0.0
iniconfig==1.0.1
jmespath==0.10.0
joblib==0.17.0
jsonnet==0.14.0
jsonpickle==1.4.1
more-itertools==8.5.0
murmurhash==1.0.2
nltk==3.5
numpy==1.19.1
overrides==3.1.0
packaging==20.4
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
protobuf==3.13.0
py==1.9.0
py-rouge==1.1
pyparsing==2.4.7
pytest==6.0.1
python-dateutil==2.8.1
regex==2019.11.1
requests==2.24.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn==0.23.0
scipy==1.5.2
sentencepiece==0.1.91
six==1.15.0
spacy==2.2.2
srsly==0.2.0
tensorboardX==2.1
thinc==7.3.1
threadpoolctl==2.1.0
tokenizers==0.9.3
toml==0.10.1
torch==1.7.0
tqdm==4.51.0
transformers==3.5.1
typing-extensions==3.7.4.3
urllib3==1.25.10
wasabi==0.6.0
wcwidth==0.2.5
word2number==1.1
zipp==3.4.0

Steps to reproduce


Example source:

The proximate cause of the error can be reproduced as follows:

from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL

tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}

# Case 1, don't add AllenNLPs start/end symbols to vocabulary
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased")
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)

# Case 2, set add_special_tokens=True (the default) in PretrainedTransformerTokenizer for a wordpiece based tokenizer
# this WON'T fail
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)

# Case 3, BPE-based tokenizers fail regardless
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)

bug

All 8 comments

What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67

If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].

What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67

At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right? If thats correct, then the ValueError is still raised even if source_add_start_token and source_add_end_token are False (I confirmed this).

If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].

Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise? I went looking but couldn't find it.

At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right?

Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.

Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise?

Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69

Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.

Gotcha. That is probably the best solution for me.

Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69

Thanks. Not sure how I missed that.


It still seems like there is a bug here. BPE-based tokenizers will throw an error if at least one of source_add_start_token,
source_add_end_token, target_add_start_token, or target_add_end_token is True, even if you took care to add AllenNLPs start and end symbols to its vocabulary, and you disable the tokenizer from adding its own special tokens with add_special_tokens=False:

from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL

tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# >> [@start@, 臓, @end@]

This is because the try/except block of Seq2SeqDatasetReader will fail as the tokenizer will not return a list of length two (it encodes the space as "臓").

What if the check was more explict. Something like:

tokens = self._source_tokenizer.tokenize(start_symbol + " " + end_symbol)
if tokens[0] != self._start_token or tokens[-1] != self._end_token:
raise ValueError(
  f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
  f"for tokenizer {self._source_tokenizer}"
 )

I think that's a good idea. Want to make a PR? Feel free to tag me as a reviewer. I also just realized the start/end_symbol parameters are missing from the docstring. Would be good to update the docstring as well.

Sounds good, I'll do both!

@epwalsh I took a closer look at the start_symbol and end_symbol arguments in Seq2SeqDatasetReader, and it looks like they are never used (besides in the try/except)? Either this is a bug, or their purpose is not to update the default choices of START_SYMBOL and END_SYMBOL?

Hmm, it's a weird way of doing it, but the result of https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L95-L97 is essentially the same as

self._start_symbol = start_symbol
self._end_symbol = end_symbol
Was this page helpful?
0 / 5 - 0 ratings