master branch of AllenNLP.pip freeze.As far as I can tell, PretrainedTransformerTokenizer is not compatible with the seq2seq dataset reader of allennlp-models when it is used as the source_tokenizer. The same error, contained in this try/except block here is triggered in multiple cases.
allennlp.common.util.START_SYMBOL and allennlp.common.util.END_SYMBOL are not in the pretrained transformers vocabulary. I was able to solve this in the config as follows: "dataset_reader": {
"type": "copynet_seq2seq",
"target_namespace": "target_tokens",
"source_tokenizer": {
"type": "pretrained_transformer",
"model_name": "distilroberta-base",
"tokenizer_kwargs": {
"additional_special_tokens": {
"allennlp_start_symbol": "@start@",
"allennlp_end_symbol": "@end@",
},
}
},
PretrainedTransformerTokenizer.add_special_tokens is True (the default) for wordpiece-based tokenizers.The error arises because there are more than two tokens in the list returned by self._source_tokeniser.tokenizer in the try/except block for all cases listed above:
try:
self._start_token, self._end_token = self._source_tokenizer.tokenize(
start_symbol + " " + end_symbol
)
except ValueError:
raise ValueError(
f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
f"for tokenizer {self._source_tokenizer}"
)
Python traceback:
2020-11-16 16:53:24,760 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 98, in __init__
self._start_token, self._end_token = self._source_tokenizer.tokenize(
ValueError: too many values to unpack (expected 2)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/johnmg/seq2rel/bin/allennlp", line 33, in <module>
sys.exit(load_entry_point('allennlp', 'console_scripts', 'allennlp')())
File "/project/6006286/johnmg/allennlp/allennlp/__main__.py", line 34, in run
main(prog="allennlp")
File "/project/6006286/johnmg/allennlp/allennlp/commands/__init__.py", line 118, in main
args.func(args)
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 110, in train_model_from_args
train_model_from_file(
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 170, in train_model_from_file
return train_model(
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 236, in train_model
model = _train_worker(
File "/project/6006286/johnmg/allennlp/allennlp/commands/train.py", line 453, in _train_worker
train_loop = TrainModel.from_params(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
return retyped_subclass.from_params(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 627, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 198, in create_kwargs
constructed_arg = pop_and_construct_arg(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 305, in pop_and_construct_arg
return construct_arg(class_name, name, popped_params, annotation, default, **extras)
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 339, in construct_arg
return annotation.from_params(params=popped_params, **subextras)
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 595, in from_params
return retyped_subclass.from_params(
File "/project/6006286/johnmg/allennlp/allennlp/common/from_params.py", line 629, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/project/6006286/johnmg/allennlp-models/allennlp_models/generation/dataset_readers/seq2seq.py", line 102, in __init__
raise ValueError(
ValueError: Bad start or end symbol ('@start@', '@end@') for tokenizer <allennlp.data.tokenizers.pretrained_transformer_tokenizer.PretrainedTransformerTokenizer object at 0x2b6503248970>
OS: Linux
Python version: 3.8.0
Output of
pip freeze:
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/avx2
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/nix/generic
-f /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/generic
-e git+https://github.com/allenai/allennlp.git@0d8873cfef628eaf0457bee02422bbf8dae475a2#egg=allennlp
-e git+https://github.com/allenai/allennlp-models.git@236034ff54ac3197ec4d710438cebdfa919c5a45#egg=allennlp_models
attrs==20.2.0
blis==0.4.1
boto3==1.16.2
botocore==1.19.2
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
conllu==4.2.1
cymem==2.0.2
filelock==3.0.12
ftfy==5.5.1
future==0.18.2
h5py==2.10.0
idna==2.10
importlib-metadata==2.0.0
iniconfig==1.0.1
jmespath==0.10.0
joblib==0.17.0
jsonnet==0.14.0
jsonpickle==1.4.1
more-itertools==8.5.0
murmurhash==1.0.2
nltk==3.5
numpy==1.19.1
overrides==3.1.0
packaging==20.4
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
protobuf==3.13.0
py==1.9.0
py-rouge==1.1
pyparsing==2.4.7
pytest==6.0.1
python-dateutil==2.8.1
regex==2019.11.1
requests==2.24.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn==0.23.0
scipy==1.5.2
sentencepiece==0.1.91
six==1.15.0
spacy==2.2.2
srsly==0.2.0
tensorboardX==2.1
thinc==7.3.1
threadpoolctl==2.1.0
tokenizers==0.9.3
toml==0.10.1
torch==1.7.0
tqdm==4.51.0
transformers==3.5.1
typing-extensions==3.7.4.3
urllib3==1.25.10
wasabi==0.6.0
wcwidth==0.2.5
word2number==1.1
zipp==3.4.0
Example source:
The proximate cause of the error can be reproduced as follows:
from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL
tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}
# Case 1, don't add AllenNLPs start/end symbols to vocabulary
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased")
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# Case 2, set add_special_tokens=True (the default) in PretrainedTransformerTokenizer for a wordpiece based tokenizer
# this WON'T fail
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("bert-base-uncased", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# Case 3, BPE-based tokenizers fail regardless
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# this WILL fail
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=True)
start_token, end_token = tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67
If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].
What if you just disable adding start or end symbols? https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L64-L67
At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right? If thats correct, then the ValueError is still raised even if source_add_start_token and source_add_end_token are False (I confirmed this).
If that doesn't work for your use case, you could also change the start and end symbols to something that you know is in your transformer's vocab, like [SEP] or [CLS].
Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise? I went looking but couldn't find it.
At the very least, you would need target_add_end_token to be True so that the decoder can stop before it hits the max_decoding_length, right?
Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.
Is there a clean way to change AllenNLPs START_SYMBOL AND END_SYMBOL? i.e. in a config or otherwise?
Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69
Not necessarily. If your tokenizer adds its own end symbol, then the decoder can just go off of that.
Gotcha. That is probably the best solution for me.
Yes, just set start_symbol and end_symbol in your dataset reader: https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L68-L69
Thanks. Not sure how I missed that.
It still seems like there is a bug here. BPE-based tokenizers will throw an error if at least one of source_add_start_token,
source_add_end_token, target_add_start_token, or target_add_end_token is True, even if you took care to add AllenNLPs start and end symbols to its vocabulary, and you disable the tokenizer from adding its own special tokens with add_special_tokens=False:
from allennlp.data.tokenizers import PretrainedTransformerTokenizer
from allennlp.common.util import START_SYMBOL, END_SYMBOL
tokenizer_kwargs = {"additional_special_tokens": [START_SYMBOL, END_SYMBOL]}
tokenizer = PretrainedTransformerTokenizer("distilroberta-base", tokenizer_kwargs=tokenizer_kwargs, add_special_tokens=False)
tokenizer.tokenize(START_SYMBOL + " " + END_SYMBOL)
# >> [@start@, 臓, @end@]
This is because the try/except block of Seq2SeqDatasetReader will fail as the tokenizer will not return a list of length two (it encodes the space as "臓").
What if the check was more explict. Something like:
tokens = self._source_tokenizer.tokenize(start_symbol + " " + end_symbol)
if tokens[0] != self._start_token or tokens[-1] != self._end_token:
raise ValueError(
f"Bad start or end symbol ('{start_symbol}', '{end_symbol}') "
f"for tokenizer {self._source_tokenizer}"
)
I think that's a good idea. Want to make a PR? Feel free to tag me as a reviewer. I also just realized the start/end_symbol parameters are missing from the docstring. Would be good to update the docstring as well.
Sounds good, I'll do both!
@epwalsh I took a closer look at the start_symbol and end_symbol arguments in Seq2SeqDatasetReader, and it looks like they are never used (besides in the try/except)? Either this is a bug, or their purpose is not to update the default choices of START_SYMBOL and END_SYMBOL?
Hmm, it's a weird way of doing it, but the result of https://github.com/allenai/allennlp-models/blob/236034ff54ac3197ec4d710438cebdfa919c5a45/allennlp_models/generation/dataset_readers/seq2seq.py#L95-L97 is essentially the same as
self._start_symbol = start_symbol
self._end_symbol = end_symbol