Allennlp: Train a model with transformer embeddings and additional_special_tokens

Created on 1 Oct 2020 · 24Comments · Source: allenai/allennlp

Checklist

[x] I have verified that the issue exists against the master branch of AllenNLP.
[x] I have read the relevant section in the contribution guide on reporting bugs.
[x] I have checked the issues list for similar or identical bug reports.
[x] I have checked the pull requests list for existing proposed fixes.
[x] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
[x] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[x] I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
[x] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[x] I have included in the "Environment" section below the output of pip freeze.
[x] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

Hi there! I'm trying to train a transformer-based text classifier model in AllenNLP, but I need to add 5 additional special tokens, in a way compatible with tokenizers lib. I tried adding them to the jsonnet AllenNLP config file and then to the transformer's model path, but neither worked, with each approach having a different problem, which will be described below.

Python traceback:

2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Epoch 0/9
2020-09-30 23:56:17,398 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 10065.304
2020-09-30 23:56:17,484 - WARNING - allennlp.common.util - unable to check gpu_memory_mb() due to occasional failure, continuing
Traceback (most recent call last):
  File "/media/discoD/repositorios/allennlp/allennlp/common/util.py", line 415, in gpu_memory_mb
    encoding="utf-8",
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 488, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/subprocess.py", line 1482, in _execute_child
    restore_signals, start_new_session, preexec_fn)
  File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_monkey.py", line 526, in new_fork_exec
    return getattr(_posixsubprocess, original_name)(args, *patch_fork_exec_executable_list(args, other_args))
OSError: [Errno 12] Cannot allocate memory
2020-09-30 23:56:17,489 - INFO - allennlp.training.trainer - Training
  0%|          | 0/11817 [00:00<?, ?it/s]/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [69,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
...
...
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [102,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:272: indexSelectLargeIndex: block: [102,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
  0%|          | 0/11817 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 443, in _train_worker
    metrics = train_loop.run()
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 505, in run
    return self.trainer.train()
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 872, in train
    train_metrics = self._train_epoch(epoch)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 594, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/media/discoD/repositorios/allennlp/allennlp/training/trainer.py", line 479, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/models/basic_classifier.py", line 121, in forward
    embedded_text = self._text_field_embedder(tokens)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(**tensors, **forward_params_values)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/repositorios/allennlp/allennlp/modules/token_embedders/pretrained_transformer_embedder.py", line 184, in forward
    transformer_output = self.transformer_model(**parameters)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_bert.py", line 221, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/nn/functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
python-BaseException
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered

Related issues or possible duplicates

None

Environment

OS: Linux

Python version: 3.7.7

Output of pip freeze:

allennlp==1.1.0
allennlp-models==1.1.0
-e [email protected]:allenai/allennlp-server.git@bc56288b9295391051f7b7b042fe34219bfe33ab#egg=allennlp_server
attrs==19.3.0
backcall==0.2.0
bleach==3.1.5
blis==0.4.1
boto3==1.14.31
botocore==1.17.31
cachetools==4.1.1
catalogue==1.0.0
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
conllu==4.1
cycler==0.10.0
cymem==2.0.3
cytoolz==0.10.1
decorator==4.4.2
defusedxml==0.6.0
docutils==0.15.2
eland==7.7.0a1
elasticsearch-dsl==7.2.1
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
entrypoints==0.3
filelock==3.0.12
fire==0.3.1
Flask==1.1.2
Flask-Cors==3.0.8
ftfy==5.8
future==0.18.2
gevent==20.6.2
greenlet==0.4.16
h5py==2.10.0
idna==2.10
importlib-metadata==1.7.0
iniconfig==1.0.1
ipykernel==5.3.4
ipython==7.16.1
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.17.2
jellyfish==0.8.2
Jinja2==2.11.2
jmespath==0.10.0
joblib==0.16.0
jsonnet==0.16.0
jsonpickle==1.4.1
jsonschema==3.2.0
jupyter-client==6.1.6
jupyter-core==4.6.3
Keras==2.4.3
kiwisolver==1.2.0
MarkupSafe==1.1.1
matplotlib==3.3.0
mistune==0.8.4
mkl-fft==1.1.0
mkl-random==1.1.1
mkl-service==2.3.0
more-itertools==8.4.0
murmurhash==1.0.2
nbconvert==5.6.1
nbformat==5.0.7
networkx==2.4
nltk==3.5
notebook==6.0.3
numpy==1.18.5
olefile==0.46
overrides==3.1.0
packaging==20.4
pandas==1.1.0
pandocfilters==1.4.2
parso==0.7.1
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.2.0
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
prometheus-client==0.8.0
prompt-toolkit==3.0.5
protobuf==3.12.4
ptyprocess==0.6.0
py==1.9.0
py-rouge==1.1
pydot==1.4.1
pyemd==0.5.1
Pygments==2.6.1
pyparsing==2.4.7
Pyphen==0.9.5
pyrsistent==0.16.0
pytest==6.0.1
python-dateutil==2.8.1
pytz==2020.1
PyYAML==5.3.1
pyzmq==19.0.1
regex==2020.7.14
requests==2.24.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn==0.23.1
scipy==1.5.2
seaborn==0.11.0
Send2Trash==1.5.0
sentencepiece==0.1.91
seqeval==0.0.12
six==1.15.0
spacy==2.3.2
srsly==1.0.2
tensorboardX==2.1
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
thinc==7.4.1
threadpoolctl==2.1.0
tokenizers==0.8.1rc1
toml==0.10.1
toolz==0.10.0
torch==1.6.0+cu101
torchvision==0.7.0+cu101
tornado==6.0.4
tqdm==4.48.0
traitlets==4.3.3
transformers==3.0.2
urllib3==1.25.10
visualise-spacy-tree==0.0.6
wasabi==0.7.1
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.5.1
word2number==1.1
zipp==3.1.0
zope.event==4.4
zope.interface==5.1.0

Steps to reproduce

First I tried adding the 5 additional special tokens directly in the jsonnet model config, like this:

    "token_indexers": {
            "tokens": {
                "type": "pretrained_transformer",
                "model_name": transformer_model,
                "max_length": transformer_dim,
                "tokenizer_kwargs": {"additional_special_tokens": [['<REL_SEP>'], ['[['], [']]'], ['<<'], ['>>']], "max_len": transformer_dim}
            }
     },

But I ran into a problem at allennlp.common.cached_transformer.get_tokenizer, because cache_key = (model_name, frozenset(kwargs.items())) tries to use the "tokenizer_kwargs" value as a cache key, but it can't parse the additional_special_tokens list into a string, throwing the following exception:

TypeError: unhashable type: 'list'

Traceback (most recent call last):
  File "/media/discoD/pycharm-community-2019.2/plugins/python-ce/helpers/pydev/pydevd.py", line 1465, in _exec
    runpy._run_module_as_main(module_name, alter_argv=False)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/media/discoD/anaconda3/envs/allennlp/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 38, in <module>
    run()
  File "/media/discoD/repositorios/allennlp/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/media/discoD/repositorios/allennlp/allennlp/commands/__init__.py", line 94, in main
    args.func(args)
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 118, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 177, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 238, in train_model
    file_friendly_logging=file_friendly_logging,
  File "/media/discoD/repositorios/allennlp/allennlp/commands/train.py", line 433, in _train_worker
    local_rank=process_rank,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 626, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 197, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 306, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 387, in construct_arg
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 340, in construct_arg
    return annotation.from_params(params=popped_params, **subextras)
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 599, in from_params
    **extras,
  File "/media/discoD/repositorios/allennlp/allennlp/common/from_params.py", line 628, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/media/discoD/repositorios/allennlp/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 58, in __init__
    model_name, tokenizer_kwargs=tokenizer_kwargs
  File "/media/discoD/repositorios/allennlp/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 71, in __init__
    model_name, add_special_tokens=False, **tokenizer_kwargs
  File "/media/discoD/repositorios/allennlp/allennlp/common/cached_transformers.py", line 101, in get_tokenizer
    cache_key = (model_name, frozenset(kwargs.items()))
TypeError: unhashable type: 'list'

I couldn't find a way to work passing the tokens in this way, so I ended up downloading the bert model to my local disk and added the tokenizers config files to the same path (the vocab size of my bert model is 29794, so the last index is 29793). Files contents I changed are in the "Example source" section below.

After debugging, looks like this config at least was enough to get the bert tokenizer to recognize the 5 tokens and tokenize the training data accordingly, but then I ran into another issue once training actually began (the one pasted in the "Python traceback" section of this issue).

Looks like this error is due to the fact that the transformer's model embeddings layer weren't properly resized according to the new vocabulary size, which would be accomplished with a code like this: model.resize_token_embeddings(len(tokenizer)). I didn't find any code in the AllenNLP lib that would do something like this, so I'm thinking this is the issue's cause.

Is there another way to accomplish this using AllenNLP that I'm not aware of? Looks like both ways to expand the vocab size should be possible.

Example source:

`added_tokens.json`: `{"": 29794, "[[": 29795, "]]": 29796, "<<": 29797, ">>": 29798}` `special_tokens_map.json`: `{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "additional_special_tokens": ["", "[[", "]]", "<<", ">>"]}` `tokenizer_config.json`: `{"do_lower_case": false, "additional_special_tokens": ["", "[[", "]]", "<<", ">>"]}`

Thanks!

Contributions welcome bug

Source

pvcastro

👍3

Most helpful comment

Hi @NicolasAG . I ended up doing something similar. In my custom reader class I did this in init:

        self._tokenizer = PretrainedTransformerTokenizer(
            model_name=transformer_model_name,
            tokenizer_kwargs=tokenizer_kwargs,
        )
        self._token_indexers = {
            "tokens": PretrainedTransformerIndexer(
                transformer_model_name, tokenizer_kwargs=tokenizer_kwargs
            )
        }
        special_tokens_dict = {'additional_special_tokens': ['<REL_SEP>']}
        self._tokenizer.tokenizer.add_special_tokens(special_tokens_dict)
        self._token_indexers["tokens"]._allennlp_tokenizer.tokenizer.add_special_tokens(special_tokens_dict)

In the model I did this:

        default_vocab_size = self._embedder.token_embedder_tokens.config.vocab_size
        self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings(default_vocab_size + 1)

pvcastro on 6 Nov 2020

👍2

All 24 comments

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

github-actions[bot] on 16 Oct 2020

@dirkgr ? :sweat_smile:

pvcastro on 27 Oct 2020

@pvcastro I've managed to approximately get something working like this by writing a wrapper Model class for the huggingface model that calls the HF function model.resize_token_embeddings(new_vocab_size) at the end of the constructor as well as forcibly creating a custom AllenNLP vocab and a Huggingface Vocab. My use case is Bart so I've modified the bart.py file in allennlp-models. I'm now just verifying if this doesn't remove/break the pretrained weight initialisation. I'm not sure if this is useful for you - but let me know if it helps!

tomsherborne on 27 Oct 2020

hi @tomsherborne !
So you had to hardcode tokenizer_kwargs directly in .py to pass additional_special_tokens?

pvcastro on 27 Oct 2020

@pvcastro yes you could do that. i found it more stable/manageable to manually create an instance of the relevant HF tokenizer, use add_tokens to manually add each extra token i needed and save this to disk. Then in the ANLP config file I reference a local path under the "model_name" argument for tokenizer/indexer. I also converted this HF vocabulary into an ANLP vocabulary and reference this in the config as:

    "vocabulary": {
        "type": "from_files", 
        "directory": "./path/to/allennlp/version/of/extended/vocab",
        "oov_token": "<unk>",
        "padding_token": "<pad>"
    }

tomsherborne on 27 Oct 2020

I see. I guess I'll do the same then, until this is supported by the library.
Thanks! @tomsherborne !

pvcastro on 27 Oct 2020

@pvcastro just FYI, Dirk is currently on vacation and won't be back for another week.

@tomsherborne it seems like it would be useful to have a from_pretrained_transformer vocab type? Then you could just give it the model name or path:

"vocabulary": {
    "type": "from_pretrained_transformer",
    "model_name": "bert-base-cased"
}

This feature request has come up elsewhere as well (see here, for example).

epwalsh on 27 Oct 2020

@epwalsh i may have some aspects of the vocabulary API wrong here but I understood that a specific type was unneccessary right now because the PretrainedTransformerIndexer will force a copy of the pretrained vocabulary into the AllenNLP vocab namespace here. Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

tomsherborne on 27 Oct 2020

@tomsherborne thanks for pointing that out, I actually wasn't aware of that.

Would you also need a way of forcing the correct OOV/Pad tokens from the pretrained version when they aren't the default AllenNLP strings?

Hmm do you need that for your use case? Right now the Vocabulary object just has a single padding_token field, but we would really need to be able to specify a different padding_token for each namespace in the Vocabulary. I think that's very doable. More generally, we could have optional namespace-specific settings which could override the default. I'm just wondering if there's a need for that.

epwalsh on 27 Oct 2020

Hi,
I just want to add my interest in seeing this issue resolved :)

I have the exact same issue as originally stated: trying to add special tokens to the T5 pretrained tokenizer with this extra argument in my tokenizer config:

"tokenizer_kwargs": { "additional_special_tokens": ["##START##", "##END##", ...] }

but that fails because the list type is unhashable.

However, I am not sure to understand the trick of creating a Model wrapper, a new Vocab file, etc... as suggested by @tomsherborne

I guess another solution could be to:
(1) manually add the special tokens when creating the data reader:

special_tokens_dict = {'additional_special_tokens': ['##START##', '##END##', ...]}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
logger.info(f'We have added {num_added_toks} tokens')

and then (2) when creating the model we resize the BasicTextFieldEmbedder.token_embbeder dimension like this:

text_embedder.t5.resize_token_embeddings(len(vocab))

this is because I have the following config for my model in which I use the pretrained t5 encoder as an embedder:

"source_text_embedder": {
    "type": "basic",
    "token_embedders": {
        "t5": { "type": "pretrained_transformer", "model_name": model_name, .... }
    }
}

Would that work? will the vocab passed to the model constructor have the additional tokens....?
Anyway, I'll try it out and see how I can work around this issue, but I'm also really looking forward to this being properly supported by the ANLP library :)

NicolasAG on 6 Nov 2020

Hi @NicolasAG . I ended up doing something similar. In my custom reader class I did this in init:

        self._tokenizer = PretrainedTransformerTokenizer(
            model_name=transformer_model_name,
            tokenizer_kwargs=tokenizer_kwargs,
        )
        self._token_indexers = {
            "tokens": PretrainedTransformerIndexer(
                transformer_model_name, tokenizer_kwargs=tokenizer_kwargs
            )
        }
        special_tokens_dict = {'additional_special_tokens': ['<REL_SEP>']}
        self._tokenizer.tokenizer.add_special_tokens(special_tokens_dict)
        self._token_indexers["tokens"]._allennlp_tokenizer.tokenizer.add_special_tokens(special_tokens_dict)

In the model I did this:

        default_vocab_size = self._embedder.token_embedder_tokens.config.vocab_size
        self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings(default_vocab_size + 1)

pvcastro on 6 Nov 2020

👍2

Thanks for sharing! I'll try it out :)

NicolasAG on 6 Nov 2020

Can you see if #4781 fixes your original issue?

dirkgr on 11 Nov 2020

Hi @dirkgr , thanks. I'll give it a try and get back to you.

pvcastro on 11 Nov 2020

@dirkgr I think it would still be necessary to do something like was done here

self._embedder.token_embedder_tokens.transformer_model.resize_token_embeddings

otherwise we'll end up with other errors due to the transformer model having a different dimension, not adjusted to the new vocabulary

pvcastro on 11 Nov 2020

I did confirm that #4781 fixes the parsing of additional tokens from config, but, as I stated in the previous comment, it's still necessary to resize the embeddings layer to accomodate the new vocab.

pvcastro on 11 Nov 2020

@dirkgr would a PR which handles the special case of extending a pretrained HuggingFace model embedder be useful? The AllenNLP model class already does this (here I think) but this doesn't handle the HuggingFace model case since the HF embeddings don't have the "extend_vocab" attribute. Coul this function could call pretrained_model.resize_token_embeddings(new_vocab_size) as a special case?

tomsherborne on 11 Nov 2020

@tomsherborne, yes, that would be a welcome addition. I'd be happy to review that PR. I think that's not enough by itself, since we still have to call those new methods then, but it would be easy to do so.

dirkgr on 12 Nov 2020

This is not closed yet. #4781 only goes part of the way.

dirkgr on 12 Nov 2020

@dirkgr just to let you know that I will have time for this after EMNLP this week. To get ahead of things - where should I be writing tests for an addition to the Model class?

tomsherborne on 16 Nov 2020

I think the best place is here: https://github.com/allenai/allennlp/blob/master/tests/models/model_test.py

There is already one test in there about extending vocabs!

dirkgr on 25 Nov 2020

@dirkgr this is just a friendly ping to make sure you haven't forgotten about this issue 😜

github-actions[bot] on 10 Dec 2020

@tomsherborne, did you get anywhere with this?

dirkgr on 10 Dec 2020

@dirkgr Yes apologies - it's in the works and I'll open a PR when its done.

tomsherborne on 11 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings