Spacy: Error using pretrain when building a model with embedded word vectors

Created on 3 Oct 2019 · 9Comments · Source: explosion/spaCy

I am trying to train a Swedish model with tagging, parsing and embedded word vectors for similarity scoring. Training data comes from https://github.com/UniversalDependencies/UD_Swedish-Talbanken and word vectors are trained using gensim.

Training a small model with pretraining but without embedded vectors works from the command line. Likewise a large model with embedded vectors but without pretraining also works great. The problem arises when trying
trying to train a large model with embedded vectors and pretraining.

As I understand it "spacy pretrain" uses the command --use-vectors argument is used if you want the word model to include features from the word vectors.

However, although pretrain works fine with the --use-vectors command the "spacy train" command fails.

How to reproduce the behaviour

python -m spacy init-model sv ./init_models/word2vec -v ./vectors/word2vec.txt

python -m spacy pretrain './corpus/sents_webbnyheter2013.jsonl' './init_models/word4vec' ./pretrained/ud_w2v' --use-vectors -i 100

python -m spacy train sv models_temp ./corpus/ud_swedish_talbanken_json_sent10/sv_talbanken-ud-train.json ./corpus/ud_swedish_talbanken_json_sent10/sv_talbanken-ud-dev.json -p 'tagger,parser' -t2v ./pretrained/ud_w2v/model99.bin -g 0 -n 55

Produces the following error

Training pipeline: ['tagger', 'parser']
Starting with blank model 'sv'
Counting training words (limit=0)
Loaded pretrained tok2vec for: ['tagger', 'parser']

Itn Tag Loss Tag % Dep Loss UAS LAS Token % CPU WPS GPU WPS

✔ Saved model to output directory
models_temp/model-final
⠙ Creating best model...
Traceback (most recent call last):
File "/home/gustav/anaconda3/lib/python3.7/site-packages/spacy/cli/train.py", line 365, in train
losses=losses,
File "/home/gustav/anaconda3/lib/python3.7/site-packages/spacy/language.py", line 516, in update
proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
File "nn_parser.pyx", line 424, in spacy.syntax.nn_parser.Parser.update
File "_parser_model.pyx", line 214, in spacy.syntax._parser_model.ParserModel.begin_update
File "_parser_model.pyx", line 262, in spacy.syntax._parser_model.ParserStepModel.__init__
File "/home/gustav/anaconda3/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/thinc/api.py", line 295, in begin_update
X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/thinc/api.py", line 379, in uniqued_fwd
Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/thinc/neural/_classes/layernorm.py", line 60, in begin_update
X, backprop_child = self.child.begin_update(X, drop=0.0)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/thinc/neural/_classes/maxout.py", line 76, in begin_update
output__boc = self.ops.gemm(X__bi, W, trans2=True)
File "ops.pyx", line 860, in thinc.neural.ops.CupyOps.gemm
File "/home/gustav/anaconda3/lib/python3.7/site-packages/cupy/linalg/product.py", line 35, in dot
return a.dot(b, out)
File "cupy/core/core.pyx", line 1306, in cupy.core.core.ndarray.dot
File "cupy/core/core.pyx", line 1940, in cupy.core.core.dot
ValueError: Axis dimension mismatch

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gustav/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/gustav/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/spacy/__main__.py", line 35, in
plac.call(commands[command], sys.argv[1:])
File "/home/gustav/anaconda3/lib/python3.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), *kwargs)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/spacy/cli/train.py", line 486, in train
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
File "/home/gustav/anaconda3/lib/python3.7/site-packages/spacy/cli/train.py", line 554, in _collate_best_model
path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

Your Environment

Operating System: Ubuntu 19.04
Python Version Used: 3.7
spaCy Version Used: 2.1

feat / tok2vec feat / vectors lang / sv

Source

gustavengstrom

All 9 comments

Hi @gustavengstrom , this functionality is still a bit experimental so you could definitely be hitting a non-happy path that hasn't been tested before. I hope we can find a way to fix it though!

So just to summarize: you're storing the pretrained models in ./pretrained/ud_w2v, taking the last one (99) and using that for training the tagger and the parser, right?

Are you sure all settings are the same between pretraining and training?

Is it an option for you to update spaCy to the latest version (2.2.1) and check whether the error is still there? Please note though that upgrading will also require updating/retraining your models, cf. https://github.com/explosion/spaCy/releases/tag/v2.2.0 .

svlandeg on 4 Oct 2019

What happens in the code is that the train command will read the file from pretraining, and use that as weights for the tok2vec layer for both the tagger and the parser. And from your error message (only the first block is relevant), it looks like the dimensions of the layers in the parser's neural net are not matching up.

svlandeg on 4 Oct 2019

Thanks! I tried with updated version as well. The settings should be the same since I am not using any specifying and model specific attributes in either pretrain or train. I agree that it looks like a dimension missmatch... I will try to post a minimal working example asap so that the error can be reproduced...

gustavengstrom on 4 Oct 2019

👍1

That would be very helpful !

svlandeg on 4 Oct 2019

@svlandeg I have created a git repository that reproduces the error. Should be reproducible if you simply clone the repository and run the commands in the readme file. This seems like a bug to me... I have reproduced the error both on my ubuntu server and my macbook.

https://github.com/gustavengstrom/train_swe_vector_model

gustavengstrom on 5 Oct 2019

👍1

@svlandeg Have you had a chance to look at this problem? Alternatively is there a way to feed in the word vectors post training?

gustavengstrom on 9 Oct 2019

@gustavengstrom : apologies for the late follow-up.

Looking at your commands, I noticed you specify --use-vectors during pretrain, so you'll also have to define -v for train, otherwise the train script will not accomodate for pretrained vectors in its Tok2Vec component and there will indeed be a dimension mismatch when loading your pretrained model from file. See also my comment here.

svlandeg on 14 Feb 2020

Thanks! This worked for reference. Final train command was amended as follows:
python -m spacy train sv models_temp sv_talbanken-ud-train.json sv_talbanken-ud-dev.json -p 'tagger,parser' -t2v ./pretrained/model9.bin -v init_models -n 10

gustavengstrom on 17 Feb 2020

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.