All I want for Christmas is a core nb model for spacy. And we're getting close!
Last week the Language Technology Group at the University of Oslo released NER annotations on top of the Norwegian Dependency Treebank. The nb tag map is made for it, thanks to @katarkor. So there's now both UD/POS and NER data.
TODO:
-n 10 to spacy convert and removing --gold-preproc from spacy train.@jarib I managed to publish (an unpolished) repo with the model i Trained for Nudge (Tagbox.ai):
https://github.com/ohenrik/nb_news_ud_sm
I will hopefully get to clean this up a bit more.
Also regarding: "How come sentence segmentation is not working?":
This is because you need to activate this for a new model. You can make this automatic by adding "sbd" to the list of pipelines in the meta.json file for the model. See example here
@jarib Also the model i link to above does not appear to diverge after X iterations. So i think the dataset used to traing the NER (+ DEP) model did not have this problem.
Ah, great! I'll have a look.
Was the NER/DEP dataset that didn't diverge based on an earlier version of https://github.com/ltgoslo/norne/, or something else?
I don’t know if the dataset has changed since I got access to it, I got it by email earlier this summer. So the repo might have newer data. However I had to combine train, dev and test data, randomize it and then split it into train, dev and test again. This was to avoid the training from getting stuck and producing wired results.
Ah, I see. The training doesn't get stuck with the latest version, but the NER F score is ~79. I might be doing something else wrong though. I'll publish the steps I've taken soon.
Also, it seems that existing models (e.g. en_core_web_sm) is able to do sentence segmentation without sbd in the pipeline, apparently based on the dependency parse… I guess the same should be possible for norwegian?
@jarib Thanks for your work on this! We're still on holidays, but some quick answers in the meantime:
Do I need to retrain word vectors?
Maybe not. If you run the tokenizer, have a look at which words end up without a vector. Also check word vector entries which never occur in your data, after tokenizing a bunch of text. For instance if you were doing this for English and you found you didn't have a vector for "n't", and you had a vectors-table entry for "can't", you'd know you have problems. Also, check that the word vectors are case sensitive. Case sensitive vectors generally work better with spaCy.
Finally, and perhaps most importantly. check that the vectors have a compatible license. We'd like the vectors to be CC-BY, MIT or BSD licensed. If they're CC-BY-SA, CC-BY-NC or GPL licensed, they'll restrict how we can license the final model.
Tagger/NER diverges
This seems to be resolved? About the accuracy plateauing, well, it has to top out somewhere right? You can try fiddling with the hyper-parameters maybe, to see if you can get better performance. Another trick is to run cross-fold training, and record the accuracy of each sentence in the training data across the different folds. Have a look at sentences in the training data which are consistently predicted poorly. Sometimes you'll find these sentences are just...bad, and accuracy might go up if these bad sentences are removed.
Sentence boundary detection
Make sure you're setting the -n flag when you're converting the data, so that you have some documents with multiple sentences in them. This allows the parser to learn to divide the sentences. If you only have one sentence per document in the training data, the parser never learns to use the Break transition, and so accuracy on multi-sentence documents is very low.
@honnibal Thanks for the suggestions!
The problem with the divergence / tag accuracy dropping appears to be gone after I switched to spacy 2.1.0a4.
I have some code running right now that goes through all the Norwegian Bokmål word embedding models from the NLPL repository to see which one gives the best results. I haven't found much license information for those, but I'll look into it.
The results look promising to my eyes. Here's the output from one of the large (>2GB) models. I've also put up the scripts I've used.
=========
Model 0
=========
Path: data/vectors-all/11-100
Vectors
-------
Algorithm: Gensim Continuous Skipgram
Corpus : Norsk Aviskorpus + NoWaC + NBDigital (lemmatized=False, case preserved=True, tokens=3028545953)
Vectors : dimensions=100, window=5, iterations=5, vocab size=4480046
Training
--------
UAS NER P. NER R. NER F. Tag % Token %
86.532 86.734 87.253 86.993 94.617 100.000
90.355 89.247 89.408 89.327 96.446 100.000
89.914 89.653 89.707 89.680 96.361 100.000
90.173 89.928 89.767 89.847 96.427 100.000
90.593 88.889 89.048 88.969 96.641 100.000
90.581 88.611 88.929 88.769 96.646 100.000
89.569 89.467 89.467 89.467 96.210 100.000
89.274 89.408 89.408 89.408 96.076 100.000
90.267 88.949 89.108 89.028 96.564 100.000
88.150 88.590 88.749 88.670 95.462 100.000
90.543 89.121 89.228 89.175 96.638 100.000
90.593 88.989 88.989 88.989 96.616 100.000
90.638 88.439 88.809 88.623 96.624 100.000
88.635 89.348 89.348 89.348 95.870 100.000
90.618 89.241 89.348 89.294 96.613 100.000
Best
----
Path: data/vectors-all/11-100/training/model-best
Size: 2282 MB
UAS NER P. NER R. NER F. Tag % Token %
90.618 89.241 89.348 89.294 96.613 100.000
Evaluate
--------
Time 3.72 s
Words 30034
Words/s 8079
TOK 100.00
POS 96.01
UAS 90.48
LAS 88.16
NER P 85.24
NER R 86.73
NER F 85.98
The vectors are licensed CC-BY. For attribution, these publications can be cited:
Stadsnes, Øvrelid, & Velldal (2018)
http://ojs.bibsys.no/index.php/NIK/article/view/490/
Fares, Kutuzov, Oepen, & Velldal (2017)
http://www.ep.liu.se/ecp/article.asp?issue=131&article=037
Numbers look great! I wonder exactly which change made the difference...Possibly just the different hyper-parameters, especially the narrower widths (which decreases overfitting).
It sounds like there's no barrier to adding this. I just have to add the data files to our corpora image.
Sounds good.
I'll have the same output from training all the 61 Norwegian Bokmål vector models in the NLPL repository soon. Should be finished in a day or two. That should make it easier to decide which ones make sense to use for spacy models.
Yay, super excited about this. Once the nb model is added, I'll link this thread in the master thread in #3056 as a great example of end-to-end community-driven model development 💖
Write more tests for Norwegian
For the upcoming v2.1.x, we've moved all model-related tests out of spaCy's regular test suite and over to spacy-models/tests. This makes it easier to run them independently, e.g. as part of our automated model training process.
Edit: The latest commit totally gives the wrong impression 😅

The test suite includes a bunch of general "sanity checks" that all models should pass – so in case you haven't run those yet, it'd definitely be nice to check that there are no deeper issues.
And of course it'd also be cool to have some very basic Norwegian tests for the individual components and vocab (e.g. lexical attributes, see here for an example). Writing tests for the statistical components can be a bit tricky, because predictions can change and it doesn't necessarily mean that the model is worse if it performs worse on some arbitrary test case.
(Slightly OT, but I've been thinking about adding a test helper that lets you assert that at least X% of results are correct. So we could use this to make sure that the accuracy of all POS tags or per-token BILUO entity labels in a longer text doesn't fall below a certain threshold.)
Here is the result of spacy evaluate on the test split from NORNE and the 59 Norwegian Bokmål vector models from the NLPL repository (scroll right for scores):
| Corpus | Lemmatized | Algorithm | Dimensions | Window | Vocab size | Model size | TOK | POS | UAS | LAS | NERÂ P | NERÂ R | NERÂ F |
| ------ | ---------- | --------- | ---------- | ------ | ---------- | ---------- | --- | --- | --- | --- | ----- | ----- | ----- |
| Norsk Aviskorpus + NoWaC + NBDigital | False | Gensim Continuous Skipgram | 100 | 5 | 4480046 | 2282MB | 100.0 | 96.01 | 90.48 | 88.16 | 85.24 | 86.73 | 85.98 |
| Norsk Aviskorpus | False | Gensim Continuous Skipgram | 100 | 5 | 1728100 | 890MB | 100.0 | 95.83 | 90.43 | 88.08 | 84.94 | 86.3 | 85.61 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 100 | 5 | 2551820 | 1307MB | 100.0 | 95.79 | 90.4 | 88.14 | 85.88 | 86.44 | 86.16 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | Gensim Continuous Bag-of-Words | 100 | 5 | 4480046 | 2282MB | 100.0 | 95.76 | 90.15 | 87.75 | 85.2 | 86.01 | 85.6 |
| Norsk Aviskorpus + NoWaC | False | Gensim Continuous Bag-of-Words | 100 | 5 | 2551819 | 1307MB | 100.0 | 95.74 | 90.44 | 88.09 | 85.01 | 84.77 | 84.89 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 50 | 5 | 2551820 | 820MB | 100.0 | 95.75 | 90.21 | 87.81 | 84.7 | 85.57 | 85.13 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | fastText Skipgram | 100 | 5 | 4428648 | 2256MB | 100.0 | 95.97 | 90.26 | 87.85 | 85.26 | 86.88 | 86.06 |
| Norsk Aviskorpus + NoWaC | False | Gensim Continuous Skipgram | 100 | 5 | 2551819 | 1307MB | 100.0 | 96.0 | 90.14 | 87.82 | 85.63 | 86.44 | 86.04 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 300 | 5 | 1487994 | 1904MB | 100.0 | 95.39 | 90.08 | 87.51 | 83.17 | 82.51 | 82.84 |
| Norsk Aviskorpus + NoWaC | False | fastText Continuous Bag-of-Words | 100 | 5 | 2551820 | 1307MB | 100.0 | 95.8 | 90.26 | 88.05 | 84.79 | 84.91 | 84.85 |
| Norsk Aviskorpus | False | Gensim Continuous Bag-of-Words | 100 | 5 | 1728100 | 890MB | 100.0 | 95.73 | 90.43 | 88.07 | 85.32 | 84.69 | 85.0 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | Gensim Continuous Skipgram | 100 | 5 | 4031460 | 2055MB | 100.0 | 95.41 | 89.82 | 87.36 | 84.47 | 84.84 | 84.65 |
| Norsk Aviskorpus | False | fastText Skipgram | 100 | 5 | 1728101 | 890MB | 100.0 | 95.78 | 90.23 | 87.8 | 84.33 | 85.13 | 84.73 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | fastText Skipgram | 100 | 5 | 3998140 | 2038MB | 100.0 | 95.37 | 89.75 | 87.26 | 83.3 | 85.06 | 84.17 |
| Norsk Aviskorpus + NoWaC | True | Gensim Continuous Skipgram | 100 | 5 | 2239664 | 1149MB | 100.0 | 95.37 | 90.1 | 87.64 | 83.5 | 84.11 | 83.81 |
| Norsk Aviskorpus | True | fastText Skipgram | 600 | 5 | 1487995 | 3607MB | 100.0 | 95.44 | 89.69 | 87.22 | 83.31 | 84.77 | 84.03 |
| Norsk Aviskorpus | True | fastText Skipgram | 300 | 5 | 1487995 | 1904MB | 100.0 | 95.36 | 89.7 | 87.29 | 83.79 | 85.13 | 84.45 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | Gensim Continuous Bag-of-Words | 100 | 5 | 4031460 | 2055MB | 100.0 | 95.32 | 89.68 | 87.14 | 81.98 | 82.58 | 82.28 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 300 | 5 | 2551820 | 3254MB | 100.0 | 96.11 | 90.31 | 87.93 | 84.9 | 86.08 | 85.49 |
| Norsk Aviskorpus | False | fastText Continuous Bag-of-Words | 100 | 5 | 1728101 | 890MB | 100.0 | 95.85 | 90.34 | 87.99 | 84.11 | 84.48 | 84.29 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | fastText Continuous Bag-of-Words | 100 | 5 | 4428648 | 2256MB | 100.0 | 95.96 | 90.26 | 87.96 | 83.12 | 83.97 | 83.54 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 600 | 5 | 1487994 | 3607MB | 100.0 | 95.44 | 89.54 | 86.75 | 84.26 | 83.89 | 84.08 |
| Norsk Aviskorpus + NoWaC | True | Gensim Continuous Bag-of-Words | 100 | 5 | 2239664 | 1149MB | 100.0 | 95.34 | 89.78 | 87.23 | 83.69 | 83.02 | 83.35 |
| Norsk Aviskorpus | True | Gensim Continuous Skipgram | 100 | 5 | 1487994 | 768MB | 100.0 | 95.31 | 89.78 | 87.25 | 84.45 | 84.33 | 84.39 |
| NoWaC | True | Gensim Continuous Bag-of-Words | 100 | 5 | 1199274 | 619MB | 100.0 | 95.27 | 89.61 | 86.99 | 79.36 | 79.88 | 79.62 |
| Norsk Aviskorpus + NoWaC | True | fastText Continuous Bag-of-Words | 100 | 5 | 2239665 | 1149MB | 100.0 | 95.21 | 89.7 | 87.19 | 83.88 | 82.65 | 83.26 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 600 | 5 | 2551820 | 6175MB | 100.0 | 96.24 | 90.48 | 88.16 | 84.73 | 85.35 | 85.04 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 100 | 5 | 1487994 | 768MB | 100.0 | 95.2 | 89.78 | 87.2 | 81.53 | 80.76 | 81.14 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | Global Vectors | 100 | 15 | 4480047 | 2282MB | 100.0 | 95.64 | 90.06 | 87.6 | 84.39 | 84.33 | 84.36 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | fastText Continuous Bag-of-Words | 100 | 5 | 3998140 | 2038MB | 100.0 | 95.23 | 89.71 | 87.27 | 82.47 | 81.92 | 82.19 |
| Norsk Aviskorpus | True | fastText Skipgram | 50 | 5 | 1487995 | 485MB | 100.0 | 95.32 | 89.6 | 87.05 | 81.69 | 82.29 | 81.99 |
| NoWaC | False | fastText Skipgram | 100 | 5 | 1356633 | 699MB | 100.0 | 95.79 | 90.22 | 87.76 | 84.18 | 85.35 | 84.76 |
| Norsk Aviskorpus | True | fastText Skipgram | 100 | 5 | 1487995 | 768MB | 100.0 | 95.32 | 89.84 | 87.37 | 81.8 | 82.87 | 82.33 |
| Norsk Aviskorpus | True | fastText Continuous Bag-of-Words | 100 | 5 | 1487995 | 768MB | 100.0 | 95.12 | 89.7 | 87.25 | 81.61 | 82.14 | 81.87 |
| NoWaC | False | fastText Continuous Bag-of-Words | 100 | 5 | 1356633 | 699MB | 100.0 | 95.73 | 89.86 | 87.43 | 82.16 | 83.24 | 82.69 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 50 | 5 | 1487994 | 485MB | 100.0 | 95.13 | 89.53 | 86.96 | 81.3 | 81.12 | 81.21 |
| NoWaC | False | Gensim Continuous Skipgram | 100 | 5 | 1356632 | 699MB | 100.0 | 95.79 | 89.84 | 87.45 | 83.24 | 84.33 | 83.78 |
| NoWaC | False | Gensim Continuous Bag-of-Words | 100 | 5 | 1356632 | 699MB | 100.0 | 95.79 | 90.29 | 88.0 | 80.01 | 80.54 | 80.28 |
| Norsk Aviskorpus + NoWaC | False | Global Vectors | 100 | 15 | 2551820 | 1307MB | 100.0 | 95.56 | 89.67 | 87.21 | 83.19 | 84.77 | 83.97 |
| Norsk Aviskorpus + NoWaC | True | fastText Skipgram | 100 | 5 | 2239665 | 1149MB | 100.0 | 95.55 | 89.74 | 87.17 | 84.38 | 85.42 | 84.9 |
| NBDigital | False | fastText Skipgram | 100 | 5 | 2390584 | 1221MB | 100.0 | 95.63 | 89.79 | 87.37 | 79.94 | 81.63 | 80.78 |
| Norsk Aviskorpus | False | Global Vectors | 100 | 15 | 1728101 | 890MB | 100.0 | 95.5 | 89.76 | 87.26 | 83.47 | 84.26 | 83.86 |
| NoWaC | True | fastText Skipgram | 100 | 5 | 1199275 | 619MB | 100.0 | 95.3 | 89.6 | 87.05 | 80.73 | 81.85 | 81.29 |
| Norsk Aviskorpus + NoWaC | True | Global Vectors | 100 | 15 | 2239665 | 1149MB | 100.0 | 95.09 | 89.18 | 86.56 | 80.56 | 81.56 | 81.06 |
| NBDigital | False | Gensim Continuous Skipgram | 100 | 5 | 2390583 | 1221MB | 100.0 | 95.7 | 89.91 | 87.55 | 80.39 | 81.56 | 80.97 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | Global Vectors | 100 | 15 | 4031461 | 2055MB | 100.0 | 95.06 | 89.19 | 86.55 | 81.86 | 82.22 | 82.04 |
| Norsk Aviskorpus | True | Global Vectors | 100 | 15 | 1487995 | 768MB | 100.0 | 95.08 | 89.27 | 86.68 | 81.94 | 82.65 | 82.29 |
| NoWaC | True | Gensim Continuous Skipgram | 100 | 5 | 1199274 | 619MB | 100.0 | 95.37 | 89.64 | 87.18 | 82.3 | 83.38 | 82.84 |
| NBDigital | False | fastText Continuous Bag-of-Words | 100 | 5 | 2390584 | 1221MB | 100.0 | 95.39 | 89.75 | 87.21 | 77.67 | 79.37 | 78.51 |
| NoWaC | False | Global Vectors | 100 | 15 | 1356633 | 699MB | 100.0 | 95.43 | 89.51 | 86.98 | 81.4 | 82.94 | 82.17 |
| NBDigital | False | Gensim Continuous Bag-of-Words | 100 | 5 | 2390583 | 1221MB | 100.0 | 95.57 | 89.71 | 87.35 | 78.22 | 78.79 | 78.5 |
| NBDigital | True | Gensim Continuous Skipgram | 100 | 5 | 2187702 | 1119MB | 100.0 | 95.36 | 89.65 | 87.04 | 78.49 | 79.01 | 78.75 |
| NoWaC | True | Global Vectors | 100 | 15 | 1199275 | 619MB | 100.0 | 94.89 | 88.91 | 86.22 | 79.3 | 79.59 | 79.45 |
| NBDigital | True | fastText Skipgram | 100 | 5 | 2187703 | 1119MB | 100.0 | 95.36 | 89.31 | 86.82 | 79.08 | 79.88 | 79.48 |
| NoWaC | True | fastText Continuous Bag-of-Words | 100 | 5 | 1199275 | 619MB | 100.0 | 95.43 | 89.76 | 87.22 | 80.99 | 82.0 | 81.49 |
| NBDigital | False | Global Vectors | 100 | 15 | 2390584 | 1221MB | 100.0 | 95.41 | 89.38 | 86.91 | 77.7 | 79.23 | 78.46 |
| NBDigital | True | fastText Continuous Bag-of-Words | 100 | 5 | 2187703 | 1119MB | 100.0 | 95.04 | 89.64 | 87.15 | 79.1 | 79.45 | 79.27 |
| NBDigital | True | Global Vectors | 100 | 15 | 2187703 | 1119MB | 100.0 | 95.0 | 89.31 | 86.56 | 78.08 | 77.62 | 77.85 |
| NBDigital | True | Gensim Continuous Bag-of-Words | 100 | 5 | 2187702 | 1119MB | 100.0 | 95.1 | 89.26 | 86.67 | 77.35 | 77.92 | 77.63 |
Full output: https://gist.github.com/jarib/f0da63fbe338ae3dac0559032cc2e1fd
Output as JSON: https://gist.github.com/jarib/6048712165290a13179c5cd47157f1bd
I'm not sure what accuracy/size tradeoffs makes the most sense for spaCy. I also haven't tried to do any pruning of the vectors. And this is was trained using data converted before I was aware of the --n-sents option to spacy convert.
Thanks for running all this! Just to be clear, did you retrain with the new vectors? You can't really compare the accuracy by swapping in the vectors at runtime, because then all you're really measuring is how similar each vectors are to the one the model is trained with.
I think it would also be very useful to train an sm model to check how much improvement the vectors really make. Generally using --prune-vectors 20000 (this is the setting we've been using for md models) gives a pretty good balance of accuracy and size.
I did retrain with the new vectors. I don't know spacy's internals well enough to dare swapping anything out :)
I'll train one model without vectors (= sm?) and one with pruning, and post the results here.
Okay, great!
Here are the results without vectors. The size is 15MB on disk, uncompressed / unpackaged.
Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
--- ---------- ---------- ------- ------- ------- ------- ------- ------- ------- -------
0 179886.936 8866.197 82.367 72.354 69.539 70.919 91.280 100.000 9177 9234
1 132272.499 4886.268 84.810 76.308 74.207 75.243 93.018 100.000 4203 4394
2 116446.641 3512.800 85.919 78.156 75.583 76.848 93.767 100.000 8864 8857
3 106976.626 2606.215 86.395 78.886 77.139 78.003 94.099 100.000 9001 8975
4 99785.032 2114.693 86.876 80.331 78.456 79.382 94.436 100.000 8807 8927
5 94071.978 1743.858 87.448 79.648 78.456 79.047 94.606 100.000 8910 8902
6 88821.250 1449.010 87.793 80.024 78.396 79.202 94.850 100.000 8907 8886
7 84949.003 1302.102 87.984 80.049 78.276 79.153 95.009 100.000 8939 8957
8 81243.756 1196.975 88.084 79.376 77.618 78.487 95.045 100.000 8974 8900
9 78191.758 986.522 88.165 78.545 77.558 78.049 95.124 100.000 9015 8962
10 75810.611 923.147 88.266 78.235 77.439 77.835 95.149 100.000 8969 8930
11 73314.405 795.656 88.314 78.100 77.259 77.677 95.237 100.000 8797 8833
12 70981.429 729.826 88.528 78.463 77.618 78.039 95.251 100.000 8981 8951
13 68370.635 664.590 88.682 79.162 77.977 78.565 95.286 100.000 9059 9006
14 66972.303 661.691 88.624 79.101 77.917 78.505 95.294 100.000 8872 8867
15 65480.245 614.919 88.660 79.358 78.456 78.905 95.371 100.000 8894 8804
16 63636.544 538.836 88.827 79.613 78.755 79.182 95.385 100.000 8745 8727
17 61809.648 510.238 88.980 79.649 78.695 79.169 95.355 100.000 8676 8743
18 60794.229 498.642 88.974 79.406 78.456 78.928 95.404 100.000 8777 8787
19 58516.986 474.259 89.156 79.030 78.037 78.531 95.448 100.000 8838 8928
20 57729.559 465.683 89.090 78.434 77.917 78.175 95.437 100.000 8697 8777
21 56607.341 434.805 89.166 78.627 78.157 78.391 95.451 100.000 8877 8983
22 55707.805 408.971 89.309 79.275 78.516 78.894 95.492 100.000 8817 8750
23 54313.256 397.411 89.309 79.024 78.456 78.739 95.484 100.000 8745 8929
24 53104.471 393.764 89.245 78.688 78.217 78.451 95.478 100.000 8789 8771
25 52802.164 361.336 89.292 79.287 78.576 78.930 95.475 100.000 8866 8775
26 51530.342 384.859 89.315 79.577 78.815 79.194 95.475 100.000 8810 8697
27 50466.753 293.543 89.293 79.217 78.695 78.955 95.473 100.000 8866 8920
28 50368.632 346.166 89.270 79.207 78.875 79.040 95.492 100.000 8910 8949
29 49231.304 353.097 89.331 78.790 78.695 78.743 95.536 100.000 8825 8796
Time 3.40 s
Words 29847
Words/s 8781
TOK 100.00
POS 94.66
UAS 88.97
LAS 86.30
NER P 71.64
NER R 70.54
NER F 71.08
Seems like the vectors improves the accuracy quite a lot. Note that @ohenrik said he improved the accuracy without vectors by merging and re-splitting the data https://github.com/explosion/spaCy/issues/3082#issuecomment-449637286
Here I've pruned the vectors from the third row in the table above (fastText skipgram on the "Norsk Aviskorpus + NoWaC" corpus) to 20 000 words. The final model ends up at 454 MB.
When I use --prune-vectors 20000 I get this warning: Warning: Unnamed vectors -- this won't allow multiple vectors models to be loaded. (Shape: (20000, 100)). That doesn't happen if I remove the flag. Not sure if it matters.
Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
--- ---------- ---------- ------- ------- ------- ------- ------- ------- ------- -------
0 159537.223 6402.911 85.781 82.008 82.107 82.057 93.723 100.000 8352 8415
1 114886.809 3331.395 87.663 84.703 84.500 84.602 94.790 100.000 8268 8287
2 102142.415 2465.607 88.314 84.886 85.039 84.963 95.278 100.000 8473 8422
3 93845.905 1858.885 88.637 85.561 85.817 85.689 95.544 100.000 8630 8464
4 87807.250 1485.389 88.958 85.697 86.056 85.876 95.728 100.000 8603 8586
5 82416.232 1216.918 89.311 85.791 85.996 85.894 95.785 100.000 8559 8559
6 78299.269 1105.943 89.598 85.629 85.577 85.603 95.900 100.000 8644 8627
7 74927.845 946.399 89.746 85.117 85.218 85.167 95.914 100.000 8620 8647
8 72173.234 775.656 89.912 84.455 84.859 84.657 95.974 100.000 8555 8532
9 69231.955 732.922 90.027 83.750 84.201 83.975 96.068 100.000 8597 8571
10 66741.832 681.178 90.077 84.223 84.979 84.599 96.114 100.000 8800 8773
11 64863.437 592.628 90.235 85.296 85.398 85.347 96.114 100.000 8757 8763
12 62994.504 544.747 90.391 85.084 85.338 85.211 96.150 100.000 8752 8711
13 60421.970 527.930 90.435 85.305 85.458 85.381 96.155 100.000 8696 8727
14 59108.179 483.949 90.481 84.123 84.979 84.549 96.172 100.000 8767 8711
15 57507.565 414.018 90.446 84.734 85.697 85.213 96.240 100.000 8747 8665
16 55917.333 445.512 90.446 84.811 85.877 85.340 96.229 100.000 8567 8583
17 54140.529 401.599 90.405 84.734 85.697 85.213 96.235 100.000 8735 8656
18 53235.531 345.128 90.460 84.479 85.338 84.906 96.262 100.000 8581 8534
19 51425.662 370.436 90.510 84.438 85.398 84.915 96.271 100.000 8464 8546
20 50373.646 343.464 90.591 84.830 85.338 85.084 96.262 100.000 8549 8555
21 49415.185 337.757 90.644 84.551 85.159 84.854 96.224 100.000 8573 8564
22 48312.131 274.937 90.756 83.589 84.740 84.160 96.221 100.000 8560 8563
23 47543.626 361.788 90.738 84.360 85.218 84.787 96.224 100.000 8586 8547
24 47092.949 330.405 90.780 85.459 85.817 85.638 96.210 100.000 8524 8487
25 45720.017 282.577 90.744 85.629 85.937 85.783 96.232 100.000 8560 8591
26 44934.282 250.263 90.656 84.816 85.577 85.195 96.260 100.000 8533 8538
27 43830.520 255.366 90.576 84.456 85.518 84.984 96.284 100.000 8520 8543
28 43344.429 284.323 90.686 84.834 85.697 85.263 96.295 100.000 8682 8589
29 43504.905 286.349 90.699 84.775 85.637 85.204 96.287 100.000 8699 8710
Time 3.46 s
Words 29847
Words/s 8620
TOK 100.00
POS 95.64
UAS 90.25
LAS 87.93
NER P 82.96
NER R 84.79
NER F 83.87
So, comparing size and accuracy from spacy evaluate:
| Name | Size | TOK | POS | UAS | LAS | NER P | NER R | NER F |
| ---- | ---- |-----|----| --- | --- | ------| ----- | ----- |
| sm | 15 MB | 100.00 | 94.66 | 88.97 |86.30|71.64|70.54|71.08|
| md | 454 MB | 100.00 | 95.64 | 90.25 | 87.93 | 82.96| 84.79| 83.87 |
| lg | 1308 MB | 100.00 | 95.96 | 90.42 | 88.30| 85.29 | 86.48 | 85.88 |
I'm unable to get sentence segmentation to work, even after changing to spacy convert -n 10 [...]. I do the conversion here.
What happens? You're not passing -G during training are you? Edit: Just saw your command --- yeah the --gold-preproc argument is the problem. Is that inherited from some example we give? Someone else has had this problem too, so maybe we have some bad instructions somewhere?
The gap between the sm and md models is interesting. If you want you could experiment with spacy pretrain for this...It takes a jsonl file, formatted with each line being a dict like {"text": "..."}. This outputs pre-trained weights for the CNN. On small datasets, it can have a big improvement in accuracy. It's best to run it with at least 1 billion words of text, but even if you only have like 50-100 million it should still help.
You're right, I was passing -G! Not sure where I got that from. I guess it could be clearer in the docs when it is or isn't appropriate.
I'll try to do the sm/md/lg training again without -G. The big table above with all the pre-trained vectors was done without -G, luckily (command here).
I'll try spacy pretrain as well.
Updated results, without -G. Sentence segmentation now works well with all of them.
| Name | Size | TOK | POS | UAS | LAS | NER P | NER R | NER F |
| ---- | ---- |-----|----| --- | --- | ------| ----- | ----- |
| sm | 15 MB | 100.00 | 94.60 | 88.59| 86.10 | 71.96 | 70.54 | 71.24 |
| md | 454 MB | 100.00 | 95.59 | 89.88 | 87.65 | 83.15 | 84.50 | 83.82 |
| lg | 1308 MB | 100.00 | 95.83 | 90.44 | 88.20 | 83.92 | 85.89 | 84.89 |
sm training output
Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
--- ---------- ---------- ------- ------- ------- ------- ------- ------- ------- -------
0 140328.007 9713.144 81.386 69.818 66.727 68.237 91.036 100.000 6500 6424
1 95301.408 5064.493 84.296 76.306 75.165 75.731 92.887 100.000 6197 6254
2 81858.447 3635.719 85.439 77.488 76.421 76.951 93.682 100.000 10615 10428
3 74299.650 2738.703 86.115 77.778 77.080 77.427 94.140 100.000 10497 10431
4 68887.638 2227.522 86.737 77.544 77.080 77.311 94.392 100.000 10404 10339
5 64298.027 1766.455 87.304 77.858 77.858 77.858 94.617 100.000 7621 6241
6 61905.430 1505.741 87.480 78.138 77.858 77.998 94.719 100.000 10538 9815
7 59213.529 1273.619 87.572 77.798 77.798 77.798 94.814 100.000 6155 6155
8 56951.210 1104.528 87.714 77.564 77.379 77.472 94.930 100.000 10223 10270
9 54658.723 1079.417 87.897 77.624 77.439 77.531 94.927 100.000 6292 6227
10 52977.966 932.116 88.002 77.379 77.379 77.379 95.017 100.000 6189 6162
11 51874.855 800.744 88.200 77.738 77.319 77.528 95.135 100.000 6233 6244
12 50026.206 736.797 88.311 77.281 76.541 76.909 95.176 100.000 10411 7271
13 49049.945 712.090 88.436 77.221 76.481 76.849 95.229 100.000 10293 10285
14 47388.001 647.277 88.573 76.960 76.361 76.660 95.253 100.000 7194 6375
15 47097.322 628.981 88.525 76.354 75.943 76.148 95.292 100.000 6091 6088
16 44984.461 606.129 88.667 76.881 76.421 76.651 95.256 100.000 10292 10445
17 44119.642 525.098 88.698 77.114 76.421 76.766 95.286 100.000 6647 6485
18 43658.052 525.569 88.771 76.891 76.062 76.474 95.330 100.000 10247 10207
19 42470.731 474.503 88.788 76.584 75.943 76.262 95.294 100.000 6089 6114
20 41664.124 553.536 88.923 76.584 75.943 76.262 95.297 100.000 10200 10213
21 41395.161 510.137 88.807 77.549 76.481 77.011 95.325 100.000 10207 10296
22 40159.752 407.179 88.863 77.529 76.601 77.062 95.308 100.000 6658 6288
23 39051.428 430.688 89.041 77.556 76.721 77.136 95.346 100.000 10146 10207
24 38466.165 380.242 89.031 77.300 76.421 76.858 95.399 100.000 9764 10136
25 37894.168 376.259 89.055 77.300 76.421 76.858 95.429 100.000 7419 10290
26 37924.853 404.705 89.030 76.770 75.943 76.354 95.407 100.000 6100 6221
27 36738.349 431.608 89.048 77.039 76.302 76.669 95.412 100.000 7700 6379
28 36492.987 394.110 89.025 77.179 76.302 76.738 95.442 100.000 10228 10311
29 35964.157 323.550 89.177 76.784 76.002 76.391 95.442 100.000 6265 6719
md training output
Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
--- ---------- ---------- ------- ------- ------- ------- ------- ------- ------- -------
0 125910.684 7581.375 84.740 82.388 81.747 82.067 93.613 100.000 6890 6078
1 82290.664 3598.582 86.767 83.106 83.902 83.502 94.697 100.000 6495 6439
2 70896.668 2724.962 87.885 85.312 85.159 85.235 95.204 100.000 9421 5991
3 65209.235 2009.910 88.452 85.203 85.458 85.330 95.467 100.000 5999 6069
4 59814.261 1610.226 88.626 85.646 86.056 85.851 95.582 100.000 5974 6042
5 56224.491 1380.671 88.951 85.774 86.595 86.182 95.747 100.000 10059 6041
6 54198.211 1154.143 89.329 85.207 86.176 85.689 95.878 100.000 6044 10186
7 52108.301 985.541 89.417 86.012 86.475 86.243 95.903 100.000 6044 7271
8 49958.169 845.080 89.650 85.859 86.116 85.987 95.988 100.000 9654 6023
9 48384.992 719.396 89.924 85.748 86.056 85.902 96.016 100.000 6327 10027
10 46380.225 676.546 89.960 85.510 85.817 85.663 96.024 100.000 10218 9840
11 45604.224 698.995 89.960 85.723 85.877 85.800 96.057 100.000 10239 10192
12 44398.147 574.467 90.036 85.603 85.757 85.680 96.024 100.000 10287 10000
13 43337.924 481.058 90.052 85.569 85.877 85.723 96.090 100.000 10231 10321
14 42124.992 494.542 90.032 86.139 86.655 86.396 96.131 100.000 10270 10062
15 40621.568 416.546 90.129 85.841 86.715 86.276 96.136 100.000 10062 10072
16 39310.532 471.311 90.213 85.697 86.774 86.233 96.139 100.000 9945 10052
17 39134.968 373.413 90.343 85.883 87.014 86.445 96.109 100.000 10016 9996
18 38455.480 439.943 90.360 85.816 86.894 86.351 96.095 100.000 10251 10142
19 37404.740 390.323 90.429 85.782 86.655 86.216 96.139 100.000 10185 10144
20 36686.148 394.971 90.478 85.833 86.655 86.242 96.131 100.000 10234 10144
21 36009.123 364.551 90.485 85.917 86.894 86.403 96.169 100.000 10290 10552
22 34603.331 320.468 90.496 85.630 86.655 86.139 96.153 100.000 10210 10540
23 34493.734 321.966 90.537 85.529 86.655 86.088 96.180 100.000 10087 10498
24 33816.817 337.003 90.561 84.979 85.996 85.485 96.164 100.000 10153 10104
25 33440.944 295.308 90.430 85.604 86.475 86.038 96.199 100.000 10097 10169
26 32788.472 311.987 90.421 85.258 86.176 85.714 96.166 100.000 10606 10346
27 32166.413 275.568 90.515 85.264 85.877 85.569 96.112 100.000 10572 10243
28 31914.538 270.512 90.584 85.110 85.518 85.313 96.081 100.000 10511 10122
29 31567.351 249.013 90.507 85.348 85.757 85.552 96.101 100.000 10185 9963
```
</details>
<details>
<summary><strong>lg</strong> training output</summary>
Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
0 124258.144 7236.655 85.636 83.533 83.483 83.508 94.173 100.000 6226 6147
1 80057.797 3228.932 87.450 86.283 86.954 86.617 95.429 100.000 6193 6244
2 69706.663 2411.335 88.292 87.195 87.612 87.403 95.840 100.000 6194 6421
3 63253.803 1862.183 88.854 88.067 88.330 88.198 96.142 100.000 8149 7202
4 58398.192 1489.227 89.274 87.709 87.971 87.840 96.216 100.000 9753 9844
5 55081.437 1233.046 89.604 88.067 88.330 88.198 96.317 100.000 5905 6208
6 53114.583 1059.013 89.955 88.365 88.630 88.497 96.435 100.000 5708 5975
7 50720.677 910.304 90.040 88.148 88.570 88.358 96.408 100.000 10395 9472
8 48330.692 782.987 90.223 87.969 88.390 88.179 96.446 100.000 9343 10140
9 47155.682 705.529 90.446 88.109 88.689 88.398 96.441 100.000 5594 10007
10 45750.049 672.297 90.535 87.344 87.971 87.657 96.452 100.000 6098 6251
11 44437.575 649.625 90.549 87.011 87.792 87.399 96.512 100.000 10378 6162
12 43094.576 514.990 90.624 87.034 87.971 87.500 96.512 100.000 6111 5891
13 41935.703 485.979 90.673 87.537 88.270 87.902 96.534 100.000 6170 5911
14 40887.881 468.094 90.713 87.722 88.510 88.114 96.556 100.000 6033 6344
15 39608.918 409.788 90.678 87.864 88.390 88.126 96.564 100.000 10087 7881
16 38723.232 419.580 90.671 87.969 88.390 88.179 96.559 100.000 9331 10132
17 37933.653 394.874 90.613 87.388 87.911 87.649 96.545 100.000 5948 5941
18 37256.559 369.788 90.584 87.329 87.852 87.589 96.578 100.000 9515 9652
19 36489.335 305.808 90.660 87.440 87.911 87.675 96.572 100.000 9344 9432
20 35347.381 346.268 90.610 87.716 88.031 87.873 96.586 100.000 9378 9389
21 34922.059 353.310 90.769 87.657 87.971 87.814 96.591 100.000 10213 9547
22 34012.171 336.559 90.788 88.088 88.510 88.299 96.594 100.000 10272 9369
23 33653.177 345.451 90.734 88.186 88.450 88.318 96.556 100.000 10148 9369
24 33334.910 293.602 90.861 88.119 88.330 88.225 96.559 100.000 10321 9225
25 32328.676 294.129 90.823 88.158 88.211 88.184 96.575 100.000 10163 9312
26 31813.067 250.583 90.742 88.112 88.270 88.191 96.586 100.000 9302 10124
27 31284.447 262.157 90.755 88.450 88.450 88.450 96.589 100.000 5931 10207
28 31013.342 284.829 90.707 88.084 88.031 88.057 96.594 100.000 9286 9425
29 30368.747 252.931 90.773 87.926 88.031 87.978 96.600 100.000 10066 9395
```
I've tried to improve the sm model with pretraining on a subset (~250 million words) of the Norwegian News Corpus.
| Name | Pretrained | Size | POS | UAS | LAS | NER P | NER R | NER F |
| ---- | ---------- | ---- |----| --- | --- | ------| ----- | ----- |
| sm | No | 15 MB | 94.60 | 88.59 | 86.10 | 71.96 | 70.54 | 71.24 |
| sm | Yes | 15 MB | 95.07 | 90.14 | 87.82 | 78.92 | 78.69 | 78.81 |
I chose a subset of the corpus that was easy to convert into the correct format, so it can probably be further improved by pretraining on the full corpus, possibly in combination with NoWaC (700 million tokens). The vector model used for md and lg is trained on the combination of these two corpora.
I ran spacy pretrain with the default settings. Each iteration takes about 2.5 hours (30k words per second) on my hardware, so the default 1000 iterations seems a bit much for me at the moment. The scores above is from the 7th iteration, randomly chosen.
Full output of training
$ python -m spacy train nb data/nb-sm-pretrained data/norne-spacy/ud/nob/no-ud-train-ner.json data/norne-spacy/ud/nob/no-ud-dev-ner.json --n-iter 30 -t2v data/pretraining/model7.bin
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'nb'
Counting training words (limit=0)
Loaded pretrained tok2vec for: ['tagger', 'parser', 'ner']
Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
--- ---------- ---------- ------- ------- ------- ------- ------- ------- ------- -------
0 108561.528 7789.699 86.292 79.152 78.157 78.651 93.029 100.000 10325 0
1 74733.323 4058.799 87.773 80.417 80.850 80.633 94.126 100.000 10518 0
2 66452.071 2935.499 88.482 80.699 81.568 81.131 94.505 100.000 10571 0
3 60779.625 2301.898 88.892 80.969 81.987 81.475 94.834 100.000 10583 0
4 57622.801 1824.278 89.237 81.883 82.226 82.054 95.105 100.000 10615 0
5 54880.576 1503.344 89.409 82.181 82.525 82.353 95.281 100.000 10646 0
6 52967.322 1309.013 89.502 81.726 82.166 81.946 95.426 100.000 10595 0
7 51169.098 1048.847 89.616 81.981 82.226 82.103 95.492 100.000 10635 0
8 49863.342 944.649 89.741 82.409 82.705 82.557 95.582 100.000 10723 0
9 48321.433 857.080 89.842 81.992 82.286 82.139 95.632 100.000 10622 0
10 46733.436 743.655 89.958 82.283 82.825 82.553 95.662 100.000 10664 0
11 45430.614 710.145 90.128 82.627 82.825 82.726 95.687 100.000 10610 0
12 44267.610 646.346 90.256 82.188 82.286 82.237 95.717 100.000 10571 0
13 43798.329 594.519 90.298 81.910 82.107 82.008 95.725 100.000 10471 0
14 42466.530 543.666 90.331 82.019 82.166 82.093 95.752 100.000 10493 0
15 41715.754 542.159 90.237 81.530 81.628 81.579 95.777 100.000 10531 0
16 40556.998 544.410 90.320 82.030 82.226 82.128 95.835 100.000 10534 0
17 40201.797 387.598 90.367 81.618 82.107 81.862 95.815 100.000 10607 0
18 39513.345 436.006 90.438 82.083 82.525 82.304 95.835 100.000 10542 0
19 38447.469 410.579 90.442 81.797 82.286 82.041 95.835 100.000 10632 0
20 37960.295 407.407 90.472 81.742 81.987 81.864 95.867 100.000 10585 0
21 37349.251 416.729 90.553 81.672 81.867 81.769 95.851 100.000 10587 0
22 36711.216 354.900 90.625 81.557 81.508 81.532 95.818 100.000 10507 0
23 35945.099 436.860 90.519 81.243 81.388 81.315 95.824 100.000 10456 0
24 35678.236 366.608 90.523 81.818 81.867 81.843 95.840 100.000 10545 0
25 35094.173 357.058 90.544 82.074 81.927 82.001 95.859 100.000 10475 0
26 34586.409 347.660 90.572 81.976 81.927 81.952 95.884 100.000 10513 0
27 34059.730 313.799 90.528 82.085 81.987 82.036 95.859 100.000 10496 0
28 34028.352 284.917 90.503 82.156 82.107 82.131 95.884 100.000 10482 0
29 33059.099 302.737 90.595 81.829 81.927 81.878 95.911 100.000 10446 0
✔ Saved model to output directory
data/nb-sm-pretrained/model-final
✔ Created best model
data/nb-sm-pretrained/model-best
$ python -m spacy evaluate data/nb-sm-pretrained/model-best data/norne-spacy/ud/nob/no-ud-test-ner.json
================================== Results ==================================
Time 2.97 s
Words 29847
Words/s 10035
TOK 100.00
POS 95.07
UAS 90.14
LAS 87.82
NER P 78.92
NER R 78.69
NER F 78.81
@jarib What type of hardware do you have? I have an idle 1080ti graphics card that might speed things up a bit, but not sure if you already are using something similar. Let me know, and i can set it up and run it for you on my machine.
@ohenrik I've been using p2.xlarge instance from AWS. I can share the .jsonl file used for pretraining if you want to give it a try.
I can give it a try :) Just share what i need to set it up.
@ohenrik To try pretraining:
data/nnc.jsonldata/nb-lgpython -m spacy pretrain data/nnc.jsonl data/nb-lg data/pretrainingHowever I think we should eventually try a larger corpus in order to improve the sm model.
@honnibal Can spacy pretrain improve the lg model as well, or is that redundant?
@jarib should I convert the contents the ner_files folder i downloaded to jsonl using spacy? i did not get any files named nnc.jsonl (or jsonl files in general)
@ohenrik This file contains data/nnc.jsonl.
Strange... I'm not sure what happened, but i somehow mixed up the zip files with some old NER i downloaded back on 17. desember :p So no wonder nothing made sense. I started training it now, will give an update soon when i know more about the speed
The NoWaC corpus is 700 million tokens and as such a good candidate for pretraining. But it's licensed CC BY-NC-SA. Will pretraining on this corpus affect the licensing of the final model?
Will pretraining on this corpus affect the licensing of the final model?
Unfortunately, yes. The model will include embeddings based on that corpus, which would count as derivative works. It's not always 100% clear what the "SA" (share-alike) part of the license means for statistical models, but if a data resource is published as "NC" (non-commercial), we definitely won't be able to release a non-non-commercial spaCy model. And since most spaCy users do commercial work, that'd be pretty limiting.
Depending on who published a resource, it's sometimes possible to negotiate special terms, so it might be worth reaching out to the authors to ask 🙂
Still really keen to get full Norwegian support :). And also quite pleased with the pretraining performance!
We can merge this with the "Adding models" master thread, #3056 . Could you update once you have the licensing figured out?
@honnibal AFAICT there’s only a licensing issue if we want to use the NoWaC corpus for pretraining, The group who published the NLPL vectors used above confirmed to me that they are licensed CC-BY.
I can try to negotiate special terms for NoWaC, or look for an alternative corpus that can be used for pretraining, if that’s a requirement for getting the Norwegian model included.
Almost finished with the v2.1 release. After that I'll be updating the datasets in the model training pipeline, which should let us publish the official nb models 🎉
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Almost finished with the v2.1 release. After that I'll be updating the datasets in the model training pipeline, which should let us publish the official
nbmodels 🎉