Spacy: Core norwegian (nb) model

Created on 21 Dec 2018  Â·  36Comments  Â·  Source: explosion/spaCy

All I want for Christmas is a core nb model for spacy. And we're getting close!

Last week the Language Technology Group at the University of Oslo released NER annotations on top of the Norwegian Dependency Treebank. The nb tag map is made for it, thanks to @katarkor. So there's now both UD/POS and NER data.

TODO:

enhancement lang / nb models

Most helpful comment

Almost finished with the v2.1 release. After that I'll be updating the datasets in the model training pipeline, which should let us publish the official nb models 🎉

All 36 comments

@jarib I managed to publish (an unpolished) repo with the model i Trained for Nudge (Tagbox.ai):
https://github.com/ohenrik/nb_news_ud_sm

I will hopefully get to clean this up a bit more.

Also regarding: "How come sentence segmentation is not working?":

This is because you need to activate this for a new model. You can make this automatic by adding "sbd" to the list of pipelines in the meta.json file for the model. See example here

@jarib Also the model i link to above does not appear to diverge after X iterations. So i think the dataset used to traing the NER (+ DEP) model did not have this problem.

Ah, great! I'll have a look.

Was the NER/DEP dataset that didn't diverge based on an earlier version of https://github.com/ltgoslo/norne/, or something else?

I don’t know if the dataset has changed since I got access to it, I got it by email earlier this summer. So the repo might have newer data. However I had to combine train, dev and test data, randomize it and then split it into train, dev and test again. This was to avoid the training from getting stuck and producing wired results.

Ah, I see. The training doesn't get stuck with the latest version, but the NER F score is ~79. I might be doing something else wrong though. I'll publish the steps I've taken soon.

Also, it seems that existing models (e.g. en_core_web_sm) is able to do sentence segmentation without sbd in the pipeline, apparently based on the dependency parse… I guess the same should be possible for norwegian?

@jarib Thanks for your work on this! We're still on holidays, but some quick answers in the meantime:

Do I need to retrain word vectors?

Maybe not. If you run the tokenizer, have a look at which words end up without a vector. Also check word vector entries which never occur in your data, after tokenizing a bunch of text. For instance if you were doing this for English and you found you didn't have a vector for "n't", and you had a vectors-table entry for "can't", you'd know you have problems. Also, check that the word vectors are case sensitive. Case sensitive vectors generally work better with spaCy.

Finally, and perhaps most importantly. check that the vectors have a compatible license. We'd like the vectors to be CC-BY, MIT or BSD licensed. If they're CC-BY-SA, CC-BY-NC or GPL licensed, they'll restrict how we can license the final model.

Tagger/NER diverges

This seems to be resolved? About the accuracy plateauing, well, it has to top out somewhere right? You can try fiddling with the hyper-parameters maybe, to see if you can get better performance. Another trick is to run cross-fold training, and record the accuracy of each sentence in the training data across the different folds. Have a look at sentences in the training data which are consistently predicted poorly. Sometimes you'll find these sentences are just...bad, and accuracy might go up if these bad sentences are removed.

Sentence boundary detection

Make sure you're setting the -n flag when you're converting the data, so that you have some documents with multiple sentences in them. This allows the parser to learn to divide the sentences. If you only have one sentence per document in the training data, the parser never learns to use the Break transition, and so accuracy on multi-sentence documents is very low.

@honnibal Thanks for the suggestions!

The problem with the divergence / tag accuracy dropping appears to be gone after I switched to spacy 2.1.0a4.

I have some code running right now that goes through all the Norwegian Bokmål word embedding models from the NLPL repository to see which one gives the best results. I haven't found much license information for those, but I'll look into it.

The results look promising to my eyes. Here's the output from one of the large (>2GB) models. I've also put up the scripts I've used.

=========
Model   0
=========
        Path: data/vectors-all/11-100

Vectors
-------
        Algorithm: Gensim Continuous Skipgram
        Corpus   : Norsk Aviskorpus + NoWaC + NBDigital (lemmatized=False, case preserved=True, tokens=3028545953)
        Vectors  : dimensions=100, window=5, iterations=5, vocab size=4480046

Training
--------
        UAS    NER P. NER R. NER F. Tag %  Token %
        86.532 86.734 87.253 86.993 94.617 100.000
        90.355 89.247 89.408 89.327 96.446 100.000
        89.914 89.653 89.707 89.680 96.361 100.000
        90.173 89.928 89.767 89.847 96.427 100.000
        90.593 88.889 89.048 88.969 96.641 100.000
        90.581 88.611 88.929 88.769 96.646 100.000
        89.569 89.467 89.467 89.467 96.210 100.000
        89.274 89.408 89.408 89.408 96.076 100.000
        90.267 88.949 89.108 89.028 96.564 100.000
        88.150 88.590 88.749 88.670 95.462 100.000
        90.543 89.121 89.228 89.175 96.638 100.000
        90.593 88.989 88.989 88.989 96.616 100.000
        90.638 88.439 88.809 88.623 96.624 100.000
        88.635 89.348 89.348 89.348 95.870 100.000
        90.618 89.241 89.348 89.294 96.613 100.000

Best
----
        Path: data/vectors-all/11-100/training/model-best
        Size: 2282 MB

        UAS    NER P. NER R. NER F. Tag %  Token %
        90.618 89.241 89.348 89.294 96.613 100.000

Evaluate
--------


         Time      3.72 s
         Words     30034
         Words/s   8079
         TOK       100.00
         POS       96.01
         UAS       90.48
         LAS       88.16
         NER P     85.24
         NER R     86.73
         NER F     85.98

The vectors are licensed CC-BY. For attribution, these publications can be cited:

Stadsnes, Øvrelid, & Velldal (2018)
http://ojs.bibsys.no/index.php/NIK/article/view/490/

Fares, Kutuzov, Oepen, & Velldal (2017)
http://www.ep.liu.se/ecp/article.asp?issue=131&article=037

Numbers look great! I wonder exactly which change made the difference...Possibly just the different hyper-parameters, especially the narrower widths (which decreases overfitting).

It sounds like there's no barrier to adding this. I just have to add the data files to our corpora image.

Sounds good.

I'll have the same output from training all the 61 Norwegian Bokmål vector models in the NLPL repository soon. Should be finished in a day or two. That should make it easier to decide which ones make sense to use for spacy models.

Yay, super excited about this. Once the nb model is added, I'll link this thread in the master thread in #3056 as a great example of end-to-end community-driven model development 💖

Write more tests for Norwegian

For the upcoming v2.1.x, we've moved all model-related tests out of spaCy's regular test suite and over to spacy-models/tests. This makes it easier to run them independently, e.g. as part of our automated model training process.

Edit: The latest commit totally gives the wrong impression 😅
screenshot 2018-12-27 at 23 30 40

The test suite includes a bunch of general "sanity checks" that all models should pass – so in case you haven't run those yet, it'd definitely be nice to check that there are no deeper issues.

And of course it'd also be cool to have some very basic Norwegian tests for the individual components and vocab (e.g. lexical attributes, see here for an example). Writing tests for the statistical components can be a bit tricky, because predictions can change and it doesn't necessarily mean that the model is worse if it performs worse on some arbitrary test case.

(Slightly OT, but I've been thinking about adding a test helper that lets you assert that at least X% of results are correct. So we could use this to make sure that the accuracy of all POS tags or per-token BILUO entity labels in a longer text doesn't fall below a certain threshold.)

Here is the result of spacy evaluate on the test split from NORNE and the 59 Norwegian Bokmål vector models from the NLPL repository (scroll right for scores):

| Corpus | Lemmatized | Algorithm | Dimensions | Window | Vocab size | Model size | TOK | POS | UAS | LAS | NER P | NER R | NER F |
| ------ | ---------- | --------- | ---------- | ------ | ---------- | ---------- | --- | --- | --- | --- | ----- | ----- | ----- |
| Norsk Aviskorpus + NoWaC + NBDigital | False | Gensim Continuous Skipgram | 100 | 5 | 4480046 | 2282MB | 100.0 | 96.01 | 90.48 | 88.16 | 85.24 | 86.73 | 85.98 |
| Norsk Aviskorpus | False | Gensim Continuous Skipgram | 100 | 5 | 1728100 | 890MB | 100.0 | 95.83 | 90.43 | 88.08 | 84.94 | 86.3 | 85.61 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 100 | 5 | 2551820 | 1307MB | 100.0 | 95.79 | 90.4 | 88.14 | 85.88 | 86.44 | 86.16 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | Gensim Continuous Bag-of-Words | 100 | 5 | 4480046 | 2282MB | 100.0 | 95.76 | 90.15 | 87.75 | 85.2 | 86.01 | 85.6 |
| Norsk Aviskorpus + NoWaC | False | Gensim Continuous Bag-of-Words | 100 | 5 | 2551819 | 1307MB | 100.0 | 95.74 | 90.44 | 88.09 | 85.01 | 84.77 | 84.89 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 50 | 5 | 2551820 | 820MB | 100.0 | 95.75 | 90.21 | 87.81 | 84.7 | 85.57 | 85.13 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | fastText Skipgram | 100 | 5 | 4428648 | 2256MB | 100.0 | 95.97 | 90.26 | 87.85 | 85.26 | 86.88 | 86.06 |
| Norsk Aviskorpus + NoWaC | False | Gensim Continuous Skipgram | 100 | 5 | 2551819 | 1307MB | 100.0 | 96.0 | 90.14 | 87.82 | 85.63 | 86.44 | 86.04 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 300 | 5 | 1487994 | 1904MB | 100.0 | 95.39 | 90.08 | 87.51 | 83.17 | 82.51 | 82.84 |
| Norsk Aviskorpus + NoWaC | False | fastText Continuous Bag-of-Words | 100 | 5 | 2551820 | 1307MB | 100.0 | 95.8 | 90.26 | 88.05 | 84.79 | 84.91 | 84.85 |
| Norsk Aviskorpus | False | Gensim Continuous Bag-of-Words | 100 | 5 | 1728100 | 890MB | 100.0 | 95.73 | 90.43 | 88.07 | 85.32 | 84.69 | 85.0 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | Gensim Continuous Skipgram | 100 | 5 | 4031460 | 2055MB | 100.0 | 95.41 | 89.82 | 87.36 | 84.47 | 84.84 | 84.65 |
| Norsk Aviskorpus | False | fastText Skipgram | 100 | 5 | 1728101 | 890MB | 100.0 | 95.78 | 90.23 | 87.8 | 84.33 | 85.13 | 84.73 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | fastText Skipgram | 100 | 5 | 3998140 | 2038MB | 100.0 | 95.37 | 89.75 | 87.26 | 83.3 | 85.06 | 84.17 |
| Norsk Aviskorpus + NoWaC | True | Gensim Continuous Skipgram | 100 | 5 | 2239664 | 1149MB | 100.0 | 95.37 | 90.1 | 87.64 | 83.5 | 84.11 | 83.81 |
| Norsk Aviskorpus | True | fastText Skipgram | 600 | 5 | 1487995 | 3607MB | 100.0 | 95.44 | 89.69 | 87.22 | 83.31 | 84.77 | 84.03 |
| Norsk Aviskorpus | True | fastText Skipgram | 300 | 5 | 1487995 | 1904MB | 100.0 | 95.36 | 89.7 | 87.29 | 83.79 | 85.13 | 84.45 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | Gensim Continuous Bag-of-Words | 100 | 5 | 4031460 | 2055MB | 100.0 | 95.32 | 89.68 | 87.14 | 81.98 | 82.58 | 82.28 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 300 | 5 | 2551820 | 3254MB | 100.0 | 96.11 | 90.31 | 87.93 | 84.9 | 86.08 | 85.49 |
| Norsk Aviskorpus | False | fastText Continuous Bag-of-Words | 100 | 5 | 1728101 | 890MB | 100.0 | 95.85 | 90.34 | 87.99 | 84.11 | 84.48 | 84.29 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | fastText Continuous Bag-of-Words | 100 | 5 | 4428648 | 2256MB | 100.0 | 95.96 | 90.26 | 87.96 | 83.12 | 83.97 | 83.54 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 600 | 5 | 1487994 | 3607MB | 100.0 | 95.44 | 89.54 | 86.75 | 84.26 | 83.89 | 84.08 |
| Norsk Aviskorpus + NoWaC | True | Gensim Continuous Bag-of-Words | 100 | 5 | 2239664 | 1149MB | 100.0 | 95.34 | 89.78 | 87.23 | 83.69 | 83.02 | 83.35 |
| Norsk Aviskorpus | True | Gensim Continuous Skipgram | 100 | 5 | 1487994 | 768MB | 100.0 | 95.31 | 89.78 | 87.25 | 84.45 | 84.33 | 84.39 |
| NoWaC | True | Gensim Continuous Bag-of-Words | 100 | 5 | 1199274 | 619MB | 100.0 | 95.27 | 89.61 | 86.99 | 79.36 | 79.88 | 79.62 |
| Norsk Aviskorpus + NoWaC | True | fastText Continuous Bag-of-Words | 100 | 5 | 2239665 | 1149MB | 100.0 | 95.21 | 89.7 | 87.19 | 83.88 | 82.65 | 83.26 |
| Norsk Aviskorpus + NoWaC | False | fastText Skipgram | 600 | 5 | 2551820 | 6175MB | 100.0 | 96.24 | 90.48 | 88.16 | 84.73 | 85.35 | 85.04 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 100 | 5 | 1487994 | 768MB | 100.0 | 95.2 | 89.78 | 87.2 | 81.53 | 80.76 | 81.14 |
| Norsk Aviskorpus + NoWaC + NBDigital | False | Global Vectors | 100 | 15 | 4480047 | 2282MB | 100.0 | 95.64 | 90.06 | 87.6 | 84.39 | 84.33 | 84.36 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | fastText Continuous Bag-of-Words | 100 | 5 | 3998140 | 2038MB | 100.0 | 95.23 | 89.71 | 87.27 | 82.47 | 81.92 | 82.19 |
| Norsk Aviskorpus | True | fastText Skipgram | 50 | 5 | 1487995 | 485MB | 100.0 | 95.32 | 89.6 | 87.05 | 81.69 | 82.29 | 81.99 |
| NoWaC | False | fastText Skipgram | 100 | 5 | 1356633 | 699MB | 100.0 | 95.79 | 90.22 | 87.76 | 84.18 | 85.35 | 84.76 |
| Norsk Aviskorpus | True | fastText Skipgram | 100 | 5 | 1487995 | 768MB | 100.0 | 95.32 | 89.84 | 87.37 | 81.8 | 82.87 | 82.33 |
| Norsk Aviskorpus | True | fastText Continuous Bag-of-Words | 100 | 5 | 1487995 | 768MB | 100.0 | 95.12 | 89.7 | 87.25 | 81.61 | 82.14 | 81.87 |
| NoWaC | False | fastText Continuous Bag-of-Words | 100 | 5 | 1356633 | 699MB | 100.0 | 95.73 | 89.86 | 87.43 | 82.16 | 83.24 | 82.69 |
| Norsk Aviskorpus | True | Gensim Continuous Bag-of-Words | 50 | 5 | 1487994 | 485MB | 100.0 | 95.13 | 89.53 | 86.96 | 81.3 | 81.12 | 81.21 |
| NoWaC | False | Gensim Continuous Skipgram | 100 | 5 | 1356632 | 699MB | 100.0 | 95.79 | 89.84 | 87.45 | 83.24 | 84.33 | 83.78 |
| NoWaC | False | Gensim Continuous Bag-of-Words | 100 | 5 | 1356632 | 699MB | 100.0 | 95.79 | 90.29 | 88.0 | 80.01 | 80.54 | 80.28 |
| Norsk Aviskorpus + NoWaC | False | Global Vectors | 100 | 15 | 2551820 | 1307MB | 100.0 | 95.56 | 89.67 | 87.21 | 83.19 | 84.77 | 83.97 |
| Norsk Aviskorpus + NoWaC | True | fastText Skipgram | 100 | 5 | 2239665 | 1149MB | 100.0 | 95.55 | 89.74 | 87.17 | 84.38 | 85.42 | 84.9 |
| NBDigital | False | fastText Skipgram | 100 | 5 | 2390584 | 1221MB | 100.0 | 95.63 | 89.79 | 87.37 | 79.94 | 81.63 | 80.78 |
| Norsk Aviskorpus | False | Global Vectors | 100 | 15 | 1728101 | 890MB | 100.0 | 95.5 | 89.76 | 87.26 | 83.47 | 84.26 | 83.86 |
| NoWaC | True | fastText Skipgram | 100 | 5 | 1199275 | 619MB | 100.0 | 95.3 | 89.6 | 87.05 | 80.73 | 81.85 | 81.29 |
| Norsk Aviskorpus + NoWaC | True | Global Vectors | 100 | 15 | 2239665 | 1149MB | 100.0 | 95.09 | 89.18 | 86.56 | 80.56 | 81.56 | 81.06 |
| NBDigital | False | Gensim Continuous Skipgram | 100 | 5 | 2390583 | 1221MB | 100.0 | 95.7 | 89.91 | 87.55 | 80.39 | 81.56 | 80.97 |
| Norsk Aviskorpus + NoWaC + NBDigital | True | Global Vectors | 100 | 15 | 4031461 | 2055MB | 100.0 | 95.06 | 89.19 | 86.55 | 81.86 | 82.22 | 82.04 |
| Norsk Aviskorpus | True | Global Vectors | 100 | 15 | 1487995 | 768MB | 100.0 | 95.08 | 89.27 | 86.68 | 81.94 | 82.65 | 82.29 |
| NoWaC | True | Gensim Continuous Skipgram | 100 | 5 | 1199274 | 619MB | 100.0 | 95.37 | 89.64 | 87.18 | 82.3 | 83.38 | 82.84 |
| NBDigital | False | fastText Continuous Bag-of-Words | 100 | 5 | 2390584 | 1221MB | 100.0 | 95.39 | 89.75 | 87.21 | 77.67 | 79.37 | 78.51 |
| NoWaC | False | Global Vectors | 100 | 15 | 1356633 | 699MB | 100.0 | 95.43 | 89.51 | 86.98 | 81.4 | 82.94 | 82.17 |
| NBDigital | False | Gensim Continuous Bag-of-Words | 100 | 5 | 2390583 | 1221MB | 100.0 | 95.57 | 89.71 | 87.35 | 78.22 | 78.79 | 78.5 |
| NBDigital | True | Gensim Continuous Skipgram | 100 | 5 | 2187702 | 1119MB | 100.0 | 95.36 | 89.65 | 87.04 | 78.49 | 79.01 | 78.75 |
| NoWaC | True | Global Vectors | 100 | 15 | 1199275 | 619MB | 100.0 | 94.89 | 88.91 | 86.22 | 79.3 | 79.59 | 79.45 |
| NBDigital | True | fastText Skipgram | 100 | 5 | 2187703 | 1119MB | 100.0 | 95.36 | 89.31 | 86.82 | 79.08 | 79.88 | 79.48 |
| NoWaC | True | fastText Continuous Bag-of-Words | 100 | 5 | 1199275 | 619MB | 100.0 | 95.43 | 89.76 | 87.22 | 80.99 | 82.0 | 81.49 |
| NBDigital | False | Global Vectors | 100 | 15 | 2390584 | 1221MB | 100.0 | 95.41 | 89.38 | 86.91 | 77.7 | 79.23 | 78.46 |
| NBDigital | True | fastText Continuous Bag-of-Words | 100 | 5 | 2187703 | 1119MB | 100.0 | 95.04 | 89.64 | 87.15 | 79.1 | 79.45 | 79.27 |
| NBDigital | True | Global Vectors | 100 | 15 | 2187703 | 1119MB | 100.0 | 95.0 | 89.31 | 86.56 | 78.08 | 77.62 | 77.85 |
| NBDigital | True | Gensim Continuous Bag-of-Words | 100 | 5 | 2187702 | 1119MB | 100.0 | 95.1 | 89.26 | 86.67 | 77.35 | 77.92 | 77.63 |

Full output: https://gist.github.com/jarib/f0da63fbe338ae3dac0559032cc2e1fd
Output as JSON: https://gist.github.com/jarib/6048712165290a13179c5cd47157f1bd

I'm not sure what accuracy/size tradeoffs makes the most sense for spaCy. I also haven't tried to do any pruning of the vectors. And this is was trained using data converted before I was aware of the --n-sents option to spacy convert.

Thanks for running all this! Just to be clear, did you retrain with the new vectors? You can't really compare the accuracy by swapping in the vectors at runtime, because then all you're really measuring is how similar each vectors are to the one the model is trained with.

I think it would also be very useful to train an sm model to check how much improvement the vectors really make. Generally using --prune-vectors 20000 (this is the setting we've been using for md models) gives a pretty good balance of accuracy and size.

I did retrain with the new vectors. I don't know spacy's internals well enough to dare swapping anything out :)

I'll train one model without vectors (= sm?) and one with pruning, and post the results here.

Okay, great!

Here are the results without vectors. The size is 15MB on disk, uncompressed / unpackaged.

Training
Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
0   179886.936    8866.197   82.367   72.354   69.539   70.919   91.280  100.000     9177     9234
1   132272.499    4886.268   84.810   76.308   74.207   75.243   93.018  100.000     4203     4394
2   116446.641    3512.800   85.919   78.156   75.583   76.848   93.767  100.000     8864     8857
3   106976.626    2606.215   86.395   78.886   77.139   78.003   94.099  100.000     9001     8975
4    99785.032    2114.693   86.876   80.331   78.456   79.382   94.436  100.000     8807     8927
5    94071.978    1743.858   87.448   79.648   78.456   79.047   94.606  100.000     8910     8902
6    88821.250    1449.010   87.793   80.024   78.396   79.202   94.850  100.000     8907     8886
7    84949.003    1302.102   87.984   80.049   78.276   79.153   95.009  100.000     8939     8957
8    81243.756    1196.975   88.084   79.376   77.618   78.487   95.045  100.000     8974     8900
9    78191.758     986.522   88.165   78.545   77.558   78.049   95.124  100.000     9015     8962
10   75810.611     923.147   88.266   78.235   77.439   77.835   95.149  100.000     8969     8930
11   73314.405     795.656   88.314   78.100   77.259   77.677   95.237  100.000     8797     8833
12   70981.429     729.826   88.528   78.463   77.618   78.039   95.251  100.000     8981     8951
13   68370.635     664.590   88.682   79.162   77.977   78.565   95.286  100.000     9059     9006
14   66972.303     661.691   88.624   79.101   77.917   78.505   95.294  100.000     8872     8867
15   65480.245     614.919   88.660   79.358   78.456   78.905   95.371  100.000     8894     8804
16   63636.544     538.836   88.827   79.613   78.755   79.182   95.385  100.000     8745     8727
17   61809.648     510.238   88.980   79.649   78.695   79.169   95.355  100.000     8676     8743
18   60794.229     498.642   88.974   79.406   78.456   78.928   95.404  100.000     8777     8787
19   58516.986     474.259   89.156   79.030   78.037   78.531   95.448  100.000     8838     8928
20   57729.559     465.683   89.090   78.434   77.917   78.175   95.437  100.000     8697     8777
21   56607.341     434.805   89.166   78.627   78.157   78.391   95.451  100.000     8877     8983
22   55707.805     408.971   89.309   79.275   78.516   78.894   95.492  100.000     8817     8750
23   54313.256     397.411   89.309   79.024   78.456   78.739   95.484  100.000     8745     8929
24   53104.471     393.764   89.245   78.688   78.217   78.451   95.478  100.000     8789     8771
25   52802.164     361.336   89.292   79.287   78.576   78.930   95.475  100.000     8866     8775
26   51530.342     384.859   89.315   79.577   78.815   79.194   95.475  100.000     8810     8697
27   50466.753     293.543   89.293   79.217   78.695   78.955   95.473  100.000     8866     8920
28   50368.632     346.166   89.270   79.207   78.875   79.040   95.492  100.000     8910     8949
29   49231.304     353.097   89.331   78.790   78.695   78.743   95.536  100.000     8825     8796
Evaluation
Time      3.40 s
Words     29847
Words/s   8781
TOK       100.00
POS       94.66
UAS       88.97
LAS       86.30
NER P     71.64
NER R     70.54
NER F     71.08

Seems like the vectors improves the accuracy quite a lot. Note that @ohenrik said he improved the accuracy without vectors by merging and re-splitting the data https://github.com/explosion/spaCy/issues/3082#issuecomment-449637286

Here I've pruned the vectors from the third row in the table above (fastText skipgram on the "Norsk Aviskorpus + NoWaC" corpus) to 20 000 words. The final model ends up at 454 MB.

When I use --prune-vectors 20000 I get this warning: Warning: Unnamed vectors -- this won't allow multiple vectors models to be loaded. (Shape: (20000, 100)). That doesn't happen if I remove the flag. Not sure if it matters.

Training
Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
  0  159537.223    6402.911   85.781   82.008   82.107   82.057   93.723  100.000     8352     8415
  1  114886.809    3331.395   87.663   84.703   84.500   84.602   94.790  100.000     8268     8287
  2  102142.415    2465.607   88.314   84.886   85.039   84.963   95.278  100.000     8473     8422
  3   93845.905    1858.885   88.637   85.561   85.817   85.689   95.544  100.000     8630     8464
  4   87807.250    1485.389   88.958   85.697   86.056   85.876   95.728  100.000     8603     8586
  5   82416.232    1216.918   89.311   85.791   85.996   85.894   95.785  100.000     8559     8559
  6   78299.269    1105.943   89.598   85.629   85.577   85.603   95.900  100.000     8644     8627
  7   74927.845     946.399   89.746   85.117   85.218   85.167   95.914  100.000     8620     8647
  8   72173.234     775.656   89.912   84.455   84.859   84.657   95.974  100.000     8555     8532
  9   69231.955     732.922   90.027   83.750   84.201   83.975   96.068  100.000     8597     8571
 10   66741.832     681.178   90.077   84.223   84.979   84.599   96.114  100.000     8800     8773
 11   64863.437     592.628   90.235   85.296   85.398   85.347   96.114  100.000     8757     8763
 12   62994.504     544.747   90.391   85.084   85.338   85.211   96.150  100.000     8752     8711
 13   60421.970     527.930   90.435   85.305   85.458   85.381   96.155  100.000     8696     8727
 14   59108.179     483.949   90.481   84.123   84.979   84.549   96.172  100.000     8767     8711
 15   57507.565     414.018   90.446   84.734   85.697   85.213   96.240  100.000     8747     8665
 16   55917.333     445.512   90.446   84.811   85.877   85.340   96.229  100.000     8567     8583
 17   54140.529     401.599   90.405   84.734   85.697   85.213   96.235  100.000     8735     8656
 18   53235.531     345.128   90.460   84.479   85.338   84.906   96.262  100.000     8581     8534
 19   51425.662     370.436   90.510   84.438   85.398   84.915   96.271  100.000     8464     8546
 20   50373.646     343.464   90.591   84.830   85.338   85.084   96.262  100.000     8549     8555
 21   49415.185     337.757   90.644   84.551   85.159   84.854   96.224  100.000     8573     8564
 22   48312.131     274.937   90.756   83.589   84.740   84.160   96.221  100.000     8560     8563
 23   47543.626     361.788   90.738   84.360   85.218   84.787   96.224  100.000     8586     8547
 24   47092.949     330.405   90.780   85.459   85.817   85.638   96.210  100.000     8524     8487
 25   45720.017     282.577   90.744   85.629   85.937   85.783   96.232  100.000     8560     8591
 26   44934.282     250.263   90.656   84.816   85.577   85.195   96.260  100.000     8533     8538
 27   43830.520     255.366   90.576   84.456   85.518   84.984   96.284  100.000     8520     8543
 28   43344.429     284.323   90.686   84.834   85.697   85.263   96.295  100.000     8682     8589
 29   43504.905     286.349   90.699   84.775   85.637   85.204   96.287  100.000     8699     8710
Evaluation
Time      3.46 s
Words     29847
Words/s   8620
TOK       100.00
POS       95.64
UAS       90.25
LAS       87.93
NER P     82.96
NER R     84.79
NER F     83.87

So, comparing size and accuracy from spacy evaluate:

| Name | Size | TOK | POS | UAS | LAS | NER P | NER R | NER F |
| ---- | ---- |-----|----| --- | --- | ------| ----- | ----- |
| sm | 15 MB | 100.00 | 94.66 | 88.97 |86.30|71.64|70.54|71.08|
| md | 454 MB | 100.00 | 95.64 | 90.25 | 87.93 | 82.96| 84.79| 83.87 |
| lg | 1308 MB | 100.00 | 95.96 | 90.42 | 88.30| 85.29 | 86.48 | 85.88 |

I'm unable to get sentence segmentation to work, even after changing to spacy convert -n 10 [...]. I do the conversion here.

What happens? You're not passing -G during training are you? Edit: Just saw your command --- yeah the --gold-preproc argument is the problem. Is that inherited from some example we give? Someone else has had this problem too, so maybe we have some bad instructions somewhere?

The gap between the sm and md models is interesting. If you want you could experiment with spacy pretrain for this...It takes a jsonl file, formatted with each line being a dict like {"text": "..."}. This outputs pre-trained weights for the CNN. On small datasets, it can have a big improvement in accuracy. It's best to run it with at least 1 billion words of text, but even if you only have like 50-100 million it should still help.

You're right, I was passing -G! Not sure where I got that from. I guess it could be clearer in the docs when it is or isn't appropriate.

I'll try to do the sm/md/lg training again without -G. The big table above with all the pre-trained vectors was done without -G, luckily (command here).

I'll try spacy pretrain as well.

Updated results, without -G. Sentence segmentation now works well with all of them.

| Name | Size | TOK | POS | UAS | LAS | NER P | NER R | NER F |
| ---- | ---- |-----|----| --- | --- | ------| ----- | ----- |
| sm | 15 MB | 100.00 | 94.60 | 88.59| 86.10 | 71.96 | 70.54 | 71.24 |
| md | 454 MB | 100.00 | 95.59 | 89.88 | 87.65 | 83.15 | 84.50 | 83.82 |
| lg | 1308 MB | 100.00 | 95.83 | 90.44 | 88.20 | 83.92 | 85.89 | 84.89 |


sm training output

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
  0  140328.007    9713.144   81.386   69.818   66.727   68.237   91.036  100.000     6500     6424
  1   95301.408    5064.493   84.296   76.306   75.165   75.731   92.887  100.000     6197     6254
  2   81858.447    3635.719   85.439   77.488   76.421   76.951   93.682  100.000    10615    10428
  3   74299.650    2738.703   86.115   77.778   77.080   77.427   94.140  100.000    10497    10431
  4   68887.638    2227.522   86.737   77.544   77.080   77.311   94.392  100.000    10404    10339
  5   64298.027    1766.455   87.304   77.858   77.858   77.858   94.617  100.000     7621     6241
  6   61905.430    1505.741   87.480   78.138   77.858   77.998   94.719  100.000    10538     9815
  7   59213.529    1273.619   87.572   77.798   77.798   77.798   94.814  100.000     6155     6155
  8   56951.210    1104.528   87.714   77.564   77.379   77.472   94.930  100.000    10223    10270
  9   54658.723    1079.417   87.897   77.624   77.439   77.531   94.927  100.000     6292     6227
 10   52977.966     932.116   88.002   77.379   77.379   77.379   95.017  100.000     6189     6162
 11   51874.855     800.744   88.200   77.738   77.319   77.528   95.135  100.000     6233     6244
 12   50026.206     736.797   88.311   77.281   76.541   76.909   95.176  100.000    10411     7271
 13   49049.945     712.090   88.436   77.221   76.481   76.849   95.229  100.000    10293    10285
 14   47388.001     647.277   88.573   76.960   76.361   76.660   95.253  100.000     7194     6375
 15   47097.322     628.981   88.525   76.354   75.943   76.148   95.292  100.000     6091     6088
 16   44984.461     606.129   88.667   76.881   76.421   76.651   95.256  100.000    10292    10445
 17   44119.642     525.098   88.698   77.114   76.421   76.766   95.286  100.000     6647     6485
 18   43658.052     525.569   88.771   76.891   76.062   76.474   95.330  100.000    10247    10207
 19   42470.731     474.503   88.788   76.584   75.943   76.262   95.294  100.000     6089     6114
 20   41664.124     553.536   88.923   76.584   75.943   76.262   95.297  100.000    10200    10213
 21   41395.161     510.137   88.807   77.549   76.481   77.011   95.325  100.000    10207    10296
 22   40159.752     407.179   88.863   77.529   76.601   77.062   95.308  100.000     6658     6288
 23   39051.428     430.688   89.041   77.556   76.721   77.136   95.346  100.000    10146    10207
 24   38466.165     380.242   89.031   77.300   76.421   76.858   95.399  100.000     9764    10136
 25   37894.168     376.259   89.055   77.300   76.421   76.858   95.429  100.000     7419    10290
 26   37924.853     404.705   89.030   76.770   75.943   76.354   95.407  100.000     6100     6221
 27   36738.349     431.608   89.048   77.039   76.302   76.669   95.412  100.000     7700     6379
 28   36492.987     394.110   89.025   77.179   76.302   76.738   95.442  100.000    10228    10311
 29   35964.157     323.550   89.177   76.784   76.002   76.391   95.442  100.000     6265     6719


md training output

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
  0  125910.684    7581.375   84.740   82.388   81.747   82.067   93.613  100.000     6890     6078
  1   82290.664    3598.582   86.767   83.106   83.902   83.502   94.697  100.000     6495     6439
  2   70896.668    2724.962   87.885   85.312   85.159   85.235   95.204  100.000     9421     5991
  3   65209.235    2009.910   88.452   85.203   85.458   85.330   95.467  100.000     5999     6069
  4   59814.261    1610.226   88.626   85.646   86.056   85.851   95.582  100.000     5974     6042
  5   56224.491    1380.671   88.951   85.774   86.595   86.182   95.747  100.000    10059     6041
  6   54198.211    1154.143   89.329   85.207   86.176   85.689   95.878  100.000     6044    10186
  7   52108.301     985.541   89.417   86.012   86.475   86.243   95.903  100.000     6044     7271
  8   49958.169     845.080   89.650   85.859   86.116   85.987   95.988  100.000     9654     6023
  9   48384.992     719.396   89.924   85.748   86.056   85.902   96.016  100.000     6327    10027
 10   46380.225     676.546   89.960   85.510   85.817   85.663   96.024  100.000    10218     9840
 11   45604.224     698.995   89.960   85.723   85.877   85.800   96.057  100.000    10239    10192
 12   44398.147     574.467   90.036   85.603   85.757   85.680   96.024  100.000    10287    10000
 13   43337.924     481.058   90.052   85.569   85.877   85.723   96.090  100.000    10231    10321
 14   42124.992     494.542   90.032   86.139   86.655   86.396   96.131  100.000    10270    10062
 15   40621.568     416.546   90.129   85.841   86.715   86.276   96.136  100.000    10062    10072
 16   39310.532     471.311   90.213   85.697   86.774   86.233   96.139  100.000     9945    10052
 17   39134.968     373.413   90.343   85.883   87.014   86.445   96.109  100.000    10016     9996
 18   38455.480     439.943   90.360   85.816   86.894   86.351   96.095  100.000    10251    10142
 19   37404.740     390.323   90.429   85.782   86.655   86.216   96.139  100.000    10185    10144
 20   36686.148     394.971   90.478   85.833   86.655   86.242   96.131  100.000    10234    10144
 21   36009.123     364.551   90.485   85.917   86.894   86.403   96.169  100.000    10290    10552
 22   34603.331     320.468   90.496   85.630   86.655   86.139   96.153  100.000    10210    10540
 23   34493.734     321.966   90.537   85.529   86.655   86.088   96.180  100.000    10087    10498
 24   33816.817     337.003   90.561   84.979   85.996   85.485   96.164  100.000    10153    10104
 25   33440.944     295.308   90.430   85.604   86.475   86.038   96.199  100.000    10097    10169
 26   32788.472     311.987   90.421   85.258   86.176   85.714   96.166  100.000    10606    10346
 27   32166.413     275.568   90.515   85.264   85.877   85.569   96.112  100.000    10572    10243
 28   31914.538     270.512   90.584   85.110   85.518   85.313   96.081  100.000    10511    10122
 29   31567.351     249.013   90.507   85.348   85.757   85.552   96.101  100.000    10185     9963
 ```
 </details>

<details>
  <summary><strong>lg</strong> training output</summary>

Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS


0 124258.144 7236.655 85.636 83.533 83.483 83.508 94.173 100.000 6226 6147
1 80057.797 3228.932 87.450 86.283 86.954 86.617 95.429 100.000 6193 6244
2 69706.663 2411.335 88.292 87.195 87.612 87.403 95.840 100.000 6194 6421
3 63253.803 1862.183 88.854 88.067 88.330 88.198 96.142 100.000 8149 7202
4 58398.192 1489.227 89.274 87.709 87.971 87.840 96.216 100.000 9753 9844
5 55081.437 1233.046 89.604 88.067 88.330 88.198 96.317 100.000 5905 6208
6 53114.583 1059.013 89.955 88.365 88.630 88.497 96.435 100.000 5708 5975
7 50720.677 910.304 90.040 88.148 88.570 88.358 96.408 100.000 10395 9472
8 48330.692 782.987 90.223 87.969 88.390 88.179 96.446 100.000 9343 10140
9 47155.682 705.529 90.446 88.109 88.689 88.398 96.441 100.000 5594 10007
10 45750.049 672.297 90.535 87.344 87.971 87.657 96.452 100.000 6098 6251
11 44437.575 649.625 90.549 87.011 87.792 87.399 96.512 100.000 10378 6162
12 43094.576 514.990 90.624 87.034 87.971 87.500 96.512 100.000 6111 5891
13 41935.703 485.979 90.673 87.537 88.270 87.902 96.534 100.000 6170 5911
14 40887.881 468.094 90.713 87.722 88.510 88.114 96.556 100.000 6033 6344
15 39608.918 409.788 90.678 87.864 88.390 88.126 96.564 100.000 10087 7881
16 38723.232 419.580 90.671 87.969 88.390 88.179 96.559 100.000 9331 10132
17 37933.653 394.874 90.613 87.388 87.911 87.649 96.545 100.000 5948 5941
18 37256.559 369.788 90.584 87.329 87.852 87.589 96.578 100.000 9515 9652
19 36489.335 305.808 90.660 87.440 87.911 87.675 96.572 100.000 9344 9432
20 35347.381 346.268 90.610 87.716 88.031 87.873 96.586 100.000 9378 9389
21 34922.059 353.310 90.769 87.657 87.971 87.814 96.591 100.000 10213 9547
22 34012.171 336.559 90.788 88.088 88.510 88.299 96.594 100.000 10272 9369
23 33653.177 345.451 90.734 88.186 88.450 88.318 96.556 100.000 10148 9369
24 33334.910 293.602 90.861 88.119 88.330 88.225 96.559 100.000 10321 9225
25 32328.676 294.129 90.823 88.158 88.211 88.184 96.575 100.000 10163 9312
26 31813.067 250.583 90.742 88.112 88.270 88.191 96.586 100.000 9302 10124
27 31284.447 262.157 90.755 88.450 88.450 88.450 96.589 100.000 5931 10207
28 31013.342 284.829 90.707 88.084 88.031 88.057 96.594 100.000 9286 9425
29 30368.747 252.931 90.773 87.926 88.031 87.978 96.600 100.000 10066 9395
```

I've tried to improve the sm model with pretraining on a subset (~250 million words) of the Norwegian News Corpus.

| Name | Pretrained | Size | POS | UAS | LAS | NER P | NER R | NER F |
| ---- | ---------- | ---- |----| --- | --- | ------| ----- | ----- |
| sm | No | 15 MB | 94.60 | 88.59 | 86.10 | 71.96 | 70.54 | 71.24 |
| sm | Yes | 15 MB | 95.07 | 90.14 | 87.82 | 78.92 | 78.69 | 78.81 |

I chose a subset of the corpus that was easy to convert into the correct format, so it can probably be further improved by pretraining on the full corpus, possibly in combination with NoWaC (700 million tokens). The vector model used for md and lg is trained on the combination of these two corpora.

I ran spacy pretrain with the default settings. Each iteration takes about 2.5 hours (30k words per second) on my hardware, so the default 1000 iterations seems a bit much for me at the moment. The scores above is from the 7th iteration, randomly chosen.


Full output of training

$ python -m spacy train nb data/nb-sm-pretrained data/norne-spacy/ud/nob/no-ud-train-ner.json data/norne-spacy/ud/nob/no-ud-dev-ner.json --n-iter 30 -t2v data/pretraining/model7.bin
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'nb'
Counting training words (limit=0)
Loaded pretrained tok2vec for: ['tagger', 'parser', 'ner']

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
  0  108561.528    7789.699   86.292   79.152   78.157   78.651   93.029  100.000    10325        0
  1   74733.323    4058.799   87.773   80.417   80.850   80.633   94.126  100.000    10518        0
  2   66452.071    2935.499   88.482   80.699   81.568   81.131   94.505  100.000    10571        0
  3   60779.625    2301.898   88.892   80.969   81.987   81.475   94.834  100.000    10583        0
  4   57622.801    1824.278   89.237   81.883   82.226   82.054   95.105  100.000    10615        0
  5   54880.576    1503.344   89.409   82.181   82.525   82.353   95.281  100.000    10646        0
  6   52967.322    1309.013   89.502   81.726   82.166   81.946   95.426  100.000    10595        0
  7   51169.098    1048.847   89.616   81.981   82.226   82.103   95.492  100.000    10635        0
  8   49863.342     944.649   89.741   82.409   82.705   82.557   95.582  100.000    10723        0
  9   48321.433     857.080   89.842   81.992   82.286   82.139   95.632  100.000    10622        0
 10   46733.436     743.655   89.958   82.283   82.825   82.553   95.662  100.000    10664        0
 11   45430.614     710.145   90.128   82.627   82.825   82.726   95.687  100.000    10610        0
 12   44267.610     646.346   90.256   82.188   82.286   82.237   95.717  100.000    10571        0
 13   43798.329     594.519   90.298   81.910   82.107   82.008   95.725  100.000    10471        0
 14   42466.530     543.666   90.331   82.019   82.166   82.093   95.752  100.000    10493        0
 15   41715.754     542.159   90.237   81.530   81.628   81.579   95.777  100.000    10531        0
 16   40556.998     544.410   90.320   82.030   82.226   82.128   95.835  100.000    10534        0
 17   40201.797     387.598   90.367   81.618   82.107   81.862   95.815  100.000    10607        0
 18   39513.345     436.006   90.438   82.083   82.525   82.304   95.835  100.000    10542        0
 19   38447.469     410.579   90.442   81.797   82.286   82.041   95.835  100.000    10632        0
 20   37960.295     407.407   90.472   81.742   81.987   81.864   95.867  100.000    10585        0
 21   37349.251     416.729   90.553   81.672   81.867   81.769   95.851  100.000    10587        0
 22   36711.216     354.900   90.625   81.557   81.508   81.532   95.818  100.000    10507        0
 23   35945.099     436.860   90.519   81.243   81.388   81.315   95.824  100.000    10456        0
 24   35678.236     366.608   90.523   81.818   81.867   81.843   95.840  100.000    10545        0
 25   35094.173     357.058   90.544   82.074   81.927   82.001   95.859  100.000    10475        0
 26   34586.409     347.660   90.572   81.976   81.927   81.952   95.884  100.000    10513        0
 27   34059.730     313.799   90.528   82.085   81.987   82.036   95.859  100.000    10496        0
 28   34028.352     284.917   90.503   82.156   82.107   82.131   95.884  100.000    10482        0
 29   33059.099     302.737   90.595   81.829   81.927   81.878   95.911  100.000    10446        0
✔ Saved model to output directory
data/nb-sm-pretrained/model-final
✔ Created best model
data/nb-sm-pretrained/model-best
$ python -m spacy evaluate data/nb-sm-pretrained/model-best data/norne-spacy/ud/nob/no-ud-test-ner.json

================================== Results ==================================

Time      2.97 s
Words     29847
Words/s   10035
TOK       100.00
POS       95.07
UAS       90.14
LAS       87.82
NER P     78.92
NER R     78.69
NER F     78.81

@jarib What type of hardware do you have? I have an idle 1080ti graphics card that might speed things up a bit, but not sure if you already are using something similar. Let me know, and i can set it up and run it for you on my machine.

@ohenrik I've been using p2.xlarge instance from AWS. I can share the .jsonl file used for pretraining if you want to give it a try.

I can give it a try :) Just share what i need to set it up.

@ohenrik To try pretraining:

  1. Install spacy-nightly
  2. Download and extract the NNC subset to e.g. data/nnc.jsonl
  3. Download the extract the large model (vectors) to e.g. data/nb-lg
  4. Run python -m spacy pretrain data/nnc.jsonl data/nb-lg data/pretraining

However I think we should eventually try a larger corpus in order to improve the sm model.

@honnibal Can spacy pretrain improve the lg model as well, or is that redundant?

@jarib should I convert the contents the ner_files folder i downloaded to jsonl using spacy? i did not get any files named nnc.jsonl (or jsonl files in general)

@ohenrik This file contains data/nnc.jsonl.

Strange... I'm not sure what happened, but i somehow mixed up the zip files with some old NER i downloaded back on 17. desember :p So no wonder nothing made sense. I started training it now, will give an update soon when i know more about the speed

The NoWaC corpus is 700 million tokens and as such a good candidate for pretraining. But it's licensed CC BY-NC-SA. Will pretraining on this corpus affect the licensing of the final model?

Will pretraining on this corpus affect the licensing of the final model?

Unfortunately, yes. The model will include embeddings based on that corpus, which would count as derivative works. It's not always 100% clear what the "SA" (share-alike) part of the license means for statistical models, but if a data resource is published as "NC" (non-commercial), we definitely won't be able to release a non-non-commercial spaCy model. And since most spaCy users do commercial work, that'd be pretty limiting.

Depending on who published a resource, it's sometimes possible to negotiate special terms, so it might be worth reaching out to the authors to ask 🙂

Still really keen to get full Norwegian support :). And also quite pleased with the pretraining performance!

We can merge this with the "Adding models" master thread, #3056 . Could you update once you have the licensing figured out?

@honnibal AFAICT there’s only a licensing issue if we want to use the NoWaC corpus for pretraining, The group who published the NLPL vectors used above confirmed to me that they are licensed CC-BY.

I can try to negotiate special terms for NoWaC, or look for an alternative corpus that can be used for pretraining, if that’s a requirement for getting the Norwegian model included.

Almost finished with the v2.1 release. After that I'll be updating the datasets in the model training pipeline, which should let us publish the official nb models 🎉

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

smartinsightsfromdata picture smartinsightsfromdata  Â·  3Comments

norrishd picture norrishd  Â·  3Comments

TropComplique picture TropComplique  Â·  3Comments

ajayrfhp picture ajayrfhp  Â·  3Comments

peterroelants picture peterroelants  Â·  3Comments