Spacy: Is there a plan to include Finnish model or is it in already in development?

Created on 23 Jul 2018 · 6Comments · Source: explosion/spaCy

Feature description

The area of the library is the model. There are indeed description on how to add a new language to Spacy, but because I am quite new to Spacy those process seems quite daunting for me so I wonder if there's a plan to add Finnish to the models existing in Spacy or its already under development.

Could the feature be a custom component or spaCy plugin?

Yes. If other's work on it as well.

enhancement help wanted lang / fi models

Source

omitobi

All 6 comments

The Nordic languages are definitely high on our list. The Finnish language data in spaCy is still a bit sparse, so there might be a few things that need to be improved before we can train a model.

The process requires the following steps and components:

language data: shipped with spaCy, see here for Finnish. The tokenization should be reliable, and there should be a tag map that maps the tags used in the training data to coarse-grained tags like NOUN and optional morphological features.
training corpus: the model needs to be trained on a suitable corpus, e.g. an existing Universal Dependencies treebank. Commercial-friendly treebank licenses are always a plus. Data for tagging and parsing is usually easier to find than data for named entity recognition – in the long term, we want to do more data annotation ourselves using Prodigy, but that's obviously a much bigger project. In the meantime, we have to use other available resources (academic etc.).
data conversion: spaCy comes with a range of built-in converters via the spacy convert command that take .conllu files and output spaCy's JSON format. See here for an example of a training pipeline with data conversion. Corpora can have very subtle formatting differences, so it's important to check that they can be converted correctly.
training pipeline: if we have language data plus a suitable training corpus plus a conversion pipeline, we can run spacy train to train a new model.

With our new internal model training infrastructure, it's now much easier for us to integrate new pipelines and train new models. In order to train and distribute "official" spaCy models, we need to be able to integrate and reproduce the full training pipeline whenever we release a new version of spaCy that requires new models (so we can't just upload a model trained by someone else).

But this also means that users can contribute by sharing their data conversion and training commands. So if you end up experimenting with the Finnish Universal Dependencies treebank and find an approach that works, that'd be super cool 🙂

ines on 23 Jul 2018

👍1

Thanks for your response to my question. Like I said, I am so new to models, nlp, training, pipeline terms but I have indeed checked the Dependencies Treebank and also a demo page that shows word similarities. If I am able to lay my hand on this I'll report here or make a new issue otherwise It'd be nice if someone is already working on it.

omitobi on 30 Jul 2018

Just keeping an eye on this. Just in case an update pops up, I'll be glad to know =)

omitobi on 6 Nov 2018

Me too :dancer:
I don't know if you are aware, I suppose you are, but there is a Universal Dependency Tree under a Creative Commons license for the Finnish language. I think it has been developed by the University of Turku, could that be used?