from spacy.lang.sv import Swedish
nlp = Swedish()
doc = nlp(u"Provar att tokenisera en mening med ord i.")
# Enligsh translation: "Trying to tokenize a sentence with words"
width = 15
print(f"{'Token': <{width}} {'Lemma': <{width}}")
print(f"{'':-<{width}} {'':-<{width}}")
for token in doc:
print(f"{token.text: <{width}} {token.lemma_: <{width}}")
Output:
Token Lemma
--------------- ---------------
Provar Provar
att att
tokenisera tokenisera
en man
mening mening
med mede
ord ord
i. i.
In general, I would like to improve the quality of the Swedish tokenization and lemmatization. It seems lookup.py is in a really bad shape, with three lemmatization errors in a simple eight token sentence. I'm wondering if fixing errors one by one is the right move here? Should I instead try to figure out a separate way to do lemmatization?
Maybe one way to get this right would be to set up statistical tests against a big pre-tagged corpus like SUC 3.0 and try to improve the score using another method? Any ideas on what a good next step for me would be?
Thanks for the super detailed report! 馃憤
Since this error has to do with tokenization, I should probably look in spacy/lang/sv/tokenizer_exceptions.py right? But "i." is not listed as an abbreviation, so shouldn't it by default be split up into two tokens?
That's correct, yes. And the example is an interesting edge case: I think it might be caused by the language-independent base exceptions that include all single lowercase letters plus .. So we should either rethink including those by default, overwrite them in Swedish or have each language specify those separately, so languages can make their own exceptions here.
So now we're dealing with lemmatization, so I think I should be looking in spacy/lang/sv/lemmatizer/lookup.py. I find "provar" (verb, translated: "testing") in that list, but it points to "provare" (noun, translated: "tester"), which is not only incorrect, but also not the unchanged token I see above. Why doesn't it return the token that it's mapped to?
I think what's happening here is that by default, the lookup table is case-sensitive, so it finds "provar" but not "Provar". The simplest workaround would be to just provide both variations of tokens that can occur at the beginning of a sentence (not very satisfying), or allow languages to specify that they should use case-insensitive / case-sensitive lookup lemmatization (e.g. Swedish vs. German).
However, the ideal solution we'd love to transition to in the future would be rule-based lemmatization. We will be working with someone to help us get this moving forward for Spanish and German, and we've also been getting more contributions in this direction recently (see Norwegian and Greek). So if you want to help out with Swedish, the most valuable thing to do would probably be to look into writing lemmatizer rules. For inspiration, you can check out the English and Norwegian lemmatizer. It'd be really nice if we could move towards a rule-based system and train a statistical model, and then retire those giant lookup lists! 馃帀
Thanks for the detailed answer!
I'll fix the tokenizer bug in a separate PR. Since this is my first attempt hacking on SpaCy I'll just see if I can make Swedish an exception, not change the format for all the languages. I'll leave that call to the pros :)
As for the lemmas, I'll see if I can get some kind of quality measure going based on Universal Dependency corpuses. I only now have a slightly bad feeling about the general quality, if I get number I could substantiate that, and know if a new solution I device work any better. Also, if I use UD, maybe that code could be used for other languages as well. The lemmas part will probably take a while, so don't let this block someone else from working on it!
Thank again, looking forward for full support for Swedish down the line! :)
I'd also like to do Swedish lemmatization and would be happy to help out with creating a rule-based Swedish lemmatizer. I skimmed through some parts of lookup.py and the sad fact is that more than 50% of what I saw was questionable.
@spindelmanne I'm currently not working on this, so feel free to give it a go. I'm not sure where to start when creating a rule based lemmatizer... I see the one for norwegian, but I don't understand where to get the rules from, trial and error on real data?
@spindelmanne, @EmilStenstrom just spotted that there is an ongoing conversation about Swedish rule based lemmatization and I thought that I could be of some help, since I wrote from scratch a rule based lemmatizer for Greek language after getting disappointed with the results of the lookup.
So, let's begin. First of all, we have to clarify that when it comes to lemmatization, currently you choose whether you use rule based lemmatizer or lookup. As you can see here rule based lemmatization is applied when no rules are specified. So, if you specify rules, you have to go with them.
When it comes to the question whether you should choose rule based lemmatization for your language you can ask yourself: do you trust your data? Lookup tables is a dirty approach to the problem, because you have so much data that a human cannot check them by hand. Even worse, there are words that their lemmas differ depending on their PoS tags (I don't know if that's applicable to Sweedish language, but just saying you have to keep that in mind). Added to this, the data are often scrapped by Wiktionary or other sources and thus not well preprocessed; you may find duplicates in the list, or words that don't even exist (if you ever manage to read them all and find them). But even if you forget all the bad things I mentioned above, the thing with the lookups is that they don't guess; they can't understand that the lemma for plays is the same with the lemma for playing or play, you have to hardcode it yourself. All those reasons lead to a difficult to maintain, non-scalable approach to the problem.
So, is the rule-based lemmatization far better? The answer is it depends. The advantages of using rule-based lemmatization are pretty much obvious; you have to write some general rules and then those rules can be generalized for thousands of words. And what if a word's lemma comes with a completely unique procedure? Then, you have to write it as an exception. However, this method is not ideal too. The reason is that it is sensitive to the mistakes of the PoS tagger. The way it works is that you specify rules for each PoS category,and PoS tagger decides the tag for the word you are looking for and then it tries to apply a lot of transformations to the suffix of the word (based on the rules you have provided) in order to match a known lemma. Now you can pretty much guess the problems that arise; a mistake to the PoS tag will probably ruin the lemmatization. A more rare (but still possible) problem arises when morphological tagger fails to understand that this word is already a lemma and tries to transform it. See this issue for further insights into that. A last thing to mention about this, is that you have for sure to provide a list for lemmas for this approach. But compared to the lookup approach, now you only need to provide the lemmas, not all the words that can lead to them.
To sum it up, if you trust your PoS tagger I would say that you should go for the rule based approach. Now, let's say that you decide to try it, some things to keep in mind;
I really hope it helped! Go for it and I think the Swedish lemmatizer will be amazing :100:
According to Swedish Treebank, the Stockholm-Ume氓 Corpus has lemmas human notated. Maybe those can be extracted into a sort of test suite for any rule-based lemmatization created here.
I wrote a benchmarking script for comparing tokenization results for swedish previously. Shouldn't be to hard to adapt for lemmatization instead: https://github.com/explosion/spaCy/issues/2608
Merging with #3052 , the model inaccuracies master thread.
I've made a lot of progress recently on the rich morphology support: https://github.com/explosion/spaCy/pull/2807 . I don't want to merge this for the v2.1 release, as it's a big patch and it's still a bit rough. But, it's coming along.
In the meantime, if runtime efficiency isn't a problem, you could also try the StanfordNLP system, which you can use through Ines's wrapper: https://github.com/explosion/spacy-stanfordnlp
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@spindelmanne, @EmilStenstrom just spotted that there is an ongoing conversation about Swedish rule based lemmatization and I thought that I could be of some help, since I wrote from scratch a rule based lemmatizer for Greek language after getting disappointed with the results of the lookup.
So, let's begin. First of all, we have to clarify that when it comes to lemmatization, currently you choose whether you use rule based lemmatizer or lookup. As you can see here rule based lemmatization is applied when no rules are specified. So, if you specify rules, you have to go with them.
When it comes to the question whether you should choose rule based lemmatization for your language you can ask yourself: do you trust your data? Lookup tables is a dirty approach to the problem, because you have so much data that a human cannot check them by hand. Even worse, there are words that their lemmas differ depending on their PoS tags (I don't know if that's applicable to Sweedish language, but just saying you have to keep that in mind). Added to this, the data are often scrapped by Wiktionary or other sources and thus not well preprocessed; you may find duplicates in the list, or words that don't even exist (if you ever manage to read them all and find them). But even if you forget all the bad things I mentioned above, the thing with the lookups is that they don't guess; they can't understand that the lemma for plays is the same with the lemma for playing or play, you have to hardcode it yourself. All those reasons lead to a difficult to maintain, non-scalable approach to the problem.
So, is the rule-based lemmatization far better? The answer is it depends. The advantages of using rule-based lemmatization are pretty much obvious; you have to write some general rules and then those rules can be generalized for thousands of words. And what if a word's lemma comes with a completely unique procedure? Then, you have to write it as an exception. However, this method is not ideal too. The reason is that it is sensitive to the mistakes of the PoS tagger. The way it works is that you specify rules for each PoS category,and PoS tagger decides the tag for the word you are looking for and then it tries to apply a lot of transformations to the suffix of the word (based on the rules you have provided) in order to match a known lemma. Now you can pretty much guess the problems that arise; a mistake to the PoS tag will probably ruin the lemmatization. A more rare (but still possible) problem arises when morphological tagger fails to understand that this word is already a lemma and tries to transform it. See this issue for further insights into that. A last thing to mention about this, is that you have for sure to provide a list for lemmas for this approach. But compared to the lookup approach, now you only need to provide the lemmas, not all the words that can lead to them.
To sum it up, if you trust your PoS tagger I would say that you should go for the rule based approach. Now, let's say that you decide to try it, some things to keep in mind;
I really hope it helped! Go for it and I think the Swedish lemmatizer will be amazing :100: