Spacy: Low quality of Swedish tokenization and lemmatization

Created on 22 Jul 2018  路  9Comments  路  Source: explosion/spaCy

How to reproduce the behaviour (requires Python 3.6+ because of f-strings)

from spacy.lang.sv import Swedish

nlp = Swedish()
doc = nlp(u"Provar att tokenisera en mening med ord i.")
# Enligsh translation: "Trying to tokenize a sentence with words"

width = 15
print(f"{'Token': <{width}} {'Lemma': <{width}}")
print(f"{'':-<{width}} {'':-<{width}}")
for token in doc:
    print(f"{token.text: <{width}} {token.lemma_: <{width}}")

Output:

Token           Lemma          
--------------- ---------------
Provar          Provar         
att             att            
tokenisera      tokenisera     
en              man            
mening          mening         
med             mede           
ord             ord            
i.              i.   
  1. "i" and "." should be separate tokens.
  2. Provar -> Prova (not "Provar"). Translation: "Trying -> Try"
  3. en -> en (not "man"). Translation: "A -> A"
  4. med -> med (mede is not a word in Swedish).

More details into each incorrect tokenization or lemmatization

  1. Since this error has to do with tokenization, I should probably look in spacy/lang/sv/tokenizer_exceptions.py right? But "i." is not listed as an abbreviation, so shouldn't it by default be split up into two tokens?
  2. So now we're dealing with lemmatization, so I think I should be looking in spacy/lang/sv/lemmatizer/lookup.py. I find "provar" (verb, translated: "testing") in that list, but it points to "provare" (noun, translated: "tester"), which is not only incorrect, but also not the unchanged token I see above. Why doesn't it return the token that it's mapped to?
  3. In lookup.py, "en" (determiner, translated: "a") is mapped to "man" (noun, translated: "man", or (pronoun, translated: "you"). This is incorrect and should just be changed to "en". There's another case where "en" is a noun, but it has the same lemma then.
  4. In lookup.py, "med" (adposition, translated: "with") is mapped to "mede" (not a real word).

In general, I would like to improve the quality of the Swedish tokenization and lemmatization. It seems lookup.py is in a really bad shape, with three lemmatization errors in a simple eight token sentence. I'm wondering if fixing errors one by one is the right move here? Should I instead try to figure out a separate way to do lemmatization?

Maybe one way to get this right would be to set up statistical tests against a big pre-tagged corpus like SUC 3.0 and try to improve the score using another method? Any ideas on what a good next step for me would be?

Your Environment

  • spaCy version: Latest nightly version, on master branch.
  • Platform: Darwin-17.6.0-x86_64-i386-64bit
  • Python version: 3.6.5
feat / lemmatizer help wanted lang / sv perf / accuracy

Most helpful comment

@spindelmanne, @EmilStenstrom just spotted that there is an ongoing conversation about Swedish rule based lemmatization and I thought that I could be of some help, since I wrote from scratch a rule based lemmatizer for Greek language after getting disappointed with the results of the lookup.

So, let's begin. First of all, we have to clarify that when it comes to lemmatization, currently you choose whether you use rule based lemmatizer or lookup. As you can see here rule based lemmatization is applied when no rules are specified. So, if you specify rules, you have to go with them.

When it comes to the question whether you should choose rule based lemmatization for your language you can ask yourself: do you trust your data? Lookup tables is a dirty approach to the problem, because you have so much data that a human cannot check them by hand. Even worse, there are words that their lemmas differ depending on their PoS tags (I don't know if that's applicable to Sweedish language, but just saying you have to keep that in mind). Added to this, the data are often scrapped by Wiktionary or other sources and thus not well preprocessed; you may find duplicates in the list, or words that don't even exist (if you ever manage to read them all and find them). But even if you forget all the bad things I mentioned above, the thing with the lookups is that they don't guess; they can't understand that the lemma for plays is the same with the lemma for playing or play, you have to hardcode it yourself. All those reasons lead to a difficult to maintain, non-scalable approach to the problem.

So, is the rule-based lemmatization far better? The answer is it depends. The advantages of using rule-based lemmatization are pretty much obvious; you have to write some general rules and then those rules can be generalized for thousands of words. And what if a word's lemma comes with a completely unique procedure? Then, you have to write it as an exception. However, this method is not ideal too. The reason is that it is sensitive to the mistakes of the PoS tagger. The way it works is that you specify rules for each PoS category,and PoS tagger decides the tag for the word you are looking for and then it tries to apply a lot of transformations to the suffix of the word (based on the rules you have provided) in order to match a known lemma. Now you can pretty much guess the problems that arise; a mistake to the PoS tag will probably ruin the lemmatization. A more rare (but still possible) problem arises when morphological tagger fails to understand that this word is already a lemma and tries to transform it. See this issue for further insights into that. A last thing to mention about this, is that you have for sure to provide a list for lemmas for this approach. But compared to the lookup approach, now you only need to provide the lemmas, not all the words that can lead to them.

To sum it up, if you trust your PoS tagger I would say that you should go for the rule based approach. Now, let's say that you decide to try it, some things to keep in mind;

  • For each category, search for the required rules extensively. What I mean is that I would suggest you to grab a grammar book and check the different forms a verb can occur for example and write all the appropriate rules down. It will be a pain at the beginning, but I don't think there is any other viable way to proceed; you will forget something and then you would struggle to understand what it is.
  • Write tests. I think this one is the most valuable advice I can give you. When you add a rule you can fix something but destroy something else. It is important to have a gold dataset with words and their correct lemmas and test your lemmatizer upon that.
  • Test frequently. A good thing to do would be to run your lemmatizer against these tests frequently so you can easily point the rule that you added and is wrong. I believe that this would save you some time.
  • Always check if the PoS tagger is mistaken. Don't assume so easily that your lemmatizer is wrong and delete correct rules.
  • Try to construct rules that are specific, but not so specific. I know that this sounds a bit weird but if you write too general rules you will probably lead to wrong lemmatization too but if you write too specific rules you will 1. increase the computational cost, 2. get confused, 3. not generalize. There is a profound trade-off here.
  • Write comments for each rule. An example for each rule is probably the best thing you can do to remember why you used this.

I really hope it helped! Go for it and I think the Swedish lemmatizer will be amazing :100:

All 9 comments

Thanks for the super detailed report! 馃憤

Since this error has to do with tokenization, I should probably look in spacy/lang/sv/tokenizer_exceptions.py right? But "i." is not listed as an abbreviation, so shouldn't it by default be split up into two tokens?

That's correct, yes. And the example is an interesting edge case: I think it might be caused by the language-independent base exceptions that include all single lowercase letters plus .. So we should either rethink including those by default, overwrite them in Swedish or have each language specify those separately, so languages can make their own exceptions here.

So now we're dealing with lemmatization, so I think I should be looking in spacy/lang/sv/lemmatizer/lookup.py. I find "provar" (verb, translated: "testing") in that list, but it points to "provare" (noun, translated: "tester"), which is not only incorrect, but also not the unchanged token I see above. Why doesn't it return the token that it's mapped to?

I think what's happening here is that by default, the lookup table is case-sensitive, so it finds "provar" but not "Provar". The simplest workaround would be to just provide both variations of tokens that can occur at the beginning of a sentence (not very satisfying), or allow languages to specify that they should use case-insensitive / case-sensitive lookup lemmatization (e.g. Swedish vs. German).

However, the ideal solution we'd love to transition to in the future would be rule-based lemmatization. We will be working with someone to help us get this moving forward for Spanish and German, and we've also been getting more contributions in this direction recently (see Norwegian and Greek). So if you want to help out with Swedish, the most valuable thing to do would probably be to look into writing lemmatizer rules. For inspiration, you can check out the English and Norwegian lemmatizer. It'd be really nice if we could move towards a rule-based system and train a statistical model, and then retire those giant lookup lists! 馃帀

Thanks for the detailed answer!

I'll fix the tokenizer bug in a separate PR. Since this is my first attempt hacking on SpaCy I'll just see if I can make Swedish an exception, not change the format for all the languages. I'll leave that call to the pros :)

As for the lemmas, I'll see if I can get some kind of quality measure going based on Universal Dependency corpuses. I only now have a slightly bad feeling about the general quality, if I get number I could substantiate that, and know if a new solution I device work any better. Also, if I use UD, maybe that code could be used for other languages as well. The lemmas part will probably take a while, so don't let this block someone else from working on it!

Thank again, looking forward for full support for Swedish down the line! :)

I'd also like to do Swedish lemmatization and would be happy to help out with creating a rule-based Swedish lemmatizer. I skimmed through some parts of lookup.py and the sad fact is that more than 50% of what I saw was questionable.

@spindelmanne I'm currently not working on this, so feel free to give it a go. I'm not sure where to start when creating a rule based lemmatizer... I see the one for norwegian, but I don't understand where to get the rules from, trial and error on real data?

@spindelmanne, @EmilStenstrom just spotted that there is an ongoing conversation about Swedish rule based lemmatization and I thought that I could be of some help, since I wrote from scratch a rule based lemmatizer for Greek language after getting disappointed with the results of the lookup.

So, let's begin. First of all, we have to clarify that when it comes to lemmatization, currently you choose whether you use rule based lemmatizer or lookup. As you can see here rule based lemmatization is applied when no rules are specified. So, if you specify rules, you have to go with them.

When it comes to the question whether you should choose rule based lemmatization for your language you can ask yourself: do you trust your data? Lookup tables is a dirty approach to the problem, because you have so much data that a human cannot check them by hand. Even worse, there are words that their lemmas differ depending on their PoS tags (I don't know if that's applicable to Sweedish language, but just saying you have to keep that in mind). Added to this, the data are often scrapped by Wiktionary or other sources and thus not well preprocessed; you may find duplicates in the list, or words that don't even exist (if you ever manage to read them all and find them). But even if you forget all the bad things I mentioned above, the thing with the lookups is that they don't guess; they can't understand that the lemma for plays is the same with the lemma for playing or play, you have to hardcode it yourself. All those reasons lead to a difficult to maintain, non-scalable approach to the problem.

So, is the rule-based lemmatization far better? The answer is it depends. The advantages of using rule-based lemmatization are pretty much obvious; you have to write some general rules and then those rules can be generalized for thousands of words. And what if a word's lemma comes with a completely unique procedure? Then, you have to write it as an exception. However, this method is not ideal too. The reason is that it is sensitive to the mistakes of the PoS tagger. The way it works is that you specify rules for each PoS category,and PoS tagger decides the tag for the word you are looking for and then it tries to apply a lot of transformations to the suffix of the word (based on the rules you have provided) in order to match a known lemma. Now you can pretty much guess the problems that arise; a mistake to the PoS tag will probably ruin the lemmatization. A more rare (but still possible) problem arises when morphological tagger fails to understand that this word is already a lemma and tries to transform it. See this issue for further insights into that. A last thing to mention about this, is that you have for sure to provide a list for lemmas for this approach. But compared to the lookup approach, now you only need to provide the lemmas, not all the words that can lead to them.

To sum it up, if you trust your PoS tagger I would say that you should go for the rule based approach. Now, let's say that you decide to try it, some things to keep in mind;

  • For each category, search for the required rules extensively. What I mean is that I would suggest you to grab a grammar book and check the different forms a verb can occur for example and write all the appropriate rules down. It will be a pain at the beginning, but I don't think there is any other viable way to proceed; you will forget something and then you would struggle to understand what it is.
  • Write tests. I think this one is the most valuable advice I can give you. When you add a rule you can fix something but destroy something else. It is important to have a gold dataset with words and their correct lemmas and test your lemmatizer upon that.
  • Test frequently. A good thing to do would be to run your lemmatizer against these tests frequently so you can easily point the rule that you added and is wrong. I believe that this would save you some time.
  • Always check if the PoS tagger is mistaken. Don't assume so easily that your lemmatizer is wrong and delete correct rules.
  • Try to construct rules that are specific, but not so specific. I know that this sounds a bit weird but if you write too general rules you will probably lead to wrong lemmatization too but if you write too specific rules you will 1. increase the computational cost, 2. get confused, 3. not generalize. There is a profound trade-off here.
  • Write comments for each rule. An example for each rule is probably the best thing you can do to remember why you used this.

I really hope it helped! Go for it and I think the Swedish lemmatizer will be amazing :100:

According to Swedish Treebank, the Stockholm-Ume氓 Corpus has lemmas human notated. Maybe those can be extracted into a sort of test suite for any rule-based lemmatization created here.

I wrote a benchmarking script for comparing tokenization results for swedish previously. Shouldn't be to hard to adapt for lemmatization instead: https://github.com/explosion/spaCy/issues/2608

Merging with #3052 , the model inaccuracies master thread.

I've made a lot of progress recently on the rich morphology support: https://github.com/explosion/spaCy/pull/2807 . I don't want to merge this for the v2.1 release, as it's a big patch and it's still a bit rough. But, it's coming along.

In the meantime, if runtime efficiency isn't a problem, you could also try the StanfordNLP system, which you can use through Ines's wrapper: https://github.com/explosion/spacy-stanfordnlp

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

armsp picture armsp  路  3Comments

enerrio picture enerrio  路  3Comments

peterroelants picture peterroelants  路  3Comments

smartinsightsfromdata picture smartinsightsfromdata  路  3Comments

besirkurtulmus picture besirkurtulmus  路  3Comments