Hi, I tried to make a french lemmatization with spacy.
I tried this code for testing.
from spacy.lemmatizer import Lemmatizer
from spacy.lang.fr import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
lemmas = lemmatizer(u'yeux')
print(lemmas)
But the trouble is that I get the folowwing error:
ImportError: No module named lang.fr
So how can I solve this?
Which version of spaCy are you using? Could you run python -m spacy info --markdown and post the result here?
spaCy v2.0 moved all language data to a submodule lang (see here for details and other backwards incompatibilities). Before, the language data lived in spacy.[lang]. So if it turns out you're using v1.x, you'd either have to change it to spacy.fr, or upgrade to v2.x.
Using spaCy v2.0 doesn't solve the problem, รฌt only gives a new one:
from spacy.lang.fr import LEMMA_INDEX
ImportError: cannot import name 'LEMMA_INDEX'
I feel like these LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES only exist for english.
Can you help with that?
I feel like these LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES only exist for english.
Can you help with that?
Ah yes - I hadn't looked at that part in detail, but that's correct. The English language data currently defines rules, while the French data only uses lookup-based lemmatization (i.e. only comes with a lookup table).
However, you can take inspiration from the English lemmatizer rules, adapt those for French and add them to the French class (or load them in from somewhere else).
I have been using Lefff lemmarizer for some time and I think it is the best. It has a large coverage and takes into account the pos tag to lemmatize. I do not know how it would be to integrate it... or just take inspiration. :)
Because if I understood well, how lemmatization is made now in spaCy does not take into account pos tag, it only looks up words in the table, right? (I kind of mentioned it in another issue)
Otherwise, yes, it would be great to adapt the rules for French.
@JonathanBonnaud Yes, that's one of the problems with lookup tables โ they do work okay for simple, general purpose use cases, but they'll never be as good as more explicit rules and a statistical model.
If you want to play around with integrating the Lefff Lemmatizer, a good starting point would be to write a simple custom pipeline component. This lets you test the functionality in an isolated environment (without having to worry about spaCy's internals). See here for examples of other spaCy pipeline extensions developed by users. spaCy v2.0 now also allows adding your own custom attributes to the Doc, Span and Token โ so you could start by adding the lemma to a custom token._.lefff_lemma attribute.
A pipeline component is a function that takes a Doc object, modifies it and returns it. Here's a simple, pseudocode example:
from spacy.tokens import Token
# register your new attribute token._.lefff_lemma
Token.set_extension('lefff_lemma', default=None)
def french_lemmatizer(doc):
for token in doc:
# compute the lemma based on the token's text, POS tag and whatever else you need โ
# you'll have to write your own wrapper for the Lefff Lemmatizer here
lemma = GET_LEMMA(token.text, token.pos_, token.tag_)
token._.lefff_lemma = lemma
return doc
You can then add your component to the pipeline using nlp.add_pipe โ and set after='parser' to make sure it's added after the dependency parser, so your Doc object will already have POS tags and dependency labels available:
nlp = spacy.load('fr')
nlp.add_pipe(french_lemmatizer, name='lefff', after='parser')
doc = nlp(u"avions")
assert doc[0]._.lefff_lemma == 'avion'
# sorry, my French isn't good enough to come up with a context-sensitive example ๐
The pipeline component docs also have some more advanced code examples. For instance, you could also wrap your component in a class and allow initialising it with settings:
french_lemmatizer = FrenchLemmatizer(some_setting=True, some_path='/path')
nlp.add_pipe(french_lemmatizer, after='parser')
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@JonathanBonnaud Yes, that's one of the problems with lookup tables โ they do work okay for simple, general purpose use cases, but they'll never be as good as more explicit rules and a statistical model.
If you want to play around with integrating the Lefff Lemmatizer, a good starting point would be to write a simple custom pipeline component. This lets you test the functionality in an isolated environment (without having to worry about spaCy's internals). See here for examples of other spaCy pipeline extensions developed by users. spaCy v2.0 now also allows adding your own custom attributes to the
Doc,SpanandTokenโ so you could start by adding the lemma to a customtoken._.lefff_lemmaattribute.A pipeline component is a function that takes a
Docobject, modifies it and returns it. Here's a simple, pseudocode example:You can then add your component to the pipeline using
nlp.add_pipeโ and setafter='parser'to make sure it's added after the dependency parser, so yourDocobject will already have POS tags and dependency labels available:The pipeline component docs also have some more advanced code examples. For instance, you could also wrap your component in a class and allow initialising it with settings: