Spacy: Custom French lemmatizer

Created on 9 Mar 2018 · 6Comments · Source: explosion/spaCy

Hi, I tried to make a french lemmatization with spacy.
I tried this code for testing.

from spacy.lemmatizer import Lemmatizer
from spacy.lang.fr import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
lemmas = lemmatizer(u'yeux')
print(lemmas)

But the trouble is that I get the folowwing error:
ImportError: No module named lang.fr
So how can I solve this?

feat / lemmatizer lang / fr usage

Source

Vonisoa

Most helpful comment

@JonathanBonnaud Yes, that's one of the problems with lookup tables – they do work okay for simple, general purpose use cases, but they'll never be as good as more explicit rules and a statistical model.

If you want to play around with integrating the Lefff Lemmatizer, a good starting point would be to write a simple custom pipeline component. This lets you test the functionality in an isolated environment (without having to worry about spaCy's internals). See here for examples of other spaCy pipeline extensions developed by users. spaCy v2.0 now also allows adding your own custom attributes to the Doc, Span and Token – so you could start by adding the lemma to a custom token._.lefff_lemma attribute.

A pipeline component is a function that takes a Doc object, modifies it and returns it. Here's a simple, pseudocode example:

from spacy.tokens import Token

# register your new attribute token._.lefff_lemma
Token.set_extension('lefff_lemma', default=None)

def french_lemmatizer(doc):
    for token in doc:
        # compute the lemma based on the token's text, POS tag and whatever else you need –
        # you'll have to write your own wrapper for the Lefff Lemmatizer here
        lemma = GET_LEMMA(token.text, token.pos_, token.tag_)
        token._.lefff_lemma = lemma
    return doc

You can then add your component to the pipeline using nlp.add_pipe – and set after='parser' to make sure it's added after the dependency parser, so your Doc object will already have POS tags and dependency labels available:

nlp = spacy.load('fr')
nlp.add_pipe(french_lemmatizer, name='lefff', after='parser')

doc = nlp(u"avions")
assert doc[0]._.lefff_lemma == 'avion'
# sorry, my French isn't good enough to come up with a context-sensitive example 😜

The pipeline component docs also have some more advanced code examples. For instance, you could also wrap your component in a class and allow initialising it with settings:

french_lemmatizer = FrenchLemmatizer(some_setting=True, some_path='/path')
nlp.add_pipe(french_lemmatizer, after='parser')

ines on 9 Mar 2018

👍2

All 6 comments

Which version of spaCy are you using? Could you run python -m spacy info --markdown and post the result here?

spaCy v2.0 moved all language data to a submodule lang (see here for details and other backwards incompatibilities). Before, the language data lived in spacy.[lang]. So if it turns out you're using v1.x, you'd either have to change it to spacy.fr, or upgrade to v2.x.

ines on 9 Mar 2018

Using spaCy v2.0 doesn't solve the problem, ìt only gives a new one:

from spacy.lang.fr import LEMMA_INDEX
ImportError: cannot import name 'LEMMA_INDEX'

I feel like these LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES only exist for english.
Can you help with that?

Info about spaCy

Platform: Linux-4.4.0-116-generic-x86_64-with-Ubuntu-16.04-xenial
spaCy version: 2.0.9
Python version: 3.5.2
Models: en, en_core_web_md, fr_core_news_sm, fr_core_news_md

JonathanBonnaud on 9 Mar 2018

I feel like these LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES only exist for english.
Can you help with that?

Ah yes - I hadn't looked at that part in detail, but that's correct. The English language data currently defines rules, while the French data only uses lookup-based lemmatization (i.e. only comes with a lookup table).

However, you can take inspiration from the English lemmatizer rules, adapt those for French and add them to the French class (or load them in from somewhere else).

ines on 9 Mar 2018

👍1

I have been using Lefff lemmarizer for some time and I think it is the best. It has a large coverage and takes into account the pos tag to lemmatize. I do not know how it would be to integrate it... or just take inspiration. :)
Because if I understood well, how lemmatization is made now in spaCy does not take into account pos tag, it only looks up words in the table, right? (I kind of mentioned it in another issue)

Otherwise, yes, it would be great to adapt the rules for French.

JonathanBonnaud on 9 Mar 2018

A pipeline component is a function that takes a Doc object, modifies it and returns it. Here's a simple, pseudocode example:

from spacy.tokens import Token

# register your new attribute token._.lefff_lemma
Token.set_extension('lefff_lemma', default=None)

def french_lemmatizer(doc):
    for token in doc:
        # compute the lemma based on the token's text, POS tag and whatever else you need –
        # you'll have to write your own wrapper for the Lefff Lemmatizer here
        lemma = GET_LEMMA(token.text, token.pos_, token.tag_)
        token._.lefff_lemma = lemma
    return doc

nlp = spacy.load('fr')
nlp.add_pipe(french_lemmatizer, name='lefff', after='parser')

doc = nlp(u"avions")
assert doc[0]._.lefff_lemma == 'avion'
# sorry, my French isn't good enough to come up with a context-sensitive example 😜

The pipeline component docs also have some more advanced code examples. For instance, you could also wrap your component in a class and allow initialising it with settings:

french_lemmatizer = FrenchLemmatizer(some_setting=True, some_path='/path')
nlp.add_pipe(french_lemmatizer, after='parser')

ines on 9 Mar 2018

👍2

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] on 7 May 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

tag every token from the matched sentence

nadachaabani1 · 3Comments

why the performance of lemmatizing of spacy is so slow compared with nltk

tonywangcn · 3Comments

Issue with handling empty strings in spaCy 2.0.0a6

melanietosik · 3Comments

Info request: update on availability of German model with word vectors?

smartinsightsfromdata · 3Comments

High similarity scores for antonyms

ajayrfhp · 3Comments