Rasa: Lemmatization and CountVectorFeaturizers

Created on 1 Sep 2020 · 4Comments · Source: RasaHQ/rasa

Rasa version:

1.8 onwards

Python version:

3.6 or 3.7

Operating system (windows, osx, ...):

All

Issue:

Currently if you use spaCy components in your pipeline then the behaviour of the CountVectorFeaturizer changes. Currently we use the lemma that spaCy provides as an alternative for the text when we use the "word" analyser. There's potentially merit to this idea but currently:

The user does not know this is happening because this phenomenon is undocumented. At the very least we need to update the docs to reflect this.
The user cannot configure this, the user must use the lemma tokens at all times the user wants to use spaCy.

I think the best way forward is to implement a configuration in the CountVectorFeaturizer that makes it possible to not use the lemma. Once this is implemented we can update the documentation.

area type

Source

koaning

Most helpful comment

I think CVF should only featuring tokens/words. The rest is the job of LexicalSyntacticFeaturizer

Ghostvv on 3 Sep 2020

👍2

All 4 comments

A related discussion needs to take place here too. Is this how we want to deal with lemmatisation in Rasa?

My impression is that we've always used the spaCy lemma features inside of the CountVectorFeaturizer and never really considered how we think about lemmatisation in general. The reason I want to bring it up is related to a feature request in rasa-nlu-examples. There are other tools for lemmatization/tokenization that offer support for languages that spaCy currently does not cover. I'd like to add support for them but it might be good to formalise how we want these components to behave.

@Ghostvv might be in favour of decoupling lemmatisation and the CountVectorFeaturizer.

koaning on 1 Sep 2020

As soon as we add an option to the CountVectorFeaturizer to use the lemma or not, we are decoupling the lemmatisation from the CountVectorFeaturizer, aren't we? I think if users want to add another component to the pipeline that does lemmatisation, that is fine as they would need to reuse our interface, e.g. updating the lemma attribute of the tokens.

tabergma on 2 Sep 2020

👍2

@tabergma is the lemma attribute the only attribute that we want to "countvectorize"? The reason I'm asking is related to (yet another) feature request. In spaCy there are many attributes that could be interesting to featurize. There's flags for stopwords, sentiment and out-of-vocabulary terms. There are all interesting but there's two paths towards an implementation.

We can have a tokenizer that adds all of these attributes to the token and then we can have the CountVectorFeaturizer turn them all into sparse vectors for DIET.
We can have a seperate featurizer that handles this directly without the need to attach anything to the tokens.

I'm currently leaning towards implementing #2 but since there's an option to do something "more general" here I figured I'd at least mention it.

koaning on 3 Sep 2020

I think CVF should only featuring tokens/words. The rest is the job of LexicalSyntacticFeaturizer

Ghostvv on 3 Sep 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Slots not populating in Rasa X

anishrav · 3Comments

where is the code for bert pruning?

Jasperty · 3Comments

Rasa training is very slow due to excessive copy of the domain, fails on machine with low memory.

edouardlp · 3Comments

Performance Evaluation of a Trained Model?

mit4dev · 4Comments

Problem with self.validate_slots in def validate after updating rasa to version 1.7.0

igormiranda001 · 3Comments