Rasa: Lemmatization and CountVectorFeaturizers

Created on 1 Sep 2020  路  4Comments  路  Source: RasaHQ/rasa

Rasa version:

1.8 onwards

Python version:

3.6 or 3.7

Operating system (windows, osx, ...):

All

Issue:

Currently if you use spaCy components in your pipeline then the behaviour of the CountVectorFeaturizer changes. Currently we use the lemma that spaCy provides as an alternative for the text when we use the "word" analyser. There's potentially merit to this idea but currently:

  1. The user does not know this is happening because this phenomenon is undocumented. At the very least we need to update the docs to reflect this.
  2. The user cannot configure this, the user must use the lemma tokens at all times the user wants to use spaCy.

I think the best way forward is to implement a configuration in the CountVectorFeaturizer that makes it possible to not use the lemma. Once this is implemented we can update the documentation.

area type

Most helpful comment

I think CVF should only featuring tokens/words. The rest is the job of LexicalSyntacticFeaturizer

All 4 comments

A related discussion needs to take place here too. Is this how we want to deal with lemmatisation in Rasa?

My impression is that we've always used the spaCy lemma features inside of the CountVectorFeaturizer and never really considered how we think about lemmatisation in general. The reason I want to bring it up is related to a feature request in rasa-nlu-examples. There are other tools for lemmatization/tokenization that offer support for languages that spaCy currently does not cover. I'd like to add support for them but it might be good to formalise how we want these components to behave.

@Ghostvv might be in favour of decoupling lemmatisation and the CountVectorFeaturizer.

As soon as we add an option to the CountVectorFeaturizer to use the lemma or not, we are decoupling the lemmatisation from the CountVectorFeaturizer, aren't we? I think if users want to add another component to the pipeline that does lemmatisation, that is fine as they would need to reuse our interface, e.g. updating the lemma attribute of the tokens.

@tabergma is the lemma attribute the only attribute that we want to "countvectorize"? The reason I'm asking is related to (yet another) feature request. In spaCy there are many attributes that could be interesting to featurize. There's flags for stopwords, sentiment and out-of-vocabulary terms. There are all interesting but there's two paths towards an implementation.

  1. We can have a tokenizer that adds all of these attributes to the token and then we can have the CountVectorFeaturizer turn them all into sparse vectors for DIET.
  2. We can have a seperate featurizer that handles this directly without the need to attach anything to the tokens.

I'm currently leaning towards implementing #2 but since there's an option to do something "more general" here I figured I'd at least mention it.

I think CVF should only featuring tokens/words. The rest is the job of LexicalSyntacticFeaturizer

Was this page helpful?
0 / 5 - 0 ratings