Rasa version:
1.8 onwards
Python version:
3.6 or 3.7
Operating system (windows, osx, ...):
All
Issue:
Currently if you use spaCy components in your pipeline then the behaviour of the CountVectorFeaturizer changes. Currently we use the lemma that spaCy provides as an alternative for the text when we use the "word" analyser. There's potentially merit to this idea but currently:
lemma tokens at all times the user wants to use spaCy. I think the best way forward is to implement a configuration in the CountVectorFeaturizer that makes it possible to not use the lemma. Once this is implemented we can update the documentation.
A related discussion needs to take place here too. Is this how we want to deal with lemmatisation in Rasa?
My impression is that we've always used the spaCy lemma features inside of the CountVectorFeaturizer and never really considered how we think about lemmatisation in general. The reason I want to bring it up is related to a feature request in rasa-nlu-examples. There are other tools for lemmatization/tokenization that offer support for languages that spaCy currently does not cover. I'd like to add support for them but it might be good to formalise how we want these components to behave.
@Ghostvv might be in favour of decoupling lemmatisation and the CountVectorFeaturizer.
As soon as we add an option to the CountVectorFeaturizer to use the lemma or not, we are decoupling the lemmatisation from the CountVectorFeaturizer, aren't we? I think if users want to add another component to the pipeline that does lemmatisation, that is fine as they would need to reuse our interface, e.g. updating the lemma attribute of the tokens.
@tabergma is the lemma attribute the only attribute that we want to "countvectorize"? The reason I'm asking is related to (yet another) feature request. In spaCy there are many attributes that could be interesting to featurize. There's flags for stopwords, sentiment and out-of-vocabulary terms. There are all interesting but there's two paths towards an implementation.
I'm currently leaning towards implementing #2 but since there's an option to do something "more general" here I figured I'd at least mention it.
I think CVF should only featuring tokens/words. The rest is the job of LexicalSyntacticFeaturizer
Most helpful comment
I think CVF should only featuring tokens/words. The rest is the job of
LexicalSyntacticFeaturizer