Rasa: Capitalization throwing off the tensorflow_embedding classifier

Created on 28 Apr 2018  路  16Comments  路  Source: RasaHQ/rasa

Rasa NLU version: 0.12.2

Operating system: Windows 10

Content of model configuration file:

language: "en"

pipeline: "tensorflow_embedding"

Issue:
Capitalization is seriously messing up the intent classification for a model I trained using the new pipeline tensorflow_embedding.
Example (I'm just posting the relevant output from the parser):

'text': 'no'
'intent': {'confidence': 0.9569746255874634, 'name': 'disagree'}
'text': 'No'
'intent': {'confidence': 0.6564008593559265, 'name': 'disagree'}
# See the lower confidence
#----
'text': 'yes'
'intent': {'confidence': 0.9270809888839722, 'name': 'agree'}
'text':'Yes' 
'intent': {'confidence': 0.6564008593559265, 'name': 'disagree'}
# It's classifying it completely wrongly.
# (variations like 'yEs', 'yES', and 'YES' also gives the exact same confidences as 'Yes')
#----
'text':'hi'
'intent': {'confidence': 0.8774316310882568, 'name': 'greet'}
'text':''Hi'
'intent': {'confidence': 0.6564008593559265, 'name': 'disagree'}
# Again completely wrong!

I have no capital letters in any of my training data utterances.
I have trained another model using the same data with the spacy_sklearn pipeline which gives me exact to the last digit same confidence in intents however I capitalize my input to the model.

type

Most helpful comment

This is an interesting usability issue. There are a number of ways you can remedy this, and we should document this better.

The simplest is to pass a preprocessor to the CountVectorFeaturizer which just lowercases everything. Then "Hi" and "hi" get mapped to the same feature.

Another approach is to add nlp_spacy and tokenizer_spacy to the pipeline, because if spaCy is present, we will actually replace each token with its lemma. We didn't do that by default bc then you would still have to load a spaCy model.

All 16 comments

Tensoflow embedding is building the word vectors from scratch. Spacy has pre-trained word vectors that may be the reason it is able to detect 'Yes' even if your training utterances don't have it.
So if i understand well,
You have not provided any training utterances with examples of 'Yes'?
Did you try adding them to your training data?

I agree with @souvikg10 , the probability of the intention when you test the utterances with capital letters is relatively low, so it means the program unsure which intention is the best because the utterances is new for program.

Yes, I also speculate that the way the new pipeline creates embeddings for words instead of using pre-trained ones is what is causing the issue. Unless it was a bug when the feature was incorporated. Or I am doing something wrong somewhere.

But point is, I think it is major drawback if the training data has to contain examples that vary just in their capitalization, because capitalization doesn't change the meaning of the text, and hence its a failure on the part of the NLU or the tensorflow_embedding approach.

I tried adding one more utterance YES to my data, and it started detecting all the variation of yes as agree intent correctly but the confidence levels are still an issue.

'yes' - 'confidence': 0.9103827476501465
'YES' - 'confidence': 0.8896434307098389
'Yes','yEs','yeS','YEs','YeS','yES' - 'confidence': 0.47380155324935913
# This could still be an issue if say,
# I am setting a limit of 0.5 for least acceptable confidence. 

I tried adding all the variations and then of course there wasn't an issue anymore.
But you can see how impractical it is to manually do that. It could be automated, but that would mean the training data size would increase multiple folds(unnecessarily) and it's using brute force to solve the problem, which is not elegant I'd say. Yet another option would be to just use something like lower() to change all the user input to lowercase before parsing.

I'm not just looking for a practical solution, but to make sense of what's happening and to find an elegant solution to it. 馃槂

If you guys are also trying the new pipeline can you just check if capitalization is messing up the classification for your models?

I think it is the same way Spacy's pre trained vector is trained as well. Spacy uses Wikipedia corpus( at least for Dutch) which also don't do well in certain non-traditional words . It completely depends on your corpus. I would really suggest to asses your use-case properly. Tensorflow embedding works really well on a narrow domain chatbot which btw are 95% of the bots in the market. Do you really have users typing a word like, Yes and yes or yeah i understand but yEs ? if that's the case then even your spacy model might fail as well or you will have to crawl public forums and retrieve examples as such. and Capitalization can indeed change the meaning of a word.

"I want to ship a Book" or "I want to book a Ship". there is a difference..

For our work, i switched to tensorflow at the moment because it is giving better results compared to SpaCy's default model. But it is a narrow domain chatbot focussed on a proper set of questions. There are some niche edge cases we see but that are usually handled by the art of interrogation and asking the proper question to your user.

This is an interesting usability issue. There are a number of ways you can remedy this, and we should document this better.

The simplest is to pass a preprocessor to the CountVectorFeaturizer which just lowercases everything. Then "Hi" and "hi" get mapped to the same feature.

Another approach is to add nlp_spacy and tokenizer_spacy to the pipeline, because if spaCy is present, we will actually replace each token with its lemma. We didn't do that by default bc then you would still have to load a spaCy model.

Here is my pipeline and indeed i load spaCy's language model for tokenization that could be the reason why i am getting better results with tensorflow. Wasn't aware of that 馃憤

pipeline:
# this is using the spacy sklearn pipeline, adding duckling
# all components will use their default values

- name: "nlp_spacy"
- name: "tokenizer_spacy"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_spacy"
- name: "ner_spacy"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "ner_duckling_http"
  locale: "nl_Nothing"
  url: "http://duckling:8000"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

Hi @amn41
Good to know why this was happening only in tensorflow_embedding and not the spacy_sklearn pipeline.

I think I had posted in gitter asking for the components of the tensorflow_embedding pipeline template as that was also missing from the documentation. Is there a way to contribute to the documentation as well? (I'd like to make it more descriptive and friendly for naive users like myself, whenever I come across something I can add to)

Of the two solutions the first one sounds better because like you mentioned in the blog post the greatest advantage I had of shifting to the new pipeline is the low memory load. The spaCy trained interpreter takes a couple of minutes to load on my laptop where as the tensorflow interpreter takes just a couple of seconds. It would feel bad to give that up. 馃槄

cool! the documentation is just part of this repo, so you can create PRs the same way as for code changes. You will need to install a few more dependencies to build the docs, check out the docs section of dev-requirements.txt.

Agree- lemmatization is probably overkill so just lowercasing should do the job! We should consider making this the default. Currently the only preprocessing we do is mapping all numbers to the token NUMBER, https://github.com/RasaHQ/rasa_nlu/pull/981

@souvikg10
Okay, there might be some user case where we want to know the difference of words with capitals and those without. But the issue I was pointing out is use case independent.
And I know I also don't expect someone to type in "YeS" but if there is a typo I don't want to penalize the user by giving a random response to them, just because its not there in my training data set. 馃槃

Also the difference between the meaning of the sentences "I want to ship a Book" or "I want to book a Ship" is not because of the capitalization. 馃槃
You've changed the word ordering and both are completely different statements.

@amn41
Okay! I will look at how to add to the documentation and try to keep making it more descriptive.

Yeah. Adding a case_sensitive flag to control case sensitivity of the tensorflow_embedding pipeline preprocessing would be a great and necessary addition I think.

That was just an example as I couldn鈥檛 think of any but I am sure they are words that are verbs and nouns though you are right it might not be necessarily due to capitalisation. I am not sure if it is wise to ignore capitalisation and make everything to lower. I would indeed enrich my training data.

You can load spacy without loading word embeddings if only to keep all preprocessing in once place. I think matching the spacy yaml quargs with the function signature in initializing the nlp object. (along with case sensitivity allows spacy to be used with and without wordvectors, entity models, dependency parsing models etc.

Seems this was closed with #1053

Was this page helpful?
0 / 5 - 0 ratings