Rasa: [Diet Classifier] ValueError: Number of examples should be the same for all data.

Created on 27 Mar 2020 · 17Comments · Source: RasaHQ/rasa

Rasa version: 1.9.2

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version:
3.6
Operating system (windows, osx, ...):
linux
Issue:
When training rasa nlu (i.e. rasa nlu train) there is an error from rasa/utils/tensorflow/model_data.py line 107

Error (including full traceback):

2020-03-27 13:13:06 INFO     rasa.nlu.model  - Starting to train component tokenizer_whitespace
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
Traceback (most recent call last):
  File "/home/gunsu/diet/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/__main__.py", line 91, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/cli/train.py", line 140, in train_nlu
    persist_nlu_training_data=args.persist_nlu_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 414, in train_nlu
    persist_nlu_training_data,
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 445, in _train_nlu_async
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 474, in _train_nlu_with_validated_data
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/train.py", line 86, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/model.py", line 191, in train
    updates = component.train(working_data, self.config, **context)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 622, in train
    model_data = self.preprocess_train_data(training_data)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 601, in preprocess_train_data
    label_attribute=label_attribute,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 549, in _create_model_data
    model_data.add_features(LABEL_FEATURES, [Y_sparse, Y_dense])
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 145, in add_features
    self.num_examples = self.number_of_examples()
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 107, in number_of_examples
    f"Number of examples differs for keys '{data.keys()}'. Number of "
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.

Command or request that led to error:

rasa train nlu -c ./bots/lib/config.yml -u ./bots/nlu_train.md --out ./models

Content of configuration file (config.yml) (if relevant):

language: "xx"
pipeline:
  - name: "component.KoreanTokenizer"
  - name: "intent_entity_featurizer_regex"
  - name: "intent_featurizer_count_vectors"
    "token_pattern": '(?u)\b\w+\b' # 1개의 character도 인식하도록 regex 변경
  - name: DIETClassifier
    intent_classification: True
    entity_recognition: False
    use_masked_language_model: False
    BILOU_flag: False
    number_of_transformer_layers: 0
    epochs: 100

Content of domain file (domain.yml) (if relevant):

type

Source

robinsongh381

Most helpful comment

@robinsongh381 @JoaoVFelipe @tabergma @Ghostvv I am the colleague of @shfshf who provides the custom tokenizer component for his pipeline. And I finally find the there are two root causes of this issue:

The same issue as #1515, I have a very detailed explanation in there and I think it affects all the East Asian language (Chinese, Keras and more)
out-of-date custom tokenizer: the tokenizer which I provide doesn't compatible with current Rasa (1.10.5). Rasa changed the tokenizer protocol since 1.7.0 (https://github.com/RasaHQ/rasa/releases/tag/1.7.0):
> By default all tokenizer add a special token (__CLS__) to the end of the list of tokens. This token will be used to capture the features of the whole utterance."

Solutions:

Set token_pattern to "(?u)\b\w+\b" for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)
Check your tokenizer whether it supports the new tokenizer protocol if you are using a custom tokenizer (if it is not, try to rewrite your custom tokenizer according to one of the official tokenizers, for example, jieba tokenizer is a good one)

howl-anderson on 7 Jul 2020

👍2 🚀1

All 17 comments

Thanks for the issue, @rgstephens will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

sara-tagger on 27 Mar 2020

It looks like some examples don't have intent labels

Ghostvv on 27 Mar 2020

@Ghostvv

Hi thanks for reply

I had a look at my nlu.md file and didn't find any issues

I trained rasa nlu with the same nlu.md for a lower version of rasa-nlu (0.14.1) and the training was successful, so I don't think it's got to do with nlu.md

robinsongh381 on 28 Mar 2020

otherwise, it could be that some examples couldn't be featurized for some reason.

0.14.1 version didn't have this check

Ghostvv on 1 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 2 Jul 2020

Any update on that? I'm getting the same issue here, using rasa 1.10.5.

JoaoVFelipe on 5 Jul 2020

me too! using rasa 1.10.5
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.

shfshf on 6 Jul 2020

Hi!
As a temporary solution, I managed to do my bot training by downgrading to rasa 1.10.1. At least here, it issues some warnings, but finishes the training and works correctly.

JoaoVFelipe on 6 Jul 2020

Hi！
when I use rasa 1.10.1，the result is still reported the same error

shfshf on 6 Jul 2020

😕1

@shfshf @JoaoVFelipe Is one of you able to share his NLU data + config.yml so that I can take a closer look at the problem? Without the data to reproduce the issue it is hard to tell what is going wrong. Thanks.

tabergma on 6 Jul 2020

The same issue as #1515, I have a very detailed explanation in there and I think it affects all the East Asian language (Chinese, Keras and more)
out-of-date custom tokenizer: the tokenizer which I provide doesn't compatible with current Rasa (1.10.5). Rasa changed the tokenizer protocol since 1.7.0 (https://github.com/RasaHQ/rasa/releases/tag/1.7.0):
> By default all tokenizer add a special token (__CLS__) to the end of the list of tokens. This token will be used to capture the features of the whole utterance."

Solutions:

Set token_pattern to "(?u)\b\w+\b" for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)
Check your tokenizer whether it supports the new tokenizer protocol if you are using a custom tokenizer (if it is not, try to rewrite your custom tokenizer according to one of the official tokenizers, for example, jieba tokenizer is a good one)

howl-anderson on 7 Jul 2020

👍2 🚀1

Thanks @howl-anderson for the comment. We actually tackle problem 1 already in https://github.com/RasaHQ/rasa/issues/5905. It is already merged into master.

Just to be sure, if you update your custom tokenizer and solve the token_pattern issue, the problem is gone?

tabergma on 7 Jul 2020

👍1

@tabergma It's good to see that the official team already takes action for problem 1. For problem 2, I am just working on the tokenizer rewriting process, but because when we using jieba as the tokenizer, all problem is gone, so there is definitely something wrong with the custom tokenizer. I will keep you informed whether updating the custom tokenizer works or not.

howl-anderson on 7 Jul 2020

👍1

Thanks @tabergma and @howl-anderson for the help, setting the token_pattern for CountVectorsFeaturizer solved the problem. I actually not training an bot in any Asian language, but some of my training data to recognize out of scope languages has some Chinese, Japanese and Korean characters, and I didn't noticed.

By the way, sorry for not sharing the NLU data before. It is pretty big, and I was instructed to not share it since some of it is enterprise sensitive. Thank you very much.

JoaoVFelipe on 7 Jul 2020

@tabergma It's proved by @shfshf that updating the custom tokenizer indeed works! So, I think at least part of @robinsongh381's issue is related to the custom tokenizer too, since his tokenizer works in v0.14.1, but doesn't work in v1.9.2. I hope this message can help him. If @robinsongh381 has trouble rewrite his custom tokenizer, I can try my best to help him.

howl-anderson on 8 Jul 2020

Thanks @howl-anderson my colleague，
@robinsongh381 @JoaoVFelipe @tabergma，I solved this bug through his solutions successfully，with chinese language the custom tokenizer

shfshf on 8 Jul 2020

Great, glad to hear that it works for you! I will close the issue as there is nothing code wise we can do. If you have trouble rewriting your tokenizers, feel free to ask a question on our forum. We are happy to help.

tabergma on 8 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings