Rasa version: 1.9.2
Rasa SDK version (if used & relevant):
Rasa X version (if used & relevant):
Python version:
3.6
Operating system (windows, osx, ...):
linux
Issue:
When training rasa nlu (i.e. rasa nlu train) there is an error from rasa/utils/tensorflow/model_data.py line 107
Error (including full traceback):
2020-03-27 13:13:06 INFO rasa.nlu.model - Starting to train component tokenizer_whitespace
2020-03-27 13:13:09 INFO rasa.nlu.model - Finished training component.
2020-03-27 13:13:09 INFO rasa.nlu.model - Starting to train component RegexFeaturizer
2020-03-27 13:13:09 INFO rasa.nlu.model - Finished training component.
2020-03-27 13:13:09 INFO rasa.nlu.model - Starting to train component CountVectorsFeaturizer
2020-03-27 13:13:09 INFO rasa.nlu.model - Finished training component.
2020-03-27 13:13:09 INFO rasa.nlu.model - Starting to train component DIETClassifier
Traceback (most recent call last):
File "/home/gunsu/diet/bin/rasa", line 8, in <module>
sys.exit(main())
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/__main__.py", line 91, in main
cmdline_arguments.func(cmdline_arguments)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/cli/train.py", line 140, in train_nlu
persist_nlu_training_data=args.persist_nlu_data,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 414, in train_nlu
persist_nlu_training_data,
File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 445, in _train_nlu_async
persist_nlu_training_data=persist_nlu_training_data,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 474, in _train_nlu_with_validated_data
persist_nlu_training_data=persist_nlu_training_data,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/train.py", line 86, in train
interpreter = trainer.train(training_data, **kwargs)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/model.py", line 191, in train
updates = component.train(working_data, self.config, **context)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 622, in train
model_data = self.preprocess_train_data(training_data)
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 601, in preprocess_train_data
label_attribute=label_attribute,
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 549, in _create_model_data
model_data.add_features(LABEL_FEATURES, [Y_sparse, Y_dense])
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 145, in add_features
self.num_examples = self.number_of_examples()
File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 107, in number_of_examples
f"Number of examples differs for keys '{data.keys()}'. Number of "
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.
Command or request that led to error:
rasa train nlu -c ./bots/lib/config.yml -u ./bots/nlu_train.md --out ./models
Content of configuration file (config.yml) (if relevant):
language: "xx"
pipeline:
- name: "component.KoreanTokenizer"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_count_vectors"
"token_pattern": '(?u)\b\w+\b' # 1개의 character도 인식하도록 regex 변경
- name: DIETClassifier
intent_classification: True
entity_recognition: False
use_masked_language_model: False
BILOU_flag: False
number_of_transformer_layers: 0
epochs: 100
Content of domain file (domain.yml) (if relevant):
It looks like some examples don't have intent labels
@Ghostvv
Hi thanks for reply
I had a look at my nlu.md file and didn't find any issues
I trained rasa nlu with the same nlu.md for a lower version of rasa-nlu (0.14.1) and the training was successful, so I don't think it's got to do with nlu.md
otherwise, it could be that some examples couldn't be featurized for some reason.
0.14.1 version didn't have this check
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Any update on that? I'm getting the same issue here, using rasa 1.10.5.
me too! using rasa 1.10.5
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.
Hi!
As a temporary solution, I managed to do my bot training by downgrading to rasa 1.10.1. At least here, it issues some warnings, but finishes the training and works correctly.
Hi!
when I use rasa 1.10.1,the result is still reported the same error
@shfshf @JoaoVFelipe Is one of you able to share his NLU data + config.yml so that I can take a closer look at the problem? Without the data to reproduce the issue it is hard to tell what is going wrong. Thanks.
@robinsongh381 @JoaoVFelipe @tabergma @Ghostvv I am the colleague of @shfshf who provides the custom tokenizer component for his pipeline. And I finally find the there are two root causes of this issue:
Solutions:
token_pattern to "(?u)\b\w+\b" for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)jieba tokenizer is a good one)Thanks @howl-anderson for the comment. We actually tackle problem 1 already in https://github.com/RasaHQ/rasa/issues/5905. It is already merged into master.
Just to be sure, if you update your custom tokenizer and solve the token_pattern issue, the problem is gone?
@tabergma It's good to see that the official team already takes action for problem 1. For problem 2, I am just working on the tokenizer rewriting process, but because when we using jieba as the tokenizer, all problem is gone, so there is definitely something wrong with the custom tokenizer. I will keep you informed whether updating the custom tokenizer works or not.
Thanks @tabergma and @howl-anderson for the help, setting the token_pattern for CountVectorsFeaturizer solved the problem. I actually not training an bot in any Asian language, but some of my training data to recognize out of scope languages has some Chinese, Japanese and Korean characters, and I didn't noticed.
By the way, sorry for not sharing the NLU data before. It is pretty big, and I was instructed to not share it since some of it is enterprise sensitive. Thank you very much.
@tabergma It's proved by @shfshf that updating the custom tokenizer indeed works! So, I think at least part of @robinsongh381's issue is related to the custom tokenizer too, since his tokenizer works in v0.14.1, but doesn't work in v1.9.2. I hope this message can help him. If @robinsongh381 has trouble rewrite his custom tokenizer, I can try my best to help him.
Thanks @howl-anderson my colleague,
@robinsongh381 @JoaoVFelipe @tabergma,I solved this bug through his solutions successfully,with chinese language the custom tokenizer
Great, glad to hear that it works for you! I will close the issue as there is nothing code wise we can do. If you have trouble rewriting your tokenizers, feel free to ask a question on our forum. We are happy to help.
Most helpful comment
@robinsongh381 @JoaoVFelipe @tabergma @Ghostvv I am the colleague of @shfshf who provides the custom tokenizer component for his pipeline. And I finally find the there are two root causes of this issue:
> By default all tokenizer add a special token (__CLS__) to the end of the list of tokens. This token will be used to capture the features of the whole utterance."
Solutions:
token_patternto"(?u)\b\w+\b"for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)jiebatokenizer is a good one)