Rasa: TrainingDataGenerator#generate hangs infinitely

Created on 23 Nov 2018 · 8Comments · Source: RasaHQ/rasa

Rasa Core version: 0.12.2 (also can repro against master)

Python version: 3.6

Operating system (windows, osx, ...): Ubuntu 18.04

Issue:

I am migrating an existing Rasa bot from 0.11 to 0.12. The bot itself works just as before, yet I am running into one issue when trying to do interactive training: The TrainingDataGenerator created by training.load_data before running the interactive learning I/0 will hang infinitely on generate being called here: https://github.com/RasaHQ/rasa_core/blob/c82d9107021d7d681f4beb60b7dbdb7ff60a7031/rasa_core/training/__init__.py#L57

When I look at the logging output I can see that the analysis assumes it has found unused checkpoints and does another iteration. Yet, this iteration will still not finish and at the 3rd or 4th loop my computer will usually freeze:

2018-11-23 20:20:08 DEBUG    rasa_core.training.generator  - Starting data generation round 0 ... (with 1 trackers)
Processed Story Blocks: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [00:37<00:00,  5.25it/s, # trackers=2]
2018-11-23 20:20:46 DEBUG    rasa_core.training.generator  - Finished phase (66670 training samples found).
2018-11-23 20:20:46 DEBUG    rasa_core.training.generator  - Found 3 unused checkpoints in current phase.
2018-11-23 20:20:46 DEBUG    rasa_core.training.generator  - Found 1280 active trackers for these checkpoints.
2018-11-23 20:20:46 DEBUG    rasa_core.training.generator  - Starting data generation round 1 ... (with 1280 trackers)
Processed Story Blocks: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [01:21<00:00,  2.43it/s, # trackers=8192]
2018-11-23 20:22:08 DEBUG    rasa_core.training.generator  - Finished phase (199790 training samples found).
2018-11-23 20:22:08 DEBUG    rasa_core.training.generator  - Found 29 unused checkpoints in current phase.
2018-11-23 20:22:08 DEBUG    rasa_core.training.generator  - Found 121344 active trackers for these checkpoints.
2018-11-23 20:22:08 DEBUG    rasa_core.training.generator  - Starting data generation round 2 ... (with 121344 trackers)

... and so on

This leaves me with a couple of questions:

Is there any way for me to bail out of this condition: https://github.com/RasaHQ/rasa_core/blob/c82d9107021d7d681f4beb60b7dbdb7ff60a7031/rasa_core/training/generator.py#L218-L219 by setting configuration values?
Is this a problem with the stories? In case yes, how could I debug such a condition?
Is this a bug in this logic: https://github.com/RasaHQ/rasa_core/blob/c82d9107021d7d681f4beb60b7dbdb7ff60a7031/rasa_core/training/generator.py#L309-L329 Should this contain some sort of bail out logic in case the number of unused checkpoints grows instead of decreases?

Source

m90

👍1

Most helpful comment

checkpoints are a controversial feature: the problem is they are very powerful and often quite useful, but as with any powerful technic overuse is quite dangerous. Personally, I use them quite a lot, but I try to keep in my overall graph structure. You can see this graph by passing --debug_plots flag to train.py script

Ghostvv on 28 Nov 2018

👍2

All 8 comments

Thanks for raising this issue, @Ghostvv will get back to you about it soon.

akelad on 26 Nov 2018

@m90 thank you for detailed issue. Do you have heavily checkpointed stories? The problem is that checkpointed stories can lead to the creation of loops, which we break by creating different source/sink checkpoints. If there were a lot of loops, the amount of unused checkpoints can grow with phases. This logic insures that the algorithm make at least one loop. If you bail earlier, it means not all of your stories will be scanned (randomly chosen). Therefore anyhow there is no point to have all of them, because not all of them will be seen during training (and you cannot control which ones).

My suggestion would be to inspect your stories and try to reduce the number and complexity of checkpoints.

Ghostvv on 27 Nov 2018

👍1

Hey @Ghostvv thanks very much for the detailed explanation. The stories in question do indeed contain quite a few checkpoints, so that likely is the root cause of the issue in that case. We (@hendr-ik rather) will try to reduce them and see if we can start using interactive training again.

Which leaves me with another question though: if checkpoints are potentially an issue with interactive training, should their usage be actively discouraged / the feature be deprecated in a future version?

Some background: while debugging I hacked together a bail out mechanism that would exit after 2 rounds in the scenario above to see what happes, but only to find out that my computer would crash on the sheer amount of trackers it would have to save next. Which means even if I'd wait very long for the process to finish, I still couldn't continue because the next step would fail for similar reasons.

Interactive training is a key feature in rasa_core though, so I wonder if it makes sense to keep a potentially conflicting feature like checkpoints?

m90 on 27 Nov 2018

👍1

Ghostvv on 28 Nov 2018

👍2

So after applying the fixes from #1410 I can run training with --debug_plots, but I am still seeing a few things I don't fully understand: Most nodes map to things from the given story, but I will also see nodes like GENR_OR_xxxx and GENR_CYCL_xxxx (this one seems suspicious). Is there any document or place in the codebase that explain what these mean?

m90 on 28 Nov 2018

GENR_OR_xxxx checkpoints are created when you use OR keyword between intents in the stories.
GENR_CYCL_xxxx are checkpoints that were created to break graph loops created by your checkpoints - these are the loops I was speaking about above

Ghostvv on 29 Nov 2018

👍1

Thanks for the explanation @Ghostvv we'll try to use this as a guideline for getting the bot in question working on 0.12.

If you want to feel free to close this issue, otherwise I'll do so once the bot is back up.

m90 on 30 Nov 2018

i'll close this, you can reopen if you encounter more issues

akelad on 1 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

DIET classifier _predict_entities function clean_up_entities for Chinese language issue

johnson7788 · 3Comments

rasa_core.policies.ensemble.InvalidPolicyConfig: You didn't define any policies. Please define them under 'policies:' in your policy configuration file.

Arghya999 · 3Comments

Rasa training is very slow due to excessive copy of the domain, fails on machine with low memory.

edouardlp · 3Comments

How to send and receive message per sender_id?

nahidalam · 3Comments

where is the code for bert pruning?

Jasperty · 3Comments