Rasa Core version: 0.12.2 (also can repro against master)
Python version: 3.6
Operating system (windows, osx, ...): Ubuntu 18.04
Issue:
I am migrating an existing Rasa bot from 0.11 to 0.12. The bot itself works just as before, yet I am running into one issue when trying to do interactive training: The TrainingDataGenerator created by training.load_data before running the interactive learning I/0 will hang infinitely on generate being called here: https://github.com/RasaHQ/rasa_core/blob/c82d9107021d7d681f4beb60b7dbdb7ff60a7031/rasa_core/training/__init__.py#L57
When I look at the logging output I can see that the analysis assumes it has found unused checkpoints and does another iteration. Yet, this iteration will still not finish and at the 3rd or 4th loop my computer will usually freeze:
2018-11-23 20:20:08 DEBUG rasa_core.training.generator - Starting data generation round 0 ... (with 1 trackers)
Processed Story Blocks: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 198/198 [00:37<00:00, 5.25it/s, # trackers=2]
2018-11-23 20:20:46 DEBUG rasa_core.training.generator - Finished phase (66670 training samples found).
2018-11-23 20:20:46 DEBUG rasa_core.training.generator - Found 3 unused checkpoints in current phase.
2018-11-23 20:20:46 DEBUG rasa_core.training.generator - Found 1280 active trackers for these checkpoints.
2018-11-23 20:20:46 DEBUG rasa_core.training.generator - Starting data generation round 1 ... (with 1280 trackers)
Processed Story Blocks: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 198/198 [01:21<00:00, 2.43it/s, # trackers=8192]
2018-11-23 20:22:08 DEBUG rasa_core.training.generator - Finished phase (199790 training samples found).
2018-11-23 20:22:08 DEBUG rasa_core.training.generator - Found 29 unused checkpoints in current phase.
2018-11-23 20:22:08 DEBUG rasa_core.training.generator - Found 121344 active trackers for these checkpoints.
2018-11-23 20:22:08 DEBUG rasa_core.training.generator - Starting data generation round 2 ... (with 121344 trackers)
... and so on
This leaves me with a couple of questions:
Thanks for raising this issue, @Ghostvv will get back to you about it soon.
@m90 thank you for detailed issue. Do you have heavily checkpointed stories? The problem is that checkpointed stories can lead to the creation of loops, which we break by creating different source/sink checkpoints. If there were a lot of loops, the amount of unused checkpoints can grow with phases. This logic insures that the algorithm make at least one loop. If you bail earlier, it means not all of your stories will be scanned (randomly chosen). Therefore anyhow there is no point to have all of them, because not all of them will be seen during training (and you cannot control which ones).
My suggestion would be to inspect your stories and try to reduce the number and complexity of checkpoints.
Hey @Ghostvv thanks very much for the detailed explanation. The stories in question do indeed contain quite a few checkpoints, so that likely is the root cause of the issue in that case. We (@hendr-ik rather) will try to reduce them and see if we can start using interactive training again.
Which leaves me with another question though: if checkpoints are potentially an issue with interactive training, should their usage be actively discouraged / the feature be deprecated in a future version?
Some background: while debugging I hacked together a bail out mechanism that would exit after 2 rounds in the scenario above to see what happes, but only to find out that my computer would crash on the sheer amount of trackers it would have to save next. Which means even if I'd wait very long for the process to finish, I still couldn't continue because the next step would fail for similar reasons.
Interactive training is a key feature in rasa_core though, so I wonder if it makes sense to keep a potentially conflicting feature like checkpoints?
checkpoints are a controversial feature: the problem is they are very powerful and often quite useful, but as with any powerful technic overuse is quite dangerous. Personally, I use them quite a lot, but I try to keep in my overall graph structure. You can see this graph by passing --debug_plots flag to train.py script
So after applying the fixes from #1410 I can run training with --debug_plots, but I am still seeing a few things I don't fully understand: Most nodes map to things from the given story, but I will also see nodes like GENR_OR_xxxx and GENR_CYCL_xxxx (this one seems suspicious). Is there any document or place in the codebase that explain what these mean?
GENR_OR_xxxx checkpoints are created when you use OR keyword between intents in the stories.
GENR_CYCL_xxxx are checkpoints that were created to break graph loops created by your checkpoints - these are the loops I was speaking about above
Thanks for the explanation @Ghostvv we'll try to use this as a guideline for getting the bot in question working on 0.12.
If you want to feel free to close this issue, otherwise I'll do so once the bot is back up.
i'll close this, you can reopen if you encounter more issues
Most helpful comment
checkpoints are a controversial feature: the problem is they are very powerful and often quite useful, but as with any powerful technic overuse is quite dangerous. Personally, I use them quite a lot, but I try to keep in my overall graph structure. You can see this graph by passing
--debug_plotsflag totrain.pyscript