Rasa: Incremental training

Created on 8 Oct 2020 · 20Comments · Source: RasaHQ/rasa

Description of Problem:
Once a model is trained it cannot be updated. It is not possible to continue training the model on new data that came in. Instead the model needs to be retrained from scratch, which takes up a lot of time.

Overview of the Solution:
It should be possible to load a model from a previous checkpoint and continue training with new data added.

area type

Source

tabergma

Most helpful comment

@wochinge @joejuzl Created a shared branch named continuous_training for us to merge our respective PRs into.

dakshvar22 on 30 Nov 2020

👍2

All 20 comments

@tabergma how urgent is this one?

evgeniiaraz on 21 Oct 2020

@evgeniiaraz We want to tackle this issue this quarter. @dakshvar22 is leading this topic. Why do you ask?

tabergma on 22 Oct 2020

@tabergma I wanted to work on it not to lose shape :) but if it is urgent, I'll pick something non-essential

evgeniiaraz on 22 Oct 2020

Based on the discussion in the document, here are more fine-grained implementation tasks that are needed -

Changes to CLI and rasa/train.py

Add a parameter to rasa train called finetune_previous_model which starts training in finetuning mode.
Add a parameter to rasa train called finetune_model_path which lets you specify the path to a previous model which should be finetuned.
rasa.train_async_internal should be refactored to check if training should proceed in finetuning mode(if finetune_previous_model is set to True). If yes, then it should check if it is possible to do so.(Check the doc for constraints for finetuning to be possible).
rasa.nlu.train should be refactored to create the Trainer object in fine-tune mode which means each component should be loaded with the model to be finetuned from. This will involve building the pipeline in a way similar to how the pipeline is built during inference, i.e. when rasa shell is run or rasa test is run.
rasa.core.train should also be refactored similarly as above.
Add telemetry event.

Changes to ML components

CountVectorsFeaturizer(CVF)

Add a parameter max_additional_vocabulary_size which lets users specify the additional buffer size that CVF should keep to accommodate new vocabulary tokens during fine-tuning.
_train_with_independent_vocab should be refactored to construct the vocabulary with the additional buffer specified above. Things to keep in mind here -

When a new training cycle is triggered, the ordering of existing vocabulary tokens should not be changed and the new vocabulary tokens should only occupy the empty slots in the vocabulary.
If the vocabulary size of CVF is exhausted, we should continue training, but warn the user that the vocabulary is exhausted and treat the new tokens which overflow as OOV tokens. At this point, the user should also be informed about the total vocabulary size of their dataset and should be prompted to retrain with full vocabulary.

DIETClassifier, ResponseSelector and TEDPolicy

load() should be refactored to load the models with weights in training mode and not in prediction mode. Currently, _load_model() builds the TF graph in predict mode which should be changed if the classifier is being loaded for finetuning. So instead of calling _get_tf_call_model_function(), _get_tf_train_functions() should be reused to build the graph for training.
Make sure the signature of RasaModelData in finetune mode is the same as what is constructed during training from scratch.

dakshvar22 on 7 Nov 2020

A working version(very draft) of the above steps is implemented on this branch. From early observations, what needs to be improved/additionally done to make this mergeable as a feature -

Ability to specify a model path to fine-tune from in the CLI.
Implement checks here to see if the previous model is compatible to be fine-tuned with the current configuration specified, for e.g., all parameters for the two configurations should be the same except number of epochs for training.
The above working version loads up the pipeline in fine-tune mode only for the NLU pipeline. Still needs to be done for the Core pipeline. The refactoring needed inside TEDPolicy is straightforward and identical to what is done for DIETClassifier. Loading up the instance of Agent class with the old model in fine-tune mode is what needs to be implemented.
While loading the NLU pipeline, currently the config of the loaded model is passed to the components, which means if I change the number of epochs in my new configuration, it is not used by the component. Will need to refactor that.
Make sure fine-tuning is possible for rasa train nlu and rasa train core as well. Currently it works for rasa train.

Of course docs, code quality and tests also need to be added.

dakshvar22 on 9 Nov 2020

👍1

Next steps based on the call with @dakshvar22 @joejuzl

Create engineering issues from this (should be around 2-3 issues 🤔 )
Get started with the engineering issues in the week of November 23rd

Other things to keep in mind:

Can we branch off master or does it make more sense to branch off the e2e branch?

wochinge on 12 Nov 2020

I ran some initial experiments using the working version on this branch -

Setup

Data: Financial Bot NLU data split into 80:20 train test split. The train split is further divided into 2 sets - split 80:20. The first set is used for training an initial model from scratch. The second set is used for finetuning the first model trained. Consider the second set as new annotations that a user added to their training data.

Size of Set 1: 233
Size of Set 2: 59
Size of held-out test set: 73

Training: We train the first model from scratch for 100 epochs. Then add the second set to the training data and further train the first model for 30 more epochs.

Config

Note: Finetuning is done by mixing the new data with the old data and then training on batches from the combined data.

Results:

Initial Model|Training data|Number of epochs|Intent F1(held out test set)|Entity F1(held out test set)|Time for training
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Randomly initialized| Set 1| 100| 0.753| 0.9| 48s
Model trained on set 1| Set 1 + Set 2| 30| 0.861| 0.927| 16s
Randomly initialized| Set 1 + Set 2| 130| 0.876| 0.911| 1 min 16s

dakshvar22 on 13 Nov 2020

Experiments on Sara data -

Size of Set 1: 3166
Size of Set 2: 792
Size of held-out test set: 990

Config

Note: additional_vocabulary_size was set to 1000 for char based CVF and 100 for word based CVF.

Results:

Initial Model|Training data|Number of epochs|Intent F1|Entity F1|Response F1|Time for training
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Randomly initialized| Set 1| 40| 0.789| 0.832| 0.927| 4m 10s
Model trained on set 1| Set 1 + Set 2| 10| 0.823| 0.861| 0.935| 1m 39s
Randomly initialized| Set 1 + Set 2| 50| 0.818| 0.854| 0.938| 6m 2s

dakshvar22 on 15 Nov 2020

🎉1

@dakshvar22 Do I understand it correctly that the incremental training is in total faster than training everything at once? This seems somewhat counterintuitive for me as I'd expect overhead from loading training data / pipelines etc.

wochinge on 16 Nov 2020

@wochinge The time mentioned above are the time to train DIETClassifier alone and does not include the pipeline and training data loading time. We should measure that too, but it would be much smaller in comparison to the amount of time required to train DIETClassifier for an additional 40/70 epochs as shown in the examples above.

dakshvar22 on 16 Nov 2020

Thanks for clarifying! Even if we measure the DIETClassier on its own - shouldn't the total time of the incremental timing be greater than when training everything in one go?

wochinge on 16 Nov 2020

@wochinge The small overhead(11s) that you see when trained in one go is because of the increase in input feature vector size and hence bigger matrix multiplications. The first two experiments on Sara data have an input feature vector of size 11752(actual vocabulary size + buffer added). The third experiment has an input feature vector of size 12752(actual vocabulary size + buffer added). The additional 1000 dimensions are present because the model is trained from scratch and hence new buffer space is added in CountVectorsFeaturizer. I did run an additional experiment to validate this with additional_vocabulary_size set to 0 in CountVectorsFeaturizer and the training times were then comparable with a small stochastic overhead(+-2 secs) either side. Does that help clarify?

dakshvar22 on 16 Nov 2020

Thanks a lot for digging into and clarifying this! 🙌

wochinge on 17 Nov 2020

I had a short look on the e2e branch and at least for the engineering changes we don't need to branch off e2e. However, DietClassifier has huge changes @dakshvar22 so you want to probably branch off from e2e for your changes, what do you think?

wochinge on 20 Nov 2020

@wochinge The only change that we need for incremental training inside DIETClassifier is a change in the load method which isn't touched on e2e. So, we should be fine branching off master. Would like to decouple it from e2e as much as possible.

dakshvar22 on 23 Nov 2020

👍1

@wochinge @joejuzl Created a shared branch named continuous_training for us to merge our respective PRs into.

dakshvar22 on 30 Nov 2020

👍2

@dakshvar22 cc @joejuzl Can we finetune a core model when NLU was finetuned previously? Or do we have to train Core from scratch as the featurization of messages will change?

wochinge on 2 Dec 2020

Not sure if I understand the case completely. Do you mean that rasa train nlu finetune was run and then rasa train core finetune was run?

dakshvar22 on 2 Dec 2020

We run rasa train --finetune
NLU model is finetuned
Do we now finetune the core model or do we train it from scratch?

wochinge on 2 Dec 2020

Ohh, we can finetune the core model as long as we are inside our current constraints, i.e. no change to labels(intents, actions, slots, entities, etc.). Why do you think we would need to train it from scratch?

dakshvar22 on 2 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings