Description of Problem:
Once a model is trained it cannot be updated. It is not possible to continue training the model on new data that came in. Instead the model needs to be retrained from scratch, which takes up a lot of time.
Overview of the Solution:
It should be possible to load a model from a previous checkpoint and continue training with new data added.
@tabergma how urgent is this one?
@evgeniiaraz We want to tackle this issue this quarter. @dakshvar22 is leading this topic. Why do you ask?
@tabergma I wanted to work on it not to lose shape :) but if it is urgent, I'll pick something non-essential
Based on the discussion in the document, here are more fine-grained implementation tasks that are needed -
Changes to CLI and rasa/train.py
rasa train called finetune_previous_model which starts training in finetuning mode.rasa train called finetune_model_path which lets you specify the path to a previous model which should be finetuned.rasa.train_async_internal should be refactored to check if training should proceed in finetuning mode(if finetune_previous_model is set to True). If yes, then it should check if it is possible to do so.(Check the doc for constraints for finetuning to be possible).rasa.nlu.train should be refactored to create the Trainer object in fine-tune mode which means each component should be loaded with the model to be finetuned from. This will involve building the pipeline in a way similar to how the pipeline is built during inference, i.e. when rasa shell is run or rasa test is run.rasa.core.train should also be refactored similarly as above.Changes to ML components
CountVectorsFeaturizer(CVF)
max_additional_vocabulary_size which lets users specify the additional buffer size that CVF should keep to accommodate new vocabulary tokens during fine-tuning._train_with_independent_vocab should be refactored to construct the vocabulary with the additional buffer specified above. Things to keep in mind here -DIETClassifier, ResponseSelector and TEDPolicy
load() should be refactored to load the models with weights in training mode and not in prediction mode. Currently, _load_model() builds the TF graph in predict mode which should be changed if the classifier is being loaded for finetuning. So instead of calling _get_tf_call_model_function(), _get_tf_train_functions() should be reused to build the graph for training.RasaModelData in finetune mode is the same as what is constructed during training from scratch.A working version(very draft) of the above steps is implemented on this branch. From early observations, what needs to be improved/additionally done to make this mergeable as a feature -
TEDPolicy is straightforward and identical to what is done for DIETClassifier. Loading up the instance of Agent class with the old model in fine-tune mode is what needs to be implemented.rasa train nlu and rasa train core as well. Currently it works for rasa train.Of course docs, code quality and tests also need to be added.
Next steps based on the call with @dakshvar22 @joejuzl
Other things to keep in mind:
master or does it make more sense to branch off the e2e branch?I ran some initial experiments using the working version on this branch -
Setup
Data: Financial Bot NLU data split into 80:20 train test split. The train split is further divided into 2 sets - split 80:20. The first set is used for training an initial model from scratch. The second set is used for finetuning the first model trained. Consider the second set as new annotations that a user added to their training data.
Size of Set 1: 233
Size of Set 2: 59
Size of held-out test set: 73
Training: We train the first model from scratch for 100 epochs. Then add the second set to the training data and further train the first model for 30 more epochs.
Note: Finetuning is done by mixing the new data with the old data and then training on batches from the combined data.
Results:
Initial Model|Training data|Number of epochs|Intent F1(held out test set)|Entity F1(held out test set)|Time for training
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Randomly initialized| Set 1| 100| 0.753| 0.9| 48s
Model trained on set 1| Set 1 + Set 2| 30| 0.861| 0.927| 16s
Randomly initialized| Set 1 + Set 2| 130| 0.876| 0.911| 1 min 16s
Experiments on Sara data -
Size of Set 1: 3166
Size of Set 2: 792
Size of held-out test set: 990
Note: additional_vocabulary_size was set to 1000 for char based CVF and 100 for word based CVF.
Results:
Initial Model|Training data|Number of epochs|Intent F1|Entity F1|Response F1|Time for training
:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:
Randomly initialized| Set 1| 40| 0.789| 0.832| 0.927| 4m 10s
Model trained on set 1| Set 1 + Set 2| 10| 0.823| 0.861| 0.935| 1m 39s
Randomly initialized| Set 1 + Set 2| 50| 0.818| 0.854| 0.938| 6m 2s
@dakshvar22 Do I understand it correctly that the incremental training is in total faster than training everything at once? This seems somewhat counterintuitive for me as I'd expect overhead from loading training data / pipelines etc.
@wochinge The time mentioned above are the time to train DIETClassifier alone and does not include the pipeline and training data loading time. We should measure that too, but it would be much smaller in comparison to the amount of time required to train DIETClassifier for an additional 40/70 epochs as shown in the examples above.
Thanks for clarifying! Even if we measure the DIETClassier on its own - shouldn't the total time of the incremental timing be greater than when training everything in one go?
@wochinge The small overhead(11s) that you see when trained in one go is because of the increase in input feature vector size and hence bigger matrix multiplications. The first two experiments on Sara data have an input feature vector of size 11752(actual vocabulary size + buffer added). The third experiment has an input feature vector of size 12752(actual vocabulary size + buffer added). The additional 1000 dimensions are present because the model is trained from scratch and hence new buffer space is added in CountVectorsFeaturizer. I did run an additional experiment to validate this with additional_vocabulary_size set to 0 in CountVectorsFeaturizer and the training times were then comparable with a small stochastic overhead(+-2 secs) either side. Does that help clarify?
Thanks a lot for digging into and clarifying this! 馃檶
I had a short look on the e2e branch and at least for the engineering changes we don't need to branch off e2e. However, DietClassifier has huge changes @dakshvar22 so you want to probably branch off from e2e for your changes, what do you think?
@wochinge The only change that we need for incremental training inside DIETClassifier is a change in the load method which isn't touched on e2e. So, we should be fine branching off master. Would like to decouple it from e2e as much as possible.
@wochinge @joejuzl Created a shared branch named continuous_training for us to merge our respective PRs into.
@dakshvar22 cc @joejuzl Can we finetune a core model when NLU was finetuned previously? Or do we have to train Core from scratch as the featurization of messages will change?
Not sure if I understand the case completely. Do you mean that rasa train nlu finetune was run and then rasa train core finetune was run?
rasa train --finetuneOhh, we can finetune the core model as long as we are inside our current constraints, i.e. no change to labels(intents, actions, slots, entities, etc.). Why do you think we would need to train it from scratch?
Most helpful comment
@wochinge @joejuzl Created a shared branch named
continuous_trainingfor us to merge our respective PRs into.