Description of Problem:
DIET uses a Transformer encoder, however it is perfectly possible to use a different encoder instead. In my experiments (see below) both encoders achieve comparable performance, with the LSTM typically being slightly weaker than the Transformer. Nonetheless, our experiments are not exhaustive and it might be that some users find the LSTM encoder performing better than the Transformer. In terms of training time, the Transformer based model trains slightly faster than the LSTM based one, but the difference isn't dramatic.
Preliminary experimental results:
| Dataset | LSTM | Transformer |
|-------------------|---------------------------------|---------------------------------|
| ATIS - Intent | 95.53 (Accuracy) | 96.61 (Accuracy) |
| ATIS - Entities | 94.83 (micro-avg F1) | 95.37 (micro-avg F1) |
| SNIPS - Intent | 97.71 (Accuracy) | 98.03 (Accuracy) |
| SNIPS - Entities | 95.62 (micro-avg F1) | 95.10 (micro-avg F1) |
| HERMIT - Intent | 90.77 (+/- 0.72) (micro-avg F1) | 89.89 (+/- 0.43) (micro-avg F1) |
| HERMIT - Entities | 81.47 (+/- 1.27) (micro-avg F1) | 87.38 (+/- 0.64) (micro-avg F1) |
Overview of the Solution:
There are some upsides and some downsides to adding this option to rasa OSS
_Pro_:
num_transformer_layers vs. num_lstm_layers) could become num_encoder_layers for parameters with shared semantics._Con_:
num_transformer_layers vs num_lstm_layers)An initial implementation is done, integration into rasa OSS will involve at least a little bit of refactoring in the DIET model + optionally harmonising some of the config keys.
Whats your thoughts @tabergma @Ghostvv @amn41 @tmbo?
Definition of Done:
which version of Rasa is it? Is it before I reintroduced scale_loss?
yes, its branched off 1.8, so doesn't include that fix yet
@tttthomasssss The current results above don't give a lot of motivation to include LSTM as a configurable option IMO. Do you have any results with pipelines that don't have ConveRT featurizer in them? Curious if LSTM is better for other feature combination.
Results with sparse feature have been substantially worse. I haven't run many experiments with GloVe, though I can certainly give it a go.