using example.
I'm trying to benchmark the performance of Language Model task compared with Tensorflow on PTB dataset.
Using the same parameters setting, I can not produce a same result. Thus I modified the parameters, e.g. optimization algorithm is Adam, batch_size=64, lr=0.01, wd=0.0001, which can only produce a near results on Validation PPL=134(TF can get 121). While I try to use the medium setting in MXNet, the result was even worse.
Can anyone figure out the reason?
Have you used norm_clipping? If not, you can try the script in the PR https://github.com/dmlc/mxnet/pull/5861
@sxjscience Is it the param clip_gradient (float, optional) in mxnet.optimizer.Optimizer? If so, yes, but it seems like not working. Or is my try wrong?
No, it's different. The clip_gradient parameter clips the maximum value. But the norm_clipping refers to the global L2 norm
Setting clip_gradient is like to clip the global norm based on the infinity norm. While the norm_clipping means to clip based on the L2 norm, which is less restrictive.
@tangjiasheng @piiswrong I find that the revised script also cannot reproduce the PTB result. The reason is that the training split is different from the standard approach. For the current example, we split the PTB data to multiple sentences and group different sentences into a minibatch. While in the original version, the raw_data is split into multiple segments and truncated BPTT is used for each segment.
See https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/reader.py and https://github.com/wojzaremba/lstm/blob/master/data.lua#L48
@sxjscience Yes, I'm running your new script and also find the same performance. Besides data processor, is the converge rate comparable with using SGD? I found it also needs more epoches to converge into a low PPL.
While using adam, the converge rate seems better. But you know, TF only uses SGD inside this example.
@tangjiasheng I'm trying to reproduce the TF version exactly so I try to revise it to use the same optimization parameter. But it turns out that the way we train the model and evaluate the model is different from TF. Will need to correct it.
@sxjscience Thanks for your efforts. will you please update this issue if any progress could be got?
@tangjiasheng Yes, I'll let you know once I've got the result.
:) Follow the discussion.
@zihaolucky @tangjiasheng The PTB validation and testing results are much more reasonable now. You can try the lstm_ptb.py in the PR .
@sxjscience Cool. Should we add more relevant doc/discussion information for later users? Such as the difference between two implements.
@sxjscience Nice! Besides gradient clipping, what others lead to this significant reduce of PPL?
I think truncated BPTT helps improves the performance.
@tangjiasheng Would you have time to test the code with other parameters?
@sxjscience I will do it in the daytime.
@sxjscience I just tested using the parameters setting same as TF's medium and large config. But the convergence performs [RIGHT], see below.
@tangjiasheng I'll look at the medium and large cases now.
I've performed a quick test of the medium case
#Medium
python lstm_ptb.py --num-hidden 650 --lr-decay 0.8 --gpus 0
(The scale of the initializer is not the same as TF, need to fix it later)
The network can get a better ppl than the small version.
For the large case, I've manually changed the initialization scale to be mx.init.Uniform(0.04) and it gets better result
python lstm_ptb.py --num-hidden 1500 --lr-decay 0.87 --dropout 0.35 --max-norm 10 --gpus 0
@sxjscience Sorry, I used a wrong lr_decay setting. Reset it and the result is right.
Additionally, the num_embed should also be set to same as num_hidden. But it will have little bearing on the result.
Updated previous response:
```2017-04-19 16:38:25,920 Epoch:[9], Batch: [200]/[1327], lr: 0.640000, nll: 4.33644, ppl: 76.4347, grad_norm: 10.8647, ms/batch: 28.82
2017-04-19 16:38:31,655 Epoch:[9], Batch: [400]/[1327], lr: 0.640000, nll: 4.33113, ppl: 76.0302, grad_norm: 11.7073, ms/batch: 28.67
2017-04-19 16:38:37,393 Epoch:[9], Batch: [600]/[1327], lr: 0.640000, nll: 4.31457, ppl: 74.7812, grad_norm: 11.812, ms/batch: 28.69
2017-04-19 16:38:43,129 Epoch:[9], Batch: [800]/[1327], lr: 0.640000, nll: 4.25376, ppl: 70.3692, grad_norm: 11.4512, ms/batch: 28.68
2017-04-19 16:38:48,880 Epoch:[9], Batch: [1000]/[1327], lr: 0.640000, nll: 4.31899, ppl: 75.1131, grad_norm: 11.3126, ms/batch: 28.76
2017-04-19 16:38:54,623 Epoch:[9], Batch: [1200]/[1327], lr: 0.640000, nll: 4.19484, ppl: 66.343, grad_norm: 11.7858, ms/batch: 28.71
2017-04-19 16:39:01,640 Epoch:[9], valid nll: 4.66385, valid ppl: 106.043
2017-04-19 16:39:01,640 Epoch:[9], test nll: 4.62765, test ppl: 102.274
@tangjiasheng Thanks a lot! I will revise the script
@sxjscience these lines of code may be convenient for you, including several parameters not(maybe never haha) used. Just leaving the default setting to small config.
```
def load_medium_config(args):
args.num_layers=2
args.num_hidden=650
args.num_embed=650
args.lr_decay=0.8
args.lr=1.0
args.max_norm=5
args.dropout=0.5
args.batch_size=20
args.bptt=35
args.init_scale=0.05
def load_large_config(args):
args.num_layers=2
args.num_hidden=1500
args.num_embed=1500
args.lr_decay=1/1.15
args.lr=1.0
args.max_norm=10
args.dropout=0.65
args.batch_size=20
args.bptt=35
args.init_scale=0.04
@sxjscience Fix a minor bug for your code when run it on multi gpus, the evaluation and testing phase should be allocated to single gpu.
File "lstm_ptb.py", line 165, in evaluate
axis=-1).asnumpy()
mxnet.base.MXNetError: Shape inconsistent, Provided=(20,10), inferred shape=(20,5)
I fixed the code in a lazy(not clean) way. Since the lines are so little, no need to send a PR, you can just modify it with your PR:
# add this for eval and test phase.
test_contexts = [contexts[0]]
test_net = mx.mod.Module(mx.sym.Group([logits_sym] +
[mx.sym.BlockGrad(ele) for ele in next_states]),
data_names=["data"],
label_names=None,
state_names=[state.name for state in prev_state_list],
context=test_contexts)
New to MXNet, maybe we can perform a distributed evaluation? Not sure.
Close it for now as we have examples in Gluon: https://github.com/apache/incubator-mxnet/tree/master/example/gluon/word_language_model
@sxjscience Can you tell me that what's the differences between gluon's word_language_model and lstm_bucketing.py? The word_language_model's PPL is 70+ and lstm_bucketing.py's PPL is 150+ on data PTB. Thanks
@lyblsgo I think the difference is that one uses the Gluon API and the other uses the Module API.
@lyblsgo the Gluon version uses gradient clipping by global norm
@eric-haibin-lin Thanks
Most helpful comment
@sxjscience these lines of code may be convenient for you, including several parameters not(maybe never haha) used. Just leaving the default setting to small config.
```
def load_medium_config(args):
args.num_layers=2
args.num_hidden=650
args.num_embed=650
args.lr_decay=0.8
args.lr=1.0
args.max_norm=5
args.dropout=0.5
args.batch_size=20
args.bptt=35
args.init_scale=0.05
def load_large_config(args):
args.num_layers=2
args.num_hidden=1500
args.num_embed=1500
args.lr_decay=1/1.15
args.lr=1.0
args.max_norm=10
args.dropout=0.65
args.batch_size=20
args.bptt=35
args.init_scale=0.04