Incubator-mxnet: Language Model Benchmark: can not reproduce the same results as Tensorflow with the same parameters.

Created on 17 Apr 2017 · 28Comments · Source: apache/incubator-mxnet

using example.
I'm trying to benchmark the performance of Language Model task compared with Tensorflow on PTB dataset.
Using the same parameters setting, I can not produce a same result. Thus I modified the parameters, e.g. optimization algorithm is Adam, batch_size=64, lr=0.01, wd=0.0001, which can only produce a near results on Validation PPL=134(TF can get 121). While I try to use the medium setting in MXNet, the result was even worse.

Can anyone figure out the reason?

Source

tangjiasheng

Most helpful comment

@sxjscience these lines of code may be convenient for you, including several parameters not(maybe never haha) used. Just leaving the default setting to small config.
```
def load_medium_config(args):
args.num_layers=2
args.num_hidden=650
args.num_embed=650
args.lr_decay=0.8
args.lr=1.0
args.max_norm=5
args.dropout=0.5
args.batch_size=20
args.bptt=35
args.init_scale=0.05

def load_large_config(args):
args.num_layers=2
args.num_hidden=1500
args.num_embed=1500
args.lr_decay=1/1.15
args.lr=1.0
args.max_norm=10
args.dropout=0.65
args.batch_size=20
args.bptt=35
args.init_scale=0.04

tangjiasheng on 19 Apr 2017

👍2

All 28 comments

Have you used norm_clipping? If not, you can try the script in the PR https://github.com/dmlc/mxnet/pull/5861

sxjscience on 17 Apr 2017

@sxjscience Is it the param clip_gradient (float, optional) in mxnet.optimizer.Optimizer? If so, yes, but it seems like not working. Or is my try wrong?

tangjiasheng on 17 Apr 2017

No, it's different. The clip_gradient parameter clips the maximum value. But the norm_clipping refers to the global L2 norm

sxjscience on 17 Apr 2017

Setting clip_gradient is like to clip the global norm based on the infinity norm. While the norm_clipping means to clip based on the L2 norm, which is less restrictive.

sxjscience on 17 Apr 2017

@tangjiasheng @piiswrong I find that the revised script also cannot reproduce the PTB result. The reason is that the training split is different from the standard approach. For the current example, we split the PTB data to multiple sentences and group different sentences into a minibatch. While in the original version, the raw_data is split into multiple segments and truncated BPTT is used for each segment.

See https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/reader.py and https://github.com/wojzaremba/lstm/blob/master/data.lua#L48

sxjscience on 17 Apr 2017

@sxjscience Yes, I'm running your new script and also find the same performance. Besides data processor, is the converge rate comparable with using SGD? I found it also needs more epoches to converge into a low PPL.
While using adam, the converge rate seems better. But you know, TF only uses SGD inside this example.

tangjiasheng on 17 Apr 2017

@tangjiasheng I'm trying to reproduce the TF version exactly so I try to revise it to use the same optimization parameter. But it turns out that the way we train the model and evaluate the model is different from TF. Will need to correct it.

sxjscience on 17 Apr 2017

@sxjscience Thanks for your efforts. will you please update this issue if any progress could be got?

tangjiasheng on 17 Apr 2017

@tangjiasheng Yes, I'll let you know once I've got the result.

sxjscience on 17 Apr 2017

👍1

:) Follow the discussion.

zihaolucky on 17 Apr 2017

@zihaolucky @tangjiasheng The PTB validation and testing results are much more reasonable now. You can try the lstm_ptb.py in the PR .

sxjscience on 17 Apr 2017

@sxjscience Cool. Should we add more relevant doc/discussion information for later users? Such as the difference between two implements.

zihaolucky on 18 Apr 2017

@sxjscience Nice! Besides gradient clipping, what others lead to this significant reduce of PPL?

tangjiasheng on 18 Apr 2017

I think truncated BPTT helps improves the performance.

sxjscience on 18 Apr 2017

@tangjiasheng Would you have time to test the code with other parameters?

sxjscience on 18 Apr 2017

@sxjscience I will do it in the daytime.

tangjiasheng on 19 Apr 2017

@sxjscience I just tested using the parameters setting same as TF's medium and large config. But the convergence performs [RIGHT], see below.

tangjiasheng on 19 Apr 2017

@tangjiasheng I'll look at the medium and large cases now.

sxjscience on 19 Apr 2017

I've performed a quick test of the medium case

#Medium
python lstm_ptb.py --num-hidden 650  --lr-decay 0.8 --gpus 0

(The scale of the initializer is not the same as TF, need to fix it later)

The network can get a better ppl than the small version.
For the large case, I've manually changed the initialization scale to be mx.init.Uniform(0.04) and it gets better result

python lstm_ptb.py --num-hidden 1500  --lr-decay 0.87 --dropout 0.35 --max-norm 10 --gpus 0

sxjscience on 19 Apr 2017

@sxjscience Sorry, I used a wrong lr_decay setting. Reset it and the result is right.
Additionally, the num_embed should also be set to same as num_hidden. But it will have little bearing on the result.
Updated previous response:
```2017-04-19 16:38:25,920 Epoch:[9], Batch: [200]/[1327], lr: 0.640000, nll: 4.33644, ppl: 76.4347, grad_norm: 10.8647, ms/batch: 28.82
2017-04-19 16:38:31,655 Epoch:[9], Batch: [400]/[1327], lr: 0.640000, nll: 4.33113, ppl: 76.0302, grad_norm: 11.7073, ms/batch: 28.67
2017-04-19 16:38:37,393 Epoch:[9], Batch: [600]/[1327], lr: 0.640000, nll: 4.31457, ppl: 74.7812, grad_norm: 11.812, ms/batch: 28.69
2017-04-19 16:38:43,129 Epoch:[9], Batch: [800]/[1327], lr: 0.640000, nll: 4.25376, ppl: 70.3692, grad_norm: 11.4512, ms/batch: 28.68
2017-04-19 16:38:48,880 Epoch:[9], Batch: [1000]/[1327], lr: 0.640000, nll: 4.31899, ppl: 75.1131, grad_norm: 11.3126, ms/batch: 28.76
2017-04-19 16:38:54,623 Epoch:[9], Batch: [1200]/[1327], lr: 0.640000, nll: 4.19484, ppl: 66.343, grad_norm: 11.7858, ms/batch: 28.71
2017-04-19 16:39:01,640 Epoch:[9], valid nll: 4.66385, valid ppl: 106.043
2017-04-19 16:39:01,640 Epoch:[9], test nll: 4.62765, test ppl: 102.274

tangjiasheng on 19 Apr 2017

@tangjiasheng Thanks a lot! I will revise the script

sxjscience on 19 Apr 2017

tangjiasheng on 19 Apr 2017

👍2

@sxjscience Fix a minor bug for your code when run it on multi gpus, the evaluation and testing phase should be allocated to single gpu.

File "lstm_ptb.py", line 165, in evaluate
    axis=-1).asnumpy()
mxnet.base.MXNetError: Shape inconsistent, Provided=(20,10), inferred shape=(20,5)

I fixed the code in a lazy(not clean) way. Since the lines are so little, no need to send a PR, you can just modify it with your PR:

    # add this for eval and test phase.
    test_contexts = [contexts[0]]
    test_net = mx.mod.Module(mx.sym.Group([logits_sym] +
                                          [mx.sym.BlockGrad(ele) for ele in next_states]),
                             data_names=["data"],
                             label_names=None,
                             state_names=[state.name for state in prev_state_list],
                             context=test_contexts)

New to MXNet, maybe we can perform a distributed evaluation? Not sure.

tangjiasheng on 20 Apr 2017

Close it for now as we have examples in Gluon: https://github.com/apache/incubator-mxnet/tree/master/example/gluon/word_language_model

sxjscience on 28 Sep 2017

@sxjscience Can you tell me that what's the differences between gluon's word_language_model and lstm_bucketing.py? The word_language_model's PPL is 70+ and lstm_bucketing.py's PPL is 150+ on data PTB. Thanks