Models: Clarifications/suggestions for models/tutorials/rnn/ptb

Created on 6 Oct 2017  路  27Comments  路  Source: tensorflow/models

I have been trying to adapt the models/tutorials/rnn/ptb to my needs. Along the way I run across some questions and suggestions for improvement - so this classifies as feature request :)

  • why are we exporting then importing the metagraph ? This needs comments as it adds complexity to already complex code (hey it's a tutorial)
  • since Supervisor is on its way to deprecation (wise, we have countless APIs for managed sessions, it only adds to learning curve) shouldn't we change the session to a MonitoredTrainingSession ?
  • the comment on using a static rnn is badly outdated as it refers to the tensorflow_models/tutorials/rnn/rnn.py's rnn(). How should this be rephrased as of now ?
  • some variables need renaming (max_max_epoch input_ etc) to more descriptive names

I can take this up as pull request but I need some feedback on the points above I can then incorporate into the pull.

Thanks

Most helpful comment

I face similar problem
The name 'classifier/main/encoding/fw_0/cudnn_gru/opaque_kernel_saveable' refers to an Operation not in the graph.

All 27 comments

And a bug - if run on a machine with gpus == 0 it raises for num_gpus == 1 (the default value). However if setting num_gpus == 0 it will silently zero the tuples in ptb.util.import_state_tuples. This can lead to nasty goose chases. Here is a fix:

https://github.com/Utumno/models/commit/7d9ddc794006c7823cf59bce3f66da36bb269616

I switched to argparse as is done in cifar and corrected some minor control flow issues.

I would make data_path a required positional argument - should I ? Also we force BASIC config if num_cpus != 1 - is this correct ?

@nealwu it sounds like @Utumno would like to help; can you give him any guidance about this code?

@lukaszkaiser should know better about this model. What are your thoughts on the fixes above Lukasz?

It looks to me like this model should go the same way as the seq2seq tutorial and become either deprecated or replaced by a more modern tutorial. I don't know anyone working on that now and the above points are a great summary of what a more modern tutorial would entail :). So if @Utumno wants to do it, I'd say sure, let's improve it! While on it, I'd also suggest to replace static_rnn by dynamic_rnn and use tf.layers.dense for the final softmax instead of the nasty reshapes. What do you think @Utumno ?

Good points @lukaszkaiser - will come back to that ASAP. Meanwhile pull https://github.com/tensorflow/models/pull/2403 must be merged for this to even run on python3. Re: importing and exporting the metagraph, seems to be related to the CUDNN cell, introduced in c705568b5b64c2af6222017ab1b153647fe65c83 - unfortunately there is not a single comment as to what this procedure serves.
Since this is a tutorial it should be easily extendible and this is not the case with the different graphs created, plus tutorials are for people new to tf and should be focused on demonstrating particular uses of the API, not all the API at once - that's too steep a learning curve.

Comments are needed also on what version supports which rnn_mode - tensorflow 1.1 is not supported anyway due to reuse parameter, see https://github.com/tensorflow/tensorflow/issues/8191#issuecomment-285092629 - I would suggest we go all the way and only support 1.3+, and have a nice check on parsing main args that will raise for lower versions.

For anybody working with this code, please note that the perplexities achieved by earlier versions of the code are no longer achieved by the current version.

The comments at the head of the script suggest that the "medium" model should achieve a validation perplexity of about 86 and test perplexity of about 82. This was true in earlier versions but since two recent large changes: (1) addition of block and cudnn LSTM cell types, and (2) option to auto parallelize via grappler, the medium model now only achieves a validation perplexity of about 92 and test perplexity of about 88.

I've run the code a few times and the differences can't be explained by variations in the initial random parameters.

This is disappointing because it is now no longer a robust TF implementation of a standard neural language modelling benchmark.

@qtdaniel @Utumno Is there any other implementation that can serve as a reliable baseline?

edit: The above comment is correct. Testing both BLOCK and BASIC version yields perplexity worse by 3-4 points both on validation and test set for the medium configuration.

@bignamehyp Could you take a look at @qtdaniel's comment above? Looks like there are issues with the PTB model's perplexity.

I realised that all the trials I had performed were using 2 GPUs. I've tried again with only a single GPU and the results are not quite as bad. However, I still think some changes have been made that impair the model's capability compared to earlier versions of this code.

Using the most recent code with settings that most closely resemble those used in earlier versions of this script (i.e. basic LSTM cell trained on a single GPU) the validation perplexity is 89.7 and test perplexity is 85.7. Using older versions of the script the validation perplexity is 87.6 and test perplexity is 84.0.

All trials used TensorFlow 1.3.0 compiled from source running on Ubuntu Linux 14.04 with Anaconda Python 2.7.14. When one GPU was used it was an NVIDIA GeForce GTX 970 and when two GPUs were used they were the same GTX 970 plus an NVIDIA GeForce GTX Titan X.

Here's some results using the RNN LM code at commit 5025711 (the latest version at time of writing), i.e. https://github.com/tensorflow/models/tree/50257111493008ea8daaff9e5e9fe48213f7a5ab/tutorials/rnn/ptb

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/ --rnn_mode basic

Epoch: 39 Train Perplexity: 53.191
Epoch: 39 Valid Perplexity: 89.658
Test Perplexity: 85.718

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/ --rnn_mode block

Epoch: 39 Train Perplexity: 45.668
Epoch: 39 Valid Perplexity: 87.864
Test Perplexity: 84.120

python ptb_word_lm.py --model medium --data_path ../simple-examples/data --save_path pwd/ --rnn_mode basic --num_gpus 2

Epoch: 39 Train Perplexity: 49.052
Epoch: 39 Valid Perplexity: 92.500
Test Perplexity: 88.109

Using the code at commit 983b7d0, i.e. https://github.com/tensorflow/models/tree/983b7d08b6e98c60c4016ac9d4b647ea7935928d/tutorials/rnn/ptb

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/

Epoch: 39 Train Perplexity: 45.764
Epoch: 39 Valid Perplexity: 87.693
Test Perplexity: 83.965

Using the code at commit f6e23e5, i.e. https://github.com/tensorflow/models/tree/f6e23e5618ef18625966c6668f1f90dca25dbc56/tutorials/rnn/ptb

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/

Epoch: 39 Train Perplexity: 45.761
Epoch: 39 Valid Perplexity: 87.559
Test Perplexity: 83.956

I don't know why tf.train.Supervisor is being used on this repo, but tf.train.Supervisor should not be used because it is deprecated. I opened an issue thread.

Hi @donghwicha, what do you mean? I don't see any indication that it's deprecated here: https://www.tensorflow.org/api_docs/python/tf/train/Supervisor

@nealwu This was news to me too but the docs just haven't been updated yet. Here's a couple of pieces of evidence:

@nealwu Please update all of models if you are the one who is managing this model.

We will not update research/, but we will make sure to update official/.

We may update this model or replace it with a different one in official/.

Hi @qtdaniel and @Utumno, could you try pulling the model and running it again now to check the perplexity? I believe the regressions that occurred earlier should be fixed by this commit: https://github.com/tensorflow/models/commit/25a16a2940b952d3f899abc55d107a5106d7790c

I had already fixed that here: https://github.com/Utumno/models/commit/34e9041f196b412a1b0930ee74580786a6de8dae which is part of https://github.com/tensorflow/models/pull/2524

I can't run the perplexity test still as I don't have access to GPUs ATM

Yes; unfortunately we had a regression on that bug as a result of some recent changes.

I can confirm that this problem has been resolved.

Using the RNN tutorial code as of commit 77cab72 I get the following results using the default BLOCK RNN mode:

Epoch: 13 Train Perplexity: 40.726
Epoch: 13 Valid Perplexity: 119.508
Test Perplexity: 113.303

And using the BASIC RNN mode I get something very similar (just trains a bit slower):

Epoch: 13 Train Perplexity: 40.518
Epoch: 13 Valid Perplexity: 119.574
Test Perplexity: 114.397

It fails in CUDNN RNN mode but this appears to be a known problem when using TensorFlow 1.4: https://github.com/tensorflow/models/issues/2709

Great! I got perplexity improvements on my end as well, including for the medium and large models.

The CudnnLSTM version of ptb still does not work now. There are some problems about it:
1) It seems the ptb use the cudnn_rnn_ops.CudnnLSTM API, but Tensorflow 1.5 give the tf.contrib.cudnn_rnn.CudnnLSTM() as the default api. And I update the related code to the tf1.5.

    self._cell = tf.contrib.cudnn_rnn.CudnnLSTM(
        num_layers = config.num_layers,
        num_units  = config.hidden_size,
        #input_size = config.hidden_size,
        dropout    = 1 - config.keep_prob if is_training else 0)

2) The tf.contrib.cudnn_rnn.CudnnLSTM object does has paramerter_size() API.

    self._cell = tf.contrib.cudnn_rnn.CudnnLSTM(
        num_layers = config.num_layers,
        num_units  = config.hidden_size,
        #input_size = config.hidden_size,
        dropout    = 1 - config.keep_prob if is_training else 0)

3) I think it does not need udnn_rnn_ops.CudnnLSTMSaveable() api, because the follow function call or
self._cell.build([config.num_layers, self.batch_size, config.hidden_size]) __call__() will do this implicitly.

So, I just comment these code mentioned in 2) and 3). And add build() and __call() as follow:

self._cell.build([config.num_layers, self.batch_size, config.hidden_size])

outputs, output_state = self._cell(inputs, initial_state =
                             self._initial_state[0], training = is_training)

But I encounter another error when import_meta_graph()
tf.train.import_meta_graph(metagraph)
"The name 'Model/cudnn_lstm/opaque_kernel_saveable' refers to an Operation not in the graph."

Are there any tips about this?

I face similar problem
The name 'classifier/main/encoding/fw_0/cudnn_gru/opaque_kernel_saveable' refers to an Operation not in the graph.

@xingjinglu One workaround is to use freezed graph.

KeyError: "The name 'QueryLSTM/cudnn_lstm/opaque_kernel_saveable' refers to an Operation not in the graph."

@xingjinglu @qiaohaijun @chenghuige
Hi guys,
I'm using cudnn_LSTM now, and I met the exactly same problem with yours. If I follow this link, I can managed to restore the checkpoint. But I cant use saver = tf.train.import_meta_graph("{}.meta".format(checkpoint)) to restore some operations like placeholder.

Besides, I dont have to restore to CPU, but I didnt find a way to restore to GPU yet. Any idea to restore the cudnn_LSTM checkpoint to GPU and enable tf.train.import_meta_graph()?

Tensorflow version: 1.10.1

Thank You!

Like @xingjinglu and @SysuJayce ,I'm struggling with the same problem. I'have restored my network with using,

cudnn_lstm = tf.contrib.cudnn_rnn.CudnnLSTM(num_layers=2, num_units=36, dtype=tf.float32)

I saved it in a checkpoint file. After that, using the code seen below, I'm trying to restore variables like input placeholder or accuracy.

graph = tf.Graph()
with tf.Graph().as_default(), tf.device('/cpu:0'):
    with tf.Session() as sess:
        saver = tf.train.import_meta_graph('checkpoints_gpu/lstm.ckpt.meta')
        saver.restore(sess, tf.train.latest_checkpoint('checkpoints/.'))

        input = graph.get_tensor_by_name('inputs:0')
        output = graph.get_tensor_by_name('labels:0')
        pred = graph.get_tensor_by_name('pred_y:0')
        accuracy = graph.get_tensor_by_name('accuracy:0')
        keep_ = graph.get_tensor_by_name('keep:0')

But I get this error " The name 'cudnn_lstm/opaque_kernel_saveable' refers to an Operation not in the graph'
Is there any solution about this problem ?

@aslihn
Are you able to solve this issue.
I am also facing in the same issue.

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

frankkloster picture frankkloster  路  3Comments

trungdn picture trungdn  路  3Comments

jacknlliu picture jacknlliu  路  3Comments

rakashi picture rakashi  路  3Comments

dsindex picture dsindex  路  3Comments