Models: Clarifications/suggestions for models/tutorials/rnn/ptb

Created on 6 Oct 2017 · 27Comments · Source: tensorflow/models

I have been trying to adapt the models/tutorials/rnn/ptb to my needs. Along the way I run across some questions and suggestions for improvement - so this classifies as feature request :)

why are we exporting then importing the metagraph ? This needs comments as it adds complexity to already complex code (hey it's a tutorial)
since Supervisor is on its way to deprecation (wise, we have countless APIs for managed sessions, it only adds to learning curve) shouldn't we change the session to a MonitoredTrainingSession ?
the comment on using a static rnn is badly outdated as it refers to the tensorflow_models/tutorials/rnn/rnn.py's rnn(). How should this be rephrased as of now ?
some variables need renaming (max_max_epoch input_ etc) to more descriptive names

I can take this up as pull request but I need some feedback on the points above I can then incorporate into the pull.

Thanks

Source

Utumno

👍1

Most helpful comment

I face similar problem
The name 'classifier/main/encoding/fw_0/cudnn_gru/opaque_kernel_saveable' refers to an Operation not in the graph.

chenghuige on 6 Mar 2018

👍4

All 27 comments

And a bug - if run on a machine with gpus == 0 it raises for num_gpus == 1 (the default value). However if setting num_gpus == 0 it will silently zero the tuples in ptb.util.import_state_tuples. This can lead to nasty goose chases. Here is a fix:

https://github.com/Utumno/models/commit/7d9ddc794006c7823cf59bce3f66da36bb269616

I switched to argparse as is done in cifar and corrected some minor control flow issues.

I would make data_path a required positional argument - should I ? Also we force BASIC config if num_cpus != 1 - is this correct ?

Utumno on 7 Oct 2017

@nealwu it sounds like @Utumno would like to help; can you give him any guidance about this code?

cy89 on 9 Oct 2017

@lukaszkaiser should know better about this model. What are your thoughts on the fixes above Lukasz?

nealwu on 11 Oct 2017

It looks to me like this model should go the same way as the seq2seq tutorial and become either deprecated or replaced by a more modern tutorial. I don't know anyone working on that now and the above points are a great summary of what a more modern tutorial would entail :). So if @Utumno wants to do it, I'd say sure, let's improve it! While on it, I'd also suggest to replace static_rnn by dynamic_rnn and use tf.layers.dense for the final softmax instead of the nasty reshapes. What do you think @Utumno ?

lukaszkaiser on 11 Oct 2017

Good points @lukaszkaiser - will come back to that ASAP. Meanwhile pull https://github.com/tensorflow/models/pull/2403 must be merged for this to even run on python3. Re: importing and exporting the metagraph, seems to be related to the CUDNN cell, introduced in c705568b5b64c2af6222017ab1b153647fe65c83 - unfortunately there is not a single comment as to what this procedure serves.
Since this is a tutorial it should be easily extendible and this is not the case with the different graphs created, plus tutorials are for people new to tf and should be focused on demonstrating particular uses of the API, not all the API at once - that's too steep a learning curve.

Comments are needed also on what version supports which rnn_mode - tensorflow 1.1 is not supported anyway due to reuse parameter, see https://github.com/tensorflow/tensorflow/issues/8191#issuecomment-285092629 - I would suggest we go all the way and only support 1.3+, and have a nice check on parsing main args that will raise for lower versions.

Utumno on 11 Oct 2017

For anybody working with this code, please note that the perplexities achieved by earlier versions of the code are no longer achieved by the current version.

The comments at the head of the script suggest that the "medium" model should achieve a validation perplexity of about 86 and test perplexity of about 82. This was true in earlier versions but since two recent large changes: (1) addition of block and cudnn LSTM cell types, and (2) option to auto parallelize via grappler, the medium model now only achieves a validation perplexity of about 92 and test perplexity of about 88.

I've run the code a few times and the differences can't be explained by variations in the initial random parameters.

This is disappointing because it is now no longer a robust TF implementation of a standard neural language modelling benchmark.

qtdaniel on 16 Oct 2017

👍2

@qtdaniel @Utumno Is there any other implementation that can serve as a reliable baseline?

edit: The above comment is correct. Testing both BLOCK and BASIC version yields perplexity worse by 3-4 points both on validation and test set for the medium configuration.

mirceamironenco on 16 Oct 2017

@bignamehyp Could you take a look at @qtdaniel's comment above? Looks like there are issues with the PTB model's perplexity.

nealwu on 16 Oct 2017

I realised that all the trials I had performed were using 2 GPUs. I've tried again with only a single GPU and the results are not quite as bad. However, I still think some changes have been made that impair the model's capability compared to earlier versions of this code.

Using the most recent code with settings that most closely resemble those used in earlier versions of this script (i.e. basic LSTM cell trained on a single GPU) the validation perplexity is 89.7 and test perplexity is 85.7. Using older versions of the script the validation perplexity is 87.6 and test perplexity is 84.0.

All trials used TensorFlow 1.3.0 compiled from source running on Ubuntu Linux 14.04 with Anaconda Python 2.7.14. When one GPU was used it was an NVIDIA GeForce GTX 970 and when two GPUs were used they were the same GTX 970 plus an NVIDIA GeForce GTX Titan X.

Here's some results using the RNN LM code at commit 5025711 (the latest version at time of writing), i.e. https://github.com/tensorflow/models/tree/50257111493008ea8daaff9e5e9fe48213f7a5ab/tutorials/rnn/ptb

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/ --rnn_mode basic

Epoch: 39 Train Perplexity: 53.191
Epoch: 39 Valid Perplexity: 89.658
Test Perplexity: 85.718

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/ --rnn_mode block

Epoch: 39 Train Perplexity: 45.668
Epoch: 39 Valid Perplexity: 87.864
Test Perplexity: 84.120

python ptb_word_lm.py --model medium --data_path ../simple-examples/data --save_path pwd/ --rnn_mode basic --num_gpus 2

Epoch: 39 Train Perplexity: 49.052
Epoch: 39 Valid Perplexity: 92.500
Test Perplexity: 88.109

Using the code at commit 983b7d0, i.e. https://github.com/tensorflow/models/tree/983b7d08b6e98c60c4016ac9d4b647ea7935928d/tutorials/rnn/ptb

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/

Epoch: 39 Train Perplexity: 45.764
Epoch: 39 Valid Perplexity: 87.693
Test Perplexity: 83.965

Using the code at commit f6e23e5, i.e. https://github.com/tensorflow/models/tree/f6e23e5618ef18625966c6668f1f90dca25dbc56/tutorials/rnn/ptb

CUDA_VISIBLE_DEVICES=1 python ptb_word_lm.py --model medium --data_path ../simple-examples/data/ --save_path pwd/

Epoch: 39 Train Perplexity: 45.761
Epoch: 39 Valid Perplexity: 87.559
Test Perplexity: 83.956

qtdaniel on 17 Oct 2017

I don't know why tf.train.Supervisor is being used on this repo, but tf.train.Supervisor should not be used because it is deprecated. I opened an issue thread.

TuranTimur on 1 Nov 2017

Hi @donghwicha, what do you mean? I don't see any indication that it's deprecated here: https://www.tensorflow.org/api_docs/python/tf/train/Supervisor

nealwu on 1 Nov 2017

@nealwu This was news to me too but the docs just haven't been updated yet. Here's a couple of pieces of evidence:

qtdaniel on 2 Nov 2017

@nealwu Please update all of models if you are the one who is managing this model.

TuranTimur on 2 Nov 2017

We will not update research/, but we will make sure to update official/.

We may update this model or replace it with a different one in official/.

martinwicke on 2 Nov 2017

👍1

Hi @qtdaniel and @Utumno, could you try pulling the model and running it again now to check the perplexity? I believe the regressions that occurred earlier should be fixed by this commit: https://github.com/tensorflow/models/commit/25a16a2940b952d3f899abc55d107a5106d7790c

nealwu on 29 Nov 2017

I had already fixed that here: https://github.com/Utumno/models/commit/34e9041f196b412a1b0930ee74580786a6de8dae which is part of https://github.com/tensorflow/models/pull/2524

I can't run the perplexity test still as I don't have access to GPUs ATM

Utumno on 29 Nov 2017

Yes; unfortunately we had a regression on that bug as a result of some recent changes.

nealwu on 30 Nov 2017

I can confirm that this problem has been resolved.

Using the RNN tutorial code as of commit 77cab72 I get the following results using the default BLOCK RNN mode:

Epoch: 13 Train Perplexity: 40.726
Epoch: 13 Valid Perplexity: 119.508
Test Perplexity: 113.303

And using the BASIC RNN mode I get something very similar (just trains a bit slower):

Epoch: 13 Train Perplexity: 40.518
Epoch: 13 Valid Perplexity: 119.574
Test Perplexity: 114.397

It fails in CUDNN RNN mode but this appears to be a known problem when using TensorFlow 1.4: https://github.com/tensorflow/models/issues/2709

qtdaniel on 30 Nov 2017

Great! I got perplexity improvements on my end as well, including for the medium and large models.

nealwu on 30 Nov 2017

The CudnnLSTM version of ptb still does not work now. There are some problems about it:
1) It seems the ptb use the cudnn_rnn_ops.CudnnLSTM API, but Tensorflow 1.5 give the tf.contrib.cudnn_rnn.CudnnLSTM() as the default api. And I update the related code to the tf1.5.

    self._cell = tf.contrib.cudnn_rnn.CudnnLSTM(
        num_layers = config.num_layers,
        num_units  = config.hidden_size,
        #input_size = config.hidden_size,
        dropout    = 1 - config.keep_prob if is_training else 0)

2) The tf.contrib.cudnn_rnn.CudnnLSTM object does has paramerter_size() API.

    self._cell = tf.contrib.cudnn_rnn.CudnnLSTM(
        num_layers = config.num_layers,
        num_units  = config.hidden_size,
        #input_size = config.hidden_size,
        dropout    = 1 - config.keep_prob if is_training else 0)

3) I think it does not need udnn_rnn_ops.CudnnLSTMSaveable() api, because the follow function call or
self._cell.build([config.num_layers, self.batch_size, config.hidden_size]) __call__() will do this implicitly.

So, I just comment these code mentioned in 2) and 3). And add build() and __call() as follow:

self._cell.build([config.num_layers, self.batch_size, config.hidden_size])

outputs, output_state = self._cell(inputs, initial_state =
                             self._initial_state[0], training = is_training)

But I encounter another error when import_meta_graph()
tf.train.import_meta_graph(metagraph)
"The name 'Model/cudnn_lstm/opaque_kernel_saveable' refers to an Operation not in the graph."

Are there any tips about this?

xingjinglu on 17 Jan 2018

👍4

I face similar problem
The name 'classifier/main/encoding/fw_0/cudnn_gru/opaque_kernel_saveable' refers to an Operation not in the graph.

chenghuige on 6 Mar 2018

👍4

@xingjinglu One workaround is to use freezed graph.

chenghuige on 13 Mar 2018

KeyError: "The name 'QueryLSTM/cudnn_lstm/opaque_kernel_saveable' refers to an Operation not in the graph."

qiaohaijun on 27 Aug 2018

@xingjinglu @qiaohaijun @chenghuige
Hi guys,
I'm using cudnn_LSTM now, and I met the exactly same problem with yours. If I follow this link, I can managed to restore the checkpoint. But I cant use saver = tf.train.import_meta_graph("{}.meta".format(checkpoint)) to restore some operations like placeholder.

Besides, I dont have to restore to CPU, but I didnt find a way to restore to GPU yet. Any idea to restore the cudnn_LSTM checkpoint to GPU and enable tf.train.import_meta_graph()?

Tensorflow version: 1.10.1

Thank You!

SysuJayce on 25 Oct 2018

Like @xingjinglu and @SysuJayce ,I'm struggling with the same problem. I'have restored my network with using,

cudnn_lstm = tf.contrib.cudnn_rnn.CudnnLSTM(num_layers=2, num_units=36, dtype=tf.float32)

I saved it in a checkpoint file. After that, using the code seen below, I'm trying to restore variables like input placeholder or accuracy.

graph = tf.Graph()
with tf.Graph().as_default(), tf.device('/cpu:0'):
    with tf.Session() as sess:
        saver = tf.train.import_meta_graph('checkpoints_gpu/lstm.ckpt.meta')
        saver.restore(sess, tf.train.latest_checkpoint('checkpoints/.'))

        input = graph.get_tensor_by_name('inputs:0')
        output = graph.get_tensor_by_name('labels:0')
        pred = graph.get_tensor_by_name('pred_y:0')
        accuracy = graph.get_tensor_by_name('accuracy:0')
        keep_ = graph.get_tensor_by_name('keep:0')

But I get this error " The name 'cudnn_lstm/opaque_kernel_saveable' refers to an Operation not in the graph'
Is there any solution about this problem ?

aslihn on 17 Dec 2018

@aslihn
Are you able to solve this issue.
I am also facing in the same issue.

manish-kumar-garg on 29 Jan 2020

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.