Models: reproducing the video prediction model

Created on 18 Oct 2016 · 13Comments · Source: tensorflow/models

models/video_prediction @cbfinn

Thank you for generously sharing the code! I have three questions about the released code:

are the hyperparameters used in the paper the same as the default options in prediction_train.py? in particular the number of training steps.
can you share some figures on the expected performance of the trained model over the val/train sets? I observed some strange val_loss trend line so i wonder if i made a mistake.
is there a plan to also release the valuation/visualization script for the model? if not, i would love to contribute (and i am sure many other users will as well).

Source

falcondai

👍2

Most helpful comment

are the hyperparameters used in the paper the same as the default options in prediction_train.py?

For the most part, yes. There are a few differences:

For the paper, I downsampled with PIL's antialiasing method, outside of tensorflow. In this code, the images are downsampled in tensorflow, using bicubic interpolation. This isn't a great option, as it causes the images to be a bit pixelated. A convolution-based downsampling would be a better option.
I use layer norm after every layer, which I didn't do in the paper. I think this only makes things more stable.
train/val split is different from what I used
The PSNR calculation that is saved in a scalar summary is not quite correct. It is done for an entire batch of images, but should be done for each image independently and then averaged. This is pretty easy to fix (and I probably should have fixed it earlier; I've just been really busy).

I observed some strange val_loss trend line so i wonder if i made a mistake.

That curve is about what I would expect. It looks strange is because of scheduled sampling, a curriculum which stochastically passes in ground truth frames at some times during the beginning of training. The curriculum ends around 12k. (See citation [2] in paper for details). To turn off scheduled sampling, you can set --schedsamp_k=-1
Alternatively, you could make a change to the code to set schedsamp_k=-1 for the validation model, regardless of what's used for the training model. This might be nice.

can you share some figures on the expected performance of the trained model over the val/train sets?

I did this work when I was an intern at Google Brain, and I no longer have access to data/code/training curves that I used for the paper.

is there a plan to also release the valuation/visualization script for the model?

I'm not planning on doing this in the immediate future, but love to have something like this added to the released code. I'd be happy to help review code for this, and potentially add to it. For example, I think that tiling animated gifs is a great way to visualize the model's predictions, as seen here: https://sites.google.com/site/robotprediction/ (scroll down about halfway). I have the code for tiling predictions together and saving into a gif, which I'd be happy to share.

It would also be really useful to visualize the gifs during training, e.g., in tensorboard (https://github.com/tensorflow/tensorflow/issues/3936)

cbfinn on 18 Oct 2016

👍4

All 13 comments

are the hyperparameters used in the paper the same as the default options in prediction_train.py?

For the most part, yes. There are a few differences:

For the paper, I downsampled with PIL's antialiasing method, outside of tensorflow. In this code, the images are downsampled in tensorflow, using bicubic interpolation. This isn't a great option, as it causes the images to be a bit pixelated. A convolution-based downsampling would be a better option.
I use layer norm after every layer, which I didn't do in the paper. I think this only makes things more stable.
train/val split is different from what I used
The PSNR calculation that is saved in a scalar summary is not quite correct. It is done for an entire batch of images, but should be done for each image independently and then averaged. This is pretty easy to fix (and I probably should have fixed it earlier; I've just been really busy).

I observed some strange val_loss trend line so i wonder if i made a mistake.

can you share some figures on the expected performance of the trained model over the val/train sets?

I did this work when I was an intern at Google Brain, and I no longer have access to data/code/training curves that I used for the paper.

is there a plan to also release the valuation/visualization script for the model?

It would also be really useful to visualize the gifs during training, e.g., in tensorboard (https://github.com/tensorflow/tensorflow/issues/3936)

cbfinn on 18 Oct 2016

👍4

Thanks for the response @cbfinn.
@falcondai : seems you got the answers you were seeking.

Closing this out. If you have more concerns, please do file a new issue/check with @cbfinn

asimshankar on 18 Oct 2016

@cbfinn Thanks for the clarifications and pointers! i will follow up with more specific issues should they arise.

falcondai on 18 Oct 2016

@cbfinn
For your previous commentary, how can I get tiling animated gifs visualizing result of model's prediction? I have been tried to analyze and modify input and training files, but I couldn't do well. Can I get any help for that?

tegg89 on 18 Apr 2017

Here's an example script that loads images from the pushing dataset and exports them to gifs, using the moviepy package (though does not tile them).
grab_train_images.py.zip

It is straightforward to use moviepy to stack gifs side-by-side, to form a tiling.
http://zulko.github.io/moviepy/getting_started/compositing.html#stacking-and-concatenating-clips

cbfinn on 18 Apr 2017

👍1

@cbfinn @falcondai
Thanks for your generous reply! I will follow rest of the codes with respect to included codes :)

tegg89 on 18 Apr 2017

@tegg89 i ended up using imageio for creating GIF. Its API is pretty straightforward. For an example (ipython notebook): https://gist.github.com/falcondai/1e22919e6ce8d6a8e3dd3da5a6a0ad94

falcondai on 18 Apr 2017

👍2

@cbfinn @falcondai
When I put data to network for evaluation, result GIF file is created that is not sequential.
I have switched shuffle option in prediction_input.py to disable.

tegg89 on 20 Apr 2017

@tegg89 Make sure you are only calling session.run() once for the entire sequence, rather than once for each frame. The script grab_train_images.py shows how to extract a sequence of images in order, with a single sess.run() per sequence.

cbfinn on 20 Apr 2017

@cbfinn @falcondai
Sorry for bothering to keep ask you questions. But I have keep troubles on visualizing test data.

Referring to grab_train_images.py, I have changed the input file that returns sequential video frames. However, when I have put this file through network model, the gen_images came out with not sequential form. Modified code is in here. The steps
that I ran through are followed:

Get images, states and actions from prediction_input.py (I already checked that images are shown sequential order)
Put three data into network by Model class in prediction_train.py.
Create session and load trained model. (Up to this step, I could load model with some minimal code changes)
gen_images = sess.run([model.gen_images], feed_dict={model.iter_num: -1})
(learning_rate term from feed_dict is deleted because of not using)
Denormalize gen_images
Transform gen_images to gif

Then it came out with no sequential form. In my opinion, the network model makes input file not sequential form. How did you do visualizing evaluation?

tegg89 on 21 Apr 2017

@cbfinn Thanks for your paper and codes. And sorry to bother you for a little detail.

As you said above, "train/val split is different from what I used".
In the paper, train_val_split is 0.95. And I saw that this tf version also uses the same setting as default.
In the complete training (iteration 10K), I found that the validation psnr (I use this evaluation) is not always accordance with test psnr. I choose the best model through selecting best validation psnr when training. But sometimes, some periodic checkpoint models' psnr are higher than the selected best model (gap up to 0.5).

Is train_val_split == 0.95 not enough in practice?