Models: test results are weird even at 368987 global steps of training

Created on 3 Oct 2016 · 10Comments · Source: tensorflow/models

hi @cshallue did you run your training for 3 million global steps? For me it is currently at 300k steps (over 2 days and a half using Tesla K40) but I get weird results for the test image you mentioned in the READ.ME. What are your thoughts?

mona@pascal:~/computer_vision/mscoco/models/im2txt$ CHECKPOINT_DIR="${HOME}/im2txt/model/train"
mona@pascal:~/computer_vision/mscoco/models/im2txt$ VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt"
mona@pascal:~/computer_vision/mscoco/models/im2txt$ IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg"
mona@pascal:~/computer_vision/mscoco/models/im2txt$ bazel build -c opt im2txt/run_inference
..
INFO: Found 1 target...
Target //im2txt:run_inference up-to-date:
  bazel-bin/im2txt/run_inference
INFO: Elapsed time: 3.577s, Critical Path: 0.02s
mona@pascal:~/computer_vision/mscoco/models/im2txt$ export CUDA_VISIBLE_DEVICES=""
mona@pascal:~/computer_vision/mscoco/models/im2txt$ bazel-bin/im2txt/run_inference \
>   --checkpoint_path=${CHECKPOINT_DIR} \
>   --vocab_file=${VOCAB_FILE} \
>   --input_files=${IMAGE_FILE}
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.48  Sat Sep  3 18:21:08 PDT 2016
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.48.0
Captions for image COCO_val2014_000000224477.jpg:
  0) two people are standing on rocks , holding a frisbee . (p=0.311219)
  1) the people are surfing the waves with each other (p=0.222864)
  2) two people are standing on the shore next to a body of water . (p=0.091812)

coco_val2014_000000224477

Source

monajalal

Most helpful comment

In my experience, you get best results if you train for 1M steps in the first phase, and then begin fine-tuning.
I'm not sure why your eval script is not outputting to stdout. It is the training script that writes the checkpoints. The eval script is only necessary to compute Perplexity on the evaluation data.
Training time is linear. If it takes ~2.5 days to do 360k steps, it should take ~7 days to get to 1M steps. The second phase is slower, because you're also training the image model. You can stop at any time, and the model should still be good, it just might not have peak performance.
Unfortunately, I don't have benchmarks for different GPUs for this model.

cshallue on 4 Oct 2016

🎉1 👍1

All 10 comments

@monajalal this seems normal to me. The captions are not perfect yet, but note that:

They are good English
They are somewhat related to the image

So your model is certainly better than random already. It will continue improving with time. See the README for my suggested training phases and number of steps.

cshallue on 3 Oct 2016

@cshallue Thanks for the response. That makes sense. I was mostly confused because insisting on the fact there are two people in the picture. Regardless, before doing major mistakes due to time constraints, I would like to know if I should stop the current training that is run off this command:

# Directory containing preprocessed MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"

# Inception v3 checkpoint file.
INCEPTION_CHECKPOINT="${HOME}/im2txt/data/inception_v3.ckpt"

# Directory to save the model.
MODEL_DIR="${HOME}/im2txt/model"

# Build the model.
bazel build -c opt im2txt/...

# Run the training script.
bazel-bin/im2txt/train \
  --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
  --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" \
  --train_dir="${MODEL_DIR}/train" \
  --train_inception=false \
  --number_of_steps=1000000

or should I wait till 1000000 and then I continue to fine-tune it using this script?

# Restart the training script with --train_inception=true.
bazel-bin/im2txt/train \
  --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
  --train_dir="${MODEL_DIR}/train" \
  --train_inception=true \
  --number_of_steps=3000000  # Additional 2M steps (assuming 1M in initial training).

My evaluation script was working fine and producing checkpoints but it somehow stopped showing any output to stdout so I am not sure what's going on. This is the last item shown on stdout:

INFO:tensorflow:Loading model from checkpoint: /home/mona/im2txt/model/train/model.ckpt-5769
INFO:tensorflow:Successfully loaded model.ckpt-5769 at global step = 5770.

however it is still in run and not stopped and I assume it is working in the background according to what I have here:

mona@pascal:~/im2txt/model/train$ ls
checkpoint                             graph.pbtxt        model.ckpt-370433.meta  model.ckpt-371441.meta  model.ckpt-372450.meta  model.ckpt-373459.meta  model.ckpt-374467.meta
events.out.tfevents.1475286077.pascal  model.ckpt-370433  model.ckpt-371441       model.ckpt-372450       model.ckpt-373459       model.ckpt-374467

My other very important question is how many exact weeks would it take for Tesla k40 GPU? We have deadlines coming in a month so we really need to take some strategic decisions. Should we buy Tesla K80 or Tesla P100 to make things faster? (would it be really noticeable?) Do you by any chance have any sort of benchmarking that shows how fast this training phase is across various Tesla GPUs?

Also is training time approximately linear? Like can I say 360k samples in 2 days and a half, how many days for 2M steps?

Sorry for flooding you with many questions, but given the fact you are the expert in the field, sharing them will be very valuable to the community.

P.S.: If I want to choose between Tesla K80 and Tesla P100, considering the fact that Tesla k80 has ~2x GPU RAM, would it better for training in terms of your code? Or should I stick to Tesla P100?
Thanks,
Mona

monajalal on 3 Oct 2016

In my experience, you get best results if you train for 1M steps in the first phase, and then begin fine-tuning.
I'm not sure why your eval script is not outputting to stdout. It is the training script that writes the checkpoints. The eval script is only necessary to compute Perplexity on the evaluation data.
Training time is linear. If it takes ~2.5 days to do 360k steps, it should take ~7 days to get to 1M steps. The second phase is slower, because you're also training the image model. You can stop at any time, and the model should still be good, it just might not have peak performance.
Unfortunately, I don't have benchmarks for different GPUs for this model.

cshallue on 4 Oct 2016

🎉1 👍1

@cshallue Thanks a lot for the valuable information. So after 3M is done, and we got a wrong label or labels with low confidence scores (say all three less than <40%), can I change the distribution model of the tensorflow im2txt and correct it myself?
Essentially I would like to know if there is a distribution map that tells if there is a man(70%), a surfing board (80%), sea (60%) at all? I know the image goes through a CNN and then fed to a RNN with LSTM cells, however I would like to know how a developer can change the im2txt code so that they can improve the training based on the new captions that user provides it. If you could guide me in that direction, would be very valuable. I read the show & tell paper but was wondering what features do you use for object detections? Say, do you use GIST or HoG? I saw that in im2text (2011) paper they said they are using them.
Also, I am still rather unclear if you detect only 89 categories of objects or not? do these 89 categories include all of MSCOCO? The reason I am asking is I just saw items in MSCOCO and no action or scene. However both im2text and show&tell are using scene and action recognition. This is what I am referring to: https://github.com/pdollar/coco/blob/master/PythonAPI/pycocoDemo.ipynb

The reason I asked about number of categories was to know if we are going to caption any random object or things unseen in training dataset or no, only limited to something like <100 categories, scenes and actions? Here's an example:
random2

Captions for image random2.jpg:
  0) a cat is standing on the street between two parked motorcycles (p=0.148223)
  1) a bunch of motorcycles in india somewhere parked near some shops . (p=0.089205)
  2) a child is riding a turkey over in ramp . (p=0.011156)

Also while it might sound irrelevant, I wonder if you could share the trainings after 3M global steps? For me the checkpoints for the first 1M global steps became 1.6G. In general would that make sense to use your 3M ckpts even if it is shareable? If that's the case, it would make things faster. The reason I am asking this other question, is something like faster-RCNN where I could use their pre-trained models.

Also how can I tell the bazel to use exactly a specific checkpoint? like how can I tell it to use model.ckpt-1000000 and ignore the ckpts that are created after it for my code?

monajalal on 8 Oct 2016

On a side note, it took me exactly 7 days for 1M initial global steps on a Tesla K40 GPU

monajalal on 8 Oct 2016

👍1

Also leaving this as is, for people who might wonder like me. Seems after 1M global steps the sentence didn't change much but confidence scores changed (I am just picking a guess so I might be all wrong):

mona@pascal:~/computer_vision/mscoco/models/im2txt$ ./test_script.sh 
INFO: Found 1 target...
Target //im2txt:run_inference up-to-date:
  bazel-bin/im2txt/run_inference
INFO: Elapsed time: 0.103s, Critical Path: 0.00s
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.48  Sat Sep  3 18:21:08 PDT 2016
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.48.0
Captions for image COCO_val2014_000000224477.jpg:
  0) two people are standing on rocks , holding a frisbee . (p=0.473335)
  1) two people are wind surfing in the ocean . (p=0.137358)
  2) two people are standing on the shore next to a body of water . (p=0.120553)

monajalal on 8 Oct 2016

It's not possible to manually edit the outputs of this model for particular images.
The model uses the raw image pixels as features. After the pixels are fed through the CNN (inception) those pixels are encoded as a 2048 dimensional vector.
This is not an object classification model, so it is not limited to recognizing any specific number of objects (i.e. not only 89 objects). It's vocabulary is about 9000 words, and it may use any of those.
It will _never_ give a correct caption for something that is not in the training set. (FYI there are a number of interesting research papers on this topic that have come out recently, but those techniques are not used in this code).
Unfortunately we're not releasing checkpoints of this model.
You can run inference on a particular checkpoint using the flag --checkpoint_path=/path/to/your/checkpoint.ckpt
Is that the output after 1M steps? It should be better than that. What is the final validation perplexity?

cshallue on 8 Oct 2016

@cshallue yes that was the result after 1M initial steps.
So as I told above my evaluation script stopped showing output after 5000 initial global steps.

I ran it again now while my fine-tuning script is being run, but I have nothing saved in eval directory and I am not sure how to find the validation perplexity. What's the command for that?

mona@pascal:~/computer_vision/mscoco/models/im2txt$ ./evaluation.sh 
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
INFO:tensorflow:Prefetching values from 1 files matching /home/mona/im2txt/data/mscoco/val-?????-of-00004
INFO:tensorflow:Starting evaluation at 2016-10-10-10:52:29
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.48  Sat Sep  3 18:21:08 PDT 2016
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.48.0
INFO:tensorflow:Loading model from checkpoint: /home/mona/im2txt/model/train/model.ckpt-1122482
INFO:tensorflow:Successfully loaded model.ckpt-1122482 at global step = 1122483.

I have nothing saved in eval dir here:

mona@pascal:~/im2txt/model/eval$ ls
mona@pascal:~/im2txt/model/eval$

In evaluation.sh I have:

MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
MODEL_DIR="${HOME}/im2txt/model"

# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""
# Run the evaluation script. This will run in a loop, periodically loading the
# latest model checkpoint file and computing evaluation metrics.
bazel-bin/im2txt/evaluate \
  --input_file_pattern="${MSCOCO_DIR}/val-?????-of-00004" \
  --checkpoint_dir="${MODEL_DIR}/train" \
  --eval_dir="${MODEL_DIR}/eval"

Can you please guide what might have gone wrong and how to fix it?

monajalal on 10 Oct 2016

It's difficult to know what might have gone wrong in your training / evaluation.

Are you sure your training data is correct? The following line suggests something is wrong with your validation data (there should be 4 files):

INFO:tensorflow:Prefetching values from 1 files matching /home/mona/im2txt/data/mscoco/val-?????-of-00004

What is the output of ls -l ${HOME}/im2txt/data/mscoco?

Perhaps you could try to use the checkpoint shared in #466 and see if that gives you better results?

cshallue on 12 Oct 2016

Closing this issue for now. Feel free to reopen if you encounter any further issues with the code.

cshallue on 9 Nov 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

I can't find preprocessor_pb2,who can help me

hanzy123 · 3Comments

SyntaxNet : echo "i am human" | syntaxnet/demo.sh

dsindex · 3Comments

Export Inference Model Error

frankkloster · 3Comments

unexpected behavior with slim.losses.add_loss

atabakd · 3Comments

Display the object ground truth box in tensorboard

mbenami · 3Comments