hi @cshallue did you run your training for 3 million global steps? For me it is currently at 300k steps (over 2 days and a half using Tesla K40) but I get weird results for the test image you mentioned in the READ.ME. What are your thoughts?
mona@pascal:~/computer_vision/mscoco/models/im2txt$ CHECKPOINT_DIR="${HOME}/im2txt/model/train"
mona@pascal:~/computer_vision/mscoco/models/im2txt$ VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt"
mona@pascal:~/computer_vision/mscoco/models/im2txt$ IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg"
mona@pascal:~/computer_vision/mscoco/models/im2txt$ bazel build -c opt im2txt/run_inference
..
INFO: Found 1 target...
Target //im2txt:run_inference up-to-date:
bazel-bin/im2txt/run_inference
INFO: Elapsed time: 3.577s, Critical Path: 0.02s
mona@pascal:~/computer_vision/mscoco/models/im2txt$ export CUDA_VISIBLE_DEVICES=""
mona@pascal:~/computer_vision/mscoco/models/im2txt$ bazel-bin/im2txt/run_inference \
> --checkpoint_path=${CHECKPOINT_DIR} \
> --vocab_file=${VOCAB_FILE} \
> --input_files=${IMAGE_FILE}
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.48.0
Captions for image COCO_val2014_000000224477.jpg:
0) two people are standing on rocks , holding a frisbee . (p=0.311219)
1) the people are surfing the waves with each other (p=0.222864)
2) two people are standing on the shore next to a body of water . (p=0.091812)

@monajalal this seems normal to me. The captions are not perfect yet, but note that:
So your model is certainly better than random already. It will continue improving with time. See the README for my suggested training phases and number of steps.
@cshallue Thanks for the response. That makes sense. I was mostly confused because insisting on the fact there are two people in the picture. Regardless, before doing major mistakes due to time constraints, I would like to know if I should stop the current training that is run off this command:
# Directory containing preprocessed MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
# Inception v3 checkpoint file.
INCEPTION_CHECKPOINT="${HOME}/im2txt/data/inception_v3.ckpt"
# Directory to save the model.
MODEL_DIR="${HOME}/im2txt/model"
# Build the model.
bazel build -c opt im2txt/...
# Run the training script.
bazel-bin/im2txt/train \
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
--inception_checkpoint_file="${INCEPTION_CHECKPOINT}" \
--train_dir="${MODEL_DIR}/train" \
--train_inception=false \
--number_of_steps=1000000
or should I wait till 1000000 and then I continue to fine-tune it using this script?
# Restart the training script with --train_inception=true.
bazel-bin/im2txt/train \
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
--train_dir="${MODEL_DIR}/train" \
--train_inception=true \
--number_of_steps=3000000 # Additional 2M steps (assuming 1M in initial training).
My evaluation script was working fine and producing checkpoints but it somehow stopped showing any output to stdout so I am not sure what's going on. This is the last item shown on stdout:
INFO:tensorflow:Loading model from checkpoint: /home/mona/im2txt/model/train/model.ckpt-5769
INFO:tensorflow:Successfully loaded model.ckpt-5769 at global step = 5770.
however it is still in run and not stopped and I assume it is working in the background according to what I have here:
mona@pascal:~/im2txt/model/train$ ls
checkpoint graph.pbtxt model.ckpt-370433.meta model.ckpt-371441.meta model.ckpt-372450.meta model.ckpt-373459.meta model.ckpt-374467.meta
events.out.tfevents.1475286077.pascal model.ckpt-370433 model.ckpt-371441 model.ckpt-372450 model.ckpt-373459 model.ckpt-374467
My other very important question is how many exact weeks would it take for Tesla k40 GPU? We have deadlines coming in a month so we really need to take some strategic decisions. Should we buy Tesla K80 or Tesla P100 to make things faster? (would it be really noticeable?) Do you by any chance have any sort of benchmarking that shows how fast this training phase is across various Tesla GPUs?
Also is training time approximately linear? Like can I say 360k samples in 2 days and a half, how many days for 2M steps?
Sorry for flooding you with many questions, but given the fact you are the expert in the field, sharing them will be very valuable to the community.
P.S.: If I want to choose between Tesla K80 and Tesla P100, considering the fact that Tesla k80 has ~2x GPU RAM, would it better for training in terms of your code? Or should I stick to Tesla P100?
Thanks,
Mona
@cshallue Thanks a lot for the valuable information. So after 3M is done, and we got a wrong label or labels with low confidence scores (say all three less than <40%), can I change the distribution model of the tensorflow im2txt and correct it myself?
Essentially I would like to know if there is a distribution map that tells if there is a man(70%), a surfing board (80%), sea (60%) at all? I know the image goes through a CNN and then fed to a RNN with LSTM cells, however I would like to know how a developer can change the im2txt code so that they can improve the training based on the new captions that user provides it. If you could guide me in that direction, would be very valuable. I read the show & tell paper but was wondering what features do you use for object detections? Say, do you use GIST or HoG? I saw that in im2text (2011) paper they said they are using them.
Also, I am still rather unclear if you detect only 89 categories of objects or not? do these 89 categories include all of MSCOCO? The reason I am asking is I just saw items in MSCOCO and no action or scene. However both im2text and show&tell are using scene and action recognition. This is what I am referring to: https://github.com/pdollar/coco/blob/master/PythonAPI/pycocoDemo.ipynb
The reason I asked about number of categories was to know if we are going to caption any random object or things unseen in training dataset or no, only limited to something like <100 categories, scenes and actions? Here's an example:

Captions for image random2.jpg:
0) a cat is standing on the street between two parked motorcycles (p=0.148223)
1) a bunch of motorcycles in india somewhere parked near some shops . (p=0.089205)
2) a child is riding a turkey over in ramp . (p=0.011156)
Also while it might sound irrelevant, I wonder if you could share the trainings after 3M global steps? For me the checkpoints for the first 1M global steps became 1.6G. In general would that make sense to use your 3M ckpts even if it is shareable? If that's the case, it would make things faster. The reason I am asking this other question, is something like faster-RCNN where I could use their pre-trained models.
Also how can I tell the bazel to use exactly a specific checkpoint? like how can I tell it to use model.ckpt-1000000 and ignore the ckpts that are created after it for my code?
On a side note, it took me exactly 7 days for 1M initial global steps on a Tesla K40 GPU
Also leaving this as is, for people who might wonder like me. Seems after 1M global steps the sentence didn't change much but confidence scores changed (I am just picking a guess so I might be all wrong):
mona@pascal:~/computer_vision/mscoco/models/im2txt$ ./test_script.sh
INFO: Found 1 target...
Target //im2txt:run_inference up-to-date:
bazel-bin/im2txt/run_inference
INFO: Elapsed time: 0.103s, Critical Path: 0.00s
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.48.0
Captions for image COCO_val2014_000000224477.jpg:
0) two people are standing on rocks , holding a frisbee . (p=0.473335)
1) two people are wind surfing in the ocean . (p=0.137358)
2) two people are standing on the shore next to a body of water . (p=0.120553)
@cshallue yes that was the result after 1M initial steps.
So as I told above my evaluation script stopped showing output after 5000 initial global steps.
I ran it again now while my fine-tuning script is being run, but I have nothing saved in eval directory and I am not sure how to find the validation perplexity. What's the command for that?
mona@pascal:~/computer_vision/mscoco/models/im2txt$ ./evaluation.sh
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so.5.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so.8.0 locally
INFO:tensorflow:Prefetching values from 1 files matching /home/mona/im2txt/data/mscoco/val-?????-of-00004
INFO:tensorflow:Starting evaluation at 2016-10-10-10:52:29
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: pascal
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 367.48.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 367.48.0
INFO:tensorflow:Loading model from checkpoint: /home/mona/im2txt/model/train/model.ckpt-1122482
INFO:tensorflow:Successfully loaded model.ckpt-1122482 at global step = 1122483.
I have nothing saved in eval dir here:
mona@pascal:~/im2txt/model/eval$ ls
mona@pascal:~/im2txt/model/eval$
In evaluation.sh I have:
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
MODEL_DIR="${HOME}/im2txt/model"
# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""
# Run the evaluation script. This will run in a loop, periodically loading the
# latest model checkpoint file and computing evaluation metrics.
bazel-bin/im2txt/evaluate \
--input_file_pattern="${MSCOCO_DIR}/val-?????-of-00004" \
--checkpoint_dir="${MODEL_DIR}/train" \
--eval_dir="${MODEL_DIR}/eval"
Can you please guide what might have gone wrong and how to fix it?
It's difficult to know what might have gone wrong in your training / evaluation.
Are you sure your training data is correct? The following line suggests something is wrong with your validation data (there should be 4 files):
INFO:tensorflow:Prefetching values from 1 files matching /home/mona/im2txt/data/mscoco/val-?????-of-00004
What is the output of ls -l ${HOME}/im2txt/data/mscoco?
Perhaps you could try to use the checkpoint shared in #466 and see if that gives you better results?
Closing this issue for now. Feel free to reopen if you encounter any further issues with the code.
Most helpful comment