Models: Data loss error when training custom data using newly gained checkpoint

Created on 18 Jan 2017 · 6Comments · Source: tensorflow/models

I am working on a project using models/slim, and I have my own image dataset which has 8 classes which will increase to hundreds soon.

My model is an Inception v3, and I downloaded my own dataset and modified it to TFRecord format successfully using _download_and_convert_data.py_.

My custom dataset has 8 classes, and I've trained one of them 1,000 times with _inception_v3.ckpt_ as checkpoint using _train_image_classifier.py_. As a result, I've got new checkpoint (_model.ckpt-1000_), and I have also done test using the checkpoint by running _eval_image_classifier.py_.

Now, I want to train my second dataset based on the newly gained checkpoint (_model.ckpt-1000_), but it shows data loss error.

As mentioned earlier, the same checkpoint (_model.ckpt-1000_) worked fine with the test, but it doesn't work with new training data.

Beginning with *_inception_v3.ckpt_, I plan to train each of my own datasets (totally 8 now, but will increase to hundreds) step by step using newly gained checkpoint in the previous training session. Does this plan have any problems?*

Here is the error messge.

Traceback (most recent call last):
File "/Users/andymac/dev/tf_python2/code/models/slim/train_image_classifier.py", line 600, in
tf.app.run()
............................
............................
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /Users/andymac/dev/tf_python2/code/models/slim/food/checkpoint/model.ckpt-1: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[Node: save/RestoreV2_80 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_80/tensor_names, save/RestoreV2_80/shape_and_slices)]]

Source

kyoseoksong

All 6 comments

It's hard to say what's happening since the details depend a lot on what exact steps were followed. For example, I notice that your report says the checkpoint is named model.cpkt-1000 but the error message is about failure to open model.ckpt-1.

Have you made any modifications to the code? What does the full stack trace look like?

asimshankar on 19 Jan 2017

You're right. I used model.ckpt-1 for my test, and I used it as my new checkpoint.
I just mentioned model.ckpt-1000 as an example.

I am not sure if I can train each of my own datasets (will be hundreds of classes) step by step using newly gained checkpoint in the previous training step. This plan doesn't have any problems, right?

kyoseoksong on 19 Jan 2017

If I understand your intent correctly, then yes, that strategy should be fine.

Though, perhaps you want to pose this on stackoverflow, which is a more suitable forum for usage advice. We try to keep the github issues focused on bugs and feature requests.

asimshankar on 20 Jan 2017

@kyoseoksong I have met the same problem with you. Once I stop the fine-tuning process, I can nerver resstart it with the same command as instruction，or just like what you did which use the new gained model as checkpoint. So have you solved it?