Models: tutorial image cifar10 estimator generate TFRecord error

Created on 21 Jan 2018 · 3Comments · Source: tensorflow/models

Describe the problem

I run the tutorial in models/tutorials/image/cifar10_estimator/ with the following command

python3 generate_cifar10_tfrecords.py --data-dir="/home/ros/data/DeepLearning/test/cifar-10-data"

But I got the following errors

Download from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz and extract.
Generating /home/ros/data/DeepLearning/test/cifar-10-data/train.tfrecords
Traceback (most recent call last):
  File "generate_cifar10_tfrecords.py", line 120, in <module>
    main(args.data_dir)
  File "generate_cifar10_tfrecords.py", line 107, in main
    convert_to_tfrecord(input_files, output_file)
  File "generate_cifar10_tfrecords.py", line 79, in convert_to_tfrecord
    data_dict = read_pickle_from_file(input_file)
  File "generate_cifar10_tfrecords.py", line 69, in read_pickle_from_file
    data_dict = cPickle.load(f)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 126, in read
    pywrap_tensorflow.ReadFromStream(self._read_buf, length, status))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 94, in _prepare_value
    return compat.as_str_any(val)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 106, in as_str_any
    return as_str(value)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 84, in as_text
    return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

System information

== cat /etc/issue ===============================================
Linux 747ea2e33a16 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
Yes

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux 747ea2e33a16 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.13.3)
numpy-stl (2.3.2)
protobuf (3.5.0.post1)
tensorflow-gpu (1.4.1)
tensorflow-tensorboard (0.4.0rc3)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.4.1
tf.GIT_VERSION = v1.4.0-19-ga52c8d9
tf.COMPILER_VERSION = v1.4.0-19-ga52c8d9
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

== cuda libs ===================================================
/usr/local/lib/python3.5/dist-packages/torch/lib/libcudart.so.8.0
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.61
/usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart_static.a

awaiting review

Source

jacknlliu

Most helpful comment

I solved this problem.
It seems that the cifar10 data website has updated their data file format but this script not.

Firstly, it seems that you are using python3, so change cPickle to pickle remains the right thing to do.
Secondly, open the file with flag 'rb' instead of only 'r', the code looks like(sorry for the indent, I don't know how to format it here):
import pickle as cPickle
with tf.gfile.Open(filename, 'rb') as f:
data_dict = cPickle.load(f, encoding='bytes')
Lastly, modify def _bytes_feature(value) function:
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

Here, I also attached the whole file for your convenience(modify the file's suffix to py for usage).

generate_cifar10_tfrecords.txt

raven1989 on 22 Jan 2018

👍3

All 3 comments

I solved this problem.
It seems that the cifar10 data website has updated their data file format but this script not.

Here, I also attached the whole file for your convenience(modify the file's suffix to py for usage).

generate_cifar10_tfrecords.txt

raven1989 on 22 Jan 2018

👍3

Thanks for the fix @raven1989! Would you like to submit a PR to fix this up for good?

tatatodd on 26 Jan 2018

I think this issue should be the same as #3428?
I tried with cPickled.load() with "encoding='bytes'" which works fine in python 3 but not in Python 2.7.
cPickled.load() does not have encoding argument in Python 2.7.