caffe.io.array_to_datum fails with some dtypes

Created on 29 Apr 2015 · 7Comments · Source: BVLC/caffe

I think there may be a problem with caffe.io.array_to_datum. It only seems to convert arrays of type uint8 or float64. I would have expected it to handle arrays of type int32 and float32.

This is what I saw:

>>> import caffe
>>> from caffe.proto import caffe_pb2 
>>> import numpy as np
>>> ar = np.array([[[1., 2., 3., 4.,],[5., 6., 7., 8.,],[11., 22., 33., 44.]]])
>>> ar.shape
      (1, 3, 4)

>>> arU8 = ar.astype(np.uint8)
>>> arI32 = ar.astype(np.int32) 
>>> arF32 = ar.astype(np.float32)
>>> arF64 = ar.astype(np.float64)

>>> a =caffe.io.array_to_datum(arU8)

>>> b = caffe.io.array_to_datum(arI32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/smgutstein/Caffe/caffe/python/caffe/io.py", line 89, in array_to_datum
    datum.float_data.extend(arr.flat)  #SG sub
  File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/containers.py", line 128, in extend
    new_values.append(self._type_checker.CheckValue(elem))
  File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/type_checkers.py", line 103,  in CheckValue
    raise TypeError(message)
TypeError: 1 has type <type 'numpy.int32'>, but expected one of: (<type 'float'>, <type 'int'>, <type 'long'>)

>>> c = caffe.io.array_to_datum(arF32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/smgutstein/Caffe/caffe/python/caffe/io.py", line 89, in array_to_datum
    datum.float_data.extend(arr.flat)  
  File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/containers.py", line 128, in     extend
    new_values.append(self._type_checker.CheckValue(elem))
  File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/type_checkers.py", line 103,   in CheckValue
    raise TypeError(message)
TypeError: 1.0 has type <type 'numpy.float32'>, but expected one of: (<type 'float'>, <type 'int'>,     <type 'long'>)

>>> d = caffe.io.array_to_datum(arF64) 
>>>

The problem, I believe, lies in the use of the extend function to store any non-uint8 data in the float_data field of a Datum object. Although this is very robust for lists, the float_data field for a Datum object is a google.protobuf.internal.containers.RepeatedScalarFieldContainer, which is more type sensitive when using the extend function, as can be seen in the error trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/smgutstein/Caffe/caffe/python/caffe/io.py", line 89, in array_to_datum
    datum.float_data.extend(arr.flat)  
  File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/containers.py", line 128, in  extend
    new_values.append(self._type_checker.CheckValue(elem))
  File "/usr/local/lib/python2.7/dist-packages/google/protobuf/internal/type_checkers.py", line 103, in CheckValue
    raise TypeError(message)
TypeError: 1.0 has type <type 'numpy.float32'>, but expected one of: (<type 'float'>, <type 'int'>, <type 'long'>)

The two ways of I see to fix this are:

Either to change protobuf's type checking, so that it allows for numpy.types. I believe this would involve modifying the following section of code (lines 196-208 in type_checkers.py):

# Type-checkers for all scalar CPPTYPEs.
_VALUE_CHECKERS = {
    _FieldDescriptor.CPPTYPE_INT32: Int32ValueChecker(),
    _FieldDescriptor.CPPTYPE_INT64: Int64ValueChecker(),
    _FieldDescriptor.CPPTYPE_UINT32: Uint32ValueChecker(),
    _FieldDescriptor.CPPTYPE_UINT64: Uint64ValueChecker(),
    _FieldDescriptor.CPPTYPE_DOUBLE: TypeChecker(
        float, int, long),
    _FieldDescriptor.CPPTYPE_FLOAT: TypeChecker(
        float, int, long),
    _FieldDescriptor.CPPTYPE_BOOL: TypeChecker(bool, int),
    _FieldDescriptor.CPPTYPE_STRING: TypeChecker(bytes),
    }

Or, to modify array_to_datum, so that when it flattens an array, the returned iterator does the type casting for protobuf:

def array_to_datum(arr, label=0):
    """Converts a 3-dimensional array to datum. If the array has dtype uint8,
    the output data will be encoded as a string. Otherwise, the output data
    will be stored in float format.
    """
    class npIterCast():                                 # +                  
      '''Class added so that iterator                   # +
         over numpy array will return                   # +
         values of float or int type,                   # +
         not np.float or np.int'''                      # +
      def __init__(self, myIter, myCast):               # +
        self.myIter = myIter                            # +
        self.myCast = myCast                            # +
      def __iter__(self):                               # +
        return self                                     # +
      def next(self):                                   # +
        return myCast(self.myIter.next())               # +

    if not isinstance(arr,np.ndarray):                  # +
         raise TypeError('Expecting a numpy array')     # +
    if arr.ndim != 3:
        raise ValueError('Incorrect array shape.')
    datum = caffe_pb2.Datum()
    datum.channels, datum.height, datum.width = arr.shape
    if arr.dtype == np.uint8:
        datum.data = arr.tostring()
    else:
        if np.issubdtype(arr.dtype, np.int) or \                                     # +
           np.issubdtype(arr.dtype, np.unsignedinteger):                             # +
            myCast = int                                                             # +
        elif np.issubdtype(arr.dtype, np.float):                                     # +
            myCast = float                                                           # +
        else:                                                                        # +
            raise TypeError('Expecting a numpy array of either a float or int type') # +

        castIter = npIterCast(arr.flat, myCast)  # +
        datum.float_data.extend(castIter)        # +
        datum.float_data.extend(arr.flat)        # -
    datum.label = label
    return datum

This gave me the following results, so it appears to be working fine:

>>> import caffe 
>>> from caffe.proto import caffe_pb2
>>> import numpy as np
>>> ar = np.array([[[1., 2., 3., 4.,],[5., 6., 7., 8.,],[11., 22., 33., 44.]]])
>>>
>>> arU8 = ar.astype(np.uint8)
>>> arI32 = ar.astype(np.int32)
>>> arF32 = ar.astype(np.float32)
>>> arF64 = ar.astype(np.float64)
>>>
>>> a =caffe.io.array_to_datum(arU8)
>>> b = caffe.io.array_to_datum(arI32)
>>> c = caffe.io.array_to_datum(arF32)
>>> d = caffe.io.array_to_datum(arF64)
>>>
>>> a.float_data
[]
>>> a.data
'\x01\x02\x03\x04\x05\x06\x07\x08\x0b\x16!,'
>>>
>>> b.float_data
[1, 2, 3, 4, 5, 6, 7, 8, 11, 22, 33, 44]
>>> c.float_data
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 11.0, 22.0, 33.0, 44.0]
>>> d.float_data
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 11.0, 22.0, 33.0, 44.0]
>>>

There may still be an issue that other than uint8, all ints will be the system's default int type (i.e. either 32 or 64 bit), all unsigned ints will also be the system's default int type and all floats will be the default float type. But, for now, unless there is a better solution, I think it's OK to use this. Am I correct in believing this? Is there a better approach?

Thanks,

Steven

Python bug

Source

tubaybb321

All 7 comments

Thanks for reporting the issue. I've changed the title to be a little more specific. We should probably just change arr.flat to arr.astype(float).flat; you're welcome to PR that.

longjon on 8 May 2015

@longjon - Thanks. I like your fix better. I suppose I got too fussy about maintaining variable type, which may not be truly important for this function.

I'd be happy to "PR" this, but I don't know what it means to "PR" something....

tubaybb321 on 18 May 2015

@tubaybb321 PR stands for pull request (docs).

seanbell on 23 May 2015

Hi @longjon, @seanbell, @tubaybb321! Pull request #4182 contains the suggested fix, with a little additional discussion in a commit message there as well. I'll copy in that message for those finding this thread:

As recommended by @longjon, [PR #4182] will allow caffe.io.array_to_datum to handle, for example, numpy.float32 arrays.

It might be worth noting that datum.float_data is stored as protobuf type 2, which is float32, as opposed to protobuf type 1, which is float64. It is a little unintuitive that caffe currently requires data to be passed in as float64 but then writes float32 to LMDB. To demonstrate this:

datum = caffe.io.array_to_datum(np.array([[[0.9]]]))
caffe.io.datum_to_array(datum)
# array([[[ 0.9]]])
datum_str = datum.SerializeToString()
new_datum = caffe.proto.caffe_pb2.Datum()
new_datum.ParseFromString(datum_str)
caffe.io.datum_to_array(new_datum)
# array([[[ 0.89999998]]])

This behavior is somewhat hidden because datum_to_array returns type float64, even though the data doesn't actually have that resolution if it has been stored as protobuf text anywhere (for example in LMDB).

Alternative solutions:

Require and return float32, consistent with the protobuf representation.
Change the protobuf to allow float32 or float64 and update surrounding code to support this.

ajschumacher on 19 May 2016

👍1

is this fixed in newer builds? I'm facing the same issue here where I'm stuck with float64 and I run out of memory because of it (cant save the normalized data into a leveldb , I have 32G of RAM by the way!)

Coderx7 on 16 Oct 2016

Fixed in #2391

shelhamer on 13 Apr 2017

👍1

@longjon Thanks your solution worked for me, very simple and helpful. I just save the array as float before sending to arr_dat=caffe.io.array_to_datum(arr)

arr=arr.astype(float)
arr_dat=caffe.io.array_to_datum(arr)