Caffe: Caffe cannot handle HDF5 files larger as large as 20GB?

Created on 21 Aug 2015 · 15Comments · Source: BVLC/caffe

I have a training database stored in the hdf5 format. However caffe immediately breaks down when it tries to train on it. Error-Message:

I0820 16:56:50.634572 15886 hdf5_data_layer.cpp:80] Loading list of HDF5 filenames from: /home/Databases/train.txt
I0820 16:56:50.634627 15886 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
F0820 16:56:50.655230 15886 blob.cpp:101] Check failed: data_ 
*** Check failure stack trace: ***
    @     0x7f5f7eebcdaa  (unknown)
    @     0x7f5f7eebcce4  (unknown)
    @     0x7f5f7eebc6e6  (unknown)
    @     0x7f5f7eebf687  (unknown)
    @     0x7f5f7f2b63ce  caffe::Blob<>::mutable_cpu_data()
    @     0x7f5f7f20e85d  caffe::hdf5_load_nd_dataset<>()
    @     0x7f5f7f2575ae  caffe::HDF5DataLayer<>::LoadHDF5FileData()
    @     0x7f5f7f2563d8  caffe::HDF5DataLayer<>::LayerSetUp()
    @     0x7f5f7f2d0332  caffe::Net<>::Init()
    @     0x7f5f7f2d1df2  caffe::Net<>::Net()
    @     0x7f5f7f2ddec0  caffe::Solver<>::InitTrainNet()
    @     0x7f5f7f2defd3  caffe::Solver<>::Init()
    @     0x7f5f7f2df1a6  caffe::Solver<>::Solver()
    @           0x40c4b0  caffe::GetSolver<>()
    @           0x406481  train()
    @           0x404a21  main
    @     0x7f5f7e3cdec5  (unknown)
    @           0x404fcd  (unknown)
    @              (nil)  (unknown)

When I split my training database into a smaller chunk (~13GB) everything works fine (all other parameters remained unchanged).
So I guess caffe has a problem with large HDF5 files?

duplicate

Source

mgarbade

Most helpful comment

I just ran into this issue as well.

My batch size is 100, so my blob shape should be (100,3,256,256). Altogether, that's 19,660,800 floats. Should be fine.

19M << 2147M (INT_MAX)

The whole dataset, however, would be (35676,3,256,256) as a single blob. Altogether that's 7,014,187,008 floats.

7014M > 2147M (INT_MAX)

Is the HDF5Data layer trying to read the whole HDF5 dataset into a single blob? Why?

lukeyeager on 3 Sep 2015

👍4

All 15 comments

You need to compile caffe in debug mode, run with gdb a send the stacktrace.

bhack on 21 Aug 2015

So I compiled caffe in debug-mode. This is the output:

I0828 11:55:43.010573  9445 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
I0828 11:55:43.010650  9445 hdf5_data_layer.cpp:29] Loading HDF5 file:     /path/to/data/trainDataset.h5
F0828 11:55:43.055294  9445 blob.cpp:29] Check failed: shape[i] <= 2147483647 / count_ (833 vs.     715) blob size exceeds INT_MAX
*** Check failure stack trace: ***
    @     0x7f51df2f6daa  (unknown)
    @     0x7f51df2f6ce4  (unknown)
    @     0x7f51df2f66e6  (unknown)
    @     0x7f51df2f9687  (unknown)
    @     0x7f51dfb254dd  caffe::Blob<>::Reshape()
    @     0x7f51dfa7132d  caffe::hdf5_load_nd_dataset_helper<>()
    @     0x7f51dfa70006  caffe::hdf5_load_nd_dataset<>()
    @     0x7f51dfab2e9f  caffe::HDF5DataLayer<>::LoadHDF5FileData()
    @     0x7f51dfab25e0  caffe::HDF5DataLayer<>::LayerSetUp()
    @     0x7f51dfacf4ba  caffe::Layer<>::SetUp()
    @     0x7f51dfb32602  caffe::Net<>::Init()
    @     0x7f51dfb30779  caffe::Net<>::Net()
    @     0x7f51dfb4fe43  caffe::Solver<>::InitTrainNet()
    @     0x7f51dfb4f665  caffe::Solver<>::Init()
    @     0x7f51dfb4f15a  caffe::Solver<>::Solver()
    @           0x41b9e3  caffe::SGDSolver<>::SGDSolver()
    @           0x419363  caffe::GetSolver<>()
    @           0x41503b  train()
    @           0x4173fa  main
    @     0x7f51de807ec5  (unknown)
    @           0x413fd9  (unknown)
    @              (nil)  (unknown)

Unfortunately I don't know how to use gdb to help me in that case

mgarbade on 28 Aug 2015

Nevermind. It is enougth

bhack on 28 Aug 2015

There is an intrisinc limit on the blob shape size that is CHECK_LE(shape[i], INT_MAX / count_)

bhack on 28 Aug 2015

So the blob has 2 GB limit minus 1 byte. You are over this limit.

bhack on 28 Aug 2015

Ok. But what should I do then? I guess the number of training samples should not matter. I'm sure there are people with more training data than 2GB.

I could cut my training data into chunks of < 2GB, train on the first chunk, save the caffenetmodel file, then load the next chunk and finetune the caffenetmodel on that chunk and so on...

Or is there a more elegant way?

Thanks for your help so far

mgarbade on 30 Aug 2015

This is not an bug. You need to close this ticket and continue the discussion on caffe-users mailing list

bhack on 30 Aug 2015

@mgarbade I believe you can have multiple HDF5 files, each with fewer than 2GB of data, but where the combination of all of them is above 2GB. You specify all the files in a list. The data layer will then cycle through the list of files. You can also get it to shuffle the list of files itself. See: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/hdf5_data_layer.cpp#L138

seanbell on 31 Aug 2015

I just ran into this issue as well.

My batch size is 100, so my blob shape should be (100,3,256,256). Altogether, that's 19,660,800 floats. Should be fine.

19M << 2147M (INT_MAX)

The whole dataset, however, would be (35676,3,256,256) as a single blob. Altogether that's 7,014,187,008 floats.

7014M > 2147M (INT_MAX)

Is the HDF5Data layer trying to read the whole HDF5 dataset into a single blob? Why?

lukeyeager on 3 Sep 2015

👍4

I just verified that a batch size of 10,923 fails (109233256256 = 2147549184) and a batch size of 10,922 doesn't (109223256256 = 2147352576). That is true whether the HDF5 dataset dtype is float32 (8.1G file) or uint8 (2.1G file) (requires #2978 to test). So the actual file size doesn't matter. What matters is the product of the dimensions.

Why is everyone talking about the filesize as if it matters? Is that a separate error?
Why is there a restriction on the amount of data in a blob? If it's an indexing problem, why wouldn't it be UINT_MAX?
Why does the HDF5Data layer read all data into a single blob?

lukeyeager on 3 Sep 2015

@lukeyeager for 1 I always talked of blob limit of 2gb minus 1 byte so 2147483647

bhack on 3 Sep 2015

For 2. see https://github.com/BVLC/caffe/issues/1470

bhack on 3 Sep 2015

(1) Yeah but if it's 4bytes per number (for float32 dtype), isn't that an (implicit) 8gb file limit? That's what I'm seeing.

(2,3) Aha, so the HDF5Data layer doesn't prefetch? That's vexing. I still don't see a need for the INT_MAX limit, but it won't matter after #2892.

lukeyeager on 3 Sep 2015

Hi, I can't see the need for using integer for the count variable in blobs as @lukeyeager said. Is there a particular reason for this, instead of using uint? I am having issues for big 3D data.
Thanks!