I have a training database stored in the hdf5 format. However caffe immediately breaks down when it tries to train on it. Error-Message:
I0820 16:56:50.634572 15886 hdf5_data_layer.cpp:80] Loading list of HDF5 filenames from: /home/Databases/train.txt
I0820 16:56:50.634627 15886 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
F0820 16:56:50.655230 15886 blob.cpp:101] Check failed: data_
*** Check failure stack trace: ***
@ 0x7f5f7eebcdaa (unknown)
@ 0x7f5f7eebcce4 (unknown)
@ 0x7f5f7eebc6e6 (unknown)
@ 0x7f5f7eebf687 (unknown)
@ 0x7f5f7f2b63ce caffe::Blob<>::mutable_cpu_data()
@ 0x7f5f7f20e85d caffe::hdf5_load_nd_dataset<>()
@ 0x7f5f7f2575ae caffe::HDF5DataLayer<>::LoadHDF5FileData()
@ 0x7f5f7f2563d8 caffe::HDF5DataLayer<>::LayerSetUp()
@ 0x7f5f7f2d0332 caffe::Net<>::Init()
@ 0x7f5f7f2d1df2 caffe::Net<>::Net()
@ 0x7f5f7f2ddec0 caffe::Solver<>::InitTrainNet()
@ 0x7f5f7f2defd3 caffe::Solver<>::Init()
@ 0x7f5f7f2df1a6 caffe::Solver<>::Solver()
@ 0x40c4b0 caffe::GetSolver<>()
@ 0x406481 train()
@ 0x404a21 main
@ 0x7f5f7e3cdec5 (unknown)
@ 0x404fcd (unknown)
@ (nil) (unknown)
When I split my training database into a smaller chunk (~13GB) everything works fine (all other parameters remained unchanged).
So I guess caffe has a problem with large HDF5 files?
You need to compile caffe in debug mode, run with gdb a send the stacktrace.
So I compiled caffe in debug-mode. This is the output:
I0828 11:55:43.010573 9445 hdf5_data_layer.cpp:94] Number of HDF5 files: 1
I0828 11:55:43.010650 9445 hdf5_data_layer.cpp:29] Loading HDF5 file: /path/to/data/trainDataset.h5
F0828 11:55:43.055294 9445 blob.cpp:29] Check failed: shape[i] <= 2147483647 / count_ (833 vs. 715) blob size exceeds INT_MAX
*** Check failure stack trace: ***
@ 0x7f51df2f6daa (unknown)
@ 0x7f51df2f6ce4 (unknown)
@ 0x7f51df2f66e6 (unknown)
@ 0x7f51df2f9687 (unknown)
@ 0x7f51dfb254dd caffe::Blob<>::Reshape()
@ 0x7f51dfa7132d caffe::hdf5_load_nd_dataset_helper<>()
@ 0x7f51dfa70006 caffe::hdf5_load_nd_dataset<>()
@ 0x7f51dfab2e9f caffe::HDF5DataLayer<>::LoadHDF5FileData()
@ 0x7f51dfab25e0 caffe::HDF5DataLayer<>::LayerSetUp()
@ 0x7f51dfacf4ba caffe::Layer<>::SetUp()
@ 0x7f51dfb32602 caffe::Net<>::Init()
@ 0x7f51dfb30779 caffe::Net<>::Net()
@ 0x7f51dfb4fe43 caffe::Solver<>::InitTrainNet()
@ 0x7f51dfb4f665 caffe::Solver<>::Init()
@ 0x7f51dfb4f15a caffe::Solver<>::Solver()
@ 0x41b9e3 caffe::SGDSolver<>::SGDSolver()
@ 0x419363 caffe::GetSolver<>()
@ 0x41503b train()
@ 0x4173fa main
@ 0x7f51de807ec5 (unknown)
@ 0x413fd9 (unknown)
@ (nil) (unknown)
Unfortunately I don't know how to use gdb to help me in that case
Nevermind. It is enougth
There is an intrisinc limit on the blob shape size that is CHECK_LE(shape[i], INT_MAX / count_)
So the blob has 2 GB limit minus 1 byte. You are over this limit.
Ok. But what should I do then? I guess the number of training samples should not matter. I'm sure there are people with more training data than 2GB.
I could cut my training data into chunks of < 2GB, train on the first chunk, save the caffenetmodel file, then load the next chunk and finetune the caffenetmodel on that chunk and so on...
Or is there a more elegant way?
Thanks for your help so far
This is not an bug. You need to close this ticket and continue the discussion on caffe-users mailing list
@mgarbade I believe you can have multiple HDF5 files, each with fewer than 2GB of data, but where the combination of all of them is above 2GB. You specify all the files in a list. The data layer will then cycle through the list of files. You can also get it to shuffle the list of files itself. See: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/hdf5_data_layer.cpp#L138
I just ran into this issue as well.
My batch size is 100, so my blob shape should be (100,3,256,256). Altogether, that's 19,660,800 floats. Should be fine.
19M << 2147M (INT_MAX)
The whole dataset, however, would be (35676,3,256,256) as a single blob. Altogether that's 7,014,187,008 floats.
7014M > 2147M (INT_MAX)
Is the HDF5Data layer trying to read the whole HDF5 dataset into a single blob? Why?
I just verified that a batch size of 10,923 fails (109233256256 = 2147549184) and a batch size of 10,922 doesn't (109223256256 = 2147352576). That is true whether the HDF5 dataset dtype is float32 (8.1G file) or uint8 (2.1G file) (requires #2978 to test). So the actual file size doesn't matter. What matters is the product of the dimensions.
UINT_MAX?@lukeyeager for 1 I always talked of blob limit of 2gb minus 1 byte so 2147483647
For 2. see https://github.com/BVLC/caffe/issues/1470
(1) Yeah but if it's 4bytes per number (for float32 dtype), isn't that an (implicit) 8gb file limit? That's what I'm seeing.
(2,3) Aha, so the HDF5Data layer doesn't prefetch? That's vexing. I still don't see a need for the INT_MAX limit, but it won't matter after #2892.
Hi, I can't see the need for using integer for the count variable in blobs as @lukeyeager said. Is there a particular reason for this, instead of using uint? I am having issues for big 3D data.
Thanks!
Closing as duplicate of #1470.
Most helpful comment
I just ran into this issue as well.
My batch size is 100, so my blob shape should be
(100,3,256,256). Altogether, that's 19,660,800 floats. Should be fine.The whole dataset, however, would be
(35676,3,256,256)as a single blob. Altogether that's 7,014,187,008 floats.Is the
HDF5Datalayer trying to read the whole HDF5 dataset into a single blob? Why?