I trained a network on a small dataset (14604 images in the training set) using batch size=128.
Then I trained the same network with the same batch size on a larger dataset (688512 images in the training set).
During the training, I noticed that a single iteration takes ~2.5sec while training on the smaller dataset, while it takes ~80sec while training on the larger one.
I expected that an epoch takes longer using the larger dataset, not the duration of an iteration. Am I wrong?
I loaded the dataset in this way:
https://gist.github.com/amaraga/a80922e2abb64e0e1297c46255bdab2e
Following your suggestion, I modified the loading in this way, but I got the same result: the duration of one iteration depends on the size of the dataset:
https://gist.github.com/amaraga/2321ae9c44300ac9057589f71d282990
i must fix this problem because now i can't train my CNNs with large datasets
Can you post the network configuration you are using, as well as the backend (nd4j-native vs. cuda), OS and hardware details (CPU, RAM, GPU etc)
Can you also post the XMX setting you are using, if you are setting that?
first of all this problem is independent of the network configuration, but for examble i tried olso with:
https://gist.github.com/n-gregori/d0450e7dbc1f163ae62ae1a81a370c04
backend info:
CUDA version: 8.0
DL4J version: 0.7.1
hardware details:
Red Hat Enterprise Linux Server release 7.2
on aws amazon instance p2.xlarge
1 GPU Tesla k80 with 12g of memory
4 CPU 61g of memory (Intel Xeon E5-2686 v4)
XMX setting:
-J-Xmx60g
@n-gregori Thanks for the info.
@raver119 I'm thinking GC load from the 688k images?
https://github.com/deeplearning4j/DataVec/blob/master/datavec-data/datavec-data-image/src/main/java/org/datavec/image/recordreader/BaseImageRecordReader.java#L101
Specifically BaseImageRecordReader has a Collection<File>, with one entry in memory for each image in the training set.
Right, that looks like the same GC-related issue, which is planned to be solved in next major release.
Related issue: https://github.com/deeplearning4j/nd4j/issues/1559
@n-gregori one approach in the mean-time: you can save your DataSet objects first, using DataSet.save(File)
Then re-load them for training using this: https://github.com/deeplearning4j/nd4j/blob/master/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/linalg/dataset/ExistingMiniBatchDataSetIterator.java
If we're right about that, as long as you don't have the ImageRecordReader in memory, you should have significantly reduced GC load and hence better performance.
@AlexDBlack @raver119
I've tried this way now but this solution is unworkable because of the size of the saved files, For example:
2Millions images dataset at 150x150 (so not particularly big) takes approximately 2 Tera....
This fact render DL4J completely unusable for real implementations!
Is there any other workaround?
Obviously you could write your own iterator that'll produce DataSets on the fly, without storing references.
Or just wait for next release, this issue is one of two ultra-high-priority issues for upcoming release.
Thanks @raver119 !
In any case using DataSet.save(File) for huge amounts of images obviously takes a very long time, approximately 3/4 weeks for 2Millions images dataset at 150x150 with my hardware configuration.
Obviously you could write your own iterator that'll produce DataSets on the fly, without storing references.
That's one more possible workaround i've mentioned above.
Yeah, I want to look at this as a matter of priority. We'll keep you posted.
@raver119 @AlexDBlack I'm tring to write my own iterator. At present i succeeded to produce DataSets on the fly in order to have per-iteration time independent of amounts of images in the dataset.
However randomly during the training turns out this CUDA exception:
23:06:11.611 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 16443 is 3.698580161540436
23:06:13.876 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 16444 is 3.7974990748615935
23:06:16.152 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 16445 is 3.4000109563753087
23:06:18.423 [main] INFO o.d.o.l.ScoreIterationListener - Score at iteration 16446 is 3.2241867896000396
CUDA error at /projects/skymind/deploy/linux-x86_64/libnd4j/blas/cuda/NativeOps.cu:5202 code=77(<unknown>) "result"
CUDA error at /projects/skymind/deploy/linux-x86_64/libnd4j/blas/cuda/NativeOps.cu:5160 code=77(<unknown>) "result"
Failed on [140103466419744] -> [8667232256], size: [256], direction: [0], result: [77]
java.lang.IllegalStateException: MemcpyAsync H2H failed: [140103466419744] -> [8667232256]
at org.nd4j.jita.handler.impl.CudaZeroHandler.memcpyAsync(CudaZeroHandler.java:530)
at org.nd4j.jita.handler.impl.CudaZeroHandler.memcpyBlocking(CudaZeroHandler.java:631)
at org.nd4j.jita.allocator.impl.AtomicAllocator.memcpyBlocking(AtomicAllocator.java:820)
at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.concat(JCublasNDArrayFactory.java:617)
at org.nd4j.linalg.factory.Nd4j.concat(Nd4j.java:5000)
at org.nd4j.linalg.factory.BaseNDArrayFactory.vstack(BaseNDArrayFactory.java:1214)
at org.nd4j.linalg.factory.Nd4j.vstack(Nd4j.java:4859)
Do you have a suggestion?
Show full stack trace please, yours is truncated.
Yeah, we'd need to see the rest of that; your record reader code (where you create/return INDArrays) might help too.
Some progress here, but more to do:
GC pressure with ImageRecordReader comes from 2 places:
Collection<File>URI[] from FileInputSplit (Java URI objects seem to have 5 separate String objects each, so are contributing a lot to this problem)First of these issues is fixed in the following PRs:
https://github.com/deeplearning4j/nd4j/pull/1603
https://github.com/deeplearning4j/DataVec/pull/179
The second issue is still WIP
Both issues mentioned in the previous comment have been fixed, and the PRs merged.
Testing on MNIST (from PNGs, with duplicates) - 335k total images, Windows + Titan X
Before:
o.d.o.l.PerformanceListener - iteration 101; iteration time: 1094 ms; samples/sec: 58.501; batches/sec: 0.914;
o.d.o.l.PerformanceListener - iteration 201; iteration time: 3904 ms; samples/sec: 16.393; batches/sec: 0.256;
After:
o.d.o.l.PerformanceListener - iteration 101; iteration time: 49 ms; samples/sec: 1306.122; batches/sec: 20.408;
o.d.o.l.PerformanceListener - iteration 201; iteration time: 47 ms; samples/sec: 1361.702; batches/sec: 21.277;
@n-gregori You might be able to drop in the new ImageRecordReader (and dependencies - FileSplit, InputSplit interface, CompactHeapStringList) to your project (presumably based on 0.7.2). Otherwise you'll have to build from source to access this. https://deeplearning4j.org/buildinglocally https://gitter.im/deeplearning4j/deeplearning4j/earlyadopters
Of course, GC is still an issue any time we're dealing with large numbers of objects. This is a known issue and will be looked at separately. But for now, I'll close this.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Some progress here, but more to do:
GC pressure with ImageRecordReader comes from 2 places:
Collection<File>URI[]from FileInputSplit (Java URI objects seem to have 5 separate String objects each, so are contributing a lot to this problem)First of these issues is fixed in the following PRs:
https://github.com/deeplearning4j/nd4j/pull/1603
https://github.com/deeplearning4j/DataVec/pull/179
The second issue is still WIP