Deeplearning4j: per-iteration time increase with the size of the dataset

Created on 18 Jan 2017 · 17Comments · Source: eclipse/deeplearning4j

I trained a network on a small dataset (14604 images in the training set) using batch size=128.
Then I trained the same network with the same batch size on a larger dataset (688512 images in the training set).

During the training, I noticed that a single iteration takes ~2.5sec while training on the smaller dataset, while it takes ~80sec while training on the larger one.

I expected that an epoch takes longer using the larger dataset, not the duration of an iteration. Am I wrong?

I loaded the dataset in this way:
https://gist.github.com/amaraga/a80922e2abb64e0e1297c46255bdab2e

Following your suggestion, I modified the loading in this way, but I got the same result: the duration of one iteration depends on the size of the dataset:
https://gist.github.com/amaraga/2321ae9c44300ac9057589f71d282990

i must fix this problem because now i can't train my CNNs with large datasets

Performance

Source

n-gregori

Most helpful comment

Some progress here, but more to do:
GC pressure with ImageRecordReader comes from 2 places:

Collection<File>
URI[] from FileInputSplit (Java URI objects seem to have 5 separate String objects each, so are contributing a lot to this problem)

First of these issues is fixed in the following PRs:
https://github.com/deeplearning4j/nd4j/pull/1603
https://github.com/deeplearning4j/DataVec/pull/179

The second issue is still WIP

AlexDBlack on 23 Jan 2017

👍3

All 17 comments

Can you post the network configuration you are using, as well as the backend (nd4j-native vs. cuda), OS and hardware details (CPU, RAM, GPU etc)
Can you also post the XMX setting you are using, if you are setting that?

AlexDBlack on 18 Jan 2017

first of all this problem is independent of the network configuration, but for examble i tried olso with:
https://gist.github.com/n-gregori/d0450e7dbc1f163ae62ae1a81a370c04

backend info:

CUDA version: 8.0
DL4J version: 0.7.1

hardware details:

Red Hat Enterprise Linux Server release 7.2
on aws amazon instance p2.xlarge
1 GPU Tesla k80 with 12g of memory
4 CPU 61g of memory (Intel Xeon E5-2686 v4)

XMX setting:

-J-Xmx60g

n-gregori on 18 Jan 2017

@n-gregori Thanks for the info.

@raver119 I'm thinking GC load from the 688k images?
https://github.com/deeplearning4j/DataVec/blob/master/datavec-data/datavec-data-image/src/main/java/org/datavec/image/recordreader/BaseImageRecordReader.java#L101
Specifically BaseImageRecordReader has a Collection<File>, with one entry in memory for each image in the training set.

AlexDBlack on 18 Jan 2017

Right, that looks like the same GC-related issue, which is planned to be solved in next major release.

raver119 on 18 Jan 2017

@n-gregori one approach in the mean-time: you can save your DataSet objects first, using DataSet.save(File)
Then re-load them for training using this: https://github.com/deeplearning4j/nd4j/blob/master/nd4j-backends/nd4j-api-parent/nd4j-api/src/main/java/org/nd4j/linalg/dataset/ExistingMiniBatchDataSetIterator.java

If we're right about that, as long as you don't have the ImageRecordReader in memory, you should have significantly reduced GC load and hence better performance.

AlexDBlack on 18 Jan 2017

@AlexDBlack @raver119

I've tried this way now but this solution is unworkable because of the size of the saved files, For example:

2Millions images dataset at 150x150 (so not particularly big) takes approximately 2 Tera....

This fact render DL4J completely unusable for real implementations!

Is there any other workaround?

n-gregori on 18 Jan 2017

Obviously you could write your own iterator that'll produce DataSets on the fly, without storing references.

Or just wait for next release, this issue is one of two ultra-high-priority issues for upcoming release.

raver119 on 18 Jan 2017

Thanks @raver119 !
In any case using DataSet.save(File) for huge amounts of images obviously takes a very long time, approximately 3/4 weeks for 2Millions images dataset at 150x150 with my hardware configuration.

n-gregori on 18 Jan 2017

Obviously you could write your own iterator that'll produce DataSets on the fly, without storing references.

That's one more possible workaround i've mentioned above.

raver119 on 18 Jan 2017

Yeah, I want to look at this as a matter of priority. We'll keep you posted.

AlexDBlack on 19 Jan 2017

@raver119 @AlexDBlack I'm tring to write my own iterator. At present i succeeded to produce DataSets on the fly in order to have per-iteration time independent of amounts of images in the dataset.
However randomly during the training turns out this CUDA exception:

23:06:11.611 [main] INFO  o.d.o.l.ScoreIterationListener - Score at iteration 16443 is 3.698580161540436
23:06:13.876 [main] INFO  o.d.o.l.ScoreIterationListener - Score at iteration 16444 is 3.7974990748615935
23:06:16.152 [main] INFO  o.d.o.l.ScoreIterationListener - Score at iteration 16445 is 3.4000109563753087
23:06:18.423 [main] INFO  o.d.o.l.ScoreIterationListener - Score at iteration 16446 is 3.2241867896000396
CUDA error at /projects/skymind/deploy/linux-x86_64/libnd4j/blas/cuda/NativeOps.cu:5202 code=77(<unknown>) "result"
CUDA error at /projects/skymind/deploy/linux-x86_64/libnd4j/blas/cuda/NativeOps.cu:5160 code=77(<unknown>) "result"
Failed on [140103466419744] -> [8667232256], size: [256], direction: [0], result: [77]
java.lang.IllegalStateException: MemcpyAsync H2H failed: [140103466419744] -> [8667232256]
        at org.nd4j.jita.handler.impl.CudaZeroHandler.memcpyAsync(CudaZeroHandler.java:530)
        at org.nd4j.jita.handler.impl.CudaZeroHandler.memcpyBlocking(CudaZeroHandler.java:631)
        at org.nd4j.jita.allocator.impl.AtomicAllocator.memcpyBlocking(AtomicAllocator.java:820)
        at org.nd4j.linalg.jcublas.JCublasNDArrayFactory.concat(JCublasNDArrayFactory.java:617)
        at org.nd4j.linalg.factory.Nd4j.concat(Nd4j.java:5000)
        at org.nd4j.linalg.factory.BaseNDArrayFactory.vstack(BaseNDArrayFactory.java:1214)
        at org.nd4j.linalg.factory.Nd4j.vstack(Nd4j.java:4859)

Do you have a suggestion?

n-gregori on 20 Jan 2017

Show full stack trace please, yours is truncated.

raver119 on 20 Jan 2017

Yeah, we'd need to see the rest of that; your record reader code (where you create/return INDArrays) might help too.

AlexDBlack on 20 Jan 2017

Some progress here, but more to do:
GC pressure with ImageRecordReader comes from 2 places:

Collection<File>
URI[] from FileInputSplit (Java URI objects seem to have 5 separate String objects each, so are contributing a lot to this problem)

First of these issues is fixed in the following PRs:
https://github.com/deeplearning4j/nd4j/pull/1603
https://github.com/deeplearning4j/DataVec/pull/179

The second issue is still WIP

AlexDBlack on 23 Jan 2017

👍3

Both issues mentioned in the previous comment have been fixed, and the PRs merged.

Testing on MNIST (from PNGs, with duplicates) - 335k total images, Windows + Titan X
Before:
o.d.o.l.PerformanceListener - iteration 101; iteration time: 1094 ms; samples/sec: 58.501; batches/sec: 0.914;
o.d.o.l.PerformanceListener - iteration 201; iteration time: 3904 ms; samples/sec: 16.393; batches/sec: 0.256;

After:
o.d.o.l.PerformanceListener - iteration 101; iteration time: 49 ms; samples/sec: 1306.122; batches/sec: 20.408;
o.d.o.l.PerformanceListener - iteration 201; iteration time: 47 ms; samples/sec: 1361.702; batches/sec: 21.277;

@n-gregori You might be able to drop in the new ImageRecordReader (and dependencies - FileSplit, InputSplit interface, CompactHeapStringList) to your project (presumably based on 0.7.2). Otherwise you'll have to build from source to access this. https://deeplearning4j.org/buildinglocally https://gitter.im/deeplearning4j/deeplearning4j/earlyadopters

Of course, GC is still an issue any time we're dealing with large numbers of objects. This is a known issue and will be looked at separately. But for now, I'll close this.

AlexDBlack on 24 Jan 2017

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.