Deeplearning4j: java.lang.OutOfMemoryError: Cannot allocate new DoublePointer(10000000): totalBytes = 4G, physicalBytes = 7G

Created on 11 Sep 2018  路  25Comments  路  Source: eclipse/deeplearning4j

My code runs vector embedding refresh every night based on linux crontab. Code does following

  • Builds the Dl4J embedding files for records of size 2500 using GoogleNews Word2Vec model.
  • Loads embeddings into JVM cache in a HashMap for a faster access by API servers

I get following exception in my code while performing second step above, where I try to read Dl4J Embeddings File using WordVectorSerializer.readWord2VecModel()

This does not happen all the time. After start the server and run it for 4-5 days and cron job invokes above process, in 5th or 6th day I see following OutOfMemory . My current jvm option is set to -Xmx5G. In 4-5 days it will kill my server . Question is what is the real solution to this problem , instead of keep increasing -Xmx ?

My code where it triggers the error after 4-5 days of invoking this piece of code once a day

File embeddings_file = "/Users/legalizenet/embeddings/dl4j_embeddings.txt";
Word2Vec entity_embeddings_w2v = WordVectorSerializer.readWord2VecModel(embeddings_file.getAbsolutePath());

Exception:

java.lang.OutOfMemoryError: Cannot allocate new DoublePointer(10000000): totalBytes = 4G, physicalBytes = 7G
at org.bytedeco.javacpp.DoublePointer.(DoublePointer.java:76)
at org.nd4j.linalg.api.buffer.BaseDataBuffer.(BaseDataBuffer.java:584)
at org.nd4j.linalg.api.buffer.BaseDataBuffer.(BaseDataBuffer.java:570)
at org.nd4j.linalg.api.buffer.DoubleBuffer.(DoubleBuffer.java:51)
at org.nd4j.linalg.api.buffer.factory.DefaultDataBufferFactory.createDouble(DefaultDataBufferFactory.java:237)
at org.nd4j.rng.NativeRandom.(NativeRandom.java:77)
at org.nd4j.rng.NativeRandom.(NativeRandom.java:66)
at org.nd4j.rng.NativeRandom.(NativeRandom.java:62)
at org.nd4j.linalg.cpu.nativecpu.rng.CpuNativeRandom.(CpuNativeRandom.java:14)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at org.nd4j.linalg.factory.RandomFactory.getRandom(RandomFactory.java:28)
at org.nd4j.linalg.factory.Nd4j.getRandom(Nd4j.java:542)
at org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable.(InMemoryLookupTable.java:62)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2208)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2160)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2176)
...............
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Native allocator returned address == 0
at org.bytedeco.javacpp.DoublePointer.(DoublePointer.java:70)

Most helpful comment

We have discovered some interesting facts & solution to this problem, about this in terms of hardware/memory/configuration settings combinations around java on heap vs off heap settings. @rizwan-talentsky (as he is the one who did the hard work finding different options out) will share that with you guys at some point, so that it will be useful for other developers and they do not spend/waste so much time dealing with this.

cc @legalizenet

All 25 comments

Sounds like something that memory fragmentation can cause. Make sure to use "workspaces" to reduce this effect: https://deeplearning4j.org/workspaces

I'm not totally clear what you are doing here.
You mention building embedding files from the Google News embeddings - is that done in a separate JVM or in the same JVM as the one hitting the OOM exception?

Another question: how is threading done here? Is the application multi-threaded (and if so - how many threads and how are they created)?

One possible cause of these sorts of exceptions is arrays references being kept in memory (i.e., a memory 'leak' where all arrays are put in a collection and can't be garbage collected, even when you're finished with them). Could this be a possibility also?

And yes, workspaces are worth a look.

java.lang.OutOfMemoryError: Cannot allocate new DoublePointer(10000000): totalBytes = 4G, physicalBytes = 7G

I think his doubt is: physicalBytes is 7G, it looks satisfy the memory requirement 4G, why fail?

That's not related to the problem, the actual error is this:

Caused by: java.lang.OutOfMemoryError: Native allocator returned address == 0

@AlexDBlack , I have sparkjava based service that has an REST API to deals with word2vec based matching on similarity. So they all run on a same JVM server. So there are 2 parts

  1. GoogleNews Bin Word2Vec model read and load (WordVectorSerializer.readBinaryModel()) in the JVM memory at the server startup, so that it can be used by any REST API service for computing vector embedding for a each sentence , like that have 2500 sentences , so each will have 300 layer embeddings. Only Word2Vec is kept for the whole JVM (static)
  2. Embedding file that was created in the above process is converted to Word2Vec model so that I can get VocabCache (See the code below) and build HashMap (entity_embeddings_vocab_vector_map below) cache in JVM for each embeddings file , so that when I run dot product for the input text , I can do it real fast using cached HashMap of VocabCache. and process the REST API accordingly.

Word2Vec entity_embeddings_w2v = WordVectorSerializer.readWord2VecModel(embeddings_file.getAbsolutePath()); WeightLookupTable<?> entity_embeddings_wlt = entity_embeddings_w2v.lookupTable(); VocabCache<?> entity_embeddings_vocab = entity_embeddings_wlt.getVocabCache();
for(int i=0; i<entity_embeddings_vocab.numWords();++i)
{
String word = entity_embeddings_vocab.wordAtIndex(i);
entity_embeddings_vocab_vector_map.put(word, entity_embeddings_wlt.vector(word));
}

Answering your second question. The process that updates embedding file to JVM hash map cache update is done thru Async operation (a separate thread).

About keeping arrays in memory , are you talking about dl4j library code or my custom code?. If its my code why would dl4j code complain about memory?

@saudet , I have not explored workspaces so far. Can I achieve what Im doing above using workspaces? Bottom line why Im doing above stuff in a same JVM because both operations computing embedding from text & embeddings to vocabcache can be done in a same memory space as Im allocating 5G max memory for JVM and I can use single 8GB box for performing both and I can leverage JVM cache for better performance on response.

One thing I'm still not sure why would I run into OutOfMemory issue after few runs, on the WordVectorSerializer.readWord2VecModel() call , which means somewhere memory is not being freed/garbage collected.

Also I have not seen this error happening on my Mac OS , but OutOfMemory happens on Ubuntu linux on AWS. both have -Xmx5G setting at the server startup

If you're only using nd4j-native and not nd4j-cuda, instead of workspaces, it might be easier to use PointerScope, which is a bit more generic: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/
You could enclose all your code related to DL4J in a try-with-resources statement like that and see what happens.

ok, that helps.

Im allocating 5G max memory for JVM and I can use single 8GB box

So you're keeping the google news vectors in memory.
That's about 3.6GB off-heap, plus about 1GB-ish on-heap (IIRC).
so that's a big chunk of your total 8GB (or 7GB physicalBytes it seems your setting/getting) there.

Keep in mind physicalBytes is sum of off-heap and on-heap. With Xmx5G that means up to 5GB of on-heap memory can be allocated (in theory).
With the 3.6GB off-heap for the word vectors, that plus a large on-heap could be leaving nothing for other off-heap arrays.
My guess here is that your java heap is growing enough to consume all available memory.

I'd recommend decreasing your xmx and tweaking your javacpp maxbytes. Basically xmx + maxbytes should equal the maximum amount of memory you want to allocate to your program.
https://deeplearning4j.org/docs/latest/deeplearning4j-config-memory

Your API server, how many threads allowed for pool?

Well without going into the detail of CORES vs Threads madness. All I see is this in the spark async code, which I used to trigger Async request to refresh embedding that causes above exception.

private static final ExecutorService executorService = Executors.newFixedThreadPool(100);

I'm running this on on AWS t2.large (8GB/2 vCPU)

Executors.newFixedThreadPool(100);

OK, there we go. Note that there's a small RNG state buffer (10mb per thread I think, lazily initialized? cc @raver119) so that's 1GB of off-heap memory used right there.
And yes, that buffer is cleaned up when the thread is no longer in use (I tested this yesterday actually, I can spam a new thread every 5ms without net memory accumulation, and confirmed it's RNG memory is cleaned up on a GC).

I'm running this on on AWS t2.large (8GB/2 vCPU)

That's 2 vCPUs... why run with 100 threads? Unless your threads are blocked on say I/O most of the time and aren't doing anything numerically intense (i.e., few/no ND4J math ops), that's probably a bad idea from a performance point of view - i.e., you could be seeing tons of cache thrashing, etc.

So whats the final verdict on this, How do I fix this issue

  1. reduce xmx and add javacpp maxbytes to the max memory I want to allocate to the server
  2. Reduce thread in thread pool - Executors.newFixedThreadPool(4); ?

Anything else I missed ?

Yep, try these steps, and tell us what happens after that.

I tried the suggested config changes on my mac:

  • Reduced xmx and increased javacpp maxbytes.
  • Reduced the threadpool size.

The application runs fine but ultimately fails with OOM (same exception as mentioned earlier in the thread) even after doing the above suggested config changes.
Also, noticed that increasing the javacpp maxbytes only delayed the OOM. This led me to suspect that there is possibly a memory allocation issue somewhere in the cpp layer.

The OOM issues goes away if I bump up the version dependency for org.bytedeco.javacpp-presets:openblas from the current version "0.2.20-1.4.1" to "0.3.0-1.4.2".

Do you see any issue with the above workaround ?

I am using the following jvm options : -Xmx2G -Dorg.bytedeco.javacpp.maxbytes=5G -Dorg.bytedeco.javacpp.maxphysicalbytes=5G

This looks like a acceptable solution for us until a higher version of org.bytedeco.javacpp-presets:openblas is officially supported.

OOM messages come from java, not c++ layer.

Can you create full dump with java profiler tool like yourkit, that'll include memory reports?

The current version of openblas is 0.3.0-1.4.2, so you're using an old version of DL4J. Try again with 1.0.0-beta2.

I tried DL4J 1.0.0-beta2 earlier. However, It looks like nd4j 1.0.0-beta is not compatible with DL4J 1.0.0-beta2 as I get the following exception at runtime.

Exception in thread "main" java.lang.NoClassDefFoundError: org/nd4j/linalg/api/complex/IComplexDouble
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:5499)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5447)
at org.nd4j.linalg.factory.Nd4j.(Nd4j.java:213)
at org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable.(InMemoryLookupTable.java:60)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2244)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2196)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2212)
...
Caused by: java.lang.ClassNotFoundException: org.nd4j.linalg.api.complex.IComplexDouble
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 17 more

@rizwan-talentsky

I tried DL4J 1.0.0-beta2 earlier. However, It looks like nd4j 1.0.0-beta is not compatible with DL4J 1.0.0-beta2 as I get the following exception at runtime.

That error is because you are mixing versions - everything (DL4J, ND4J, DataVec etc) need to be the same version - 1.0.0-beta2 in this case. You're likely getting a big warning about the mixed versions too when you run your program.

We have discovered some interesting facts & solution to this problem, about this in terms of hardware/memory/configuration settings combinations around java on heap vs off heap settings. @rizwan-talentsky (as he is the one who did the hard work finding different options out) will share that with you guys at some point, so that it will be useful for other developers and they do not spend/waste so much time dealing with this.

cc @legalizenet

@legalizenet @rizwan-talentsky
It sounds like it would make a great addition to https://deeplearning4j.org/memory
Please consider sending a pull request containing your findings! Thanks

@legalizenet @rizwan-talentsky Pinging in case you feel kind enough to send us a PR with your findings.

@eraly , No code changes , there wont be any PR specifically . This is more around memory settings configurations (XmX , javacpp settings) and hardware swap space to get around OOM issue, we tested with multiple version of EC2 servers and it works. @rizwan-talentsky please share the details when you can.

@rizwan-talentsky Yes, please do. We can add them to our docs and it will be helpful to other people as well.

This problem is resolved in upcoming DataTypes PR, we've replaced RNG to buffer-less one, so now it's around 256 bytes per thread (rng instance itself)

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings