_From @schrum2 on September 14, 2017 14:24_
This issue is a follow-up on what was already discussed here: https://github.com/deeplearning4j/rl4j/issues/61
(all version details were worked out there: Thank you @saudet )
Although the code can successfully save movies and load saved models (thanks to fixes in the previous issue), and it does successfully train for a while, it eventually crashes with the following exception.
09:15:48.896 [main] ERROR org.deeplearning4j.rl4j.learning.sync.SyncLearning - Training failed.
java.lang.NullPointerException: null
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:50)
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:38)
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:45)
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:52)
at org.deeplearning4j.rl4j.learning.sync.qlearning.discrete.QLearningDiscrete.trainStep(QLearningDiscrete.java:159)
at org.deeplearning4j.rl4j.learning.sync.qlearning.QLearning.trainEpoch(QLearning.java:91)
at org.deeplearning4j.rl4j.learning.sync.SyncLearning.train(SyncLearning.java:38)
at org.deeplearning4j.examples.rl4j.Doom.doomLearn(Doom.java:100)
at org.deeplearning4j.examples.rl4j.Doom.main(Doom.java:78)
java.lang.NullPointerException
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:50)
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:38)
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:45)
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:52)
at org.deeplearning4j.rl4j.learning.sync.qlearning.discrete.QLearningDiscrete.trainStep(QLearningDiscrete.java:159)
at org.deeplearning4j.rl4j.learning.sync.qlearning.QLearning.trainEpoch(QLearning.java:91)
at org.deeplearning4j.rl4j.learning.sync.SyncLearning.train(SyncLearning.java:38)
at org.deeplearning4j.examples.rl4j.Doom.doomLearn(Doom.java:100)
at org.deeplearning4j.examples.rl4j.Doom.main(Doom.java:78)
High, level 3.1
[libx264 @ 00000000435bb9e0] 264 - core 148 - H.264/MPEG-4 AVC codec - Copyleft 2003-2016 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=12 lookahead_threads=2 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=30.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
Output #0, mp4, to 'C:\Users\he_de\rl4j-data\1/video/video-1528-130755.mp4':
Metadata:
encoder : Lavf57.56.100
Stream #0:0: Video: h264 (Constrained Baseline) ([33][0][0][0] / 0x0021), yuv420p, 800x600, q=2-31, 400 kb/s, 15360 tbn
[libx264 @ 00000000435bb9e0] frame I:1 Avg QP:26.61 size: 29808
[libx264 @ 00000000435bb9e0] frame P:36 Avg QP:32.52 size: 4993
[libx264 @ 00000000435bb9e0] frame B:78 Avg QP:34.95 size: 1270
[libx264 @ 00000000435bb9e0] consecutive B-frames: 5.2% 10.4% 7.8% 76.5%
[libx264 @ 00000000435bb9e0] mb I I16..4: 1.5% 76.5% 21.9%
[libx264 @ 00000000435bb9e0] mb P I16..4: 1.1% 7.0% 1.4% P16..4: 45.7% 5.0% 2.5% 0.0% 0.0% skip:37.3%
[libx264 @ 00000000435bb9e0] mb B I16..4: 0.2% 0.2% 0.0% B16..8: 35.8% 1.3% 0.1% direct: 0.4% skip:62.0% L0:50.7% L1:49.1% BI: 0.2%
[libx264 @ 00000000435bb9e0] 8x8 transform intra:73.2% inter:87.6%
[libx264 @ 00000000435bb9e0] coded y,uvDC,uvAC intra: 65.3% 56.1% 30.3% inter: 6.3% 3.6% 0.6%
[libx264 @ 00000000435bb9e0] i16 v,h,dc,p: 46% 30% 13% 12%
[libx264 @ 00000000435bb9e0] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 18% 15% 24% 6% 8% 7% 9% 6% 7%
[libx264 @ 00000000435bb9e0] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 32% 25% 18% 5% 5% 4% 5% 3% 4%
[libx264 @ 00000000435bb9e0] i8c dc,h,v,p: 61% 15% 21% 3%
[libx264 @ 00000000435bb9e0] Weighted P-Frames: Y:16.7% UV:13.9%
[libx264 @ 00000000435bb9e0] ref P L0: 59.6% 16.1% 14.3% 6.1% 3.9%
[libx264 @ 00000000435bb9e0] ref B L0: 90.8% 7.6% 1.6%
[libx264 @ 00000000435bb9e0] ref B L1: 96.2% 3.8%
[libx264 @ 00000000435bb9e0] kb/s:643.97
_Copied from original issue: deeplearning4j/rl4j#64_
_From @saudet on September 14, 2017 22:17_
Define "for a while", how much time does it take?
_From @schrum2 on September 15, 2017 12:9_
I ran the program overnight and it crashed at some point. I can confirm that it took well over 5 hours, running the CPU backend. Also, if it helps, the name of the last movie that was created was video-9403-828046.mp4, and the name of the last model that was saved was 800415.model.
I'm pretty sure that the crash time is variable though ... I don't think it has always taken this long to crash in the past, but it has eventually crashed every time I've run it.
_From @schrum2 on September 15, 2017 12:11_
It is also worth noting that the behavior achieved by the crash point is not yet impressive. The agent looks around a bit before getting shot and dying, and that is it.
_From @saudet on September 15, 2017 12:34_
@rubenfiszel Have you ever experienced that?
_From @rubenfiszel on September 16, 2017 18:4_
Yes I remember similar issues and I might have reported an issue about it
on the VizDoom project or it's Java port directly.
On Sep 15, 2017 14:34, "Samuel Audet" notifications@github.com wrote:
@rubenfiszel https://github.com/rubenfiszel Have you ever experienced
that?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/deeplearning4j/rl4j/issues/64#issuecomment-329769977,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAQ0gE0DcD3MB_gEy3hWikqYxnvPZJkYks5sim6_gaJpZM4PXqsN
.
{"api_version":"1.0","publisher":{"api_key":"
05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":
{"external_key":"github/deeplearning4j/rl4j","title":"
deeplearning4j/rl4j","subtitle":"GitHub repository","main_image_url":"
https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-
11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://
cloud.githubusercontent.com/assets/143418/15842166/
7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in
GitHub","url":"https://github.com/deeplearning4j/rl4j"}},"
updates":{"snippets":[{"icon":"PERSON","message":"@saudet in #64:
@rubenfiszel Have you ever experienced that?"}],"action":{"name":"View
Issue","url":"https://github.com/deeplearning4j/rl4j/
issues/64#issuecomment-329769977"}}}
_From @saudet on September 16, 2017 21:50_
Ah yes, thank you, here is the relevant thread: https://github.com/mwydmuch/ViZDoom/issues/106
Unfortunately, we're not getting any special exceptions...
_From @saudet on November 15, 2017 2:30_
@schrum2 Does the same thing happen with ALE? Or just with VizDoom? If you could check this out, it would help narrow down this issue.
Hi. I haven't had time to look at this in a while, but I am still experiencing the error and am interested in looking into the issue again. I've never been able to successfully compile/run ALE, though I might try that soon. However, I wondered if in the meantime I can/should simply try Doom again with the latest version of your code. It looks like DL4J has changed a lot since I last looked at it, so I'm not sure how complicated this transition would be. Ideally, I could just change the version number in my POM and go from there .... what would the latest/most appropriate version number be?
@schrum2 ALE is now all fixed up and binaries are available for Linux, Mac, and Windows on Maven, so you shouldn't have any problems with that anymore. VizDoom is another matter though... In any case, yes, please try version 1.0.0-beta. There shouldn't be much of anything to change in your code, but if there is let us know and we'll help out. Thanks!
There were a few small changes just to get it to compile. For example, the learningRate is not specified in the builder anymore, but instead in the Adam constructor. I was able to figure out these issues by looking at your code. However, after resolving all of the syntax errors, I got a successful Maven install, and then executed, only to get this error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/nd4j/tools/PropertyParser
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:6327)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:6300)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:210)
at org.deeplearning4j.rl4j.space.ArrayObservationSpace.<init>(ArrayObservationSpace.java:25)
at org.deeplearning4j.rl4j.mdp.vizdoom.VizDoom.<init>(VizDoom.java:67)
at org.deeplearning4j.rl4j.mdp.vizdoom.DeadlyCorridor.<init>(DeadlyCorridor.java:14)
at org.deeplearning4j.examples.rl4j.Doom.getMDP(Doom.java:84)
at org.deeplearning4j.examples.rl4j.Doom.doomLearn(Doom.java:95)
at org.deeplearning4j.examples.rl4j.Doom.main(Doom.java:79)
Caused by: java.lang.ClassNotFoundException: org.nd4j.tools.PropertyParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
Granted, I'm still using my code rather than starting from scratch with yours, so there may be some problem there. However, if you can think of any reason why I'm not finding PropertyParser I would appreciate it. Here is a relevant portion of my pom.xml
<nd4j.version>1.0.0-beta</nd4j.version>
<dl4j.version>1.0.0-beta</dl4j.version>
<datavec.version>1.0.0-beta</datavec.version>
<arbiter.version>1.0.0-beta</arbiter.version>
<rl4j.version>1.0.0-beta</rl4j.version>
I would assume that bringing in the latest version of ND4J would resolve this issue.
Note: I changed the pom again to this:
https://github.com/schrum2/MM-NEAT/blob/dev/pom.xml
This seemed to change ND4J somehow, in that several places where int was used for INDArray sizes were now long variables.
I got the same error though about the PropertyParser though. This does seem like an issue with a missing maven dependency/library though.
Stick with 1.0.0-beta for now. I still see some 0.9.1 here that should be upgraded to 1.0.0-beta:
https://github.com/schrum2/MM-NEAT/blob/d5e6c4a2965d0caee3cb47c560fcf84a0b84824a/pom.xml#L138
I changed the pom again ( https://github.com/schrum2/MM-NEAT/blob/dev/pom.xml ) and everything runs now! Of course, the original issue I had was that the code would run for a long time (overnight) and then eventually crash. I'm running VizDoom right now, and will let you know tomorrow in this issue what happened.
However, making all of the DL4J upgrades did cause some additional unrelated problems with ImageNet, which I've created a new issue for here: #5402
Unfortunately, after the upgrade, the crash still happens. It seems to be basically the same error.
java.lang.NullPointerException: null
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:50) ~[rl4j-core-1.0.0-beta.jar:na]
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:38) ~[rl4j-core-1.0.0-beta.jar:na]
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:45) ~[rl4j-core-1.0.0-beta.jar:na]
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:52) ~[rl4j-core-1.0.0-beta.jar:na]
at org.deeplearning4j.rl4j.learning.sync.qlearning.discrete.QLearningDiscrete.trainStep(QLearningDiscrete.java:162) ~[rl4j-core-1.0.0-beta.jar:na]
at org.deeplearning4j.rl4j.learning.sync.qlearning.QLearning.trainEpoch(QLearning.java:93) ~[rl4j-core-1.0.0-beta.jar:na]
at org.deeplearning4j.rl4j.learning.sync.SyncLearning.train(SyncLearning.java:38) ~[rl4j-core-1.0.0-beta.jar:na]
at org.deeplearning4j.examples.rl4j.Doom.doomLearn(Doom.java:101) [classes/:na]
at org.deeplearning4j.examples.rl4j.Doom.main(Doom.java:79) [classes/:na]
java.lang.NullPointerException
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:50)
at org.deeplearning4j.rl4j.learning.sync.Transition.dup(Transition.java:38)
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:45)
at org.deeplearning4j.rl4j.learning.sync.ExpReplay.getBatch(ExpReplay.java:52)
at org.deeplearning4j.rl4j.learning.sync.qlearning.discrete.QLearningDiscrete.trainStep(QLearningDiscrete.java:162)
at org.deeplearning4j.rl4j.learning.sync.qlearning.QLearning.trainEpoch(QLearning.java:93)
at org.deeplearning4j.rl4j.learning.sync.SyncLearning.train(SyncLearning.java:38)
at org.deeplearning4j.examples.rl4j.Doom.doomLearn(Doom.java:101)
at org.deeplearning4j.examples.rl4j.Doom.main(Doom.java:79)
0072118160] 8x8 transform intra:77.1% inter:50.5%
[libx264 @ 0000000072118160] coded y,uvDC,uvAC intra: 87.9% 69.5% 36.3% inter: 0.2% 0.1% 0.0%
[libx264 @ 0000000072118160] i16 v,h,dc,p: 27% 33% 15% 24%
[libx264 @ 0000000072118160] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 15% 18% 17% 7% 8% 7% 10% 6% 11%
[libx264 @ 0000000072118160] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 25% 26% 14% 7% 7% 5% 7% 4% 5%
[libx264 @ 0000000072118160] i8c dc,h,v,p: 58% 19% 20% 3%
[libx264 @ 0000000072118160] Weighted P-Frames: Y:0.0% UV:0.0%
[libx264 @ 0000000072118160] ref P L0: 88.7% 5.1% 6.2%
[libx264 @ 0000000072118160] ref B L0: 79.6% 20.4%
[libx264 @ 0000000072118160] ref B L1: 77.3% 22.7%
[libx264 @ 0000000072118160] kb/s:487.47
I'll try using the new ALE binary to see if that works any better, and I'll also explore loading the saved doom-model and resuming training, which I recall wanting to do before, but being unable to because it wasn't supported in an older version of RL4J
Loading the Doom model and training further seems to work. I suppose it will crash again, but I should eventually be able to get good behavior. I'll follow up once I know how this turns out.
Still need to try ALE though.
I ran the Doom experiment for several days, re-loading the model every time it crashed (about once a day), but I still didn't really get noticeably good performance. In the movies, the model sometimes shoots one enemy, but never reaches the end, and always dies. I'm not sure how long I should expect this to need to train for, though I admit I am using the CPU backend.
I peeked at the stats being saved in raw text, and the returns seem to fluctuate wildly ... the agent is not getting noticeably better in terms of the numbers either. I'll admit I haven't plotted the returns over time to get a better view of their behavior though.
I'll try with ALE too, and also try to get an actual plot of performance over time for VizDoom to post here for reference.
You'll need to tune the hyperparameters: https://deeplearning4j.org/troubleshootingneuralnets . This is hard for typical models, and very hard for reinforcement learning. CNNs are way too slow to get anything done on CPU, so you'll probably need to use CUDA with cuDNN to get anywhere anytime soon. @rubenfiszel Have you saved a set of parameters somewhere that worked for that scenario?
Anyway, this issue is about the crash occurring. I'm pretty sure it's caused by VizDoom, so let's make sure that's at least the case by trying it out with ALE. If it doesn't happen there, we'll have narrowed down at least that. Thanks!