loading a small dataset works file, iris_tbl <- copy_to(sc, iris)
, but loading a slightly large one gives OutOfMemoryError. The memory should be enough though.
configuration (1 master + 3 workers)
config <- spark_config()
config$spark.executor.memory <- "16G"
config$spark.driver.memory <- "16G"
sparkmaster <- Sys.getenv("sparkmaster")
sc <- spark_connect(master = sparkmaster, config=config)
> babynames_tbl <- copy_to(sc, babynames, "babynames")
|================================================================================| 100% 65 MB
Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Double.valueOf(Double.java:521)
at scala.runtime.BoxesRunTime.boxToDouble(BoxesRunTime.java:79)
at sparklyr.Utils$$anonfun$5$$anonfun$6.apply(utils.scala:224)
at sparklyr.Utils$$anonfun$5$$anonfun$6.apply(utils.scala:218)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:234)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:234)
at sparklyr.Utils$$anonfun$5.apply(utils.scala:218)
at sparklyr.Utils$$anonfun$5.apply(utils.scala:216)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at sparklyr.Utils$.createDataFrameFromText(utils.scala:216)
at sparklyr.Utils.createDataFrameFromText(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sparklyr.Invoke$.invoke(invoke.scala:94)
at sparklyr.StreamHandler$.handleMethodCall(stream.scala:89)
at sparklyr.StreamHandler$.read(stream.scala:55)
at sparklyr.BackendHandler.channelRead0(handler.scala:49)
at sparklyr.BackendHandler.channelRead0(handler.scala:14)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>
@eijoac you can try executing the same operation by setting the following additional parameter before creating spark context object.
config <- spark_config()
config$sparklyr.shell.driver-memory
<- "4G"
config$sparklyr.shell.executor-memory
<- "4G"
config$spark.yarn.executor.memoryOverhead
<- "1g"
sc <- spark_connect(master = "local", config = config)
It works for me.
Thanks. Adding sparklyr.shell.driver-memory
and sparklyr.shell.executor-memory
worked (in my case, it is standalone cluster mode). I wonder why it worked? Does copy_to use Spark shell?
copy_to
does use memory proportional to the object being copied. Glad the right settings worked in this case.
Most helpful comment
@eijoac you can try executing the same operation by setting the following additional parameter before creating spark context object.
config <- spark_config()
config$
sparklyr.shell.driver-memory
<- "4G"config$
sparklyr.shell.executor-memory
<- "4G"config$
spark.yarn.executor.memoryOverhead
<- "1g"sc <- spark_connect(master = "local", config = config)
It works for me.