Docker-stacks: How to add custom jars to jupyter notebook?

Created on 13 Mar 2016 · 7Comments · Source: jupyter/docker-stacks

Hi, I would like to run a spark streaming application in the all-spark notebookconsuming from Kafka. This requires spark-submit with custom parameters (-jars and the kafka-consumer jar). I do not completely understand how I could do this from the jupyter notebook. Has any of you tried this? The alternative is to add it with --packages. Is this easier?

I just submitted the same question to stackoverflow if you'd like more details: http://stackoverflow.com/questions/35946868/adding-custom-jars-to-pyspark-in-jupyter-notebook/35971594#35971594

Question

Source

drdwitte

Most helpful comment

@parente Seems to be working, maybe something interesting to add to the recipes...
I've been doing some elimination on the possible problems. The spark csv example you provided was actually working but that was present in the spark packages repository while the kafka consumer wasn't. This seemed to imply that I had to add the kafka consumer jar to the environment via the --jars flag.
As far as I can see I have something working right now: (note that the pyspark-shell is also very important!)

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'
import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)
broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

And this seems to be working. Probably in my previous try the SPARK_HOME might not have been resolved?

drdwitte on 29 Mar 2016

👍15

All 7 comments

The techniques about Using spark packages on the docker-stacks recipes page might work. Can you give that approach a shot?.

parente on 14 Mar 2016

@parente Unfortunately that doesn't seem to work (apologies for the big delay, I had an in-between project).

I was having a look in $SPARK_HOME/bin since there PYSPARK_SUBMIT_ARGS can be set. Basically I have to run the notebook with some custom flags to ./bin/spark-submit but it's not entirely clear to me when this command gets executed, I assume it is executed the moment you start a new notebook? In that case adding jars or a mvn ref won't work.

I tried the following in my notebook:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --jars $SPARK_HOME/spark-streaming-kafka-assembly_2.10-1.6.1.jar'

and then create a context, but later on I get the error:
Spark Streaming's Kafka libraries not found in class path

Another try was to use the --packages flag:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages org.apache.spark:spark-streaming-kafka:1.6.0'

But also no succes.

Might it be the right way to modify the pyspark.cmd files in the $SPARK_HOME/bin directory?

drdwitte on 29 Mar 2016

❤1 👍1

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'
import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)
broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

And this seems to be working. Probably in my previous try the SPARK_HOME might not have been resolved?

drdwitte on 29 Mar 2016

👍15

It would be great to get this on the recipes page if you did not hit any further problems with the approach you took above.

parente on 5 May 2016

For now no issues, but since then I did not work further on this. I will resume my work on this probably in june, if I would encounter any new issues then you'll be the first to be informed!

drdwitte on 5 May 2016

👍1

https://github.com/jupyter/docker-stacks/wiki/Docker-recipes#using-local-spark-jars has the recipe for posterity. Closing this one as resolved.

parente on 6 May 2016

@parente Seems to be working, maybe something interesting to add to the recipes...
I've been doing some elimination on the possible problems. The spark csv example you provided was actually working but that was present in the spark packages repository while the kafka consumer wasn't. This seemed to imply that I had to add the kafka consumer jar to the environment via the --jars flag.
As far as I can see I have something working right now: (note that the pyspark-shell is also very important!)
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'
import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)
broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()
And this seems to be working. Probably in my previous try the SPARK_HOME might not have been resolved?

Hi @drdwitte In "'--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar" part, what is the /home referring to? The /home on the docker? If that is the case, do you need to build the docker image with the .jar file on?
Thanks, Xingsheng