Testing in docker on single 32-core amazon ec2 c3.8xlarge instance as:
docker run -d -P -e PASSWORD=something jupyter/all-spark-notebook
Start new python2 notebook. Jupyter displays new notebook.
First Cell runs fine
import pyspark
import matplotlib
sc = pyspark.SparkContext()
Second cell has error:
myfile = sc.textFile("s3://bucketname/filename.csv")
myfile.count()
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe: java.io.IOException: No FileSystem for scheme: s3
Of course, the error points at the count, as sc.textFile() will return an RDD and does not attempt to access the file, deferring that to an action like .count().
AFAIK hadoop provides spark's s3 reader, and in 2015 there was a SO question about a similar error on hadoop:
providing the solution as including an omitted jarfile in the classpath defined in $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Looking around in the container with docker exec I don't see any HADOOP env vars set, and ran a find in root but could not find a hadoop-env.sh
I assume a workaround may to be to run a full spark cluster (say, with either the Amazon GUI or the scripts provided in spark) and connect the jupyter docker to it using the provided instructions, but have not had time to try that. It would be nice if the s3 reader worked in a stand alone case.
https://github.com/jupyter/docker-stacks/blob/master/all-spark-notebook/Dockerfile#L18
The Spark 1.6.0 package used in the stack is prebuilt against Hadoop 2.6. It looks like someone filed a defect against Spark 1.4.1 about the missing jars/config:
https://issues.apache.org/jira/browse/SPARK-7442
It was closed stating that the problem is in hadoop upstream:
https://issues.apache.org/jira/browse/HADOOP-11863
which refers to this mailing list post:
So either the s3 classes are in the image already but not configured for use (e.g., classpath). Or they're not in there at all and need to be added in plus configured for use.
That sounds consistent with some articles I found addressing use of s3 with spark.
Dec 2015
http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/
From Jun 2015 this Stack Overflow answer suggests that it was an issue with the Spark build he had with Hadoop 2.6 and that using a Spark built with Hadoop 2.4 fixed the problem for him.
I may not have time to look into this any time soon; but maybe this info will be useful to others.
I got this to work:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
import pyspark
sc = pyspark.SparkContext("local[*]")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input()
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
df = sqlContext.read.parquet("s3://myBucket/myKey")
@ksindi Thanks for sharing! Would you mind putting it on the recipes wiki page?
@parente https://github.com/jupyter/docker-stacks/wiki/Docker-recipes#using-pyspark-with-aws-s3
If you want to create a Docker that loads this config in automatically, you can do:
FROM jupyter/all-spark-notebook:latest
COPY script.py script.py
RUN python script.py
ENV PYSPARK_SUBMIT_ARGS '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
COPY hdfs-site.xml /usr/local/spark/conf
With hdfs-site.xml
<configuration>
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
</configuration>
and script.py (to trigger the relevant files being pre-installed in docker. a couple of lines in this are prob superfluous.
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
import pyspark
sc = pyspark.SparkContext("local[*]")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
@RobinL I've been hitting my head against this issue for a few hours and decided to put together your solution since it seems to be the most robust approach. I built the image and have got it running, but when I try to access anything on S3 I get a "Socket not created by this factory" error.
I was wondering whether your solution is still working for you or if you have found another way around this problem?
@DataWookie I had the same issue and could fix it by:
1) Using s3a:// URLs instead of s3:// or s3n://
2) Upgrading the AWS SDK version to aws-java-sdk:1.11.95 (https://github.com/aws/aws-sdk-java/issues/1032)
Another update on this:
In addition to running spark in jupyter lab, I have been trying to set up two additional docker commands that allow us to run spark scripts in the same environment:
docker run -it myimage /bin/bashdocker run myimage python myfile.pymyimageis the Docker file specified above that builds on all-spark but adds additional configThe script.py above is a hack that makes sure that a bunch of required spark packages (jars) are pre-installed.
This works in jupyter but not for (1) and (2) above.
After lots of digging, I figured out why. the jars are installed at like ./home/jovyan/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar.
Jupyter ‘knows’ about the joyvan folder, but the bash prompt and python script don’t. So (1) and (2) can’t see the files in /home/jovyan and therefore go and download them again
The solution is to include this:
from pyspark.conf import SparkConf
conf = SparkConf().set(
"spark.jars.ivy", "/home/jovyan/.ivy2/")
sc = SparkContext(conf=conf)
in any scripts which are running outside of jupyter
Most helpful comment
I got this to work: