Docker-stacks: jupyter/all-spark-notebook pyspark sc.textFile() can not access files stored on Amazon s3

Created on 20 Feb 2016 · 9Comments · Source: jupyter/docker-stacks

Testing in docker on single 32-core amazon ec2 c3.8xlarge instance as:

     docker run -d -P -e PASSWORD=something jupyter/all-spark-notebook

Start new python2 notebook. Jupyter displays new notebook.

First Cell runs fine

 import pyspark
 import matplotlib
 sc = pyspark.SparkContext()

Second cell has error:

 myfile = sc.textFile("s3://bucketname/filename.csv")
 myfile.count()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe: java.io.IOException: No FileSystem for scheme: s3

Of course, the error points at the count, as sc.textFile() will return an RDD and does not attempt to access the file, deferring that to an action like .count().

AFAIK hadoop provides spark's s3 reader, and in 2015 there was a SO question about a similar error on hadoop:

http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation

providing the solution as including an omitted jarfile in the classpath defined in $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Looking around in the container with docker exec I don't see any HADOOP env vars set, and ran a find in root but could not find a hadoop-env.sh

I assume a workaround may to be to run a full spark cluster (say, with either the Amazon GUI or the scripts provided in spark) and connect the jupyter docker to it using the provided instructions, but have not had time to try that. It would be nice if the s3 reader worked in a stand alone case.

Enhancement Question

Source

DrPaulBrewer

👍2

Most helpful comment

I got this to work:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input() 
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

df = sqlContext.read.parquet("s3://myBucket/myKey")

ksindi on 26 Apr 2016

👍10 🎉3

All 9 comments

https://github.com/jupyter/docker-stacks/blob/master/all-spark-notebook/Dockerfile#L18

The Spark 1.6.0 package used in the stack is prebuilt against Hadoop 2.6. It looks like someone filed a defect against Spark 1.4.1 about the missing jars/config:

https://issues.apache.org/jira/browse/SPARK-7442

It was closed stating that the problem is in hadoop upstream:

https://issues.apache.org/jira/browse/HADOOP-11863

which refers to this mailing list post:

http://mail-archives.apache.org/mod_mbox/hadoop-user/201504.mbox/%3CCA+XUwYxPxLkfhOxn1jNkoUKEQQMcPWFzvXJ=u+kP28KDEjO4GQ@mail.gmail.com%3E

So either the s3 classes are in the image already but not configured for use (e.g., classpath). Or they're not in there at all and need to be added in plus configured for use.

parente on 22 Feb 2016

That sounds consistent with some articles I found addressing use of s3 with spark.

Dec 2015
http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

From Jun 2015 this Stack Overflow answer suggests that it was an issue with the Spark build he had with Hadoop 2.6 and that using a Spark built with Hadoop 2.4 fixed the problem for him.

I may not have time to look into this any time soon; but maybe this info will be useful to others.

DrPaulBrewer on 9 Mar 2016

I got this to work:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input() 
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

df = sqlContext.read.parquet("s3://myBucket/myKey")

ksindi on 26 Apr 2016

👍10 🎉3

@ksindi Thanks for sharing! Would you mind putting it on the recipes wiki page?

parente on 26 Apr 2016

@parente https://github.com/jupyter/docker-stacks/wiki/Docker-recipes#using-pyspark-with-aws-s3

ksindi on 26 Apr 2016

🎉2

If you want to create a Docker that loads this config in automatically, you can do:

FROM jupyter/all-spark-notebook:latest

COPY script.py script.py
RUN python script.py

ENV PYSPARK_SUBMIT_ARGS  '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

COPY hdfs-site.xml /usr/local/spark/conf

With hdfs-site.xml

<configuration>
  <property>
    <name>fs.s3.impl</name>
    <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
  </property>
</configuration>

and script.py (to trigger the relevant files being pre-installed in docker. a couple of lines in this are prob superfluous.

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

RobinL on 9 Jun 2018

👍4

@RobinL I've been hitting my head against this issue for a few hours and decided to put together your solution since it seems to be the most robust approach. I built the image and have got it running, but when I try to access anything on S3 I get a "Socket not created by this factory" error.

I was wondering whether your solution is still working for you or if you have found another way around this problem?

datawookie on 30 Aug 2018

@DataWookie I had the same issue and could fix it by:
1) Using s3a:// URLs instead of s3:// or s3n://
2) Upgrading the AWS SDK version to aws-java-sdk:1.11.95 (https://github.com/aws/aws-sdk-java/issues/1032)

jawadst on 4 Nov 2018

👍2

Another update on this:

In addition to running spark in jupyter lab, I have been trying to set up two additional docker commands that allow us to run spark scripts in the same environment:

Get a bash command prompt within the Docker container with docker run -it myimage /bin/bash
Run a python script in the spark environment with docker run myimage python myfile.py
where myimageis the Docker file specified above that builds on all-spark but adds additional config

The script.py above is a hack that makes sure that a bunch of required spark packages (jars) are pre-installed.

This works in jupyter but not for (1) and (2) above.

After lots of digging, I figured out why. the jars are installed at like ./home/jovyan/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar.

Jupyter ‘knows’ about the joyvan folder, but the bash prompt and python script don’t. So (1) and (2) can’t see the files in /home/jovyan and therefore go and download them again

The solution is to include this:

from pyspark.conf import SparkConf

conf = SparkConf().set(
    "spark.jars.ivy", "/home/jovyan/.ivy2/")

sc = SparkContext(conf=conf)

in any scripts which are running outside of jupyter

RobinL on 2 Oct 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add mamba to base-notebook?

MridulS · 4Comments

Disable or manually set Token Auth

osobh · 3Comments

Update to tensorflow 2.0

edurenye · 4Comments

GPU recipe?

iyanmv · 4Comments

Permission denied when mounting folder with changed user

jp68138743541 · 4Comments