Docker-stacks: Correctly set the pyspark python version for the Spark driver

Created on 8 Mar 2016 · 12Comments · Source: jupyter/docker-stacks

In all-spark-notebook/Dockerfile, use PYSPARK_DRIVER_PYTHON instead of PYSPARK_PYTHON to set the python version of the Spark driver. PYSPARK_PYTHON changes the version for all executors which causes python not found errors otherwise because the python's path from the notebook is sent to executors.

Bug

Source

doctapp

Most helpful comment

The example in the all-spark-notebook and pyspark-notebook readmes give an explicit way to set the path:

import os
# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

import pyspark
conf = pyspark.SparkConf()

...

Of course, it would be better if the path didn't default to the driver version / path of Python like this issue states. But still, the above recipe holds true for telling Spark where Python resides on the workers.

parente on 9 Mar 2016

🎉10 👍3

All 12 comments

Ref: https://github.com/jupyter/docker-stacks/blob/master/all-spark-notebook/Dockerfile#L97

Also in pyspark-notebook.

parente on 8 Mar 2016

@boechat107, do you have any thoughts on this change?

jakirkham on 8 Mar 2016

Also, could you please share some kind of traceback @doctapp for the error(s) you are seeing?

jakirkham on 8 Mar 2016

Don't have the exact error, but it referred to not finding
/opt/conda/.../python2/bin/python on our cluster as this is a path in the
container.
On Mar 8, 2016 17:59, "jakirkham" [email protected] wrote:

Also, could you please share some kind of traceback @doctapp
https://github.com/doctapp for the error(s) you are seeing?

—
Reply to this email directly or view it on GitHub
https://github.com/jupyter/docker-stacks/issues/151#issuecomment-194007212
.

doctapp on 9 Mar 2016

The example in the all-spark-notebook and pyspark-notebook readmes give an explicit way to set the path:

import os
# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

import pyspark
conf = pyspark.SparkConf()

...

parente on 9 Mar 2016

🎉10 👍3

It would also be consistent with the pyspark console which uses the
defaults.
On Mar 8, 2016 21:06, "Peter Parente" [email protected] wrote:

The example in the all-spark-notebook and pyspark-notebook readmes give an
explicit way to set the path:

import os

make sure pyspark tells workers to use python3 not 2 if both are installed

os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

import pyspark
conf = pyspark.SparkConf()

...

Of course, it would be better if the path didn't default to the driver
version / path of Python like this issue states. But still, the above
recipe holds true for telling Spark where Python resides on the workers.

—
Reply to this email directly or view it on GitHub
https://github.com/jupyter/docker-stacks/issues/151#issuecomment-194068260
.

doctapp on 9 Mar 2016

If we make this change, using Python 2 requires manually setting PYSPARK_PYTHON anyway in the notebook before creating the spark context:

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/python2/bin/python'

Otherwise, the worker uses the default Python 3.x which is incompatible.

Unless there's some other way to tackle this that appeases both use cases, local and remote, we need to put a stake in the ground about which use case we want to work out-of-the-box.

I'm leaning toward local since remote requires additional configuration anyway.

parente on 14 Mar 2016

I have both notebook versions working fine using the driver env. I don't
know why you would need to do this change to make it work with Python 2...
On Mar 13, 2016 22:51, "Peter Parente" [email protected] wrote:

If we make this change, using Python 2 requires manually setting
PYSPARK_PYTHON anyway in the notebook before creating the spark context:

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/python2/bin/python'

Otherwise, the worker uses the default Python 3.x which is incompatible.

Unless there's some other way to tackle this that appeases both use cases,
local and remote, we need to put a stake in the ground about which use case
we want to work out-of-the-box.

I'm leaning toward local since remote requires additional configuration
anyway.

—
Reply to this email directly or view it on GitHub
https://github.com/jupyter/docker-stacks/issues/151#issuecomment-196115096
.

doctapp on 14 Mar 2016

I have both notebook versions working fine using the driver env. I don't know why you would need to do this change to make it work with Python 2

This notebook demonstrates the problem: https://gist.github.com/parente/70e8798aa22c1cc7b7eaa71c1b73bf4c

parente on 10 Apr 2016

I believe this can now be solved with environment variables set when the conda environment is activated. We already activate the conda environment when launching the python2 kernel (https://github.com/jupyter/docker-stacks/blob/master/scipy-notebook/Dockerfile#L99) so this should amount to putting a bash script in the right conda location, $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

parente on 22 Apr 2017

Or better yet, we don't need to do anything: https://gist.github.com/parente/8740dfa9b932f7bdd8edcf0f63865860

We took the PYSPARK_PYTHON env var out of the Dockerfiles a while back, and Spark now appears to pick up the correct Python interpreter when working in local mode. For cluster mode, the user or sys admin will still need need to set the proper path to python on the executor-side via the env vars or Spark config option, but at least we're not getting in the way now.

parente on 22 Apr 2017

No need to change any file just run on terminal
_export PYSPARK_PYTHON=python3_
it will be done :)