Docker-stacks: Correctly set the pyspark python version for the Spark driver

Created on 8 Mar 2016  Â·  12Comments  Â·  Source: jupyter/docker-stacks

In all-spark-notebook/Dockerfile, use PYSPARK_DRIVER_PYTHON instead of PYSPARK_PYTHON to set the python version of the Spark driver. PYSPARK_PYTHON changes the version for all executors which causes python not found errors otherwise because the python's path from the notebook is sent to executors.

Bug

Most helpful comment

The example in the all-spark-notebook and pyspark-notebook readmes give an explicit way to set the path:

import os
# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

import pyspark
conf = pyspark.SparkConf()

...

Of course, it would be better if the path didn't default to the driver version / path of Python like this issue states. But still, the above recipe holds true for telling Spark where Python resides on the workers.

All 12 comments

@boechat107, do you have any thoughts on this change?

Also, could you please share some kind of traceback @doctapp for the error(s) you are seeing?

Don't have the exact error, but it referred to not finding
/opt/conda/.../python2/bin/python on our cluster as this is a path in the
container.
On Mar 8, 2016 17:59, "jakirkham" [email protected] wrote:

Also, could you please share some kind of traceback @doctapp
https://github.com/doctapp for the error(s) you are seeing?

—
Reply to this email directly or view it on GitHub
https://github.com/jupyter/docker-stacks/issues/151#issuecomment-194007212
.

The example in the all-spark-notebook and pyspark-notebook readmes give an explicit way to set the path:

import os
# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

import pyspark
conf = pyspark.SparkConf()

...

Of course, it would be better if the path didn't default to the driver version / path of Python like this issue states. But still, the above recipe holds true for telling Spark where Python resides on the workers.

It would also be consistent with the pyspark console which uses the
defaults.
On Mar 8, 2016 21:06, "Peter Parente" [email protected] wrote:

The example in the all-spark-notebook and pyspark-notebook readmes give an
explicit way to set the path:

import os

make sure pyspark tells workers to use python3 not 2 if both are installed

os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

import pyspark
conf = pyspark.SparkConf()

...

Of course, it would be better if the path didn't default to the driver
version / path of Python like this issue states. But still, the above
recipe holds true for telling Spark where Python resides on the workers.

—
Reply to this email directly or view it on GitHub
https://github.com/jupyter/docker-stacks/issues/151#issuecomment-194068260
.

If we make this change, using Python 2 requires manually setting PYSPARK_PYTHON anyway in the notebook before creating the spark context:

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/python2/bin/python'

Otherwise, the worker uses the default Python 3.x which is incompatible.

Unless there's some other way to tackle this that appeases both use cases, local and remote, we need to put a stake in the ground about which use case we want to work out-of-the-box.

I'm leaning toward local since remote requires additional configuration anyway.

I have both notebook versions working fine using the driver env. I don't
know why you would need to do this change to make it work with Python 2...
On Mar 13, 2016 22:51, "Peter Parente" [email protected] wrote:

If we make this change, using Python 2 requires manually setting
PYSPARK_PYTHON anyway in the notebook before creating the spark context:

import os
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/python2/bin/python'

Otherwise, the worker uses the default Python 3.x which is incompatible.

Unless there's some other way to tackle this that appeases both use cases,
local and remote, we need to put a stake in the ground about which use case
we want to work out-of-the-box.

I'm leaning toward local since remote requires additional configuration
anyway.

—
Reply to this email directly or view it on GitHub
https://github.com/jupyter/docker-stacks/issues/151#issuecomment-196115096
.

I have both notebook versions working fine using the driver env. I don't know why you would need to do this change to make it work with Python 2

This notebook demonstrates the problem: https://gist.github.com/parente/70e8798aa22c1cc7b7eaa71c1b73bf4c

I believe this can now be solved with environment variables set when the conda environment is activated. We already activate the conda environment when launching the python2 kernel (https://github.com/jupyter/docker-stacks/blob/master/scipy-notebook/Dockerfile#L99) so this should amount to putting a bash script in the right conda location, $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

Or better yet, we don't need to do anything: https://gist.github.com/parente/8740dfa9b932f7bdd8edcf0f63865860

We took the PYSPARK_PYTHON env var out of the Dockerfiles a while back, and Spark now appears to pick up the correct Python interpreter when working in local mode. For cluster mode, the user or sys admin will still need need to set the proper path to python on the executor-side via the env vars or Spark config option, but at least we're not getting in the way now.

No need to change any file just run on terminal
_export PYSPARK_PYTHON=python3_
it will be done :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

akhmerov picture akhmerov  Â·  4Comments

tonywangcn picture tonywangcn  Â·  4Comments

niyazpk picture niyazpk  Â·  4Comments

edurenye picture edurenye  Â·  4Comments

maresb picture maresb  Â·  4Comments