We have basic support for PySpark workflows detailed here but find that our users will configure Kedro in many different ways. This is a research themes to determine a few things:
ParallelRunner?kedro-airflow, a Kedro plugin?This task precedes our support for Dask specified in #144. Users should have a similar user experience setting up Spark and Dask.
Hi everyone, to follow up on the questions in this issue:
We have updated our documentation to recommend initialising SparkSession in your ProjectContext but _not storing it on the context instance_. Instead, use the global SparkSession.Builder.getOrCreate to retrieve your session whenever you need it in your pipeline.
A side effect of creating a spark session inside ProjectContext is that all plugins will need to initialise spark session, even if they don't need it. However, the upside is if you use IPython notebook, you don't have to initialise spark session yourself.
In Kedro 0.16, with the introduction of Hooks, a much better place to initialise spark session is the before_pipeline_run hook. Note that if you choose to do this, you will have to remember to initialise the session manually when using IPython notebook.
Kedro's ParallelRunner uses process-based parallelism and that doesn't work with spark session. Instead, we will release a ThreadRunner which offers thread-based parallelism in the next release for users to use alongside Spark.
We will be updating our documentation to show examples as well as principles on how to do this in the future. We will also provide a CLI flag to generate a pyspark project template and will include an option to generate example pipelines as well. Both of this work have been tracked in our internal Kanban board.
pyspark with kedro-airflowWe are still trying to replicate the issue. Any actual error report would be greatly appreciated.
I'm going to close this ticket, thanks to @limdauto 馃 These changes were shipped to our docs and as a Kedro PySpark starter.
Most helpful comment
Hi everyone, to follow up on the questions in this issue:
What does a best-practice Spark configuration for Kedro look like?
We have updated our documentation to recommend initialising
SparkSessionin yourProjectContextbut _not storing it on the context instance_. Instead, use the globalSparkSession.Builder.getOrCreateto retrieve your session whenever you need it in your pipeline.What challenges do our users face while using Kedro and PySpark?
Unnecessary SparkSession is created when running non-essential commands (Issue)
A side effect of creating a spark session inside
ProjectContextis that all plugins will need to initialise spark session, even if they don't need it. However, the upside is if you use IPython notebook, you don't have to initialise spark session yourself.In Kedro 0.16, with the introduction of Hooks, a much better place to initialise spark session is the
before_pipeline_runhook. Note that if you choose to do this, you will have to remember to initialise the session manually when using IPython notebook.Parallel runner doesn't work with Spark (Issue)
Kedro's ParallelRunner uses process-based parallelism and that doesn't work with spark session. Instead, we will release a
ThreadRunnerwhich offers thread-based parallelism in the next release for users to use alongside Spark.Users want to take advantage of spark's lazy evaluation and optimiser as much as possible
We will be updating our documentation to show examples as well as principles on how to do this in the future. We will also provide a CLI flag to generate a
pysparkproject template and will include an option to generate example pipelines as well. Both of this work have been tracked in our internal Kanban board.Running
pysparkwithkedro-airflowWe are still trying to replicate the issue. Any actual error report would be greatly appreciated.