Kedro: [KED-1010] Optimal Spark Configuration

Created on 25 Oct 2019 · 2Comments · Source: quantumblacklabs/kedro

Description

We have basic support for PySpark workflows detailed here but find that our users will configure Kedro in many different ways. This is a research themes to determine a few things:

What does a best-practice Spark configuration for Kedro look like?
What challenges do our users face while using Kedro and PySpark?
How does Spark interact with our ParallelRunner?
How does Spark interact with kedro-airflow, a Kedro plugin?

Context

This task precedes our support for Dask specified in #144. Users should have a similar user experience setting up Spark and Dask.

Opportunity Roadmap

Source

yetudada

👍2 ❤1

Most helpful comment

Hi everyone, to follow up on the questions in this issue:

What does a best-practice Spark configuration for Kedro look like?

We have updated our documentation to recommend initialising SparkSession in your ProjectContext but _not storing it on the context instance_. Instead, use the global SparkSession.Builder.getOrCreate to retrieve your session whenever you need it in your pipeline.

What challenges do our users face while using Kedro and PySpark?

Unnecessary SparkSession is created when running non-essential commands (Issue)

A side effect of creating a spark session inside ProjectContext is that all plugins will need to initialise spark session, even if they don't need it. However, the upside is if you use IPython notebook, you don't have to initialise spark session yourself.

In Kedro 0.16, with the introduction of Hooks, a much better place to initialise spark session is the before_pipeline_run hook. Note that if you choose to do this, you will have to remember to initialise the session manually when using IPython notebook.

Parallel runner doesn't work with Spark (Issue)

Kedro's ParallelRunner uses process-based parallelism and that doesn't work with spark session. Instead, we will release a ThreadRunner which offers thread-based parallelism in the next release for users to use alongside Spark.

Users want to take advantage of spark's lazy evaluation and optimiser as much as possible

We will be updating our documentation to show examples as well as principles on how to do this in the future. We will also provide a CLI flag to generate a pyspark project template and will include an option to generate example pipelines as well. Both of this work have been tracked in our internal Kanban board.

Running `pyspark` with `kedro-airflow`

We are still trying to replicate the issue. Any actual error report would be greatly appreciated.

limdauto on 21 Apr 2020

🚀4

All 2 comments

Hi everyone, to follow up on the questions in this issue:

What does a best-practice Spark configuration for Kedro look like?

What challenges do our users face while using Kedro and PySpark?

Unnecessary SparkSession is created when running non-essential commands (Issue)

Parallel runner doesn't work with Spark (Issue)

Users want to take advantage of spark's lazy evaluation and optimiser as much as possible

Running `pyspark` with `kedro-airflow`

We are still trying to replicate the issue. Any actual error report would be greatly appreciated.

limdauto on 21 Apr 2020

🚀4

I'm going to close this ticket, thanks to @limdauto 🥇 These changes were shipped to our docs and as a Kedro PySpark starter.

yetudada on 22 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[KED-1667] Chronocoding: Solving the Problem of State Tracking with Temporally Sensitive DAGs

tamsanh · 4Comments

Regression 0.15 -> 0.16, can no longer specify parquet engine

philippegr · 4Comments

Reusing pipeline elements in a served model scenario

kaemo · 3Comments

Missing run_result in after_pipeline_run hook

applelok · 3Comments

Jupyter Notebook and iPython launch issues

josephhaaga · 3Comments

Kedro: [KED-1010] Optimal Spark Configuration

Description

Context

Most helpful comment

What does a best-practice Spark configuration for Kedro look like?

What challenges do our users face while using Kedro and PySpark?

Unnecessary SparkSession is created when running non-essential commands (Issue)

Parallel runner doesn't work with Spark (Issue)

Users want to take advantage of spark's lazy evaluation and optimiser as much as possible

Running pyspark with kedro-airflow

All 2 comments

What does a best-practice Spark configuration for Kedro look like?

What challenges do our users face while using Kedro and PySpark?

Unnecessary SparkSession is created when running non-essential commands (Issue)

Parallel runner doesn't work with Spark (Issue)

Users want to take advantage of spark's lazy evaluation and optimiser as much as possible

Running pyspark with kedro-airflow

Related issues

Running `pyspark` with `kedro-airflow`

Running `pyspark` with `kedro-airflow`