Kedro: [KED-1010] Optimal Spark Configuration

Created on 25 Oct 2019  路  2Comments  路  Source: quantumblacklabs/kedro

Description

We have basic support for PySpark workflows detailed here but find that our users will configure Kedro in many different ways. This is a research themes to determine a few things:

  • What does a best-practice Spark configuration for Kedro look like?
  • What challenges do our users face while using Kedro and PySpark?
  • How does Spark interact with our ParallelRunner?
  • How does Spark interact with kedro-airflow, a Kedro plugin?

Context

This task precedes our support for Dask specified in #144. Users should have a similar user experience setting up Spark and Dask.

Opportunity Roadmap

Most helpful comment

Hi everyone, to follow up on the questions in this issue:

What does a best-practice Spark configuration for Kedro look like?

We have updated our documentation to recommend initialising SparkSession in your ProjectContext but _not storing it on the context instance_. Instead, use the global SparkSession.Builder.getOrCreate to retrieve your session whenever you need it in your pipeline.

What challenges do our users face while using Kedro and PySpark?

Unnecessary SparkSession is created when running non-essential commands (Issue)

A side effect of creating a spark session inside ProjectContext is that all plugins will need to initialise spark session, even if they don't need it. However, the upside is if you use IPython notebook, you don't have to initialise spark session yourself.

In Kedro 0.16, with the introduction of Hooks, a much better place to initialise spark session is the before_pipeline_run hook. Note that if you choose to do this, you will have to remember to initialise the session manually when using IPython notebook.

Parallel runner doesn't work with Spark (Issue)

Kedro's ParallelRunner uses process-based parallelism and that doesn't work with spark session. Instead, we will release a ThreadRunner which offers thread-based parallelism in the next release for users to use alongside Spark.

Users want to take advantage of spark's lazy evaluation and optimiser as much as possible

We will be updating our documentation to show examples as well as principles on how to do this in the future. We will also provide a CLI flag to generate a pyspark project template and will include an option to generate example pipelines as well. Both of this work have been tracked in our internal Kanban board.

Running pyspark with kedro-airflow

We are still trying to replicate the issue. Any actual error report would be greatly appreciated.

All 2 comments

Hi everyone, to follow up on the questions in this issue:

What does a best-practice Spark configuration for Kedro look like?

We have updated our documentation to recommend initialising SparkSession in your ProjectContext but _not storing it on the context instance_. Instead, use the global SparkSession.Builder.getOrCreate to retrieve your session whenever you need it in your pipeline.

What challenges do our users face while using Kedro and PySpark?

Unnecessary SparkSession is created when running non-essential commands (Issue)

A side effect of creating a spark session inside ProjectContext is that all plugins will need to initialise spark session, even if they don't need it. However, the upside is if you use IPython notebook, you don't have to initialise spark session yourself.

In Kedro 0.16, with the introduction of Hooks, a much better place to initialise spark session is the before_pipeline_run hook. Note that if you choose to do this, you will have to remember to initialise the session manually when using IPython notebook.

Parallel runner doesn't work with Spark (Issue)

Kedro's ParallelRunner uses process-based parallelism and that doesn't work with spark session. Instead, we will release a ThreadRunner which offers thread-based parallelism in the next release for users to use alongside Spark.

Users want to take advantage of spark's lazy evaluation and optimiser as much as possible

We will be updating our documentation to show examples as well as principles on how to do this in the future. We will also provide a CLI flag to generate a pyspark project template and will include an option to generate example pipelines as well. Both of this work have been tracked in our internal Kanban board.

Running pyspark with kedro-airflow

We are still trying to replicate the issue. Any actual error report would be greatly appreciated.

I'm going to close this ticket, thanks to @limdauto 馃 These changes were shipped to our docs and as a Kedro PySpark starter.

Was this page helpful?
0 / 5 - 0 ratings