Spark structured streaming is the latest program mode of spark streaming, see https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
@yjshen has implemented a comprehensive connector for spark, which included schema integration. He will share the details at Pulsar meetup at Shenzhen this Saturday. If you are interested in it, come and join us.
He will contribute back to the community after the meetup.
for people who are interested in using spark structure streaming with Pulsar, @yjshen has implemented a decent spark connector here: https://github.com/streamnative/pulsar-spark
The spark connector supports both streaming and batch jobs and write data back to pulsar as well. the implementation is fully integrated with Pulsar schema. You can also use Spark SQL to query the data in pulsar.
He also wrote a blog post about it : https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c59017
We are looking forward to contributing this back to upstream pulsar or spark. thoughts and feedbacks are welcome.
Any update on https://github.com/streamnative/pulsar-spark contributing back here?
@crazylab at this moment, I don't think we will merge this back to pulsar repo. Because as part of PIP-66, we are actually trying to move Pulsar integrations out of the main pulsar repo to reduce build time. We encourage people to push the integrations to upstream of other projects (e.g. Flink or Spark) or maintain the integrations in separate repository.
Closed this issue for now. We will continue maintaining the pulsar-spark connector at https://github.com/streamnative/pulsar-spark and push it to upstream spark.
Most helpful comment
@yjshen has implemented a comprehensive connector for spark, which included schema integration. He will share the details at Pulsar meetup at Shenzhen this Saturday. If you are interested in it, come and join us.
He will contribute back to the community after the meetup.