Influxdb: [feature request] make InfluxDB interactive with spark

Created on 9 Jul 2015 · 28Comments · Source: influxdata/influxdb

So we can do spark stream and spark sql stuff, that's awesome!!

1.x kinfeature-request pextensibility proposed wontfix

Source

kerwin

👍8

Most helpful comment

Hello.

Apache Spark is a distributed analytic engine. It allows to run some complex processing on data. Data handled is store internally via a structure called a RDD. A RDD can be generated from multiple sources, like file, a kafka stream, a cassandra table, jdbc connection, an elasticsearch query, etc.

A connector spark <-> datasource (like cassandra) must be written because Spark is distributed and some information must be taken in account for correct handling (see https://github.com/datastax/spark-cassandra-connector for a good example of a very good spark connector).

For example, Spark can make a join of a RDD coming from a Cassandra request with a local CSV file to create a new data set.

I'm highly interested by this feature because we plan to make complex correlation between logs and metrics. We plan to collect log with Graylog (so in the end in an elasticsearch storage) and metrics in influxdb.

So to be able to find, for example, patterns of failures, by correlating traces and metrics, Spark must be able to query influxdb.

Hope this clarify a little bit.

zorel on 4 Aug 2016

👍18

All 28 comments

@kerwin -- I think you've posted on the wrong repo.

otoolep on 9 Jul 2015

Influxdb is great datastore for time series data, it's will be great to have support for spark so that we could do more advanced data analysis (sql, streaming, ml, interactive with hadoop).

Here is same other NoSQL example.
https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

kerwin on 9 Jul 2015

OK, it wasn't clear from your ticket that this is a request. I've re-opened it.

otoolep on 9 Jul 2015

kerwin on 9 Jul 2015

we are planning to work on it : )

CrazyJvm on 30 Oct 2015

👍3 🚀1

Very good plan ;-)

TechniclabErdmann on 5 Nov 2015

aadrian on 5 Nov 2015

Hi, @CrazyJvm. Just want to know how is progress for spark intergration ? I'm looking forward to it coming soon...

kerwin on 16 Nov 2015

@kerwin It's meaningless to query data from influxdb directly to fill Spark. Still have to find another way to accomplish.

CrazyJvm on 9 Dec 2015

@CrazyJvm how about the opposite? Is it doable now? To push the result of spark onto influxdb?

amir-rahnama on 11 Dec 2015

RadoBuransky on 22 Dec 2015

+1 I would love to join statistical performance data with other data to build a better view into how a distributed application is performing.

sseveran on 24 Feb 2016

+1 amazing request

lijialing888 on 17 Mar 2016

Feel free to look at my approach to this issue:
https://github.com/pygmalios/reactiveinflux
https://github.com/pygmalios/reactiveinflux-spark

RadoBuransky on 17 Mar 2016

👍3

I'm not entirely sure what this issue is, so I'm going to close it for now. If somebody is willing to write something up to explain what Spark is, how it relates to InfluxDB (with use cases), and what needs to be added to InfluxDB for this use case, we can take a look at it in a future release. Otherwise, I'm not really sure what to do with this issue.

Thanks.

jsternberg on 1 Aug 2016

👎2 👍1

Hello.

For example, Spark can make a join of a RDD coming from a Cassandra request with a local CSV file to create a new data set.

So to be able to find, for example, patterns of failures, by correlating traces and metrics, Spark must be able to query influxdb.

Hope this clarify a little bit.

zorel on 4 Aug 2016

👍18

@RadoBuransky does your code work on the newer spark versions?

mmrezaie on 6 Jul 2017

I have no idea to be honest.

On Thu, Jul 6, 2017 at 4:57 AM, Mohamad Rezaei notifications@github.com
wrote:

@RadoBuransky https://github.com/radoburansky does your code work on
the newer spark versions?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/influxdata/influxdb/issues/3276#issuecomment-313375452,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC933c07jle10zURSyP0_zbeIjxH5zyYks5sLMuWgaJpZM4FU7LU
.

RadoBuransky on 6 Jul 2017

👍1

Re-visit building a Spark Datasource connector for InfluxDB.

timhallinflux on 9 Feb 2018

👍7

laudukang on 13 Mar 2018

👍1

aouakki on 23 Apr 2018

👍2

is there any plan for writing a connector for Influxdb for the spark's new version? Thanks.!

srinugajjala on 7 Jun 2018

@timhallinflux - please could you update us on where we are with the spark influxdb connector. It would be a great library to have!

manjukiruthika on 27 Jun 2018

arcticOak2 on 9 Nov 2018

Is there any update or a plan towards doing this?
Also, there's this connector that seems to be up to date: https://github.com/fsanaulla/chronicler-spark
I haven't used it yet, though. Maybe its author - @fsanaulla - can say a little bit more about its state?

lukaszdudek-silvair on 22 Feb 2019

It's up to date with regular updates, published for scala 2.11/2.12.

Including support for RDD, Dataset, Streaming and Structured streaming.
It implements write flow from Apache Spark to InfluxDB.

I hope to add read flow in near future.

Any feedback is welcome!

fsanaulla on 22 Feb 2019

👍5

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 23 Jul 2019

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

stale[bot] on 30 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings