So we can do spark stream and spark sql stuff, that's awesome!!
@kerwin -- I think you've posted on the wrong repo.
Influxdb is great datastore for time series data, it's will be great to have support for spark so that we could do more advanced data analysis (sql, streaming, ml, interactive with hadoop).
Here is same other NoSQL example.
https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
OK, it wasn't clear from your ticket that this is a request. I've re-opened it.
+1
we are planning to work on it : )
Very good plan ;-)
+1
Hi, @CrazyJvm. Just want to know how is progress for spark intergration ? I'm looking forward to it coming soon...
@kerwin It's meaningless to query data from influxdb directly to fill Spark. Still have to find another way to accomplish.
@CrazyJvm how about the opposite? Is it doable now? To push the result of spark onto influxdb?
+1
+1 I would love to join statistical performance data with other data to build a better view into how a distributed application is performing.
+1 amazing request
Feel free to look at my approach to this issue:
https://github.com/pygmalios/reactiveinflux
https://github.com/pygmalios/reactiveinflux-spark
I'm not entirely sure what this issue is, so I'm going to close it for now. If somebody is willing to write something up to explain what Spark is, how it relates to InfluxDB (with use cases), and what needs to be added to InfluxDB for this use case, we can take a look at it in a future release. Otherwise, I'm not really sure what to do with this issue.
Thanks.
Hello.
Apache Spark is a distributed analytic engine. It allows to run some complex processing on data. Data handled is store internally via a structure called a RDD. A RDD can be generated from multiple sources, like file, a kafka stream, a cassandra table, jdbc connection, an elasticsearch query, etc.
A connector spark <-> datasource (like cassandra) must be written because Spark is distributed and some information must be taken in account for correct handling (see https://github.com/datastax/spark-cassandra-connector for a good example of a very good spark connector).
For example, Spark can make a join of a RDD coming from a Cassandra request with a local CSV file to create a new data set.
I'm highly interested by this feature because we plan to make complex correlation between logs and metrics. We plan to collect log with Graylog (so in the end in an elasticsearch storage) and metrics in influxdb.
So to be able to find, for example, patterns of failures, by correlating traces and metrics, Spark must be able to query influxdb.
Hope this clarify a little bit.
@RadoBuransky does your code work on the newer spark versions?
I have no idea to be honest.
On Thu, Jul 6, 2017 at 4:57 AM, Mohamad Rezaei notifications@github.com
wrote:
@RadoBuransky https://github.com/radoburansky does your code work on
the newer spark versions?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/influxdata/influxdb/issues/3276#issuecomment-313375452,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC933c07jle10zURSyP0_zbeIjxH5zyYks5sLMuWgaJpZM4FU7LU
.
Re-visit building a Spark Datasource connector for InfluxDB.
+1
+1
is there any plan for writing a connector for Influxdb for the spark's new version? Thanks.!
@timhallinflux - please could you update us on where we are with the spark influxdb connector. It would be a great library to have!
+1
Is there any update or a plan towards doing this?
Also, there's this connector that seems to be up to date: https://github.com/fsanaulla/chronicler-spark
I haven't used it yet, though. Maybe its author - @fsanaulla - can say a little bit more about its state?
It's up to date with regular updates, published for scala 2.11/2.12.
Including support for RDD, Dataset, Streaming and Structured streaming.
It implements write flow from Apache Spark to InfluxDB.
I hope to add read flow in near future.
Any feedback is welcome!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.
Most helpful comment
Hello.
Apache Spark is a distributed analytic engine. It allows to run some complex processing on data. Data handled is store internally via a structure called a RDD. A RDD can be generated from multiple sources, like file, a kafka stream, a cassandra table, jdbc connection, an elasticsearch query, etc.
A connector spark <-> datasource (like cassandra) must be written because Spark is distributed and some information must be taken in account for correct handling (see https://github.com/datastax/spark-cassandra-connector for a good example of a very good spark connector).
For example, Spark can make a join of a RDD coming from a Cassandra request with a local CSV file to create a new data set.
I'm highly interested by this feature because we plan to make complex correlation between logs and metrics. We plan to collect log with Graylog (so in the end in an elasticsearch storage) and metrics in influxdb.
So to be able to find, for example, patterns of failures, by correlating traces and metrics, Spark must be able to query influxdb.
Hope this clarify a little bit.