Sparklyr: documentation: why use spark?

Created on 20 Jun 2019  路  3Comments  路  Source: sparklyr/sparklyr

It would be good in https://spark.rstudio.com/ to state explicitly why we want to use spark. Presumably it鈥檚 to parallelise operations or deal with bigger data.

But there are seemingly a zillion ways to do this. (Am I the only one who remembers snow? mcapply?) So, maybe spark is better. But, why? For what? When is it worth dealing with java?

Most helpful comment

Spark is a product of Apache, not RStudio. Maybe you can go through the following links (if you haven't):
https://spark.apache.org/docs/latest/index.html
https://spark.apache.org/docs/latest/cluster-overview.html

All 3 comments

Spark is a product of Apache, not RStudio. Maybe you can go through the following links (if you haven't):
https://spark.apache.org/docs/latest/index.html
https://spark.apache.org/docs/latest/cluster-overview.html

Many enterprise level software are in Java, so APACHE SPARK has a java API to allow Java Programmers to work with it with ease. R, Python and Scala are other languages it supports.
Spark is used for blazing fast data analytics and processing which it does with the help of various in- built processing libraries like Graphx, SQL, ML library and Spark Streaming for stream analytics i.e. real -time processing of data.
It is completely open-source and Apache 2 licensed.

There is extensive discussion on this topic within an online book about sparklyr. See therinspark.com for details.

Re: why this was not on spark.rstudio.com: I think spark.rstudio.com is mostly intended as a place to provide quick references for practitioners. We generally don't dive into deep architectural discussions on that website.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wanting0wang picture wanting0wang  路  3Comments

admoseremic picture admoseremic  路  4Comments

dangulod picture dangulod  路  4Comments

Scotturbina picture Scotturbina  路  3Comments

saraswatmks picture saraswatmks  路  3Comments