It would be good in https://spark.rstudio.com/ to state explicitly why we want to use spark. Presumably it鈥檚 to parallelise operations or deal with bigger data.
But there are seemingly a zillion ways to do this. (Am I the only one who remembers snow? mcapply?) So, maybe spark is better. But, why? For what? When is it worth dealing with java?
Spark is a product of Apache, not RStudio. Maybe you can go through the following links (if you haven't):
https://spark.apache.org/docs/latest/index.html
https://spark.apache.org/docs/latest/cluster-overview.html
Many enterprise level software are in Java, so APACHE SPARK has a java API to allow Java Programmers to work with it with ease. R, Python and Scala are other languages it supports.
Spark is used for blazing fast data analytics and processing which it does with the help of various in- built processing libraries like Graphx, SQL, ML library and Spark Streaming for stream analytics i.e. real -time processing of data.
It is completely open-source and Apache 2 licensed.
There is extensive discussion on this topic within an online book about sparklyr. See therinspark.com for details.
Re: why this was not on spark.rstudio.com: I think spark.rstudio.com is mostly intended as a place to provide quick references for practitioners. We generally don't dive into deep architectural discussions on that website.
Most helpful comment
Spark is a product of Apache, not RStudio. Maybe you can go through the following links (if you haven't):
https://spark.apache.org/docs/latest/index.html
https://spark.apache.org/docs/latest/cluster-overview.html