data.table spark interface

Created on 26 Aug 2016 · 20Comments · Source: Rdatatable/data.table

data.table is awesome but most people don't have 100GB memory in order to handle really large data sets in memory.

Big progress has been made making the Apache Spark framework available through R in the last couple of years. Two such projects are Apache's sparkr and Rstudio's sparklyr. Both of these provide a dplyr style interface to spark's data processing engine.

As a heavy data.table user it would be amazing if there were to be a data.table interface for spark. That would make it incredibly easy for data scientists to migrate their projects from the smaller CSV style data sets to the huge data sets that can be processed by spark.

A classic data pipeline for me is

Bring the data into R by CSV
Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table
Build a model using one of R's many machine learning packages

I want to be able to migrate this to

Connect to data on hadoop cluster
Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table's spark interface
Build a model using one of spark's many machine learning algorithm.

feature request someday top request

Source

ysgit

👍27 ❤1

Most helpful comment

I really hope DT syntax is available someday. Personally, I really prefer the DT syntax and would like to use it consistently in Spark rather than cringe with dplyr.
Fingers crossed...

sskarkhanis on 11 Apr 2017

👍6 ❤1

All 20 comments

Thanks for the encouragement. Fully agree.
Yes there are lots of efforts in this space. Note that I (and now Jan Gorecki) work at H2O.
I gave a recent presentation here: https://www.youtube.com/watch?v=5X7h1rZGVs0
The slides are here : https://github.com/Rdatatable/data.table/wiki/Presentations
Just to check whether you had seen these before? Before discussing further.

mattdowle on 26 Aug 2016

👍7

Hi @mattdowle thanks for pointing those out. Very impressive results!
So would you say that H2O already enables that pipeline?

ysgit on 27 Aug 2016

A lot depends on details of your data and types of feature engineering. Definitely test H2O's ML as well as MLlib. We don't have data.table syntax for either Spark or H2O yet. So short answer is no because of that. But yes we'd like to add it.

mattdowle on 27 Aug 2016

👍2

Fully agree, this would be great. Hope that data.table will be made available in the near future.

LeonardAukea on 10 Apr 2017

I really hope DT syntax is available someday. Personally, I really prefer the DT syntax and would like to use it consistently in Spark rather than cringe with dplyr.
Fingers crossed...

sskarkhanis on 11 Apr 2017

👍6 ❤1

something new?

fbala on 25 Jun 2017

Adding my voice to a desire to be able to use data.table instead of dplyr on distributed data in Spark/H2O

aadler on 9 Nov 2017

👍3

Using data.table in Spark would be AMAZING!

petalatuy on 1 Dec 2017

I ran into an issue caused by SparkR's using rJava (and all its endless
flaws) today, would love to have been able to just use data.table. directly
instead of the magical file i/o gymnastics I found myself resorting to

On Dec 1, 2017 9:56 PM, "petalatuy" notifications@github.com wrote:

Using data.table in Spark would be AMAZING!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/1828#issuecomment-348500844,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdXbs22uCw1Zxyg2zWqkUXGWjzuifks5s8AWqgaJpZM4JuUyv
.

MichaelChirico on 1 Dec 2017

My plus 1000

saptarshiguha on 8 Dec 2017

+1 any updates on the future plan of data.table package?

sunsiyu on 18 May 2018

I'm curious what ppl want out of this exactly.

Just to be able to use [ on an RDD like you would on a data.table (namely, i/j/by)?

certainly the full functionality is a ways away, but i imagine it wouldn't be too earth shaking to make an idiom for filtering, grouping, even joining by sending syntax within [] to the corresponding operations in sparkR.

in particular, this would just amount to (in essence) aliasing sparkR functions in a syntax friendlier for data.table regulars.

is this what people have in mind?

MichaelChirico on 18 May 2018

👍4 ❤1

No updates, there is a lot to dev around data.table itself, so external interfacing is not that high priority now. Instead of just spark integration it makes more sense to integrate to dplyr, something like dplyr.table (inverse of dtplyr). Then any dplyr backend will work on data.table syntax.

jangorecki on 18 May 2018

👍1

indeed, dplyr SQL interface is good enough.

harryprince on 12 Aug 2018

@jangorecki The problem with that is that will also be slow/full of bugs/ and very unstable as dplyr interface changes on almost daily basis and sometimes they change the whole idiom at once like they did with lazyeval > rlang > tidyeval and G-d knows else what as I long time lost track. Not to mention that hadley, who once stated (with a "tongue in cheeck") that data.table uses "cryptic shortcuts" now masks few of these shortcuts and suddenly don't consider them so cryptic anymore. In short, creating such an API would be a full time job IMO.

I think migrating a few main functionalities from data.table and keep adding them when there is time would be much safer/easier.

DavidArenburg on 12 Aug 2018

👍3

@DavidArenburg agree, thus I would suggest to wait at least till dplyr 1.0 before starting any serious dev of such dplyr.table interface.

jangorecki on 24 Apr 2019

👍1

Have you arrived at a conceivable roadmap for a spark integration project (reverse dtplyr or any other form) given that dplyr 1.0 has been released? It would be great to hear your thoughts now that some time has passed.

griipen on 27 Oct 2020

👍1

@jangorecki if the dplyr.table approach is key, or any backend for that matter, it seems like higher integration with data.table would be necessary. Take i which can be:

A character that is NSE to a data.table to do key matching
A logical that is length 1 or nrow(dt) with potential for notjoin NSE
A numeric vector (or single-column numeric matrix) with potential notjoin NSE
A list-like that produces a join or anti-join

While working on #4585, I also worked on functions to process the isub to their end points but did not implement in the PR because with all the variables needed to process the isub, it was not clean. However, to implement a backend, a function to process the isub would be useful so that NSE would be processed consistently. Otherwise, it would be very easy for dplyr.table i processing to become out of sync of data.table processing.

ColeMiller1 on 15 Nov 2020

👍1

That make sense but for that we have to make multiple new helpers to process internal logic of understanding input arguments, and then export them so such tool can easily mimic this logic. Describing our current API with use of helpers is not that trivial task. See related #852

jangorecki on 15 Nov 2020

@ jangorecki
I sincerely hope data.table will become a viable solution on itself and help to completely avoid the dplyr / tidyverse fluff when (and not only when!) interacting with large datasets and out-of-memory computing. So far I find data.table a marvel of clarity and efficiency, almost perfect and excellent integration with mlr3verse, to give an example.