Kedro: Integrate datable to Kedro extras datasets

Created on 28 Oct 2020 · 7Comments · Source: quantumblacklabs/kedro

Description

Add datatable to Kedro's extras datasets

Context

datatable is derived from R's data.table. As of 20/09/2020 (Version 0.11.0), datatable is fully supported on Windows, Linux and MacOS and can be installed through pip. Offering datable in Kedro's extras dataset could offer many advantages in building pipelines:

It would make the transition for R-users wanting to use Kedro a lot easier.
Datable supports CSV, XLSX, and plain text as well as .tar, .gz, .zip, .gz2, and .tgz.
Certain performance advantages over pandas (especially for datasets larger than 50GB): benchmark
fread is really fast at reading large CSVs. It also automatically detects separators.
datatable's df.to_csv has an append parameter which appends the dataframe with column names if the file does not exist yet. If the file already exists, the column names are not added. Pandas to_csv offers "mode='a'" and "header=True" parameters which means choosing between rewriting the column names with each append or not having any column names. This is really practical for pipelines where you add new data from an IncrementalDatasets folder to a concatenated CSV.

Possible Implementation

Using the code from pandas.CSVDataSet, we could change the "pd.read_csv(fs_file, *self._load_args)" to "dt.fread(fs_file, *self._load_args)". Saving the file should be similar code albeit different parameter names.
For ExcelDataSets, we could also use code from pandas.ExcelDataSet. Datatable also uses the xlrd engine for xlsx. Replace "pd.read_excel(fs_file, *self._load_args)" with "dt.fread(fs_file, *self._load_args)". Datatable does not output xlsx. So either we do not implement ExcelDataSets for datatable or we convert the datatable to pd.DataFrame and write to excel using pd.DataFrame.to_excel.

Medium Feature Request Sprint Activity

Source

lucasjamar

Most helpful comment

@DmitriiDeriabinQB thanks and sorry for not getting back earlier. I have already developed a version of the CSV dataset and Excel dataset that I use in my current kedro project as an extras dataset. I haven't had more time to spend to try implementing it because of some issues . @mlisovyi , would you be interested in my current code for these two datasets or would you like to start from scratch? this zip is the folder that I currently place in the folder src//extras/dataset. The code is 95% derived from the pandas equivalents. The current implementation of CSVdataset works fine for me but I still havent used the Exceldataset.
datatable.zip

lucasjamar on 13 Nov 2020

👍2

All 7 comments

@lucasjamar Thanks for the suggestion. I've logged the request for it internally, however no timeline/priority can be provided at this time. If this is something of high importance for you and you have an implementation in mind, please feel free to go ahead and contribute this feature to Kedro 🙂

DmitriiDeriabinQB on 5 Nov 2020

I'm interested to try to implement it

mlisovyi on 12 Nov 2020

🎉1 👍1

lucasjamar on 13 Nov 2020

👍2

Hi @lucasjamar . Thanks for the code! What were the challenges that you have encountered? Was there anything that would prevent you from pushing it "as is"?

mlisovyi on 13 Nov 2020

Hi @mlisovyi,

CSVdataset appears to work fine and I think it could go to straight to testing (let me know if you spot any issues). I just havent been able to set up the kedro environment on my Windows. I haven't been able to make Exceldataset work however (both xlsx and xls). I think its an issue with the engine but I havent found any excel engine options in fread:

https://datatable.readthedocs.io/en/latest/api/dt/fread.html
https://github.com/h2oai/datatable/blob/140a3abdae94e77badf608936b7bdbd5b091d6e5/src/core/read/py_fread.cc#L239-L286

lucasjamar on 16 Nov 2020

@lucasjamar Thanks for feedback. Yes, i had the same problem. At the end i decided to narrow down to CSV only. What you think?
In the meantime I've been thinking if one in general want to have the actual reader using datatable. If the purpose is to benefit from faster reading, then it's unavoidable. If the purpose is to make use of the R-user-friendly API for tables, then maybe it makes more sense to have wrappers around pandas dataset classes to transform pd.DataFrame into dataframe.Frame (if it is possible to do it efficiently) before/after writing/reading.

mlisovyi on 17 Nov 2020

@mlisovyi . Yep I think its best to stick to just CSV in a first part. The main advantage in my opinion with datatable is that it is really fast at reading CSVs and it has the append parameter for writing. This is quite practical for data collection and concatenation tasks. Coming from R, I do not believe that someone familiar with data.table will have any more trouble with the pandas syntax vs. the datatable syntax. Plus, datatable is still in beta phase. In my opinion, it lacks a few key features to be a true equivalent to pandas. Most notably, it does not have a date-time type yet

lucasjamar on 17 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings