Incubator-superset: Allow users to import CSV as datasource

Created on 20 Apr 2016 · 25Comments · Source: apache/incubator-superset

Hi there,
Is there any plan to add support for uploading CSV data as well as data source ?

Maybe using sqlite3:
http://stackoverflow.com/questions/2580497/database-on-the-fly-with-scripting-languages

Thanks

request help-wanted

Source

gbrian

👍21

Most helpful comment

any updates of this feature

vinpatel on 22 Apr 2017

👍13

All 25 comments

It should probably be done at the database level, maybe an upload icon in the database list view.

pandas has some utility functions that make that trivial, first load the csv in a dataframe, then upload it to the db.

mistercrunch on 20 Apr 2016

+1, would love this feature.

Miserlou on 23 Apr 2016

This would be a very handy feature for a data mining system.

But would be curious how could the upload CSV file / data be saved,

Will the data always be read the CSV file and parsed / during dashboard / graph generation?

Or user has to select an existing database to save data on the CSV file?

This would require the database user to have insert or even create table permission on the data source, which is not necessary on current design.

Any thoughts here?

xqliu on 25 Apr 2016

I'd be willing to have a go at this. Something like:

Drag and drop a csv file and/or upload button on the page listing SQLA tables
Use pandas to parse file
Use pandas to write a single table sqlite database (might need an additional option in the config, USER_UPLOADED_DB_DIR or something)
Add metadata to caravel's db
Use table as any other

Additional bonus is this could make replicating/debugging others' problems easier.

@mistercrunch any thoughts?

andrewhn on 25 Apr 2016

For all those wanting to use CSV in the interim, I had success using a csv2sqlite script, as detailed here: https://github.com/FOIA-data-hackathon/MuckRock-Caravel

Miserlou on 25 Apr 2016

@mistercrunch, any updates on this issue? There is a workaround via miserlou but I was hoping to make a contribution.

@andrewhn, did you make any progress with this? If so, could I see your code?

SalehHindi on 5 Jul 2016

@andrewhn @SalehHindi Has anyone given this a go yet? We also think this would be a great feature, but would be keen to hear of any new approaches that did/didn't work.

samempson on 11 Oct 2016

For the record, I would suggest to anyone who wants to tackle this to the following pandas method, and expose as much as is possible/reasonable from their api in the upload form:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Basically you'd have a form with sensible defaults and options based on the pandas api.

mistercrunch on 22 Nov 2016

@mistercrunch, thanks for laying that out. I think I'll make an attempt.

SalehHindi on 8 Dec 2016

@mistercrunch @SalehHindi

I've made a start on this on a csv-import branch on my fork. The basic functionality is in place but needs some testing and additional validation on the fields presented by the new form.

A button to import CSV has been added on the 'sources->database' page that brings up a new form exposing most of the pandas api. The CSV is added as a new table to an existing database. It can then be added like any other table on the 'sources->tables' page.

Ryan4815 on 12 Dec 2016

👍7

@axitkhurana I was swamped with finals week last week so I didn't get to it so go for it.

SalehHindi on 22 Dec 2016

Nice one @Ryan4815, is there any update on your solution?

simeonbabatunde on 18 Jan 2017

@axitkhurana @Simeon-ayo I believe that @SalehHindi is going to add some tests to the branch and get it prepped for a merge request.

Ryan4815 on 30 Jan 2017

Pandas is quite memory hungry. I can't load a sparse 1GB csv file on my 16GB system due to MemoryError.

Plot.ly offers a tutorial on how to convert a CSV to SQLite chunk by chunk to avoid eating all the memory. https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/.

It's probably useful for superset to use a similar conversion step so an arbitrarily sized csv can be converted.

If speed is an issue as pandas.read_csv is single-threaded, an alternative is paratext https://github.com/wiseio/paratext. the load_csv_to_pandas function is using all cores and is much faster than pandas.
It doesn't solve the whole memory issue though: while it's quite efficient while reading and processing the CSV, the last conversion to pandas dataframe will use as much memory as pandas alone.

mratsim on 27 Feb 2017

👍3

Nice catch @mratsim and thanks for the link.
I'm currently tidying up my code and preparing to do a pull request for this issue. @mistercrunch, do you think it's ok if I go ahead and do the pull request for the current issue and include @mratsim's suggestion in another pull request?

SalehHindi on 8 Mar 2017

http://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas

mistercrunch on 8 Mar 2017

Any updates on this feature

eyadsibai on 11 Apr 2017

any updates of this feature

vinpatel on 22 Apr 2017

👍13

import pandas as pd
pdsites = pd.read_csv("site_data.csv")
pdsites.columns

def df2sqlite(dataframe, db_name = "import.sqlite", tbl_name = "import"):

import sqlite3
conn=sqlite3.connect(db_name)
cur = conn.cursor()

wildcards = ','.join(['?'] * len(dataframe.columns))
data = [tuple(x) for x in dataframe.values]

cur.execute("drop table if exists %s" % tbl_name)

col_str = '"' + '","'.join(dataframe.columns) + '"'
cur.execute("create table %s (%s)" % (tbl_name, col_str))

cur.executemany("insert into %s values(%s)" % (tbl_name, wildcards), data)

conn.commit()
conn.close()

df2sqlite(pdsites, db_name="sites_4g.db", tbl_name = "sites_data_4g")

import sqlite3
import pandas as pd

Create your connection.

cnx = sqlite3.connect('sites_4g.db')

df = pd.read_sql_query("SELECT * FROM sites_data_4g", cnx)
df.head(5)

Then go to superset data sources-databases and input 'sqlite///sites_4g.db'

hillaryhitch on 13 Aug 2017

👍3

Update: I'm now working on this issue, continuing from @SalehHindi 's last commit to the csv-import branch. When I run the code, I don't see any "Add CSV Table to Database" button. Can you tell what I might be doing wrong?
You can look at the code I'm running on my fork (https://github.com/timifasubaa/incubator-superset) with branch name import_csv.
Also, please post a snapshot of the new flow (e.g. the page with the new button e.t.c.) .

timifasubaa on 24 Aug 2017

hey @timifasuba....the method I use is using pandas to transform the
data...all this is done on the server (back end), then I put the data in
sqlite as a .db file...see below code that if you replicate you will easily
import any csv, after you do this....then go to superset data
sources-databases and input 'sqlite///sites_4g.db' (in my example below I
created a db with name sites_4g.db)

import pandas as pd
pdsites = pd.read_csv("site_data.csv")
pdsites.columns

def df2sqlite(dataframe, db_name = "import.sqlite", tbl_name = "import"):

import sqlite3
conn=sqlite3.connect(db_name)
cur = conn.cursor()

wildcards = ','.join(['?'] * len(dataframe.columns))
data = [tuple(x) for x in dataframe.values]

cur.execute("drop table if exists %s" % tbl_name)

col_str = '"' + '","'.join(dataframe.columns) + '"'
cur.execute("create table %s (%s)" % (tbl_name, col_str))

cur.executemany("insert into %s values(%s)" % (tbl_name, wildcards),

data)

conn.commit()
conn.close()

df2sqlite(pdsites, db_name="sites_4g.db", tbl_name = "sites_data_4g")

import sqlite3
import pandas as pd

Create your connection.

cnx = sqlite3.connect('sites_4g.db')

df = pd.read_sql_query("SELECT * FROM sites_data_4g", cnx)
df.head(5)

Then go to superset data sources-databases and input

'sqlite///sites_4g.db'

On Thu, Aug 24, 2017 at 3:16 AM, timifasubaa notifications@github.com
wrote:

Update: I'm now working on this issue, continuing from @SalehHindi
https://github.com/salehhindi 's last commit to the csv-import branch.
When I run the code, I don't see any "Add CSV Table to Database" button.
Can you tell what I might be doing wrong?
You can look at the code I'm running on my fork (https://github.com/
timifasubaa/incubator-superset) with branch name import_csv.
Also, please post a snapshot of the new flow (e.g. the page with the new
button e.t.c.) .

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-superset/issues/381#issuecomment-324497190,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AXH0HO5ENCIDMrIAA5EFyzbQ3QmpfshHks5sbMDNgaJpZM4ILd2G
.

hillaryhitch on 24 Aug 2017

Hey @hillaryhitch, @timifasubaa, thanks for the comment. I just started a new job so this fell off my radar but I will push up my tests/updates/screenshots for this feature tonight after work so people can start using this.

SalehHindi on 5 Sep 2017

Notice: this issue has been closed because it has been inactive for 230 days. Feel free to comment and request for this issue to be reopened.

mistercrunch on 23 Apr 2018

Would love to work on this after March 5, if this feature is not available yet @mistercrunch