Incubator-superset: Allow users to import CSV as datasource

Created on 20 Apr 2016  Â·  25Comments  Â·  Source: apache/incubator-superset

Hi there,
Is there any plan to add support for uploading CSV data as well as data source ?

Maybe using sqlite3:
http://stackoverflow.com/questions/2580497/database-on-the-fly-with-scripting-languages

Thanks

request help-wanted

Most helpful comment

any updates of this feature

All 25 comments

It should probably be done at the database level, maybe an upload icon in the database list view.

pandas has some utility functions that make that trivial, first load the csv in a dataframe, then upload it to the db.

+1, would love this feature.

This would be a very handy feature for a data mining system.

But would be curious how could the upload CSV file / data be saved,

Will the data always be read the CSV file and parsed / during dashboard / graph generation?

Or user has to select an existing database to save data on the CSV file?

This would require the database user to have insert or even create table permission on the data source, which is not necessary on current design.

Any thoughts here?

I'd be willing to have a go at this. Something like:

  • Drag and drop a csv file and/or upload button on the page listing SQLA tables
  • Use pandas to parse file
  • Use pandas to write a single table sqlite database (might need an additional option in the config, USER_UPLOADED_DB_DIR or something)
  • Add metadata to caravel's db
  • Use table as any other

Additional bonus is this could make replicating/debugging others' problems easier.

@mistercrunch any thoughts?

For all those wanting to use CSV in the interim, I had success using a csv2sqlite script, as detailed here: https://github.com/FOIA-data-hackathon/MuckRock-Caravel

@mistercrunch, any updates on this issue? There is a workaround via miserlou but I was hoping to make a contribution.

@andrewhn, did you make any progress with this? If so, could I see your code?

@andrewhn @SalehHindi Has anyone given this a go yet? We also think this would be a great feature, but would be keen to hear of any new approaches that did/didn't work.

For the record, I would suggest to anyone who wants to tackle this to the following pandas method, and expose as much as is possible/reasonable from their api in the upload form:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Basically you'd have a form with sensible defaults and options based on the pandas api.

@mistercrunch, thanks for laying that out. I think I'll make an attempt.

@mistercrunch @SalehHindi

I've made a start on this on a csv-import branch on my fork. The basic functionality is in place but needs some testing and additional validation on the fields presented by the new form.

A button to import CSV has been added on the 'sources->database' page that brings up a new form exposing most of the pandas api. The CSV is added as a new table to an existing database. It can then be added like any other table on the 'sources->tables' page.

@axitkhurana I was swamped with finals week last week so I didn't get to it so go for it.

Nice one @Ryan4815, is there any update on your solution?

@axitkhurana @Simeon-ayo I believe that @SalehHindi is going to add some tests to the branch and get it prepped for a merge request.

Pandas is quite memory hungry. I can't load a sparse 1GB csv file on my 16GB system due to MemoryError.

Plot.ly offers a tutorial on how to convert a CSV to SQLite chunk by chunk to avoid eating all the memory. https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/.

It's probably useful for superset to use a similar conversion step so an arbitrarily sized csv can be converted.

If speed is an issue as pandas.read_csv is single-threaded, an alternative is paratext https://github.com/wiseio/paratext. the load_csv_to_pandas function is using all cores and is much faster than pandas.
It doesn't solve the whole memory issue though: while it's quite efficient while reading and processing the CSV, the last conversion to pandas dataframe will use as much memory as pandas alone.

Nice catch @mratsim and thanks for the link.
I'm currently tidying up my code and preparing to do a pull request for this issue. @mistercrunch, do you think it's ok if I go ahead and do the pull request for the current issue and include @mratsim's suggestion in another pull request?

Any updates on this feature

any updates of this feature

import pandas as pd
pdsites = pd.read_csv("site_data.csv")
pdsites.columns

def df2sqlite(dataframe, db_name = "import.sqlite", tbl_name = "import"):

import sqlite3
conn=sqlite3.connect(db_name)
cur = conn.cursor()

wildcards = ','.join(['?'] * len(dataframe.columns))
data = [tuple(x) for x in dataframe.values]

cur.execute("drop table if exists %s" % tbl_name)

col_str = '"' + '","'.join(dataframe.columns) + '"'
cur.execute("create table %s (%s)" % (tbl_name, col_str))

cur.executemany("insert into %s values(%s)" % (tbl_name, wildcards), data)

conn.commit()
conn.close()

df2sqlite(pdsites, db_name="sites_4g.db", tbl_name = "sites_data_4g")

import sqlite3
import pandas as pd

Create your connection.

cnx = sqlite3.connect('sites_4g.db')

df = pd.read_sql_query("SELECT * FROM sites_data_4g", cnx)
df.head(5)

Then go to superset data sources-databases and input 'sqlite///sites_4g.db'

Update: I'm now working on this issue, continuing from @SalehHindi 's last commit to the csv-import branch. When I run the code, I don't see any "Add CSV Table to Database" button. Can you tell what I might be doing wrong?
You can look at the code I'm running on my fork (https://github.com/timifasubaa/incubator-superset) with branch name import_csv.
Also, please post a snapshot of the new flow (e.g. the page with the new button e.t.c.) .

hey @timifasuba....the method I use is using pandas to transform the
data...all this is done on the server (back end), then I put the data in
sqlite as a .db file...see below code that if you replicate you will easily
import any csv, after you do this....then go to superset data
sources-databases and input 'sqlite///sites_4g.db' (in my example below I
created a db with name sites_4g.db)

import pandas as pd
pdsites = pd.read_csv("site_data.csv")
pdsites.columns

def df2sqlite(dataframe, db_name = "import.sqlite", tbl_name = "import"):

import sqlite3
conn=sqlite3.connect(db_name)
cur = conn.cursor()

wildcards = ','.join(['?'] * len(dataframe.columns))
data = [tuple(x) for x in dataframe.values]

cur.execute("drop table if exists %s" % tbl_name)

col_str = '"' + '","'.join(dataframe.columns) + '"'
cur.execute("create table %s (%s)" % (tbl_name, col_str))

cur.executemany("insert into %s values(%s)" % (tbl_name, wildcards),

data)

conn.commit()
conn.close()

df2sqlite(pdsites, db_name="sites_4g.db", tbl_name = "sites_data_4g")

import sqlite3
import pandas as pd

Create your connection.

cnx = sqlite3.connect('sites_4g.db')

df = pd.read_sql_query("SELECT * FROM sites_data_4g", cnx)
df.head(5)

Then go to superset data sources-databases and input

'sqlite///sites_4g.db'

On Thu, Aug 24, 2017 at 3:16 AM, timifasubaa notifications@github.com
wrote:

Update: I'm now working on this issue, continuing from @SalehHindi
https://github.com/salehhindi 's last commit to the csv-import branch.
When I run the code, I don't see any "Add CSV Table to Database" button.
Can you tell what I might be doing wrong?
You can look at the code I'm running on my fork (https://github.com/
timifasubaa/incubator-superset) with branch name import_csv.
Also, please post a snapshot of the new flow (e.g. the page with the new
button e.t.c.) .

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/apache/incubator-superset/issues/381#issuecomment-324497190,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AXH0HO5ENCIDMrIAA5EFyzbQ3QmpfshHks5sbMDNgaJpZM4ILd2G
.

Hey @hillaryhitch, @timifasubaa, thanks for the comment. I just started a new job so this fell off my radar but I will push up my tests/updates/screenshots for this feature tonight after work so people can start using this.

Notice: this issue has been closed because it has been inactive for 230 days. Feel free to comment and request for this issue to be reopened.

Would love to work on this after March 5, if this feature is not available yet @mistercrunch

I would really like this feature please :)

Was this page helpful?
0 / 5 - 0 ratings