Pandas: pd.io.gspread.read_frame - from Google Spreadsheet to Pandas DataFrame

Created on 28 Sep 2013  路  28Comments  路  Source: pandas-dev/pandas

Google Spreadsheet is an online spreadsheet.
Google Document can be use to generate survey and results of survey will be stored as a Google Spreadsheet document.

Maybe Pandas should provide a pd.io.gspread.read_frame function that will read a given Google Spreadsheet document (using email, password, url or name of document and range) and return a DataFrame.

pd.io.gspread.read_frame(email, password, filename, sheet, cell_range)

gspread package could help
http://burnash.github.io/gspread/

I wrote a little bit of code for that... but it could probably be improve and add into Pandas.

email = '...@...'
password = '...'
cell_range = 'A1:R20'

gc = gspread.login(email, password)
wks = gc.open(filename).sheet1

cell_list = wks.range('')

# Build a NumPy array
(row, col) = (cell_list[-1].row, cell_list[-1].col)
data = np.empty((row-1,col), dtype=object)
data[:] = np.nan

k = 0
cols = []
for i in range(row):
    for j in range(col):
        val = cell_list[k].value
        if i==0:
            if val != None:
                if val not in cols:
                    cols.append(val)
                else: # add a number if colname ever exists
                    ii = 1
                    while True:
                        new_val = val + '_' + str(ii)
                        if new_val not in cols:
                            break
                        ii += 1
                    cols.append(new_val)
            else:
                cols.append('col_'+str(j))
                #cols.append(j)
        else:
            if val != None:
                data[i-1, j] = val
        k += 1
Enhancement IO Data IO Google

Most helpful comment

Interesting. I'm not totally clear on what you're trying to do above - why are you special casing row 0 (instead of just doing something with row 0 first and then changing the line to: for i in range(1, row): instead? It's also unclear why you're using k and j in a for loop, but maybe I'm missing something.

Based on a quick read of gspread, the easiest thing to do would be something like:

values = wks.get_all_values()
header, rows = values[0], values[1:]
df = DataFrame(rows, columns=header)

That said, the above is particularly inefficient because it first stores all of the cells in Cell objects, creates a defaultdict, stores everything, then uses a double list comprehension to get all the data.

1 big question: Why would you use gspread over the Google Data API? Are their particular advantages?

All 28 comments

Interesting. I'm not totally clear on what you're trying to do above - why are you special casing row 0 (instead of just doing something with row 0 first and then changing the line to: for i in range(1, row): instead? It's also unclear why you're using k and j in a for loop, but maybe I'm missing something.

Based on a quick read of gspread, the easiest thing to do would be something like:

values = wks.get_all_values()
header, rows = values[0], values[1:]
df = DataFrame(rows, columns=header)

That said, the above is particularly inefficient because it first stores all of the cells in Cell objects, creates a defaultdict, stores everything, then uses a double list comprehension to get all the data.

1 big question: Why would you use gspread over the Google Data API? Are their particular advantages?

Thanks for improving this script

I'm using other index to have unique column name.
Because type(df[colname]) is Pandas DataFrame (instead of Pandas Series)

I don't know Google Data API so I can't say.

Strong +1 on this! Couple of things:

  1. It would be nice to support OAuth authentication, if possible, so that users don't have to pass plain text passwords around. This also enables users to not have to log in every time they want to use this. I would not want to store my password in a script so that I could reuse this as part of some larger code base. If this is already supported, then maybe some documentation about it is in order. If I missed that documentation, my apologies.
  2. You can probably use pandas.io.parsers.TextReader to parse the text. You just have to get the values into a list of lists of strings.

@cpcloud - I'm guessing that oauth and separate authentication probably require using the Google Data API bindings (though gspread looks to have a decent amount of activity). The good part is that it should be relatively easy to test because we could set up a Google Apps account (and I wonder if it's possible to set up private Travis credentials). I was thinking the same thing about passing through the text parser.

@c0indev3l This type(df[colname]) should not return DataFrame. You're not using a Panel are you?

Also, minor point: should probably call this pandas.io.read_drive (as in Google Drive).

Related: #4093

@cpcloud so it's Google Drive because we'd only read spreadsheets?

not necessarily, but that's the name of the service

not stuck on that name...let's get a working prototype b4 we start bikeshedding about names :)

more wondering if it's the same as the older drive issue?

i think it is, close the other? this one is further along

Google Drive also provide Google Table Fusion http://www.google.com/drive/apps.html#fusiontables

hm okay...well maybe read_google.... still think we should just get something working...

Maybe
pd.io.gdrive.read_spreadsheet for reading spreadsheet,
pd.io.gdrive.read_table for reading a Table Fusion

but that's right that giving a name is probably not so important

What is important in my mind is also to wonder what kind of API Pandas will provide.

For example we can get a document using either its name or its url.

Does Pandas should "recognize" if what is given is a url ?
Or user should explicitly send parameter as url='....' or name='...'

we have some code that attempts to detect urls in pandas/io/common.py

What will happen if the name of my document "looks like" an url ?

Try to name your spreadsheet document http://www.google.com/

dealing with spreadsheet names should happen way after dealing with the url of the workbook

So you won't be able to access to a spreadsheet with a name which will looks like an url.

An other question is what will happen if a spreadsheet will have columns with same name. Should the name be automatically incremented ? Or does Pandas provides a function to do this ?

@c0indev3l instead of hashing these details out now, can I suggest a process for this?

  1. Assume we have the correct worksheet (i.e., either through gspread or the Google Data API), set up the processing for that specific worksheet.
  2. Add in the ability to pass all the text parser kwargs that are appropriate through the function.
  3. Add the ability to get the correct worksheet given a spreadsheet in gspread.

Then, after we have (1) - (3) done, we can do whatever we want with deciding how to get the appropriate worksheet, since the worksheet parsing part won't have to care about it.

@c0indev3l echoing @jtratner: Work out the parsing first (since that's the core feature being introduced here) then worry about the ancillary details (these are important too, but without correct text parsing they don't matter).

An other useful feature will be to automatically rename columns to have unique column name (like with pd.io.excel.read_excel). I asked about it on StackOverflow http://stackoverflow.com/questions/19071622/automatically-rename-columns-to-ensure-they-are-unique and I get this piece of code

df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']

def uniquify(df_columns):
    seen = set()

    for item in df_columns:
        fudge = 1
        newitem = item

        while newitem in seen:
            fudge += 1
            newitem = "{}_{}".format(item, fudge)

        yield newitem
       seen.add(newitem)

list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']

But Pandas probably ever provide a way to "uniquify" columns name (I should have a look at read_excel code to understand that part of Pandas)

I am not sure of the scope of this issue?

@cpcloud

i think feature creep for us, but interesting for an external library/module that does ... this requires authentication handling and setting up keys and such for an app ... would be nice for a lib that provide the automatiion for whatever can be automated and then uses pandas to parse the json or html (can't remember how these are delivered in API)

Anything happened since this last post around panda / Google Spreadsheet connectivity?
I'm really curious about a OAuth2 supported Spreadsheet alternative to Pandas Excel interfaces.

Here is library that provides possibilities for uploading/downloading between spreadsheets and pandas DataFrame, for those who tries to find solutions in this issue.

@maybelinot thx for the reply, very appreciated, I try to give it a run soon!

Alternative solution - using gspread library:

gc = gspread.authorize(credentials)
wks = gc.open('Spread Sheet Title')
df = pd.DataFrame(wks.sheet1.get_all_records())

I think this is a great idea that could hugely help pandas users, and even attract new ones.

@kenhui521 What would the credentials include ?

Was this page helpful?
0 / 5 - 0 ratings