Pandas: Cannot convert numbers to strings when reading an Excel spreadsheet

Created on 9 Jan 2014  路  34Comments  路  Source: pandas-dev/pandas

I'm reading some excel spreadsheets (xlsx format) into pandas using read_excel, which generally works great. The problem I have is that when a column contains numbers, pandas converts these to float64 type, and I would like them to be treated as strings. After reading them in, I can convert the column to str:

 my_frame.my_col = my_frame.my_col.astype('str') 

This works as far as assigning the right type to the column, but when I view the values in this column, the strings are formatted in scientific-format e.g. 8.027770e+14, which is not what I want. I like to be able to tell Pandas to read the columns as strings. My current solution involves dropping down to xlrd to read the spreadsheet.

See stackoverflow question: http://stackoverflow.com/q/20970483/690890

Bug Dtypes IO CSV IO Excel

Most helpful comment

the 'real' soln here is to support a dtype={} argument similar to read_csv, see issue #4932 which would allow you to specify that certain columns can be retained as object dtype (iow they wouldnt' be coerced).

PR anyone?

All 34 comments

have you tried passing the dtype argument? (its documented in read_csv)

Thanks, but the dtype argument isn't available for read_excel

did you try it? it passes thru other keywords thru.

Yes. I get 'ValueError: dtype is not supported with python parser'

ahh..sorry...related to #5232 is currently an open bug

not sure if their is a work-around ATM

@jtratner ?

I like to be able to tell Pandas to read the columns as strings. My current solution involves dropping down to xlrd to read the spreadsheet.

How exactly are you doing that?

As far as I can see from the xlrd code, when it reads a number from an xlsx file it immediately converts it to a float before returning the cell object to the user.

So unless I am missing something I don't see how Pandas can be made to read the columns as strings. It will always get a float from xlrd for an xls or xlsx file if Excel has stored them as numbers.

The CSV readers behave differently and can be type specified because they return all read data as strings regardless of whether it looks like a number of not.

Perhaps @jreback or @jtratner can suggest a different way to convert a column from float64 to formatted strings.

But apart from a workaround I don't think this issue is fixable in the way that you would like.

Have you tried this in 0.13? If all your numbers are ints, pandas will read them in as ints (recent change in 0.13) and you can just do x.astype(str) to get what you want, albeit not necessarily in a particularly fast way. That said, if you're converting them to strings, they'll all be object dtype, so you're going to be using more memory and space. All numbers are stored as float in Excel, so you can't really get around this...

closing as fixed in 0.13

why is this fixed? There is still no way of reading excel file columns as objects.

see @jmcnamara comment

unless their is an easy way to tell xlrd to read it as not a number (eg do not do any conversions)

Would it be possible then to keep every column as a float? Because this means datetime columns in excel are very hard to handle (cannot read as text, cannot convert as text)

I believe that xlrd converts datetime properly (iirc you have to specify what is a date column)

is their a problem with some conversion?

Well, in my case I don't know the file structure before hand, and hence i cannot specify which columns are dates at the moment of import.

Maybe there is another way I can detect the date columns before using the read_excel function?

Maybe there is another way I can detect the date columns before using the read_excel function

Not easily.

Dates in Excel are stored as real numbers. The only thing that distinguishes them from other real numbers is the number format that is applied to them. These number formats aren't generally stored with the number in the Excel file and in general some level of heuristic is required to determine if the cell contains a date.

Xlrd does a lot of work around this so you don't have to.

However, could we take a step back and ask what you are trying and do?

well, my company receives several files of client data (normally Excel files) on a constant basis. I am in charge of ensuring that data is formatted according to our main import logic.

I built a Flask app to help our account managers formatting the files. This app relies on pandas heavily, because pandas is amazing.

However, since these files are coming straight from clients, my app doesn't know what is on each file, so each account manager uses my app to decide if a column should be an int, or a date, etc.

So how i built the app it would turn every column into a dtype object column and then format every column according to the requirements.

This works great with csv files (because of the dtype arg), but with excel files the date columns are giving me a hard type.

read_json does this type of inference
either looking at likely looking columns from the name or simply trying varous conversion frequencies (eg may be stored as seconds/ms/ns since epoch) you can try to convert columns
just use pd.to_datetime with a try except (and errors set to raise)

I have the same problem about numbers and string trying to convert an xls/xlsx to csv.
In my case there are columns with codes like "07453" , "000314"..

Reading from excel, pandas read them as integers and cut all the 0's and results to "7453" , "314".. (which is wrong in my case).
At that point i think there is no workarounds for my problem

@andrea-lascola: I have the same problem about numbers and string trying to convert an xls/xlsx to csv. In my case there are columns with codes like "07453" , "000314"..

Pandas uses the xlrd module for reading Excel files and xlrd preserves Excel's types when there is a corresponding type in Python.

Therefore if 07453 is read as a number and not a string it is likely that it is stored in the Excel file as the number 7453 and then has a format applied to it like 00000.

You can check that yourself by selecting that cell in Excel and changing the format to General.

If the data is actually a string but is being read as a number then that would be an issue.

If not then there isn't any workaround when reading the data. Perhaps it would be possible to reformat the data prior to writing it to the CSV file. @jreback might have some suggestions there.

The data is stored as "Text" as shown in the screenshot in the attachment ("Testo" mean "Text" in italian :)) i'll continue the investigation, thanks for the support
screen shot 2014-03-20 at 11 28 09

I'm debugging "xl.parse" pandas function and i found that the data array changes from:

[u'010170' u'010185' u'010204' u'010291' u'010605' u'010787', ....]

to :

[ 10170 10185 10204 10291 10605 10787 15049 15387 ....]

after the pandas parser call to : "lib.maybe_convert_numeric" as shown in the screenshot:
screen shot 2014-03-20 at 12 20 50

it would be awesome if xl.parse will expose "try_num_bool" as an input param to avoid the conversion if not needed.

Just to say that the solution I adopted was to use xlrd to inspect the type and then convert if necessary. This worked for my use case (which was avoid large number-like strings being converted to scientific notation; I don't think it will help with the removal of leading zeroes):

wb = xlrd.open_workbook(file)
sheet = wb.sheet_by_name('Sheet1')
column = sheet.col(1)
# if values are extracted as numbers, we need to convert to int then str
# otherwise just convert the value to str (also filtering headers)
numbers = map(lambda x: x.ctype==2 and str(int(x.value)) or str(x.value), column)

Maybe it would be possible to use xlrd to create a csv tempfile of the csv that pandas would then import with the same arguments as read_csv?

the 'real' soln here is to support a dtype={} argument similar to read_csv, see issue #4932 which would allow you to specify that certain columns can be retained as object dtype (iow they wouldnt' be coerced).

PR anyone?

Exactly, that would be the solution.
But you mentioned that converting data types is not possible straight from xlrd, so we would need to add a middle procession, is that correct?

not sure
but their is some type inference going by xlrd, further in excel.py, finally when the resultant data frame is actually constructed

vote for suggested solution which is to support dtype for excel file.

i can't believe it, i have the same problem right now and i have to use somewhat else to read excel files because of this issue

I'm having an issue with this too. @jreback -- there is an issue with xlrd: < https://github.com/python-excel/xlrd/pull/77 >.

It appears to clip millisecond information off of the column when reading a time. My fall back was going to be reading the column as a string and then manually using strptime or something. But since pandas elects to punt entirely to xlrd, it seems not possible.

I suppose my best option right now is to open the files manually and re-save them as csv, then load them with read_csv so I can use the dtype argument. Is that correct?

@spearsem I think that is right

@spearsem I fixed the millisecond issue in #5945.

@spearsem Try upgrade to the latest version of xlrd.

Here is a workaround to the problem of leading zeros. If you know that your column has a fixed-width you can first turn the column into a string, and then add the gone padding zeros with the string method zfill, as done in here: http://stackoverflow.com/questions/339007/nicest-way-to-pad-zeroes-to-string

Overall, I agree that read_excel should have a dtype option.

where are we on getting read_excel to have a dtype option? I'm reading in an excessively large excel file which has over 1000 columns and 33 thousand rows. The cell values are comprised of ints, floats, and strings, but it seems that read_excel is treating all the numerics (ints and floats) as floats. So a 99 is becoming 99.0. Having the ability to specify to treat everything as a string would be ideal.

see the open issue, #5342; pull-requests are welcome for that. Its actually pretty easy. Though you realize that you generally _don't_ need it. Only to specific make things object. Almost all other types (including dates) are supplied via excel already.

Furthermore treating ints/floats as a string is NOT good practice at all, and can lead to completely non-performant code.

Was this page helpful?
0 / 5 - 0 ratings