Pandas: No way to force read numerics as string in `read_html `

Created on 9 Jul 2015  路  12Comments  路  Source: pandas-dev/pandas

When HTML table shows 01 in cell, read_html reads it and interpret it as float and removes 0 of 01 .
Options to read them as string?

Dtypes Enhancement IO HTML

Most helpful comment

Same problem here. Looks like it tries to parse the numbers before converting them to strings.
A workaround is to pass thousands="陋", decimal="陋" (or any other character not in text).

All 12 comments

Thanks. Nice to add dtypes like read_csv. I just saw a little, but it looks to be achieved by passing dtype to TextParser -> TextFileReader.

PR is welcome:)

Just stumbled across this page with the same issue. @gte620v can you explain how to accomplish the raw html parsing given your PR? Thanks!

Should be something like this: https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715

Just use a converter to convert to str.

@stevenmanton ^

@gte620v thanks for the info. It sounds like you can easily convert back to string, but can't prevent the automatic parsing in the first place. For example, keeping the leading zeros in an integer. Thanks again!

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})

If you try that example, you will see that the leading zeros are preserved.

As @jorisvandenbossche said, the converter does what you want. I made the PR to solve this exact problem.

Thanks for the clarification guys. I saw "converter" I assumed it was parsing to string back from the inferred type. I'll use this fix :-)

Should we have "dtypes" be an alias for "converters", to match pd.read_csv argument ?

Yes, I think we should add a dtype argument (not sure it should be an alias, it might be possible to just pass through dtype to the underlying parser, now the python parser supports it: https://github.com/pandas-dev/pandas/pull/14295).
@adrivsh Want to do a PR for this?

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})

If you try that example, you will see that the leading zeros are preserved.

I tried using your solulition:-

import pandas as pd
pd.read_html('https://www.gpw.pl/wskazniki',converters={'C/WK': str},header=0)[1]

But it removes the "," from the column values.

@tuhinsharma121 That seems like a bug (the returned values are strings, but indeed should not remove the ","). Could you open a new issue for that?

Same problem here. Looks like it tries to parse the numbers before converting them to strings.
A workaround is to pass thousands="陋", decimal="陋" (or any other character not in text).

Was this page helpful?
0 / 5 - 0 ratings