Pandas: No way to force read numerics as string in `read_html `

Created on 9 Jul 2015 · 12Comments · Source: pandas-dev/pandas

When HTML table shows 01 in cell, read_html reads it and interpret it as float and removes 0 of 01 .
Options to read them as string?

Dtypes Enhancement IO HTML

Source

adamist521

👍1

Most helpful comment

Same problem here. Looks like it tries to parse the numbers before converting them to strings.
A workaround is to pass thousands="ª", decimal="ª" (or any other character not in text).

jbsilva on 5 Jun 2020

👍2 🎉1

All 12 comments

Thanks. Nice to add dtypes like read_csv. I just saw a little, but it looks to be achieved by passing dtype to TextParser -> TextFileReader.

PR is welcome:)

sinhrks on 11 Jul 2015

Just stumbled across this page with the same issue. @gte620v can you explain how to accomplish the raw html parsing given your PR? Thanks!

stevenmanton on 20 Mar 2017

Should be something like this: https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715

Just use a converter to convert to str.

@stevenmanton ^

gte620v on 20 Mar 2017

@gte620v thanks for the info. It sounds like you can easily convert back to string, but can't prevent the automatic parsing in the first place. For example, keeping the leading zeros in an integer. Thanks again!

stevenmanton on 25 Mar 2017

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})

If you try that example, you will see that the leading zeros are preserved.

jorisvandenbossche on 25 Mar 2017

👍3

As @jorisvandenbossche said, the converter does what you want. I made the PR to solve this exact problem.

gte620v on 25 Mar 2017

Thanks for the clarification guys. I saw "converter" I assumed it was parsing to string back from the inferred type. I'll use this fix :-)

stevenmanton on 27 Mar 2017

Should we have "dtypes" be an alias for "converters", to match pd.read_csv argument ?

adrivsh on 16 Oct 2017

👍2

Yes, I think we should add a dtype argument (not sure it should be an alias, it might be possible to just pass through dtype to the underlying parser, now the python parser supports it: https://github.com/pandas-dev/pandas/pull/14295).
@adrivsh Want to do a PR for this?

jorisvandenbossche on 16 Oct 2017

@stevenmanton No, it does not convert back to string, it will prevent that it is parsed as numeric in the first place. In any case, leading zeros are preserved if you use converters={'col': str}. See eg the example in the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#reading-html-content (you have to scroll down a bit to
Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC': str})
If you try that example, you will see that the leading zeros are preserved.

I tried using your solulition:-

import pandas as pd
pd.read_html('https://www.gpw.pl/wskazniki',converters={'C/WK': str},header=0)[1]

But it removes the "," from the column values.

tuhinsharma121 on 23 May 2020

👍2

@tuhinsharma121 That seems like a bug (the returned values are strings, but indeed should not remove the ","). Could you open a new issue for that?

jorisvandenbossche on 26 May 2020

Same problem here. Looks like it tries to parse the numbers before converting them to strings.
A workaround is to pass thousands="ª", decimal="ª" (or any other character not in text).

jbsilva on 5 Jun 2020

👍2 🎉1

Was this page helpful?

0 / 5 - 0 ratings