Pandas: Python Pandas read_html fails when reading tables from Wikipedia

Created on 15 Jun 2018  ·  4Comments  ·  Source: pandas-dev/pandas

I am trying to read the tables from a Wikipedia page using the following code:

import pandas as pd
pd.read_html('https://en.wikipedia.org/wiki/2013–14_Premier_League')

Doing that generates the following error:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in    position 14: ordinal not in range(128)

I have tried

pd.read_html('https://en.wikipedia.org/wiki/2013–14_Premier_League', encoding='utf-8')

But still get the same error. The following works:

import requests
r = requests.get('https://en.wikipedia.org/wiki/2017–18_Premier_League')
c = r.content
dfs = pd.read_html(c)

What I want to know is how to get pd.read_html() to work directly on the url without requests. What is it that I don't understand about encoding or is this a problem with Pandas?

I am running an Anaconda distribution of Pandas 0.21.1 and Python 3.5.4. Thanks for any help.

Bug IO HTML

Most helpful comment

I used the following solution:

import requests
url = "https://ru.wikipedia.org/wiki/Города_России_с_населением_более_500_тысяч_человек"
r = requests.get(url, auth=('user', 'pass'))
website = r.text

import pandas as pd
tables = pd.read_html( website, encoding="UTF-8")

City_pop = tables[4]

All 4 comments

Hmm interesting. Looks like this is still an issue on master even specifying the encoding to be used:

>>> pd.read_html('https://en.wikipedia.org/wiki/2013–14_Premier_League', encoding='utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 14: ordinal not in range(128)

Investigation and PRs are always welcome

https://stackoverflow.com/questions/39229439/encoding-error-when-reading-url-with-urllib

As seen in this similar issue, urllib only works with ASCII requests. To remedy, I used the Requests library (http://docs.python-requests.org/en/master/).

FWIW the sample call works fine under Python 2.7.15 but not Python 3.6.5. Choice of engine doesn't matter, however.

I used the following solution:

import requests
url = "https://ru.wikipedia.org/wiki/Города_России_с_населением_более_500_тысяч_человек"
r = requests.get(url, auth=('user', 'pass'))
website = r.text

import pandas as pd
tables = pd.read_html( website, encoding="UTF-8")

City_pop = tables[4]

Was this page helpful?
0 / 5 - 0 ratings