Pandas: read_html should retry imports

Created on 28 May 2018 · 9Comments · Source: pandas-dev/pandas

Trying to use read_html I got an import error due the missing dependency. Without leave the interactive session, I've installed that and retry, but altought I can import lxml after that, pandas is not retrying it.

>>> pd.read_html('https://es.wikipedia.org/wiki/ISO_3166-2:AR')

ImportError                               Traceback (most recent call last)
<ipython-input-172-f6b538e61c22> in <module>()
----> 1 pd.read_html('https://es.wikipedia.org/wiki/ISO_3166-2:AR')

~/.virtualenvs/curso/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
    904                   thousands=thousands, attrs=attrs, encoding=encoding,
    905                   decimal=decimal, converters=converters, na_values=na_values,
--> 906                   keep_default_na=keep_default_na)

~/.virtualenvs/curso/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs)
    731     retained = None
    732     for flav in flavor:
--> 733         parser = _parser_dispatch(flav)
    734         p = parser(io, compiled_match, attrs, encoding)
    735 

~/.virtualenvs/curso/lib/python3.6/site-packages/pandas/io/html.py in _parser_dispatch(flavor)
    691     else:
    692         if not _HAS_LXML:
--> 693             raise ImportError("lxml not found, please install it")
    694     return _valid_parsers[flavor]
    695 

ImportError: lxml not found, please install it

Source

mgaitan

All 9 comments

Is it working with a new session or when using importlib.reload? This is extremely nuanced and not something that I think would generally be expected with any package in Python

WillAyd on 28 May 2018

👍1

Importing again after installing often works within the same interactive session without importlib.reload, but in general that is indeed tricky business.

However, the reason it currently does not work in pandas is that we only try to import it once and then have a global constant of _HAS_LXML == False, and so we will not try to import again.
We could change that to each time trying to import it, but that might have some performance implications? (although I think raising an import error is quite fast).

jorisvandenbossche on 28 May 2018

👍1

I'm OK with requiring people to restart their session.

TomAugspurger on 28 May 2018

restart a session may imply a lot of expensive recomputations. On exploratory data crunching, very typical when using pandas, we may not known a priori the need to use read_html or other handy io functions.

So, whats the advantage to have those globals instead to retry the import? even more, we could still have the globals to do not retry if are already imported, but only when the requirements are missing. In this way there is no performance penalty if requirements are already satisfied, and just a very low one in my use case (i.e missing requirements installed during the session)

mgaitan on 28 May 2018

retrying is out of scope for pandas
this is way too complicated

if you have a dependency that is needed then install it

jreback on 28 May 2018

@jreback if I send a PR for this could it be considered?

mgaitan on 28 May 2018

if it were really simple/reliable/testable sure

jreback on 29 May 2018

👍1

@jreback this is not "out of scope" for pandas (we are not talking about using complex machinery of importlib to reload things), as it is (I think) simply removing the global _IMPORTS:

https://github.com/pandas-dev/pandas/blob/b2eec25f4600ba17ef4b9d23cccbf0122da56279/pandas/io/html.py#L35-L37

so we retry the imports on each read_html call.

That said, it's not because it would be an easy change, that we necessarily want to do this. I am not fully sure of the performance impact of trying to import a non-existing package each time again.

jorisvandenbossche on 29 May 2018

it’s out of scope because of an import fails we don’t retry anywhere else generally

the reload mechanism are fragile
and import checking each time is expensive

jreback on 29 May 2018

Was this page helpful?

0 / 5 - 0 ratings