Pandas: read_html should retry imports

Created on 28 May 2018  路  9Comments  路  Source: pandas-dev/pandas

Trying to use read_html I got an import error due the missing dependency. Without leave the interactive session, I've installed that and retry, but altought I can import lxml after that, pandas is not retrying it.

>>> pd.read_html('https://es.wikipedia.org/wiki/ISO_3166-2:AR')

ImportError                               Traceback (most recent call last)
<ipython-input-172-f6b538e61c22> in <module>()
----> 1 pd.read_html('https://es.wikipedia.org/wiki/ISO_3166-2:AR')

~/.virtualenvs/curso/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
    904                   thousands=thousands, attrs=attrs, encoding=encoding,
    905                   decimal=decimal, converters=converters, na_values=na_values,
--> 906                   keep_default_na=keep_default_na)

~/.virtualenvs/curso/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs)
    731     retained = None
    732     for flav in flavor:
--> 733         parser = _parser_dispatch(flav)
    734         p = parser(io, compiled_match, attrs, encoding)
    735 

~/.virtualenvs/curso/lib/python3.6/site-packages/pandas/io/html.py in _parser_dispatch(flavor)
    691     else:
    692         if not _HAS_LXML:
--> 693             raise ImportError("lxml not found, please install it")
    694     return _valid_parsers[flavor]
    695 

ImportError: lxml not found, please install it

image

All 9 comments

Is it working with a new session or when using importlib.reload? This is extremely nuanced and not something that I think would generally be expected with any package in Python

Importing again after installing often works within the same interactive session without importlib.reload, but in general that is indeed tricky business.

However, the reason it currently does not work in pandas is that we only try to import it once and then have a global constant of _HAS_LXML == False, and so we will not try to import again.
We could change that to each time trying to import it, but that might have some performance implications? (although I think raising an import error is quite fast).

I'm OK with requiring people to restart their session.

restart a session may imply a lot of expensive recomputations. On exploratory data crunching, very typical when using pandas, we may not known a priori the need to use read_html or other handy io functions.

So, whats the advantage to have those globals instead to retry the import? even more, we could still have the globals to do not retry if are already imported, but only when the requirements are missing. In this way there is no performance penalty if requirements are already satisfied, and just a very low one in my use case (i.e missing requirements installed during the session)

retrying is out of scope for pandas
this is way too complicated

if you have a dependency that is needed then install it

@jreback if I send a PR for this could it be considered?

if it were really simple/reliable/testable sure

@jreback this is not "out of scope" for pandas (we are not talking about using complex machinery of importlib to reload things), as it is (I think) simply removing the global _IMPORTS:

https://github.com/pandas-dev/pandas/blob/b2eec25f4600ba17ef4b9d23cccbf0122da56279/pandas/io/html.py#L35-L37

so we retry the imports on each read_html call.

That said, it's not because it would be an easy change, that we necessarily want to do this. I am not fully sure of the performance impact of trying to import a non-existing package each time again.

it鈥檚 out of scope because of an import fails we don鈥檛 retry anywhere else generally

the reload mechanism are fragile
and import checking each time is expensive

Was this page helpful?
0 / 5 - 0 ratings