Pandas: Feature request: read_excel to support hyperlinks split into label and hyperlink

Created on 14 Jun 2016 · 11Comments · Source: pandas-dev/pandas

Situation:
.xlsx file having a column with label and hyperlinks (see screenshot collage)
In Excel, label: Python (that has a hyperlink to i.e.: www.python.org)

Code Sample, a copy-pastable example if possible

data = pd.read_excel('Book1.xlsx')
column has just "Python" as content. Hyperlink information is lost.
(see screenshot)

excel-hyperlink

Expected Output, Proposal:

Add an argument for example named hyperlinkparser to read_excel()
that offers options: "label", "hyperlink", "both"
label ... just the label (this is how it is handled now)
hyperlink ... just the hyperlink, drop the label
split ... created two separate columns for label and hyperlink each

with example it could look like:
| MyLinks | MyLinks_hyperlink
0 | Python | www.python.org

btw. openpyxl 2.4.+ has this feature now included (was broken in previous versions). Maybe this helps for implementation.

Thank you!

API Design Enhancement IO Excel

Source

DocViper666

👍3

Most helpful comment

Is this being worked on still? Would be very useful if someone could provide a closure. Thanks :)

srihari15 on 29 May 2020

👍4

All 11 comments

How difficult is to preprocess the file with xlrd or openpyxl, and pass the modified version (with the additional columns) to read_excel.? I haven't used either of those libraries extensively. If it's not too much code we could include it as a cookbook example.

TomAugspurger on 14 Jun 2016

Thanks for the answer.

What I have found as a workaround
(and please excuse clumsy coding style - I'm a python beginner):

# requires openpyxl version 2.4.0+

from openpyxl import load_workbook
wb = load_workbook(filename='input.xlsx')
ws = wb.get_active_sheet() # ws is now an IterableWorksheet

# iterate thru all cells and if hyperlink found attempt modification of cell
for row in ws.rows:
    for cell in row:
        try:
            if len(cell.hyperlink.target)  > 0:
                cell.value = "".join([cell.value,"|",cell.hyperlink.target])
                # Join cell.value and hyperlink target into string (optionally just assign the hyperlink.target to the cell.value
        except:
            pass

# save workbook to temp .xlsx (I could not manage to read from buffer...) .
wb.save("temp.xlsx")

# read with pandas 
data = pd.read_excel("temp.xlsx")

# take DataSeries and rsplit by "|" and expand to 2 columns
hyper = (data.MyLinks.str.rsplit("|", expand=True))

#set labels
hyper.columns=["Label","Hyperlink"]

# join them back to dataframe on index
data = data.join(hyper, how="left")

# done

output

Maybe this is helpful for somebody.
An integrated pandas function still would be nice.

Br,
Joe

DocViper666 on 15 Jun 2016

👍4

One possible enhancement is to allow user funcs (passed via converters kwds) to accept excel Cell instances.

sinhrks on 15 Jun 2016

@sinhrks:
Thanks for the feedback. Would this allow to target the cell.value and also the cell.hyperlink.target ? Is this supported in the xlrd module that is used for reading the file?

@all: I have updated my workaround in case someone is looking for ideas.

@TomAugspurger : Feel free to used it in the cookbook if fitting

DocViper666 on 15 Jun 2016

@DocViper666 The enhancement should allow something like:

pd.read_excel("xxx.xlsx", converters={"target_col_name": lambda x: str(x.value) + "|"+ str(x.hyperlink.target), raw=True)

As long as I know, xlrd also has a class corresponding to cell.

http://www.lexicon.net/sjmachin/xlrd.html#xlrd.Cell-class

sinhrks on 15 Jun 2016

Is anyone working on this? Are pandas maintainers open to pull requests? I would like to work on this, with a little bit of guidance.

dustinmichels on 17 Jul 2019

Not aware of anyone working this. If I'm being honest I'm not entirely sure this is within the scope of pandas as it seems like a very niche use, but you are welcome to submit a PR if you find an easy way to implement

WillAyd on 17 Jul 2019

Okay, thanks for your reply!

I came across a StackOverflow question about using pd.read_excel to read a document with hyperlinks, and the top answer said 'Pandas can't do it, but you can use openpyxl instead.'

It seems in some cases, Pandas already acts as a wrapper around openpyxl. For instance, when I try to use pandas.DataFrame.to_excel to save a DataFrame as an Excel file, I get a warning that I should install openpyxl first.

I thought perhaps an option to read an Excel file, while maintaining links, could work the same way?

dustinmichels on 17 Jul 2019

The openpyxl reader is here:

https://github.com/pandas-dev/pandas/blob/d48306e77c0a708e0ee33a2aa1da7e267df52ef6/pandas/io/excel/_openpyxl.py#L460

get_sheet_data would be the method of most interest

WillAyd on 19 Jul 2019

Thanks!! I'll see what I can do.

dustinmichels on 19 Jul 2019

Is this being worked on still? Would be very useful if someone could provide a closure. Thanks :)

srihari15 on 29 May 2020

👍4

Was this page helpful?