Situation:
.xlsx file having a column with label and hyperlinks (see screenshot collage)
In Excel, label: Python (that has a hyperlink to i.e.: www.python.org)
data = pd.read_excel('Book1.xlsx')
column has just "Python" as content. Hyperlink information is lost.
(see screenshot)
Add an argument for example named hyperlinkparser to read_excel()
that offers options: "label", "hyperlink", "both"
label ... just the label (this is how it is handled now)
hyperlink ... just the hyperlink, drop the label
split ... created two separate columns for label and hyperlink each
with example it could look like:
| MyLinks | MyLinks_hyperlink
0 | Python | www.python.org
btw. openpyxl 2.4.+ has this feature now included (was broken in previous versions). Maybe this helps for implementation.
Thank you!
How difficult is to preprocess the file with xlrd or openpyxl, and pass the modified version (with the additional columns) to read_excel
.? I haven't used either of those libraries extensively. If it's not too much code we could include it as a cookbook example.
Thanks for the answer.
What I have found as a workaround
(and please excuse clumsy coding style - I'm a python beginner):
# requires openpyxl version 2.4.0+
from openpyxl import load_workbook
wb = load_workbook(filename='input.xlsx')
ws = wb.get_active_sheet() # ws is now an IterableWorksheet
# iterate thru all cells and if hyperlink found attempt modification of cell
for row in ws.rows:
for cell in row:
try:
if len(cell.hyperlink.target) > 0:
cell.value = "".join([cell.value,"|",cell.hyperlink.target])
# Join cell.value and hyperlink target into string (optionally just assign the hyperlink.target to the cell.value
except:
pass
# save workbook to temp .xlsx (I could not manage to read from buffer...) .
wb.save("temp.xlsx")
# read with pandas
data = pd.read_excel("temp.xlsx")
# take DataSeries and rsplit by "|" and expand to 2 columns
hyper = (data.MyLinks.str.rsplit("|", expand=True))
#set labels
hyper.columns=["Label","Hyperlink"]
# join them back to dataframe on index
data = data.join(hyper, how="left")
# done
Maybe this is helpful for somebody.
An integrated pandas function still would be nice.
Br,
Joe
One possible enhancement is to allow user funcs (passed via converters
kwds) to accept excel Cell
instances.
@sinhrks:
Thanks for the feedback. Would this allow to target the cell.value and also the cell.hyperlink.target ? Is this supported in the xlrd module that is used for reading the file?
@all: I have updated my workaround in case someone is looking for ideas.
@TomAugspurger : Feel free to used it in the cookbook if fitting
@DocViper666 The enhancement should allow something like:
pd.read_excel("xxx.xlsx", converters={"target_col_name": lambda x: str(x.value) + "|"+ str(x.hyperlink.target), raw=True)
As long as I know, xlrd also has a class corresponding to cell.
Is anyone working on this? Are pandas maintainers open to pull requests? I would like to work on this, with a little bit of guidance.
Not aware of anyone working this. If I'm being honest I'm not entirely sure this is within the scope of pandas as it seems like a very niche use, but you are welcome to submit a PR if you find an easy way to implement
Okay, thanks for your reply!
I came across a StackOverflow question about using pd.read_excel
to read a document with hyperlinks, and the top answer said 'Pandas can't do it, but you can use openpyxl instead.'
It seems in some cases, Pandas already acts as a wrapper around openpyxl. For instance, when I try to use pandas.DataFrame.to_excel
to save a DataFrame as an Excel file, I get a warning that I should install openpyxl first.
I thought perhaps an option to read an Excel file, while maintaining links, could work the same way?
The openpyxl reader is here:
get_sheet_data
would be the method of most interest
Thanks!! I'll see what I can do.
Is this being worked on still? Would be very useful if someone could provide a closure. Thanks :)
Most helpful comment
Is this being worked on still? Would be very useful if someone could provide a closure. Thanks :)