Pandas: Add PyExcelerate as Excel Writer Engine

Created on 9 Aug 2013 · 29Comments · Source: pandas-dev/pandas

Here's an example with just 65,000 elements (it's _much_ worse with 250,000) so I can compare xls and xlsx:

from pandas import *

df = DataFrame({'col1' : [0.0] * 65000,
'col2' : 1,
'col3' : 2})

%timeit df.to_csv('sample.csv')
10 loops, best of 3: 109 ms per loop

%timeit df.to_excel('sample.xls')
1 loops, best of 3: 8.35 s per loop

%timeit df.to_excel('sample.xlsx')
1 loops, best of 3: 1min 31s per loop

IO Data IO Excel Performance

Source

rhstanton

👍2

Most helpful comment

Update, default writer for df.to_excel(...) still super slow for large data.

Finally gave up and used df.to_csv(...).

Worked for my purposes. Thanks!

brydavis on 29 Jun 2016

👍7

All 29 comments

I agree this is a frustrating issue (and if you get a big enough file, you can actually run out of memory entirely). I've had the thought in the back of my head that we should try to use PyExcelerate - which should be faster and use much less memory

jtratner on 9 Aug 2013

these use xlwt for writing, an option would be to use openpyxl for writing xls (its used by default for .xlsx), and using optimize_write=True in the Workbook constructor. I don't know why this is not used (maybe more functionaily exposes by xlwt?).

jreback on 9 Aug 2013

@jtratner close this as a result of #4542 ?

jreback on 27 Sep 2013

Yes, xlsxwriter is likely faster (and may become even faster in the
future). PyExcelerate may also be added. (cheers to @jmcnamara for this)

jtratner on 27 Sep 2013

closed by #4542

jreback on 28 Sep 2013

Update, default writer for df.to_excel(...) still super slow for large data.

Finally gave up and used df.to_csv(...).

Worked for my purposes. Thanks!

brydavis on 29 Jun 2016

👍7

Using PyExcelerate helps a lot when it comes to dumping lots of data.

With a DataFrame (120000, 120) of real mixed data (not ones and zeros: )) it took 4 minutes to write down an .xlsx

Another test I did was a (189121, 27) DF that took only 2min 33s (.xlsx). Also tested Pandas to_excel() and it took 5min 23s. PyExcelerate was more then 2 times faster.

I also noticed that it consumes much less memory during the process

Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on)
Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.
Anyways if one needs a bulk excel dump PyExcelerate is a way to go. Cheers for its authors!

For trivial cases I use something like this and it works fine:

from pyexcelerate import Workbook

def df_to_excel(df, path, sheet_name='Sheet 1'):
    data = [df.columns.tolist(), ] + df.values.tolist()
    wb = Workbook()
    wb.new_sheet(sheet_name, data=data)
    wb.save(path)

sancau on 24 Mar 2017

👍5

@sancau it you would like to add this to the supported engines would be fine (most of the work would be making sure dtypes are correct and round-trippable).

jreback on 24 Mar 2017

Is there any update on this?
xls does not support more than 255 columns, while saving in xlsx in pandas is really slow
I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory
Is there any plan to support pyexcelerate natively, for example?

ghost on 21 Jul 2017

Is there any update on this?

Unfortunately, no AFAIK :cry:

Is there any plan to support pyexcelerate natively, for example?

Judging from the comment from @jreback , we would certainly be happy with supporting it, but the PR needs to be done right for it to be incorporated. If you would like to jumpstart that effort, go for it!

gfyoung on 21 Jul 2017

I wouldn't even know where to start :(

ghost on 22 Jul 2017

Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on) Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.

@sancau @jreback : this makes me skeptical about performance because data preparation is very important for us to ensure round-trippability. I think this might be why we've had issues implementing this.

gfyoung on 22 Jul 2017

I'll re-open for now, just so that people know we have this on our radar.

gfyoung on 22 Jul 2017

👍1

I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory Is there any plan to support pyexcelerate natively, for example?

For us to avoid the in-memory issue, you would need to be able to write with pyexcerlate in chunks. As your Excel code is largely in Python, it would be otherwise difficult to avoid that problem.

@raffam : do you know if such functionality is possible?

gfyoung on 22 Jul 2017

@gfyoung I don't knkow. What do you mean exactly by "writing in chunks"? The way to write an xlsx file with pyexcelerate seems to be like this

from pyexcelerate import Workbook

data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # data is a 2D array

wb = Workbook()
wb.new_sheet("sheet name", data=data)
wb.save("output.xlsx")

This is more or less what I had done (I have switched to CSV since then). So I too converted the pandas Dataframe to a multi-dimensional array and passed that to pyexcelerate. I don't have deep knowledge of pandas' insight to assess if it would be possible to avoid that passage

ghost on 10 Aug 2017

@raffam : What I mean by that is that instead of writing the entire DataFrame (and having to make an entire copy of it in memory as you mentioned above), you would take a portion of the DataFrame on each iteration and write it to your Excel file.

gfyoung on 10 Aug 2017

It is probably worth pointing out that according to the benchmark in the pyexcelerate docs it is only 2x faster than XlsxWriter. Also openpyxl with lxml is probably as fast as xlsxwriter now.

So although pyexcelerate may be faster it isn't going to get the end user anywhere close to CSV speed due to the verbose nature of the xlsx format and the fact that it needs to be zipped. Ever the C version of XlsxWriter (libxlsxwriter) is only about 10x faster than the Python version.

So it is questionable if any of this is worth the effort.

If the Pandas xlsx writer was restructured to write the data row by row then I might be able to optimise XlsxWriter up to the pyexcelerate speed. But again I'm not sure if it is worth the effort.

Final note, I've seen a 10x increase in speed running long XlsxWriter programs under pypy.

jmcnamara on 10 Aug 2017

I have made a rapid testcase.
I built two functions that save a pandas DataFrame to Excel using pyexcelerate. The first function iterates over rows, the second function iterates over columns
The test is run against 2 matrices: one with lots of rows, and the other one with lots of columns.
It's highly likely that my code can be optimized further.
Anyway the results of the second implementation are consistent with what declared on pyexcelerate doc: approximately 2 times faster in any case.

The code was run under the Intel distribution of Python 3.5.2. Similar results are obtained under "normal" CPython 3.6.2

FRAME WITH LOTS OF ROWS
to_excel took 2.97 secs
pyexcelerate1 took 1.94 secs
34.68% faster than base time
pyexcelerate2 took 1.73 secs
41.75% faster than base time
FRAME WITH LOTS OF COLUMNS
to_excel took 2.13 secs
pyexcelerate1 took 1.15 secs
46.01% faster than base time
pyexcelerate2 took 1.06 secs
50.23% faster than base time

Here is the code

from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math

def toPandasExcel(file, frame):
    start = timeit.default_timer()
    frame.to_excel(file, index=False, header=False)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("to_excel took "+str(delta)+" secs")
    return delta

# Iterate over rows
def toExcelerate1(file, frame):
    start = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    col_num = frame.shape[1]
    row_num = 1
    for row_tup in frame.itertuples(name=None, index=False):
        ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
        row_num += 1        
    wb.save(file)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("pyexcelerate1 took "+str(delta)+" secs")
    return delta

# Iterate over columns
def toExcelerate2(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = 1
    for col_name, col_series in frame.iteritems():
        ws.range((1,col_num), (row_num,col_num)).value = [[x] for x in col_series]
        col_num += 1        
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate2 took "+str(delta)+" secs")
    return delta

def perf(base, eps):
    perc = ((base-eps)/base)*100
    perc = round(perc, 2)
    print(str(perc)+"% faster than base time")

print("FRAME WITH LOTS OF ROWS")
size = 50000
dic = {'a':[50]*size, 'b':[1]*size, 'c':['hi']*size}
#print(dic)
frame_rows = pd.DataFrame(dic)
del dic
base = toPandasExcel("rows_pandas.xlsx", frame_rows)
e1 = toExcelerate1("rows_pyexcelerate1.xlsx", frame_rows)
perf(base,e1)
e2 = toExcelerate2("rows_pyexcelerate2.xlsx", frame_rows)
perf(base,e2)

print("FRAME WITH LOTS OF COLUMNS")
dic = dict()
for x in range(10000):
    dic.update({"A"+str(x):['c']*5})
frame_cols = pd.DataFrame(dic)
del dic
base = toPandasExcel("cols_pandas.xlsx", frame_cols)
e1 = toExcelerate1("cols_pyexcelerate1.xlsx", frame_cols)
perf(base,e1)
e2 = toExcelerate2("cols_pyexcelerate2.xlsx", frame_cols)
perf(base,e2)

EDIT: it was not a fair comparison since to_excel wrote indexes and headers. Now it is a fair comparison
EDIT2: performance improvements

ghost on 11 Aug 2017

@gfyoung would one of the 2 implementations be ok?
I don't think is possible to actually "write" the Excel file in chunks, since as it is already been said, it needs to be compressed
It may be possible to "build" the precursor data structure in chunks, so you effectively has 2 copy of the dataframe in memory (one as pandas DataFrame, one as pyexcelerate Workbook) instead of three (pandas, pyexcelerate and the temporary 2-dimensional array you need if you build the pyexcelerate Workbook in one step)

ghost on 13 Aug 2017

@raffam : Hmm...your timing results are encouraging but not convincing enough. Can you try with even larger sizes (think millions 😉 ) ?

gfyoung on 13 Aug 2017

The limits are 1,048,576 rows by 16,384 columns

ghost on 13 Aug 2017

@raffam : Right, so you can test with WAY MORE than 50,000 rows is my point.

gfyoung on 13 Aug 2017

I understand, but unfortunately my PC stalls with very high size of the matrix :(
Here is the code I have written in case someone wants to test it:

from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math
import numpy as np

def toPandasExcel(file, frame):
    start = timeit.default_timer()
    frame.to_excel(file, index=False, header=False)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("to_excel took "+str(delta)+" secs")
    return delta

# Iterate over rows
def toExcelerate1(file, frame):
    start = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    col_num = frame.shape[1]
    row_num = 1
    for row_tup in frame.itertuples(name=None, index=False):
        ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
        row_num += 1        
    wb.save(file)
    end = timeit.default_timer()
    delta = round(end-start,2)
    print("pyexcelerate1 took "+str(delta)+" secs")
    return delta

# Iterate over columns
def toExcelerate2(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = 1
    for col_name, col_series in frame.iteritems():
        ws.range((1,col_num), (row_num,col_num)).value = list(map(lambda a:[a],col_series))
        col_num += 1        
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate2 took "+str(delta)+" secs")
    return delta

# Iterate over columns
def toExcelerate3(file, frame):
    start_time = timeit.default_timer()
    wb = Workbook()
    ws = wb.new_sheet("test")    
    row_num = frame.shape[0]    
    col_num = frame.shape[1]
    ws.range((1,1), (row_num,col_num)).value = frame.values.tolist()
    wb.save(file)
    end_time = timeit.default_timer()
    delta = round(end_time-start_time,2)
    print("pyexcelerate3 took "+str(delta)+" secs")
    return delta

def perf(base, eps):
    perc = (base-eps)/base
    print("{0:.2%} over base time".format(perc))

print("BUILDING DATAFRAME")
frame = pd.DataFrame()
rows = int(math.pow(10,5))
cols = 5000
dic = dict()
for x in range(cols):
    frame[str(x)] = pd.Series(np.random.randn(rows))

print("to_excel()")
base = toPandasExcel("pandas.xlsx", frame)
print("pyexcelerate1")
e1 = toExcelerate1("pyexcelerate1.xlsx", frame)
perf(base,e1)
print("pyexcelerate2")
e2 = toExcelerate2("pyexcelerate2.xlsx", frame)
perf(base,e2)
print("pyexcelerate3")
e3 = toExcelerate3("pyexcelerate3.xlsx", frame)
perf(base,e3)

ghost on 16 Aug 2017

@raffam : Thanks for this! This will be very helpful.

gfyoung on 16 Aug 2017

Used toExcelerate3 function and defaults xlsx function to write dataframe 333243*34 to file. Results:
to_excel (xlsx) default: 1360 seconds
toExcelerate3: 773 seconds

cryptotvync on 9 Mar 2018

For who is interested, I created a simple helper function to write DataFrames to excel (including header s and Index) using pyexcelerate. ( https://gist.github.com/mapa17/bc04be36e447cab0746a0ec8903cc49f )

I thought about adding a excel writer engine to pandas/pandas/io/excel.py but I am a bit worried looking through the other already implanted engines. They support all kind of fancy cell formatting.

Do you think it would it be sufficient to provide a minimalistic excel writer engine using pyexcelerate, writing only unformatted excel files?

mapa17 on 4 Sep 2018

👍2

Hmm...from a maintenance perspective, I think I would want to maintain fewer engines than multiple because of the compatibility issues down the road (e.g. maintaining consistency between a Python and C parser for CSV has quite difficult).

I could potentially see this as a "fast track," but it sounds a little corner case. Thus, I'm -0.5 on this overall because I'm leaning towards this being more maintenance than useful for the broader audience.

cc @jreback

gfyoung on 4 Sep 2018

This thing is indeed too slow, I'm not surprised that I find others thinking the same. 5 minutes for producing a 50 MB excel file is too much.

Masterxilo on 23 Oct 2018

Based off of conversation above I think we'll close this for now as its unclear if there's really anything to be gained here. If anyone disagrees feel free to reopen