Here's an example with just 65,000 elements (it's _much_ worse with 250,000) so I can compare xls and xlsx:
from pandas import *
df = DataFrame({'col1' : [0.0] * 65000,
'col2' : 1,
'col3' : 2})
%timeit df.to_csv('sample.csv')
10 loops, best of 3: 109 ms per loop
%timeit df.to_excel('sample.xls')
1 loops, best of 3: 8.35 s per loop
%timeit df.to_excel('sample.xlsx')
1 loops, best of 3: 1min 31s per loop
I agree this is a frustrating issue (and if you get a big enough file, you can actually run out of memory entirely). I've had the thought in the back of my head that we should try to use PyExcelerate - which should be faster and use much less memory
these use xlwt
for writing, an option would be to use openpyxl
for writing xls
(its used by default for .xlsx
), and using optimize_write=True
in the Workbook constructor. I don't know why this is not used (maybe more functionaily exposes by xlwt
?).
@jtratner close this as a result of #4542 ?
Yes, xlsxwriter is likely faster (and may become even faster in the
future). PyExcelerate may also be added. (cheers to @jmcnamara for this)
closed by #4542
Update, default writer for df.to_excel(...)
still super slow for large data.
Finally gave up and used df.to_csv(...)
.
Worked for my purposes. Thanks!
Using PyExcelerate helps a lot when it comes to dumping lots of data.
With a DataFrame (120000, 120) of real mixed data (not ones and zeros: )) it took 4 minutes to write down an .xlsx
Another test I did was a (189121, 27) DF that took only 2min 33s (.xlsx). Also tested Pandas to_excel() and it took 5min 23s. PyExcelerate was more then 2 times faster.
I also noticed that it consumes much less memory during the process
Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on)
Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.
Anyways if one needs a bulk excel dump PyExcelerate is a way to go. Cheers for its authors!
For trivial cases I use something like this and it works fine:
from pyexcelerate import Workbook
def df_to_excel(df, path, sheet_name='Sheet 1'):
data = [df.columns.tolist(), ] + df.values.tolist()
wb = Workbook()
wb.new_sheet(sheet_name, data=data)
wb.save(path)
@sancau it you would like to add this to the supported engines would be fine (most of the work would be making sure dtypes are correct and round-trippable).
Is there any update on this?
xls
does not support more than 255 columns, while saving in xlsx
in pandas is really slow
I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory
Is there any plan to support pyexcelerate
natively, for example?
Is there any update on this?
Unfortunately, no AFAIK :cry:
Is there any plan to support pyexcelerate natively, for example?
Judging from the comment from @jreback , we would certainly be happy with supporting it, but the PR needs to be done right for it to be incorporated. If you would like to jumpstart that effort, go for it!
I wouldn't even know where to start :(
Though PyExcelerate might require some manual data preparation in some cases (NaNs NaTs and so on) Personaly, I faced an error with a empty value in a datetime column and had to manualy fix that.
@sancau @jreback : this makes me skeptical about performance because data preparation is very important for us to ensure round-trippability. I think this might be why we've had issues implementing this.
I'll re-open for now, just so that people know we have this on our radar.
I tried using pyexcelerate, but that requires to copy the entire pandas DataFrame to a multi-dimension list in memory Is there any plan to support pyexcelerate natively, for example?
For us to avoid the in-memory issue, you would need to be able to write with pyexcerlate
in chunks. As your Excel code is largely in Python, it would be otherwise difficult to avoid that problem.
@raffam : do you know if such functionality is possible?
@gfyoung I don't knkow. What do you mean exactly by "writing in chunks"? The way to write an xlsx file with pyexcelerate seems to be like this
from pyexcelerate import Workbook
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # data is a 2D array
wb = Workbook()
wb.new_sheet("sheet name", data=data)
wb.save("output.xlsx")
This is more or less what I had done (I have switched to CSV since then). So I too converted the pandas Dataframe to a multi-dimensional array and passed that to pyexcelerate. I don't have deep knowledge of pandas' insight to assess if it would be possible to avoid that passage
@raffam : What I mean by that is that instead of writing the entire DataFrame
(and having to make an entire copy of it in memory as you mentioned above), you would take a portion of the DataFrame
on each iteration and write it to your Excel file.
It is probably worth pointing out that according to the benchmark in the pyexcelerate docs it is only 2x faster than XlsxWriter. Also openpyxl with lxml is probably as fast as xlsxwriter now.
So although pyexcelerate may be faster it isn't going to get the end user anywhere close to CSV speed due to the verbose nature of the xlsx format and the fact that it needs to be zipped. Ever the C version of XlsxWriter (libxlsxwriter) is only about 10x faster than the Python version.
So it is questionable if any of this is worth the effort.
If the Pandas xlsx writer was restructured to write the data row by row then I might be able to optimise XlsxWriter up to the pyexcelerate speed. But again I'm not sure if it is worth the effort.
Final note, I've seen a 10x increase in speed running long XlsxWriter programs under pypy.
I have made a rapid testcase.
I built two functions that save a pandas DataFrame to Excel using pyexcelerate. The first function iterates over rows, the second function iterates over columns
The test is run against 2 matrices: one with lots of rows, and the other one with lots of columns.
It's highly likely that my code can be optimized further.
Anyway the results of the second implementation are consistent with what declared on pyexcelerate doc: approximately 2 times faster in any case.
The code was run under the Intel distribution of Python 3.5.2. Similar results are obtained under "normal" CPython 3.6.2
FRAME WITH LOTS OF ROWS
to_excel took 2.97 secs
pyexcelerate1 took 1.94 secs
34.68% faster than base time
pyexcelerate2 took 1.73 secs
41.75% faster than base time
FRAME WITH LOTS OF COLUMNS
to_excel took 2.13 secs
pyexcelerate1 took 1.15 secs
46.01% faster than base time
pyexcelerate2 took 1.06 secs
50.23% faster than base time
Here is the code
from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math
def toPandasExcel(file, frame):
start = timeit.default_timer()
frame.to_excel(file, index=False, header=False)
end = timeit.default_timer()
delta = round(end-start,2)
print("to_excel took "+str(delta)+" secs")
return delta
# Iterate over rows
def toExcelerate1(file, frame):
start = timeit.default_timer()
wb = Workbook()
ws = wb.new_sheet("test")
col_num = frame.shape[1]
row_num = 1
for row_tup in frame.itertuples(name=None, index=False):
ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
row_num += 1
wb.save(file)
end = timeit.default_timer()
delta = round(end-start,2)
print("pyexcelerate1 took "+str(delta)+" secs")
return delta
# Iterate over columns
def toExcelerate2(file, frame):
start_time = timeit.default_timer()
wb = Workbook()
ws = wb.new_sheet("test")
row_num = frame.shape[0]
col_num = 1
for col_name, col_series in frame.iteritems():
ws.range((1,col_num), (row_num,col_num)).value = [[x] for x in col_series]
col_num += 1
wb.save(file)
end_time = timeit.default_timer()
delta = round(end_time-start_time,2)
print("pyexcelerate2 took "+str(delta)+" secs")
return delta
def perf(base, eps):
perc = ((base-eps)/base)*100
perc = round(perc, 2)
print(str(perc)+"% faster than base time")
print("FRAME WITH LOTS OF ROWS")
size = 50000
dic = {'a':[50]*size, 'b':[1]*size, 'c':['hi']*size}
#print(dic)
frame_rows = pd.DataFrame(dic)
del dic
base = toPandasExcel("rows_pandas.xlsx", frame_rows)
e1 = toExcelerate1("rows_pyexcelerate1.xlsx", frame_rows)
perf(base,e1)
e2 = toExcelerate2("rows_pyexcelerate2.xlsx", frame_rows)
perf(base,e2)
print("FRAME WITH LOTS OF COLUMNS")
dic = dict()
for x in range(10000):
dic.update({"A"+str(x):['c']*5})
frame_cols = pd.DataFrame(dic)
del dic
base = toPandasExcel("cols_pandas.xlsx", frame_cols)
e1 = toExcelerate1("cols_pyexcelerate1.xlsx", frame_cols)
perf(base,e1)
e2 = toExcelerate2("cols_pyexcelerate2.xlsx", frame_cols)
perf(base,e2)
EDIT: it was not a fair comparison since to_excel wrote indexes and headers. Now it is a fair comparison
EDIT2: performance improvements
@gfyoung would one of the 2 implementations be ok?
I don't think is possible to actually "write" the Excel file in chunks, since as it is already been said, it needs to be compressed
It may be possible to "build" the precursor data structure in chunks, so you effectively has 2 copy of the dataframe in memory (one as pandas DataFrame, one as pyexcelerate Workbook) instead of three (pandas, pyexcelerate and the temporary 2-dimensional array you need if you build the pyexcelerate Workbook in one step)
@raffam : Hmm...your timing results are encouraging but not convincing enough. Can you try with even larger sizes (think millions 馃槈 ) ?
The limits are 1,048,576 rows by 16,384 columns
@raffam : Right, so you can test with WAY MORE than 50,000 rows is my point.
I understand, but unfortunately my PC stalls with very high size of the matrix :(
Here is the code I have written in case someone wants to test it:
from pyexcelerate import Workbook, Range
import pandas as pd
import timeit
import math
import numpy as np
def toPandasExcel(file, frame):
start = timeit.default_timer()
frame.to_excel(file, index=False, header=False)
end = timeit.default_timer()
delta = round(end-start,2)
print("to_excel took "+str(delta)+" secs")
return delta
# Iterate over rows
def toExcelerate1(file, frame):
start = timeit.default_timer()
wb = Workbook()
ws = wb.new_sheet("test")
col_num = frame.shape[1]
row_num = 1
for row_tup in frame.itertuples(name=None, index=False):
ws.range((row_num,1), (row_num,col_num)).value = [[*row_tup]]
row_num += 1
wb.save(file)
end = timeit.default_timer()
delta = round(end-start,2)
print("pyexcelerate1 took "+str(delta)+" secs")
return delta
# Iterate over columns
def toExcelerate2(file, frame):
start_time = timeit.default_timer()
wb = Workbook()
ws = wb.new_sheet("test")
row_num = frame.shape[0]
col_num = 1
for col_name, col_series in frame.iteritems():
ws.range((1,col_num), (row_num,col_num)).value = list(map(lambda a:[a],col_series))
col_num += 1
wb.save(file)
end_time = timeit.default_timer()
delta = round(end_time-start_time,2)
print("pyexcelerate2 took "+str(delta)+" secs")
return delta
# Iterate over columns
def toExcelerate3(file, frame):
start_time = timeit.default_timer()
wb = Workbook()
ws = wb.new_sheet("test")
row_num = frame.shape[0]
col_num = frame.shape[1]
ws.range((1,1), (row_num,col_num)).value = frame.values.tolist()
wb.save(file)
end_time = timeit.default_timer()
delta = round(end_time-start_time,2)
print("pyexcelerate3 took "+str(delta)+" secs")
return delta
def perf(base, eps):
perc = (base-eps)/base
print("{0:.2%} over base time".format(perc))
print("BUILDING DATAFRAME")
frame = pd.DataFrame()
rows = int(math.pow(10,5))
cols = 5000
dic = dict()
for x in range(cols):
frame[str(x)] = pd.Series(np.random.randn(rows))
print("to_excel()")
base = toPandasExcel("pandas.xlsx", frame)
print("pyexcelerate1")
e1 = toExcelerate1("pyexcelerate1.xlsx", frame)
perf(base,e1)
print("pyexcelerate2")
e2 = toExcelerate2("pyexcelerate2.xlsx", frame)
perf(base,e2)
print("pyexcelerate3")
e3 = toExcelerate3("pyexcelerate3.xlsx", frame)
perf(base,e3)
@raffam : Thanks for this! This will be very helpful.
Used toExcelerate3 function and defaults xlsx function to write dataframe 333243*34 to file. Results:
to_excel (xlsx) default: 1360 seconds
toExcelerate3: 773 seconds
For who is interested, I created a simple helper function to write DataFrames to excel (including header s and Index) using pyexcelerate. ( https://gist.github.com/mapa17/bc04be36e447cab0746a0ec8903cc49f )
I thought about adding a excel writer engine to pandas/pandas/io/excel.py but I am a bit worried looking through the other already implanted engines. They support all kind of fancy cell formatting.
Do you think it would it be sufficient to provide a minimalistic excel writer engine using pyexcelerate, writing only unformatted excel files?
Hmm...from a maintenance perspective, I think I would want to maintain fewer engines than multiple because of the compatibility issues down the road (e.g. maintaining consistency between a Python and C parser for CSV has quite difficult).
I could potentially see this as a "fast track," but it sounds a little corner case. Thus, I'm -0.5 on this overall because I'm leaning towards this being more maintenance than useful for the broader audience.
cc @jreback
This thing is indeed too slow, I'm not surprised that I find others thinking the same. 5 minutes for producing a 50 MB excel file is too much.
Based off of conversation above I think we'll close this for now as its unclear if there's really anything to be gained here. If anyone disagrees feel free to reopen
Most helpful comment
Update, default writer for
df.to_excel(...)
still super slow for large data.Finally gave up and used
df.to_csv(...)
.Worked for my purposes. Thanks!