Pandas: improving the speed of to_csv

Created on 13 Apr 2016  路  16Comments  路  Source: pandas-dev/pandas

Hello,

I dont know if that is possible, but it would great to find a way to speed up the to_csv method in Pandas.

In my admittedly large dataframe with 20 million observations and 50 variables, it takes literally hours to export the data to a csv file.

Reading the csv in Pandas is much faster though. I wonder what is the bottleneck here and what can be done to improve the data transfer.

Csv files are ubiquitous, and a great way to share data (without being too nerdy with hdf5and other subtleties). What do you think?

IO CSV Performance

Most helpful comment

So how about an answer to the initial question rather than getting off track.

I need cvs format

Can pandas improve speed of writing that file format?

All 16 comments

duplicate of #3186

Well using 1/10 the rows, about 800MB in memory

In [5]: df = DataFrame(np.random.randn(2000000,50))

In [6]: df.memory_usage().sum()
Out[6]: 800000072

In [7]: df.memory_usage().sum()/1000000
Out[7]: 800

In [8]: %time df.to_csv('test.csv')
CPU times: user 2min 53s, sys: 3.71 s, total: 2min 56s
Wall time: 2min 57s

about 8.5MB/sec in raw throughput, way below IO speeds, so obviously quite some room to improve.
you might be interested in this blog here

Of course there IS really no reason at all to use CSV unless you are forced.

In [11]: %time df.to_hdf('test.hdf','df')
CPU times: user 10.5 ms, sys: 588 ms, total: 599 ms
Wall time: 2.22 s

This is also part of the documenation: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations

hi @jreback thanks for your help!

unfortunately

  • yes i am forced
  • i have mixed types in my columns and somehow to hdf fails

what do you mean by mixed types?

you should NEVER have mixed types (e.g. python objects). that is a big big no-no.

show a df.info(); you can pass format='table' which is somewhat slower but much more flexible format for writing HDF5

I will run the to hdf line tomorrow and tell you my error message. it may be due to the fact that i have string columns and or some miscoded observations...

yes thanks i will run the df info. keep in touch. Thanks jeff

good morning jeff (@jreback))

Please find my df.info

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10762587 entries, 0 to 12864511
Columns: 275 entries, bagent_name to index_month
dtypes: bool(1), datetime64[ns](16), float64(30), int32(1), int64(172), object(53), timedelta64[ns](2)
memory usage: 22.0+ GB

Do you see anything wrong? I have many string columns here

and this is what I get when I try

df.to_hdf(path + 'main.h5','stored',format='table',mode='w')

TypeError: Cannot serialize the column [zip_dw] because
its data contents are [mixed-integer-float] object dtype

So you probably have something like this. Its actually pretty hard to construct this. If you are not doing this explicity then pls show how its contructed as this should always be converted to a concrete dtype.
You NEVER want object dtypes, except if they are strings. Yes you can use them, but they are completely non-performant and generally cannot be serialized. These are by definition python objects.

In [7]: df = DataFrame({'A' : [1,2,3], 'B' : Series([1,2.0,3],dtype=object)})

In [8]: df
Out[8]: 
   A  B
0  1  1
1  2  2
2  3  3

In [9]: df.dtypes
Out[9]: 
A     int64
B    object
dtype: object

In [10]: df.to_hdf('test.h5','df',format='fixed')
pandas/core/generic.py:1096: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer-float,key->block1_values] [items->['B']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

In [11]: df.to_hdf('test.h5','df',format='table')
TypeError: Cannot serialize the column [B] because
its data contents are [mixed-integer-float] object dtype

Hi Jeff, Yes I think you are right. zip_dw is a variable that contains a zipcode.
In the data, it looks like 95966 but the dtype is object.

The key question are:

  • is there something in Pandas to force a (hdf5 compliant) conversion for these object columns?
  • i actually thought hdf5 could handle strings and fast-searching them. Am I wrong?

Thanks again @jreback

of course it can handle strings, but it makes them fixed width. But that's not your issue. you have actual Python objects (an actual integer and a float object), that are NOT in numpy (e.g. that's why the column is object).

YOU are in charge of your dtypes. So you need you either need to stringify them, or you can leave them as object and use only the fixed format (it will work, but it pickles the data).

@randomgambit you have to understand and be cognizant of your data and types. Pandas provides lots of tools and mostly does not allow you to shoot yourself, but it cannot do 'everything' even though it will try very hard to infer things.

thanks @jreback and sorry if I bother you with such basic questions. I come from a language (Stata) that is much less flexible and where everything is either a float or a string.

To recap, the problem here is that I have a column that contains some numbers (floats and integers) and some strings. This is why pandas treat them as objects. You are saying that this mixed-type column generates performance issues and cannot be well stored in hdf5. To fix this, I should probably either use to_numeric or astype('str') on this column, in order to have a single dtype for every value in that column.

Is this correct?

Following our chat, I have bought a book about hdf5 and python. that will help me understand this storage system better. @jreback , if you can just tell me if my reasoning above is correct that would help. Thanks and keep up the great work with Pandas!

so you would typically do something like to this. pandas support many dtypes; you want to type as much as possible. All of these types are supported when serialized to HDF5 (table format), though for example strings become fixed width and nans are replaced with a string.
``
In [31]: s = Series(['a',1,2.0])

In [32]: s
Out[32]:
0 a
1 1
2 2
dtype: object

In [33]: pd.to_numeric(s, errors='coerce')
Out[33]:
0 NaN
1 1.0
2 2.0
dtype: float64


So how about an answer to the initial question rather than getting off track.

I need cvs format

Can pandas improve speed of writing that file format?

Was this page helpful?
0 / 5 - 0 ratings