Hello,
I dont know if that is possible, but it would great to find a way to speed up the to_csv
method in Pandas.
In my admittedly large dataframe with 20 million observations and 50 variables, it takes literally hours to export the data to a csv file.
Reading the csv
in Pandas is much faster though. I wonder what is the bottleneck here and what can be done to improve the data transfer.
Csv files are ubiquitous, and a great way to share data (without being too nerdy with hdf5
and other subtleties). What do you think?
duplicate of #3186
Well using 1/10 the rows, about 800MB in memory
In [5]: df = DataFrame(np.random.randn(2000000,50))
In [6]: df.memory_usage().sum()
Out[6]: 800000072
In [7]: df.memory_usage().sum()/1000000
Out[7]: 800
In [8]: %time df.to_csv('test.csv')
CPU times: user 2min 53s, sys: 3.71 s, total: 2min 56s
Wall time: 2min 57s
about 8.5MB/sec in raw throughput, way below IO speeds, so obviously quite some room to improve.
you might be interested in this blog here
Of course there IS really no reason at all to use CSV unless you are forced.
In [11]: %time df.to_hdf('test.hdf','df')
CPU times: user 10.5 ms, sys: 588 ms, total: 599 ms
Wall time: 2.22 s
This is also part of the documenation: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations
hi @jreback thanks for your help!
unfortunately
what do you mean by mixed types
?
you should NEVER have mixed types (e.g. python objects). that is a big big no-no.
show a df.info()
; you can pass format='table'
which is somewhat slower but much more flexible format for writing HDF5
I will run the to hdf line tomorrow and tell you my error message. it may be due to the fact that i have string columns and or some miscoded observations...
yes thanks i will run the df info. keep in touch. Thanks jeff
good morning jeff (@jreback))
Please find my df.info
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10762587 entries, 0 to 12864511
Columns: 275 entries, bagent_name to index_month
dtypes: bool(1), datetime64[ns](16), float64(30), int32(1), int64(172), object(53), timedelta64[ns](2)
memory usage: 22.0+ GB
Do you see anything wrong? I have many string columns here
and this is what I get when I try
df.to_hdf(path + 'main.h5','stored',format='table',mode='w')
TypeError: Cannot serialize the column [zip_dw] because
its data contents are [mixed-integer-float] object dtype
So you probably have something like this. Its actually pretty hard to construct this. If you are not doing this explicity then pls show how its contructed as this should always be converted to a concrete dtype.
You NEVER want object
dtypes, except if they are strings. Yes you can use them, but they are completely non-performant and generally cannot be serialized. These are by definition python objects.
In [7]: df = DataFrame({'A' : [1,2,3], 'B' : Series([1,2.0,3],dtype=object)})
In [8]: df
Out[8]:
A B
0 1 1
1 2 2
2 3 3
In [9]: df.dtypes
Out[9]:
A int64
B object
dtype: object
In [10]: df.to_hdf('test.h5','df',format='fixed')
pandas/core/generic.py:1096: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer-float,key->block1_values] [items->['B']]
return pytables.to_hdf(path_or_buf, key, self, **kwargs)
In [11]: df.to_hdf('test.h5','df',format='table')
TypeError: Cannot serialize the column [B] because
its data contents are [mixed-integer-float] object dtype
Hi Jeff, Yes I think you are right. zip_dw is a variable that contains a zipcode.
In the data, it looks like 95966 but the dtype is object.
The key question are:
Thanks again @jreback
of course it can handle strings, but it makes them fixed width. But that's not your issue. you have actual Python objects (an actual integer and a float object), that are NOT in numpy (e.g. that's why the column is object
).
YOU are in charge of your dtypes. So you need you either need to stringify them, or you can leave them as object and use only the fixed
format (it will work, but it pickles the data).
@randomgambit you have to understand and be cognizant of your data and types. Pandas provides lots of tools and mostly does not allow you to shoot yourself, but it cannot do 'everything' even though it will try very hard to infer things.
thanks @jreback and sorry if I bother you with such basic questions. I come from a language (Stata) that is much less flexible and where everything is either a float or a string.
To recap, the problem here is that I have a column that contains some numbers (floats and integers) and some strings. This is why pandas treat them as objects. You are saying that this mixed-type column generates performance issues and cannot be well stored in hdf5. To fix this, I should probably either use to_numeric
or astype('str')
on this column, in order to have a single dtype for every value in that column.
Is this correct?
Following our chat, I have bought a book about hdf5 and python. that will help me understand this storage system better. @jreback , if you can just tell me if my reasoning above is correct that would help. Thanks and keep up the great work with Pandas!
so you would typically do something like to this. pandas support many dtypes; you want to type as much as possible. All of these types are supported when serialized to HDF5 (table format), though for example strings become fixed width and nans are replaced with a string.
``
In [31]: s = Series(['a',1,2.0])
In [32]: s
Out[32]:
0 a
1 1
2 2
dtype: object
In [33]: pd.to_numeric(s, errors='coerce')
Out[33]:
0 NaN
1 1.0
2 2.0
dtype: float64
So how about an answer to the initial question rather than getting off track.
I need cvs format
Can pandas improve speed of writing that file format?
Most helpful comment
So how about an answer to the initial question rather than getting off track.
I need cvs format
Can pandas improve speed of writing that file format?