Pandas: Concat slow when frames indexed separately

Created on 1 Mar 2018 · 10Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

frames_difindex = []
frames_sameindex = []
for i in range(1000):
  # Create a single-row frame where half the elements are null
  a = np.random.rand(600) - 0.5
  a[a<0] = np.nan

  # All rows here have the same column index
  f = pd.DataFrame({str(i): a}).T
  frames_sameindex.append(f)

  # Rows here will have separate column indices
  f = f.dropna(axis=1)
  frames_difindex.append(f)

%%timeit
_ = pd.concat(frames_sameindex)
# Output:
# 10 loops, best of 3: 50.6 ms per loop

%%timeit
_ = pd.concat(frames_difindex)
# Output:
# 1 loop, best of 3: 24.7 s per loop

Problem description

I've been trying to concat about 50k wide frames (1 or 2 rows by up to 600 cols). I noticed that when the columns are fairly sparse, it takes about 17 hours to complete. That's a huge difference from the nearly immediate result when the indices are all the same.

I boiled it down to the above example where you can see the 500x difference for just a set of 1000 dataframes.

I've been trying to find a better way of doing this, but haven't had any luck. So I'm reporting it here.
Even https://tomaugspurger.github.io/modern-4-performance.html suggests that concat() should be the way to go.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.96+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: 0.3.0
pandas_datareader: None

Performance Reshaping

Source

jeremywhelchel

Most helpful comment

Yeah, looks like there should be some way to speed situations like this up, but I'm not super familiar with this part of the codebase.

Some quickly put together timings using a generic approach that forces reindexing:

In [2]: from pandas.core.indexes.api import _union_indexes
   ...:
   ...: def concat_force_reindex(dfs):
   ...:     all_cols = _union_indexes([df.columns for df in dfs])
   ...:     dfs = (df.reindex(columns=all_cols) for df in dfs)
   ...:     return pd.concat(dfs)

In [3]: a = pd.concat(frames_difindex)

In [4]: b = concat_force_reindex(frames_difindex)

In [5]: a.equals(b)
Out[5]: True

In [6]: %timeit concat_force_reindex(frames_difindex)
667 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit pd.concat(frames_difindex)
26.8 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit pd.concat(frames_sameindex)
50.3 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So force reindexing is ~13x slower than concating with common columns, but it's better than the baseline ~530x difference.

jschendel on 2 Mar 2018

👍3

All 10 comments

Hi, thanks for submitting this issue

Yeah, concats are usually the way to go, but that's usually under the assumption that all the DataFrame columns are the same. Which is the case in your frames_sameindex but not in frames_difindex.

I was going to look at the concat source to determine why it's slow for frames_difindex, but it's extensive and I don't have the time right now. I'm guessing there's a lot of column additions being done for each of the 1,000 frames

Out of curiosity, why are you looking to concat DataFrames that are seemingly so different?

bobhaffner on 2 Mar 2018

Thank you for taking a look!

My use case was that I had a timeseries of many data points leading up to some event for each these 46k records. The timeseries was represented in a wide format, so columns were N=1,2,3,4,5,... Some frames only had a handful of events (cols 1-20 say), while others had the whole window (cols 1-600).
These timeseries were all generated in a parallel pipeline, hence why they were individual dataframes.

I came up with a good workaround, which was just to hard-code the max window size I want and reindex the columns upstream to that: df.reindex(RangeIndex(0,600), axis=1). So when it comes time to concat the column index is all the same. Doesn't take too long in that case.

Admittedly this is an odd use-case, though I feel like I've had other performance issues with concat'ing dissimilar frames before. I guess the code path is vastly different when the indices aren't equal. There might be some optimization there. Perhaps a pass to reindex everything to the union of all input indices first, then do the concat. For larger sets that may speed things up.

jeremywhelchel on 2 Mar 2018

👍2

Yeah, looks like there should be some way to speed situations like this up, but I'm not super familiar with this part of the codebase.

Some quickly put together timings using a generic approach that forces reindexing:

In [2]: from pandas.core.indexes.api import _union_indexes
   ...:
   ...: def concat_force_reindex(dfs):
   ...:     all_cols = _union_indexes([df.columns for df in dfs])
   ...:     dfs = (df.reindex(columns=all_cols) for df in dfs)
   ...:     return pd.concat(dfs)

In [3]: a = pd.concat(frames_difindex)

In [4]: b = concat_force_reindex(frames_difindex)

In [5]: a.equals(b)
Out[5]: True

In [6]: %timeit concat_force_reindex(frames_difindex)
667 ms ± 3.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit pd.concat(frames_difindex)
26.8 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit pd.concat(frames_sameindex)
50.3 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So force reindexing is ~13x slower than concating with common columns, but it's better than the baseline ~530x difference.

jschendel on 2 Mar 2018

👍3

I wonder if you saved your frames in "long form" then concat and then performed a pivot?

bobhaffner on 2 Mar 2018

The reindexing method certainly is a huge improvement. If that were standard behavior for concat, I don't think I would be nearly so bothered. Though I doubt you'd want to apply it in every case, and I'm not sure how to determine when you'd want to use it and when not.

I did try the long form concat you suggested as well. With my 50k-frame dataset, it was just never completing. I tried it with this smaller 10k row contrived dataset and got it to finish reasonably:

sameindex_t = [df.T for df in frames_sameindex]
difindex_t = [df.T for df in frames_difindex]
%%timeit
sc = pd.concat(sameindex_t, axis=1)
# Output
# 1 loop, best of 3: 385 ms per loop
%%timeit
dc = pd.concat(difindex_t, axis=1)
# Output
# 1 loop, best of 3: 1.88 s per loop

Also it seems there may be some memory issues with just pd.concat with different indices. With 10k rows my pd.concat(frames_difindex) call is causing the test colab I'm using to fail, and I suspect because it hit its memory limit. I haven't looked much into it though. TBH I never found a good way to measure memory usage.

jeremywhelchel on 2 Mar 2018

Always welcome performance improvements! I think you could certainly check if the indices are equivalent before moving on before reindexing, though not sure if that's the way we would want to go.

cc @jreback @jorisvandenbossche

gfyoung on 2 Mar 2018

Yeah, wasn't suggesting that we should always use the reindexing method, or even that we should use it at all. More so just a proof of concept that improving the performance is possible, and providing a benchmark comparison for whatever solution is implemented.

jschendel on 2 Mar 2018

concat should basically do such a reindex under the hood, so I think it should certainly be possible to improve the performance to more or less that level of speed.

jorisvandenbossche on 2 Mar 2018

Is there any chance of incorporating @jschendel 's code into pandas as @jorisvandenbossche suggested?

TKlerx on 25 Jul 2018

That would need somebody diving into the code to see how this performance bottleneck can be improved (it would probably not be simple adding of the code snippet to concat, it needs to be integrated into the existing code base).
But somebody looking into this and PRs are certainly very welcome!

jorisvandenbossche on 25 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings