Pandas: ENH: cross join in merge() and join()

Created on 31 Oct 2013 · 42Comments · Source: pandas-dev/pandas

If I haven't missed something, there's no simple way to make cartesian product method in pandas. For example, see http://stackoverflow.com/questions/19684765/pandas-extending-the-dataframe-by-adding-another-levels/19684948#19684948.
To do cross join, one have to add some keys to DataFrames, like this:

>>> dfA = pd.DataFrame(['a','b','c'],columns=['A'])
>>> dfB = pd.DataFrame(range(3),columns=['B'])
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> pd.merge(dfB, dfA, on='key').ix[:, ('A','B')]
   A  B
0  a  0
1  b  0
2  c  0
3  a  1
4  b  1
5  c  1
6  a  2
7  b  2
8  c  2

How about adding possibility to pass how='cross' into merge() and join() methods?

Enhancement Indexing Reshaping

Source

RomanPekar

👍7

Most helpful comment

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> dfA
   A  B
0  a  2
1  b  1
>>> dfB
   C  D
0  0  4
1  1  5
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> df = pd.merge(dfA, dfB, on='key')
>>> del df['key']
>>> df
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

here's sql fiddle example of cross join

RomanPekar on 1 Nov 2013

👍6

All 42 comments

Why did you close this?

jtratner on 1 Nov 2013

Oops my bad it's totally by accident, hit the button somehow :(

RomanPekar on 1 Nov 2013

related #3835 , there is a method: from pandas.tools.util import cartesian_product

how is this different from an outer join? (which does this and just drops the na groups)

jreback on 1 Nov 2013

@jreback could you provide an example how to do outer join if there's no column on which to join? May be it's possible to do with apply somehow?

RomanPekar on 1 Nov 2013

A bit hacky

In [28]: import pandas.tools.util as tools

In [29]: DataFrame(index=MultiIndex.from_arrays(
                  tools.cartesian_product([dfA['A'].tolist(),dfB['B'].tolist()]),
                    names=['A','B'])).reset_index()
Out[29]: 
   A  B
0  a  0
1  a  1
2  a  2
3  b  0
4  b  1
5  b  2
6  c  0
7  c  1
8  c  2

jreback on 1 Nov 2013

@jreback yes, good one! But what if there're more than one column in either dataframe, or in both?

RomanPekar on 1 Nov 2013

not sure....can you create input and output frames?

jreback on 1 Nov 2013

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> dfA
   A  B
0  a  2
1  b  1
>>> dfB
   C  D
0  0  4
1  1  5
>>> dfA['key'] = 1
>>> dfB['key'] = 1
>>> df = pd.merge(dfA, dfB, on='key')
>>> del df['key']
>>> df
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

here's sql fiddle example of cross join

RomanPekar on 1 Nov 2013

👍6

FWIW, this is something I've often wanted too, and I think is fundamental enough an operation to deserve a better way.

dsm054 on 9 Nov 2013

I use the same method as @RomanPekar regularly too (cartesian products are handy for basket analysis as they create a normalized version of a scarce matrix). Main thing I guess is that you could improve performance inside Pandas if you know you don't have to look at an index to reference back and forth but instead can multiply the dataframe for any possible combination? in regular python I would expect something like this:

list_1 = [['a', 1], ['b', 2], ['c', 3]]
list_2 = [['x', 11], ['y', 12], ['z', 13]]

list_result = []
for sub_list_1 in list_1:
    for sub_list_2 in list_2:
        list_result.append(sub_list_1 + sub_list_2)

When it comes notation I myself would like a "how='cartesian'" where you do not specify left_on and right_on...

CarstVaartjes on 14 Nov 2013

👍1

Is there any reason not to implement this? The only thing I'm not sure of is what we expect from how='cross' when there _are_ common columns; do we want (1) a complete merge, (2) _x and _y suffixes, or (3) to raise an exception? I.e. which of

>>> df0 = pd.DataFrame({"A": [1,2], "B": [3,4]})
>>> df1 = pd.DataFrame({"C": [5,6], "B": [3,8]})
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, how='outer'); del m["key"]; print(m)
    A  B   C
0   1  3 NaN
1   2  4 NaN
2 NaN  3   5
3 NaN  8   6
>>> df0["key"] = 1; df1["key"] = 2; m = df0.merge(df1, on="key", how='outer'); del m["key"]; print(m)
    A  B_x  B_y   C
0   1    3  NaN NaN
1   2    4  NaN NaN
2 NaN  NaN    3   5
3 NaN  NaN    8   6

do we want?

dsm054 on 18 Jun 2014

+1, I just hit this issue too while trying to represent factors of Bayesian networks as dataframes. When factors have no common variables, their products is a Cartesian product, i.e., cross join. My workaround was to add a dummy column and join on that, then remove it afterwards. I see other people do that as well.

In this application it doesn't matter what happens when there are common columns since I need to treat one of them specially anyway, but I guess adding suffixes would be a sane default; that would be like joining on an empty list of columns.

larsmans on 2 Feb 2015

Because concat doesn't give you the Cartesian product:

>>> dfA = pd.DataFrame({'A':list('ab'), 'B':range(2,0,-1)})
>>> dfB = pd.DataFrame({'C':range(2), 'D':range(4, 6)})
>>> pd.concat([dfA, dfB])
     A   B   C   D
0    a   2 NaN NaN
1    b   1 NaN NaN
0  NaN NaN   0   4
1  NaN NaN   1   5
>>> dfA["key"] = 1; dfB["key"] = 1;
>>> m = dfA.merge(dfB,how='outer'); del m["key"];
>>> m
   A  B  C  D
0  a  2  0  4
1  a  2  1  5
2  b  1  0  4
3  b  1  1  5

dsm054 on 14 Apr 2015

It's worth noting that attempting a CROSS JOIN with an ON predicate raises a syntax error in postgres. Pandas should disallow the on argument.

pcloud=# select * from orders;
 number | order_date | cust_id | salesperson_id | amount
--------+------------+---------+----------------+--------
     10 | 1996-08-02 |       4 |              2 |    540
     20 | 1999-01-30 |       4 |              8 |   1800
     30 | 1995-07-14 |       9 |              1 |    460
     40 | 1998-01-29 |       7 |              2 |   2400
     50 | 1998-02-03 |       6 |              7 |    600
     60 | 1998-03-02 |       6 |              7 |    720
     70 | 1998-05-06 |       9 |              7 |    150
     80 | 1999-07-23 |       4 |             12 |    200
     90 | 2000-07-23 |       9 |             12 |   2300
(9 rows)

pcloud=# select * from customer;
 id |   name   |   city   | industry_type
----+----------+----------+---------------
  4 | Samsonic | pleasant | J
  6 | Panasung | oaktown  | J
  7 | Samony   | jackson  | B
  9 | Orange   | Jackson  | B
(4 rows)

pcloud=# select * from orders cross join customer on orders.cust_id = customer.id;
ERROR:  syntax error at or near "on"
LINE 1: select * from orders cross join customer on orders.cust_id =...
                                                 ^
pcloud=# select * from orders left join customer on orders.cust_id = customer.id;
 number | order_date | cust_id | salesperson_id | amount | id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+----+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |  4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |  4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |  9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |  7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |  6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |  6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |  9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |  4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |  9 | Orange   | Jackson  | B
(9 rows)

cpcloud on 14 Apr 2015

We need to differentiate same name columns because the cartesian product can potentially generate different values

pcloud=# select * from orders, customer2;

 number | order_date | cust_id | salesperson_id | amount | cust_id |   name   |   city   | industry_type
--------+------------+---------+----------------+--------+---------+----------+----------+---------------
     10 | 1996-08-02 |       4 |              2 |    540 |       4 | Samsonic | pleasant | J
     10 | 1996-08-02 |       4 |              2 |    540 |       6 | Panasung | oaktown  | J
     10 | 1996-08-02 |       4 |              2 |    540 |       7 | Samony   | jackson  | B
     10 | 1996-08-02 |       4 |              2 |    540 |       9 | Orange   | Jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       4 | Samsonic | pleasant | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       6 | Panasung | oaktown  | J
     20 | 1999-01-30 |       4 |              8 |   1800 |       7 | Samony   | jackson  | B
     20 | 1999-01-30 |       4 |              8 |   1800 |       9 | Orange   | Jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       4 | Samsonic | pleasant | J
     30 | 1995-07-14 |       9 |              1 |    460 |       6 | Panasung | oaktown  | J
     30 | 1995-07-14 |       9 |              1 |    460 |       7 | Samony   | jackson  | B
     30 | 1995-07-14 |       9 |              1 |    460 |       9 | Orange   | Jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       4 | Samsonic | pleasant | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       6 | Panasung | oaktown  | J
     40 | 1998-01-29 |       7 |              2 |   2400 |       7 | Samony   | jackson  | B
     40 | 1998-01-29 |       7 |              2 |   2400 |       9 | Orange   | Jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       4 | Samsonic | pleasant | J
     50 | 1998-02-03 |       6 |              7 |    600 |       6 | Panasung | oaktown  | J
     50 | 1998-02-03 |       6 |              7 |    600 |       7 | Samony   | jackson  | B
     50 | 1998-02-03 |       6 |              7 |    600 |       9 | Orange   | Jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       4 | Samsonic | pleasant | J
     60 | 1998-03-02 |       6 |              7 |    720 |       6 | Panasung | oaktown  | J
     60 | 1998-03-02 |       6 |              7 |    720 |       7 | Samony   | jackson  | B
     60 | 1998-03-02 |       6 |              7 |    720 |       9 | Orange   | Jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       4 | Samsonic | pleasant | J
     70 | 1998-05-06 |       9 |              7 |    150 |       6 | Panasung | oaktown  | J
     70 | 1998-05-06 |       9 |              7 |    150 |       7 | Samony   | jackson  | B
     70 | 1998-05-06 |       9 |              7 |    150 |       9 | Orange   | Jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       4 | Samsonic | pleasant | J
     80 | 1999-07-23 |       4 |             12 |    200 |       6 | Panasung | oaktown  | J
     80 | 1999-07-23 |       4 |             12 |    200 |       7 | Samony   | jackson  | B
     80 | 1999-07-23 |       4 |             12 |    200 |       9 | Orange   | Jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       4 | Samsonic | pleasant | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       6 | Panasung | oaktown  | J
     90 | 2000-07-23 |       9 |             12 |   2300 |       7 | Samony   | jackson  | B
     90 | 2000-07-23 |       9 |             12 |   2300 |       9 | Orange   | Jackson  | B
(36 rows)

cpcloud on 14 Apr 2015

Okay, @cpcloud, so if I understand the semantic we need to match so as not to violate least surprise for our database-minded friends, shared column names should behave as if they had an implicit suffix to distinguish them?

dsm054 on 14 Apr 2015

@dsm054 Yep.

cpcloud on 14 Apr 2015

I've got this working for merge, so that df0.merge(df1, how='cross') works exactly as it would if you introduced the temp column, but I'm not entirely sure what to do with join. Thoughts?

dsm054 on 17 Apr 2015

Any update on this? Would love to see this feature/enhancement in an upcoming release

bmoscon on 20 May 2016

For API purposes, would it be useful to just allow an arbitrary join condition? That way, cross-join is just the special case of no condition (None, which is interpreted as lambda x, y: True).

pkch on 11 Aug 2016

@pkch on='cross' is a nice API here (it _could_ be implemented as a general conditional, but is more performant as a special case). Separately the on= could be a condition (eg. a callable)

jreback on 15 Aug 2016

I just came across this thread after looking to an answer to the same question. I think the following API would be great, noting that when the how argument is set to cross, the on argument must be None.

For pd.merge:
result = pd.merge(df1, df2, how='cross')

For pd.DataFrame.merge:
result = df1.merge(df2, how='cross')

For pd.DataFrame.join:
result = df1.join(df2, how='cross')

My experience is the same as noted above, in that the best way to accomplish this in pandas currently is:

df1['key'] = 1
df2['key'] = 1
result = pd.merge(df1, df2, on='key')

That works, but in my opinion isn't readable enough to use in polite company. Thoughts on how to move forward?

ngould on 31 May 2017

I was just thinking about this the other day when I was looking at old branches. It seems reasonable to try to get this into 0.21.

dsm054 on 31 May 2017

I'd like to get some input on the possible behaviour of a cross join. I have a candidate implementation but while testing different cases I came across some decisions I want people to have a chance to think about. Partly this is because after lots of time spent on SO, I think we chose the wrong default for as_index in groupby, but it's too late to change now-- I'd prefer not for that to happen again, so if someone else can spot a problem, that'd be great.

Overview

Based on the earlier discussion, I'm proposing a new option how='cross', which would give the Cartesian product of the two dataframes. In terms of row values, this would be equivalent to adding a dummy column to each frame, performing an inner merge on that column, and then deleting it from the result, which is the standard workaround at the moment.

As an example:

In [17]: df0 = pd.DataFrame({"A": [1,2]})

In [18]: df1 = pd.DataFrame({"B": [3,4,5]})

In [19]: df0.merge(df1, how='cross')
Out[19]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [20]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[20]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

Parameters ---------------- `pd.merge` has lots of options, though, and we need to decide how they're being handled. Here are my thoughts so far. - `on`, `left_on`, `right_on`: if they're specified, raise a ValueError. `how='cross'` implicitly specifies its own join target for each frame, and so it doesn't make sense to allow these. - `sort`: keep but ignore. The join keys are effectively all equal, so sorting has no effect, and leaving it in is harmless. - `suffixes`: keep. In the case of matching column names in each frame, it'll behave just like the reference merge, and we get

In [25]: df0.merge(df1, how='cross')
Out[25]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50

In [26]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[26]: 
   A  C_x  B  C_y
0  1   10  3   30
1  1   10  4   40
2  1   10  5   50
3  2   20  3   30
4  2   20  4   40
5  2   20  5   50

- `copy`: keep, although whether or not it does anything will depend on the current merge implementation (and the promise "not unnecessarily" is pretty weak anyhow.) - `indicator`: keep. The output won't be very interesting, as it will always be "both", but it's harmless. - `validate`: behave exactly as if we're doing an inner merge on a dummy column (so that two one-row frames pass a `'one_to_one'` check but longer frames would raise a MergeError.) Index -------- For me, the interesting question is what to do with `left_index` and `right_index`. I can think of at least three defensible options: 1) Raise if either `left_index` or `right_index` is not False. The docstring says "Use the index from the left DataFrame as the join key(s)" as the meaning of `left_index`, but `how='cross'` already specified a custom join key, and so it doesn't make sense. Advantage: very easy to explain and understand. Also doesn't prevent us from changing our minds later, as it's much easier to give code which used to fail a new meaning than to change the meaning of code which used to give a result. If you want the index information to be preserved, you'd still have the option of calling `.reset_index()` on the arguments. Disadvantage: pandas code is already too cluttered with `.reset_index()` calls, IMHO. 2) Behave exactly as merging on a dummy column would. In other words:

In [72]: df0 = pd.DataFrame({"A": [1,2]}, index=[10,20])

In [73]: df1 = pd.DataFrame({"B": [3,4,5]}, index=[20,30,40])

In [74]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner').drop('m', axis=1)
Out[74]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [75]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True).drop('m', axis=1)
Out[75]: 
    A  B
20  1  3
30  1  4
40  1  5
20  2  3
30  2  4
40  2  5

In [76]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1)
Out[76]: 
    A  B
10  1  3
10  1  4
10  1  5
20  2  3
20  2  4
20  2  5

In [77]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', left_index=True, right_index=True).drop('m', axis=1)
Out[77]: 
    A  B
20  2  3

That is, in the case of left_index=right_index=False, give the natural index. (Which is an Int64Index, not a Range, but that's minor). In the case of left_index=True, repeat the right index values. In the case of right_index=True, repeat the left index values. In the case of both, keep the shared indices.

Advantage: the behaviour, while not obvious, can be explained by reference to the workaround everyone uses to get the cross join now.

Disadvantage: it's really not obvious at first glance that this is the behaviour you get if you specify on as well as left/right index, and setting them both breaks the expectation that you'd have len(df0) * len(df1) rows in the output of a cross join.

3a) Do the same as in #2, but instead of keeping only the common indices when left_index and right_index are True, set the index to the multiindex product of the two indices:

In [109]: m0 = pd.MultiIndex.from_product([df0.index, df1.index])

In [110]: df0.assign(m=1).merge(df1.assign(m=1), on='m', how='inner', right_index=True).drop('m', axis=1).set_index(m0)
Out[110]: 
       A  B
10 20  1  3
   30  1  4
   40  1  5
20 20  2  3
   30  2  4
   40  2  5

Advantage: now the cross join produces the expected number of rows, and there's a way to preserve the index information if you want it, while the default (with left_index=right_index=False would continue to ignore it.)

Disadvantage: now it really feels like we're reinterpreting left_index and right_index in the how='cross' case.

3b) Do the same as in (3a), but change the interpretation of the index flags in the case of how='cross' to mean _keep_ that index information. In other words, left_index=True would give you an index in the result which consists of df0's index elements repeated appropriately, right_index=True would give you df1's index elements repeated appropriately, and both true would give you the multiindex.

Advantage: all of those of 3a, with a much easier way to explain what gets used in the resulting index.

Disadvantage: inverts the behaviour from anyone used to the merge behaviour in this (admittedly obscure) case.

Conclusion

The fact we have indices to worry about complicates things a bit. Option #1 (raise) is certainly the simplest. I think option #2, although I like the consistency with the workaround, is ruled out by the fact it doesn't always return the expected number of elements.

I don't think I've ever used on with left_index=True or right_index=True. To be honest, it feels a little weird that it works in the first place.. but for me since I never use it, any incompatibility with current behaviour isn't a problem. If I were the only user of pandas I'd go for 3b and get on with my day, but I want to preserve least surprise for anyone who might be using it.

Thoughts?

dsm054 on 3 Jun 2017

👍2

Wow, great writeup @dsm054. I agree with ruling out option 2, since it could yield the wrong row count. And 3b stands out to me as the least surprising behavior. Namely,

I get the correct number of rows.
I get an ordered integer index when I use the default values of left_on and right_on. (Same behavior as other types of merge.)
I have the ability to specify which original indices I want to keep, if any.

You mentioned that 3b inverts the usual interpretation of left_on and right_on. Can you explain why you think so? My understanding is that merge gives an integer index by default, and only preserve index info from the original dataframes if left_on and right_on is specified. So 3b seems in line with that behavior, unless I'm misunderstanding something.

As for option 1, can you elaborate on your comment about reset_index(), and how it could be used to restore the indices from the dataframes being merged? I don't see how that is possible, but again that might be my misunderstanding.

ngould on 4 Jun 2017

👍2

On 1, I meant that you can reset_index before you call merge:

In [12]: pd.merge(df0, df1, how='cross')
Out[12]: 
   A  B
0  1  3
1  1  4
2  1  5
3  2  3
4  2  4
5  2  5

In [13]: pd.merge(df0.reset_index(), df1.reset_index(), how='cross')
Out[13]: 
   index_x  A  index_y  B
0       10  1       20  3
1       10  1       30  4
2       10  1       40  5
3       20  2       20  3
4       20  2       30  4
5       20  2       40  5

On 3b's "inversion", what I mean is that if you think of the cross join as an inner merge on a dummy column, then when you do left_index=True, you get the right index elements in the result, and vice versa:

In [21]: df0
Out[21]: 
    A
10  1
20  2

In [22]: df1
Out[22]: 
    B
20  3
30  4
40  5

In [23]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', left_index=True)
Out[23]: 
    A  m  B
20  1  1  3
30  1  1  4
40  1  1  5
20  2  1  3
30  2  1  4
40  2  1  5

In [24]: pd.merge(df0.assign(m=1), df1.assign(m=1), how='inner', on='m', right_index=True)
Out[24]: 
    A  m  B
10  1  1  3
10  1  1  4
10  1  1  5
20  2  1  3
20  2  1  4
20  2  1  5

which is the opposite relation of X_index=True to the resulting index from the one you might expect.

BUT given #16228, which would forbid the use of left_index/right_index being set when left_on or right_on is set, maybe this isn't a problem, because either we follow option #1 and bypass the whole issue, or there's not going to be any other behaviour to interfere with anymore because you won't be setting left_on and left_index together.

dsm054 on 23 Jun 2017

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

If we're going down the path of disallowing those combinations of args, obviously option 1 is the way to go.

ngould on 23 Jun 2017

yes #16228 would disallow combo operations here (and it should account for any of index setting w/ on setting)

jreback on 28 Jun 2017

@ngould

Thanks, got it! Seems like for consistency, #16228 should also aim to forbid on with any *_index args. Right now it only mentions right_on with right_index in particular. Would you agree, @dsm054?

the purpose is left_on with left_index and right_on with right_index

jreback on 28 Jun 2017

I come up with an idea to do it:

def crossjoindf(df1,df2):
    keycolist=str(set(df1.columns.values+df2.columns.values))
    df1[keycolist]=0
    df2[keycolist]=0
    return pd.merge(dfr,dfr,on=keycolist).drop([keycolist],axis=1)

It might be too easy to build a particular function `merge(df1,df2,how='cross') for it.
Is it the reason why there is no such function in pandas document?
(T_^)

19614

ftune on 9 Feb 2018

Okay, I want to bring this back. If to avoid ambiguity and weird corner cases of interpretation, if we disallow on, left_on, right_on, left_index, and right_index, then we can pretty easily have

In [35]: df.merge(df2, how='cross')                                                                                         
Out[35]: 
   A  B  C   D
0  1  2  7  77
1  1  2  8  88
2  2  3  7  77
3  2  3  8  88
4  3  4  7  77
5  3  4  8  88

In [36]: df.merge(df2.rename(columns={"C": "B"}), how='cross')                                                              
Out[36]: 
   A  B_x  B_y   D
0  1    2    7  77
1  1    2    8  88
2  2    3    7  77
3  2    3    8  88
4  3    4    7  77
5  3    4    8  88

@jreback: any objection if I bring this back with a PR?

dsm054 on 13 Nov 2018

Hi, I've recently written a canonical post on computing the cartesian product of two DataFrames on SO.
Included in that post is a fast and simple method of computing the cartesian product using np.broadcast_arrays. Here it is:

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

This is just a very basic example and does not handle column names and suffixes. It is by no means the fastest possible method, but I feel it could be a good starting point in terms of simplicity and performance. I have added performance benchmarks in the answer, if you would like to check it out.

@jreback, @dsm054: What are your thoughts?

Coldsp33d on 12 Dec 2018

@dsm054 have to have this, I think should address #16228 first, then the choice of what to do is easier. I would simply do the 'easy' impl of adding a column (you could prob have to generate a temporary hash column to avoid columns) is fine. In a later PR could explore if this is performant (might be ok).

@Coldsp33d I am not sure that soln is going to work on mixed dtypes, nor will it easily handle things like column suffixing.

jreback on 12 Dec 2018

@jreback Fair point!

Perhaps we can substitute the calls to column_stack and .values.__getitem__ with pd.concat and loc, but my tests show a significant reduction in speed (almost 10x) for larger frames. If there is some efficient way of determining whether the columns contain mixed dtypes (eg, pd.api.types.infer_dtype()), we could determine whether to use the optimized version, or fall back to a more primitive approach that preserves the types. I will need to do some more tests to see if it is worth it, or maybe something more substantial is required.

Coldsp33d on 13 Dec 2018

Was there ever any conclusion here? how='cross' would be super helpful.