Pandas: two dataframes outer join on null values

Created on 16 Jun 2014 · 5Comments · Source: pandas-dev/pandas

In pandas, unlike SQL, the rows seemed to be joining on null values. Is this a bug?
related SO: http://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values/23940686#23940686

Code snippet

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(
    [[1, None],
    [2, 'y']],
    columns = ['A', 'B']
)
print df1

df2 = pd.DataFrame(
    [['y', 'Y'],
    [None, 'None1'],
    [None, 'None2']],
    columns = ['B', 'C']
)
print df2

print df1.merge(df2, on='B', how='outer')

Output

   A     B
0  1  None
1  2     y

      B      C
0     y      Y
1  None  None1
2  None  None2

   A     B      C
0  1  None  None1
1  1  None  None2
2  2     y      Y

You can see row 0 in df1 unexpectedly joins to both rows in df2.

I would expect the correct answer to be

   A       B      C
0  1      None  None
1  2       y      Y
2 None None None1
3 None None None2

API Design Docs Missing-data Reshaping

Source

socheon

Most helpful comment

Thanks. I think the outer join should follow the standard SQL otherwise it could be confusing.
There are some examples here

http://databases.about.com/library/weekly/aa051203a.htm
en.wikipedia.org/wiki/Join_(SQL)

In your example, you dropped the rows with null values before doing the outer join. But in the two examples above, the rows are not dropped.

In the first example, student Alan has course NULL. After joining to the instructor table, the Alan row is not dropped. The final result just says Alan has course NULL and instructor NULL.

In the second example, similarly we see that John with a Department ID NULL is not dropped in a full outer join.

A solution was suggested in the SO example above for left join
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

but for a full outer join we need to do something like

pd.concat([
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id'),
foo[pd.notnull(foo.id)].merge(bar, how='right', on='id')
])

But I think is unnatural and inconsistent with the standard SQL.

socheon on 16 Jun 2014

👍4

All 5 comments

see the linked question as well.

jreback on 16 Jun 2014

Using None as a missing value is not very standard in pandas, use np.nan

In [18]: df1
Out[18]: 
   A     B
0  1  None
1  2     y

In [19]: df1.fillna(np.nan)
Out[19]: 
   A    B
0  1  NaN
1  2    y

In [20]: df2
Out[20]: 
      B      C
0     y      Y
1  None  None1
2  None  None2

In [21]: df2.fillna(np.nan)
Out[21]: 
     B      C
0    y      Y
1  NaN  None1
2  NaN  None2

You can easily drop the missing values before merging.

In [22]: df1.fillna(np.nan).dropna().merge(df2.fillna(np.nan).dropna(),on='B',how='outer')
Out[22]: 
   A  B  C
0  2  y  Y

This is the default, as normally this would be up to the user to drop before merging.
(because its not always obvious _what_ to drop, e.g. could easily drop based on the 'on' column, but, I suspect letting the user have control is better).

could do this a bit better I suppose.

jreback on 16 Jun 2014

Thanks. I think the outer join should follow the standard SQL otherwise it could be confusing.
There are some examples here

http://databases.about.com/library/weekly/aa051203a.htm
en.wikipedia.org/wiki/Join_(SQL)

In your example, you dropped the rows with null values before doing the outer join. But in the two examples above, the rows are not dropped.

In the first example, student Alan has course NULL. After joining to the instructor table, the Alan row is not dropped. The final result just says Alan has course NULL and instructor NULL.

In the second example, similarly we see that John with a Department ID NULL is not dropped in a full outer join.

A solution was suggested in the SO example above for left join
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')

but for a full outer join we need to do something like

pd.concat([
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id'),
foo[pd.notnull(foo.id)].merge(bar, how='right', on='id')
])

But I think is unnatural and inconsistent with the standard SQL.

socheon on 16 Jun 2014

👍4

my example is very similar to what your soln shows.

I think this is not well defined ATM in pandas. Welcome a pull-request to address this. (I don't know if changing this to SQL behavior will break anything; if it does then that would need to be address).

jreback on 16 Jun 2014

@socheon Bit late after your comment but I notice this is still open and it helped me recently. I didn't quite get what I was looking for using your snippet but a bit of adjustment returned what I wanted and is more similar to the results expected doing a full outer join in SQL:

def merge_fullouter(df1, df2, on_col):
    """
    Does a SQLesque full outer join on two data frames without matching null values
    Input: df1, df2, Pandas data frames, data frames to full outer join
           on_col, string, the column on which to join the data frames by
    """
    return pd.concat([
        df1[pd.notnull(df1[on_col])].merge(df2[pd.notnull(df2[on_col])], how="outer", on=on_col),
        df1[pd.isnull(df1[on_col])],
        df2[pd.isnull(df2[on_col])]
])