In pandas, unlike SQL, the rows seemed to be joining on null values. Is this a bug?
related SO: http://stackoverflow.com/questions/23940181/pandas-merging-with-missing-values/23940686#23940686
Code snippet
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
[[1, None],
[2, 'y']],
columns = ['A', 'B']
)
print df1
df2 = pd.DataFrame(
[['y', 'Y'],
[None, 'None1'],
[None, 'None2']],
columns = ['B', 'C']
)
print df2
print df1.merge(df2, on='B', how='outer')
Output
A B
0 1 None
1 2 y
B C
0 y Y
1 None None1
2 None None2
A B C
0 1 None None1
1 1 None None2
2 2 y Y
You can see row 0 in df1 unexpectedly joins to both rows in df2.
I would expect the correct answer to be
A B C
0 1 None None
1 2 y Y
2 None None None1
3 None None None2
see the linked question as well.
Using None as a missing value is not very standard in pandas, use np.nan
In [18]: df1
Out[18]:
A B
0 1 None
1 2 y
In [19]: df1.fillna(np.nan)
Out[19]:
A B
0 1 NaN
1 2 y
In [20]: df2
Out[20]:
B C
0 y Y
1 None None1
2 None None2
In [21]: df2.fillna(np.nan)
Out[21]:
B C
0 y Y
1 NaN None1
2 NaN None2
You can easily drop the missing values before merging.
In [22]: df1.fillna(np.nan).dropna().merge(df2.fillna(np.nan).dropna(),on='B',how='outer')
Out[22]:
A B C
0 2 y Y
This is the default, as normally this would be up to the user to drop before merging.
(because its not always obvious _what_ to drop, e.g. could easily drop based on the 'on' column, but, I suspect letting the user have control is better).
could do this a bit better I suppose.
Thanks. I think the outer join should follow the standard SQL otherwise it could be confusing.
There are some examples here
In your example, you dropped the rows with null values before doing the outer join. But in the two examples above, the rows are not dropped.
In the first example, student Alan has course NULL. After joining to the instructor table, the Alan row is not dropped. The final result just says Alan has course NULL and instructor NULL.
In the second example, similarly we see that John with a Department ID NULL is not dropped in a full outer join.
A solution was suggested in the SO example above for left join
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')
but for a full outer join we need to do something like
pd.concat([
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id'),
foo[pd.notnull(foo.id)].merge(bar, how='right', on='id')
])
But I think is unnatural and inconsistent with the standard SQL.
my example is very similar to what your soln shows.
I think this is not well defined ATM in pandas. Welcome a pull-request to address this. (I don't know if changing this to SQL behavior will break anything; if it does then that would need to be address).
@socheon Bit late after your comment but I notice this is still open and it helped me recently. I didn't quite get what I was looking for using your snippet but a bit of adjustment returned what I wanted and is more similar to the results expected doing a full outer join in SQL:
def merge_fullouter(df1, df2, on_col):
"""
Does a SQLesque full outer join on two data frames without matching null values
Input: df1, df2, Pandas data frames, data frames to full outer join
on_col, string, the column on which to join the data frames by
"""
return pd.concat([
df1[pd.notnull(df1[on_col])].merge(df2[pd.notnull(df2[on_col])], how="outer", on=on_col),
df1[pd.isnull(df1[on_col])],
df2[pd.isnull(df2[on_col])]
])
Most helpful comment
Thanks. I think the outer join should follow the standard SQL otherwise it could be confusing.
There are some examples here
In your example, you dropped the rows with null values before doing the outer join. But in the two examples above, the rows are not dropped.
In the first example, student Alan has course NULL. After joining to the instructor table, the Alan row is not dropped. The final result just says Alan has course NULL and instructor NULL.
In the second example, similarly we see that John with a Department ID NULL is not dropped in a full outer join.
A solution was suggested in the SO example above for left join
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')but for a full outer join we need to do something like
But I think is unnatural and inconsistent with the standard SQL.