pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]}).merge(pd.DataFrame({'c': [6, 7, 8, 9], 'd': [4, np.nan, np.nan, 5]}), how='left', left_on='b', right_on='d')
df1:
聽 | a | b
-- | -- | --
0 | 1 | 4.0
1 | 2 | 5.0
2 | 3 | NaN
df2:
聽 | c | d
-- | -- | --
0 | 6 | 4.0
1 | 7 | NaN
2 | 8 | NaN
3 | 9 | 5.0
Current output:
聽 | a | b | c | d
-- | -- | -- | -- | --
0 | 1 | 4.0 | 6 | 4.0
1 | 2 | 5.0 | 9 | 5.0
2 | 3 | NaN | 7 | NaN
3 | 3 | NaN | 8 | NaN
Expected Output
聽 | a | b | c | d
-- | -- | -- | -- | --
0 | 1 | 4.0 | 6 | 4.0
1 | 2 | 5.0 | 9 | 5.0
What's happening is the NaN is df1.b is matching the NaNs in df2.d.
I don't see a situation in which this would be desirable behavior, but if such a situation exists, surely the opposite is also conceivable, and so there should be some documented option in DataFrame.merge which accomplishes this.
What do you think?
pd.show_versions()commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
I was able to solve my problem with
df1.merge(df2.dropna(subset=['d']), how='left', left_on='b', right_on='d')
Nevertheless, I still feel like this shouldn't be the default behavior for DataFrame.merge.
What do you think?
Hmm yea I don't think the NA values should be producing a match here - @TomAugspurger any thoughts?
Agreed, I also would not expect NAs to match here.
It might be good to explore a bit if we have always been doing that, and if we do this consistently within pandas (in which case we should certainly do some kind of deprecation if we want to change this)
I ran into this today with a dataset. In my case, I wanted a merge with an outer join, but I saw the same NaN-matching behavior.
My workaround was merging data frames like this (adapted to match the example above):
data = pd.merge(df1[df1['b'].notnull()],
df2[df2['d'].notnull()], how='outer',
left_on='b', right_on='d')
data = pd.concat([data, df1[df1['b'].isnull()],
df2[df2['d'].isnull()]],
ignore_index=True, sort=False)
I didn't know that merge would match NaNs, and I expected to get the output from merge with how='outer' that I got with this code. Fortunately, it was easy to spot (tons of repeated data), but it took a bit to understand and chase down.
Anyway, +1 on not matching NaNs, and maybe this snippet will be helpful if someone happens by who needs to do the same thing I was doing.
I kind of feel like pd.merge should behave like a SQL join, with None and np.nan being interpreted as null, and with any equality comparison involving a null value itself evaluating to null (and hence no match)
I agree that this behavior is unexpected. Since np.NaN == np.NaN evaluates to False these rows shouldn't be included in the merged DataFrame.
Most helpful comment
Agreed, I also would not expect NAs to match here.
It might be good to explore a bit if we have always been doing that, and if we do this consistently within pandas (in which case we should certainly do some kind of deprecation if we want to change this)