Pandas: Should DataFrame.merge match NaN with NaN?

Created on 24 Aug 2018 · 6Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]}).merge(pd.DataFrame({'c': [6, 7, 8, 9], 'd': [4, np.nan, np.nan, 5]}), how='left', left_on='b', right_on='d')

Problem description

df1:

| a | b
-- | -- | --
0 | 1 | 4.0
1 | 2 | 5.0
2 | 3 | NaN

df2:

| c | d
-- | -- | --
0 | 6 | 4.0
1 | 7 | NaN
2 | 8 | NaN
3 | 9 | 5.0

Current output:

| a | b | c | d
-- | -- | -- | -- | --
0 | 1 | 4.0 | 6 | 4.0
1 | 2 | 5.0 | 9 | 5.0
2 | 3 | NaN | 7 | NaN
3 | 3 | NaN | 8 | NaN

Expected Output

| a | b | c | d
-- | -- | -- | -- | --
0 | 1 | 4.0 | 6 | 4.0
1 | 2 | 5.0 | 9 | 5.0

What's happening is the NaN is df1.b is matching the NaNs in df2.d.

I don't see a situation in which this would be desirable behavior, but if such a situation exists, surely the opposite is also conceivable, and so there should be some documented option in DataFrame.merge which accomplishes this.

What do you think?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Missing-data Reshaping

Source

alexlenail

👍5

Most helpful comment

Agreed, I also would not expect NAs to match here.
It might be good to explore a bit if we have always been doing that, and if we do this consistently within pandas (in which case we should certainly do some kind of deprecation if we want to change this)

jorisvandenbossche on 24 Aug 2018

👍4

All 6 comments

I was able to solve my problem with

df1.merge(df2.dropna(subset=['d']), how='left', left_on='b', right_on='d')

Nevertheless, I still feel like this shouldn't be the default behavior for DataFrame.merge.

What do you think?

alexlenail on 24 Aug 2018

Hmm yea I don't think the NA values should be producing a match here - @TomAugspurger any thoughts?

WillAyd on 24 Aug 2018

jorisvandenbossche on 24 Aug 2018

👍4

I ran into this today with a dataset. In my case, I wanted a merge with an outer join, but I saw the same NaN-matching behavior.

My workaround was merging data frames like this (adapted to match the example above):

data = pd.merge(df1[df1['b'].notnull()],
                df2[df2['d'].notnull()], how='outer',
                left_on='b', right_on='d')
data = pd.concat([data, df1[df1['b'].isnull()],
                  df2[df2['d'].isnull()]],
                 ignore_index=True, sort=False)

I didn't know that merge would match NaNs, and I expected to get the output from merge with how='outer' that I got with this code. Fortunately, it was easy to spot (tons of repeated data), but it took a bit to understand and chase down.

Anyway, +1 on not matching NaNs, and maybe this snippet will be helpful if someone happens by who needs to do the same thing I was doing.

jtkiley on 28 Nov 2018

I kind of feel like pd.merge should behave like a SQL join, with None and np.nan being interpreted as null, and with any equality comparison involving a null value itself evaluating to null (and hence no match)