Dataframes.jl: Explicitly handling missingness in join columns

Created on 24 Oct 2020 · 31Comments · Source: JuliaData/DataFrames.jl

Julia has a great philosophy of taking missingness seriously. For example, unlike in Pandas or Postgres, sum([1,2,missing]) gives missing. However, this philosophy hasn't yet been applied to all of the functions in the JuliaData ecosystem. I'll give an example to illustrate.

My goal is to find the _relationship between age and salary_. To find out, I will combine observations from two datasets.

One with age,

DataFrame([(name="John", employer=missing, age=40)])

and one with salary.

DataFrame([(name="John", employer="Julia Computing", salary=9999)])

In a complete-data environment, if these observations correspond to the same person, I want one row in the joined dataframe; on the other hand, if they correspond to different people, I want zero rows in the join.

Given the missingness in the "employer" column, we don't know if that's the same person or not. So when I join them on ("name", "employer"), we _cannot know the right answer_. Yet

innerjoin(
    DataFrame([(name="John", employer=missing, age=40)]),
    DataFrame([(name="John", employer="Julia Computing", salary=9999)]), 
    on=[:name, :employer]
)

makes a decision implicitly, returning an empty dataframe. If the observations correspond to the same person, this result -- failing to match the two observations -- is a _false negative_.

To avoid drawing mistaken conclusions from analysis, I would like to extend the practice of enforcing explicit handling of missing values. So I would like to get an error message in this case by default, and to actively tell innerjoin() how it should handle missing values.

The passmissing() and skipmissing() patterns used elsewhere in JuliaData are a great reassurance that Julia is looking out for missing data problems. When applied to joins, I would like to consider:

In an outer join, should rows with missingness all be dropped from the output or all be kept in?
In an inner join, should they be considered matches or nonmatches?
What if there is missingness on both sides or one side only?

I'm not sure if something like passmissing(innerjoin)(a,b) would work, or if it should be more like innerjoin(a,b, missingrule=:drop) or something else. But I do want to start the conversation about it.

https://github.com/JuliaData/DataFrames.jl/issues/2243 has some discussion of missingness and joins, mainly focused on the a.fillna(b) use case.

breaking feature

Source

adkabo

Most helpful comment

Given how simple it is to support the current behavior using a keyword argument, it would be quite extreme to completely stop supporting it.

nalimilan on 25 Oct 2020

👍2

All 31 comments

Thank you for commenting on it. The API is that by default joins are performed using isequal test (technically matching hash values). This way we can make joins fast. But the consequence is that missing is treated just as a separate value. Note that this is the same that we do in groupby (although technically you could say that if we see missing we do not know to which group it should fall into).

However, I understand this request and we can add new functionality in which you could opt-in to handle missings in a special way (indeed what you write makes sense in general). I add it as a feature to add in the future (probably post 1.0 release, as we would implement it in a non-breaking way). In the mean-time in this PR a discussion about the preferred API for this can be continued.

bkamins on 24 Oct 2020

The API is that by default joins are performed using isequal test (technically matching hash values). This way we can make joins fast.

I like that joins are fast. I want to add a dispatch for joining columns of type Array{Union{T,Missing}} that raises an error unless missingness is explicitly handled. This wouldn't add any runtime overhead for joining complete data columns, nor for joining nullable columns when the missingness behavior is specified.

you could say that if we see missing we do not know to which group it should fall into

That's a great point that merits more thinking. If missing is a separate group identifier, that is helpful relative to silently dropping the values, but it has the side effect of excluding observations from the groups where they really belong -- which could be nonuniform.

I add it as a feature to add in the future (probably post 1.0 release, as we would implement it in a non-breaking way)

I'm glad the stability of a 1.0 release is taken seriously. That said, the primary goal of this issue is to intentionally break unsafe code by raising an error -- which would ideally happen before 1.0. If more leniency is desired after that release, it's easy to switch to a more permissive model without breaking code.

adkabo on 24 Oct 2020

I want to add a dispatch for joining columns of type Array{Union{T,Missing}}

The problem is that such columns might not contain missing (we know only eltypenot the contents). With missing we normally do not throw errors based on type only on value. (having said that any(ismissing, col) is relatively cheap so we could do this)

the primary goal of this issue is to intentionally break unsafe code by raising an error -- which would ideally happen before 1.0.

I understand this. In general we have a tension between two completely opposing points of view:

some users (like you) want DataFrames.jl to be strict.
some users want exactly the opposite (and make it do the operation that is most commonly right by default)

E.g. for groupby we have skipmissing kwarg, but it is false by default.

In general - while I perfectly understand your reasoning for the above reasons this is not a simple decision (if we wanted to go this way we should consult the community first and get the consensus). The additional point is that the current behavior does not lead to irreversible data corruption - you just get more results than you would get with a strict approach that are easy to be filtered out later. Also - as commented - the most consistent approach with groupby would be to have skipmissing kwarg but set to false by default (and if we went this way this change is non breaking so it becomes just a feature request that can be implemented later).

In general this request is kind of similar to asking Dict to throw an error when someone wants to put a missing value in it as a key. As I have said - I perfectly understand the logic behind wanting this (the true value behind missing might or might not be the same as the already existing key, and if you get a second missing you do not know if it is the same or a different value) but probably people would find this approach too strict (though most consistent logically).

Looking at it from another angle the question is: how often the behavior we currently have would lead to significant problems? (leaving aside the logical purity of your proposal which I appreciate and agree with)

@nalimilan - can you please comment on what you think here.

I am tagging it for a decision before 1.0 release.

bkamins on 24 Oct 2020

The additional point is that the current behavior does not lead to irreversible data corruption - you just get more results than you would get with a strict approach that are easy to be filtered out later.

This is true in the case of groupby: a missing group appears. In the case of innerjoin(), my original example shows that it does corrupt the data by falsely failing to match rows and dropping them from the output without any indication.

In general we have a tension between two completely opposing points of view:

some users (like you) want DataFrames.jl to be strict.

some users want exactly the opposite (and make it do the operation that is most commonly right by default)

Perhaps there's a way for users to specify how strict they want to be.

using DataFrames.Strict

using DataFrames.Permissive

and using DataFrames could pick one of these or neither. Or using the "context" pattern,

ctx = DataFrames.strict()
c = ctx.innerjoin(a,b)

adkabo on 24 Oct 2020

By the way @bkamins , I want to thank you for considering this suggestion. Missingness handling is a big decision, and regardless of how DataFrames.jl ultimately decides to proceed, I'm appreciative that you're taking the time to think seriously about the varied needs of the JuliaData community.

adkabo on 24 Oct 2020

my original example shows that it does corrupt the data by falsely failing to match rows and dropping them from the output without any indication.

indeed it drops it, but there is no way to "properly fix it" without filling the missings (in your approach you would throw an error in this case but then the user gets nothing - actually this is something I do in https://github.com/JuliaData/DataFrames.jl/pull/2494 by default, but there it is less debatable).

Perhaps there's a way for users to specify how strict they want to be.

We avoid keeping global state in DataFrames.jl. Probably if we add it then it will be via kwarg. The decision to make is what we want to be the default (as if we keep the current default this PR is is not urgent, but if we decide that the default is what you propose we need to implement it now).

bkamins on 24 Oct 2020

Thanks for spotting this. Indeed the presence of missing values in the join keys seems to be more dangerous than e.g. in groupby, as it can silently drop missing values. AFAICT this problem is most problematic for inner joins: in other types of joins a row with missing will still be present in the result, so missing is propagated somehow (like with groupby).

I suspect that in practice people don't perform joins on columns with missing values very frequently, and that when they do so it's often due to a mistake. We could deprecate this in 0.22 and see whether we get complaints.

A keyword argument would have to be added to disable the deprecation warning. But there are several possible behaviors to choose: throw an error (default); treat missing as a separate value equal to itself (current behavior); treat missing as a separate value not equal to itself (equivalent to skipping missing values for inner joins); skip rows with missing. Maybe we only need to support the first two behaviors, in which case a single Boolean argument would be enough (not sure how to call it: allowmissing?). Skipping rows with missing values can be done manually using dropmissing; and treating missing as not equal to itself could be allowed later via an additional argument if it's really needed.

nalimilan on 25 Oct 2020

If we went for this I would do allowmissing with meaning:

if true: treat missing as a separate value equal to itself (current behavior)
if false (default): throw an error

I will ask for opinions on slack

bkamins on 25 Oct 2020

Do any languages support joining where missings are equal? iirc Stata treats all missings as different or throws an error.

I would be most comfortable with always throwing an error and not allowing an option to do otherwise.

pdeffebach on 25 Oct 2020

FWIW dplyr supports this and does the same as we currently do by default:

na_matches: Should ‘NA’ and ‘NaN’ values match one another?

          The default, ‘"na"’, treats two ‘NA’ or ‘NaN’ values as
          equal, like ‘%in%’, ‘match()’, ‘merge()’.

          Use ‘"never"’ to always treat two ‘NA’ or ‘NaN’ values as
          different, like joins for database sources, similarly to
          ‘merge(incomparables = FALSE)’.

(Looks like they anticipated having to support more than two options but so far didn't add them. :-)

We should also consider NaN, which could also be a source of bugs.

nalimilan on 25 Oct 2020

I just don't see how the example in the OP can do anything other than what it currently does. You are labelling it to be a false negative only because you constructed the example. Suppose instead that I hand you those two datasets, now how certain are you that those observations correspond to each other? Suppose you observe another John who worked at a different place (or the same place!). Then what? Real data is messy like this.

I think the typical way (at least, what I've encountered and what I usually do) to merge datasets when you have missing keys is a multiple step merge. You first merge based on what you know gives the best match, then from the remaining unmatched subsets you might match on different -- less good -- criteria. But you might also not do this second step. It requires assumptions and outside knowledge. This isn't something the package should do for you.

If anything like this is implemented I would want it to be off by default (put me in the strict camp I guess).

tbeason on 25 Oct 2020

I would be unfavor of doing any(ismissing, col) || error(...) right now for the 1.0 release.

Since turning errors into non-errors is not breaking.
And current behaviour is weird.
We can error for now and workout exactly how we want the future API to look later.

The deprecation is fiddly.
I think we would need to introduce a new singleton sentinel for it and replace missing a with that sentinel.

oxinabox on 25 Oct 2020

unfavor

I understand you wanted to say "in favor". Right?

This is what pandas does (essentially what we do now):

>>> df
    id  x
0  0.0  0
1  NaN  1
2  2.0  2
3  3.0  3
4  4.0  4
5  5.0  5
6  6.0  6
7  7.0  7
8  8.0  8
9  9.0  9
>>> df2
    id  y
0  0.0  0
1  1.0  1
2  NaN  2
3  3.0  3
4  4.0  4
5  5.0  5
6  6.0  6
7  7.0  7
8  8.0  8
9  9.0  9
>>> pd.merge(df, df2, on='id', how='inner')
    id  x  y
0  0.0  0  0
1  NaN  1  2
2  3.0  3  3
3  4.0  4  4
4  5.0  5  5
5  6.0  6  6
6  7.0  7  7
7  8.0  8  8
8  9.0  9  9

In summary we essentially have three options then:

throw an error (and I understand that the community preference is to be breaking and choose it as the default)
treat missings as equal (this is what we already have so we can keep it)
treat missings as distinct (this would have to be implemented and requires special handling - we could add it later)

So we would a kwarg that would most likely take Symbols as values: :error (the default), :equal, :distinct. What could be the name for it? missingmatching?

bkamins on 25 Oct 2020

As mentioned above, I don't think we should allow other options besides erroring at all. We can always add it later.

pdeffebach on 25 Oct 2020

I don't think we should allow other options besides erroring at all. We can always add it later.

But why? If we make error the default, then I think it is better to allow users to get the current behavior by using a kwarg (so that they can easily update their codes). What downsides of this approach do you see?

bkamins on 25 Oct 2020

Only that exposing a larger API means more to maintain.

I also find it hard to motivate real-world examples where matching missings would be a good idea. At the very least this is a vote against allowing option 2. I would imagine that the vast majority of uses-cases are where users would like to see an error.

pdeffebach on 25 Oct 2020

Given how simple it is to support the current behavior using a keyword argument, it would be quite extreme to completely stop supporting it.

nalimilan on 25 Oct 2020

👍2

R and pandas support this behavior (so even if it is not super consistent if someone wants to port the code from e.g. pandas to DataFrames.jl 1-2-1 it could be useful).

bkamins on 25 Oct 2020

Okay, then I think we should support the keyword arguments. :error, :equal, and :distinct seem good. matchmissing seems like a good name to me.

pdeffebach on 25 Oct 2020

@adkabo - in summary we will follow your recommendation then most probably (let us wait a few days for the feedback). Thank you for raising the issue.

bkamins on 25 Oct 2020

I guess maybe I misunderstood exactly what was on the table here? I do support the proposal if it is just essentially just about the matching-ness of missing in that either they are ignored completely, throw errors, or sort of match together.

Can someone clarify what would happen, though, in OPs example? Either of the non-erroring options still returns empty, yes?

tbeason on 25 Oct 2020

Do any languages implement the "distinct" solution? Otherwise it's probably not worth supporting that. (I couldn't find what Stata does.)

Can someone clarify what would happen, though, in OPs example? Either of the non-erroring options still returns empty, yes?

Yes.

nalimilan on 25 Oct 2020

👍1

:distinct sounds useful for with cross joins.
It also lines up with the mental model of _it could be anything_

oxinabox on 25 Oct 2020

It also lines up with the mental model of it could be anything

This is what I meant at the top of this discussion. We will not implement it now, but as it is the most logically consistent approach so I would not rule it out in the design (just like R did).

bkamins on 25 Oct 2020

Cross joins are a very special kind of join (I almost wouldn't consider it as a join). They don't need to compare values so I don't think :distinct applies there (each row is distinct). Actually crossjoin doesn't take the same arguments currently and doesn't use the same implementation already.

nalimilan on 25 Oct 2020

Another data point: in SQL inner joins, NULL is treated as distinct from all other values, which is considered a trap (see Wikipedia). However NULLs are preserved by left/right/outer/cross joins.

nalimilan on 25 Oct 2020

👍1

i think in typical relational database, columns that you use as primary keys or foreign keys should be unique, distinct, and not empty. i think we are complicating the processing by virtue that the assumption does not hold. by making the API flexible, we are allowing the mistake in data encoding to be addressed by the dataframe. some basic assumptions in relational data should be met and if they are not met, the API should not be changed because it will just make the design messy.

ppalmes on 25 Oct 2020

👍1

the data should be fixed and not the algorithm for joining tables. perhaps the join operation will spit an error if the index column has missing values because the table violates the requirement of joining indexed keys or skip those with missing values because the algorithm should not guess. keys with missing values lacks the information to relate information from one table to another.

ppalmes on 25 Oct 2020

I support throwing an error but then allow an option to allow matching.

xiaodaigh on 26 Oct 2020

I favor following SQL behavior.

Nosferican on 26 Oct 2020

👍1

See https://github.com/JuliaData/DataFrames.jl/pull/2504 for a proposed implementation (finalizing is pending approval of https://github.com/JuliaData/DataFrames.jl/pull/2503).

bkamins on 29 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Rename `tail` to avoid clash with `Base.tail`

cossio · 5Comments

⍰ character in header output

cormullion · 6Comments

split a df into parts

yakir12 · 6Comments

Problems in groupreduce_init

bkamins · 8Comments

Provide indexing feature to allow for fast sort, join, and group-by operations

xiaodaigh · 7Comments