Julia: Coalesce for mixed nothing/missing: bug and design issue

Created on 28 Apr 2018  Â·  23Comments  Â·  Source: JuliaLang/julia

First let me report a bug:

julia> coalesce.([missing, nothing, Some(nothing), Some(missing)], "replacement")
ERROR: MethodError: convert(::Type{Union{Missing, Nothing, Some{Union{Missing, Nothing}}}}, ::Some{Nothing}) is ambiguous. Candidates:
  convert(::Type{Union{Missing, T}}, x) where T in Base at missing.jl:39
  convert(::Type{Union{Nothing, T}}, x) where T in Base at some.jl:21
Possible fix, define
  convert(::Type{Union{Missing, Nothing}}, ::Any)

But I have tracked it because I want to get back to #26661 and I have second thoughts about the fact that coalesce mixes handling missing and Some/nothing. It would help me if I understood why (maybe e.g. @nalimilan could explain or give a reference to an earlier discussion - I could not find it unfortunately).

The problem I have is that I feel that current coalesce is not very useful for handling missing values because coalesce(Some(missing), "replacement") returns missing, so e.g. this pattern to get rid of missings in a vector does not work (it worked in old Missings.jl package): coalesce.([Some(missing), "a", "b", missing], "replacement").

My first reaction is to have separate function for missing and separate for Some/nothing but maybe there is some good reason to mix it.

design missing data

Most helpful comment

To give a concrete example of the first-argument-as-sentinel problem, suppose both f() and g() return nothing to indicate no value but can otherwise sometimes return missing (say because they're functions that operate on data frames or whatever). Then you can have code like this:

coalesce(f(), g(), 0)

If f() returns missing as a _value_ and g() returns nothing to indicate "no value", then you would get nothing when the intended behavior would be to missing. You could argue that the call should be coalesce(nothing, f(), g(), 0), but it's far too easy for people not to do that and if that's a known trap, it would be better to just always require the caller to explicitly indicate which sentinel they want. Note that the "nothing is more missing than missing" approach would not have this problem—missing would be returned as intended.

All 23 comments

Good catch about the convert bug. It's a pretty unlikely situation but we should handle it correctly. I'm not sure how to fix that though. These ::Type{Union{...}} methods are terrible to manage. FWIW, defining the following function makes Julia crash when creating the array:

 Base.convert(::Type{Union{Missing, Nothing, T}}, x::Any) where {T} = convert(T, x)

Regarding the behavior of coalesce, the main reason it handles both missing, nothing and Some is that it's useful for all three cases, and it's not clear how we would name separate functions if we distinguished these behaviors. In practice I'm not sure it's a problem. What's the issue with coalesce(Some(missing), "replacement") returning missing? Some(missing) is definitely not a missing value, it's precisely a missing value protected by the Some wrapper to be able to distinguish it from standard missing values. Anyway I doubt actual use cases would mix missing and Some like that.

The problem I have is that I would not like to have Some(missing) unwrapped when I use coalesce in "remove missing" mode. In particular, if I understand it correctly Some is designed to work with Nothing (and this is how it is documented). Unwrapping Some(missing) is a side effect only of the fact that now coalesce works on both missing and nothing.

I agree that the situation that we mix missing with nothing/Some will be rare. In the worst case we can write a clear documentation warning not to mix them, but I feel it is not clean. I would prefer two functions. One handling missing (it could have another name e.g. fuse) only and the other handling nothing/Some (this could keep coalesce name).

Alternatively coalesce could get a keyword argument mode with three traits:

  • missing (only works with missing);
  • nothing (only works with nothing and unwraps Some);
  • some third option (default - this would be current behavior).
    (or some other similar solution)

There's nothing special about Some(missing), it's just like Some(1). AFAICT the question is rather whether coalesce should replace missing given that it also replaces nothing and unwraps Some. We could obviously allow tweaking the behavior with a keyword argument as you suggest, but that would be really verbose, when the point of coalesce is to be short and convenient.

coalesce is the standard SQL (and dplyr) term for this operation, so I'd rather not use another one like fuse. unwrap would be a good term, but it only applies to Union{Some{T}, Nothing}, not to Union{T, Nothing}.

I guess one possibility would be to introduce ?? as the null-coalescing operator, and reserve coalesce to missing. That would be consistent with C# and Swift (for ??), and with SQL and dplyr (for coalesce).

I've filed https://github.com/JuliaLang/julia/issues/26929 for the crash mentioned above.

AFAICT the question is rather whether coalesce should replace missing given that it also replaces nothing and unwraps Some.

This is exactly my point. I am flexible to whatever choice is made - given it is properly documented 😄 (I agree that Some(missing) is totally normal and should be unwrapped if unwrapping is wanted).

If we wanted to use R as a reference you have that dplyr uses coalesce for NA BUT NOT FOR NULL (which is an equivalent of nothing in Julia). And in SQL NULL is missing in Julia. So based on these terms I would argue that current coalesce will actually confuse people coming from R/SQL (as it will not ensure invariants they expect - namely that e.g. coalesce(x, 1) is guaranteed not to be missing no matter what x is).

I am pushing this so much now as for me this function is an important part of data science targeted ecosystem in base Julia so I would prefer to have a well thought of API here in Julia 1.0.

To show a situation that bothers me this is a simplified situation I had today:

julia> x = ["a", "b", missing]
3-element Array{Union{Missing, String},1}:
 "a"
 "b"
 missing

julia> y = [ismissing(v) ? missing : findfirst("a", v) for v in x]
3-element Array{Any,1}:
 1:1
 nothing
 missing

and now you cannot use coalesce meaningfully on such a vector.

My point is that for interactive use current coalesce is probably not a problem (as user will know what and why is being replaced) but for production code (e.g. writing a general purpose package) it is very easy to forget some use-case of coalesce and end up with a hard to catch bug in the code. Consider eg. some NLP processing package that would do processing similar to what I have given on the top.

I think coalesce should handle exactly one value specially; a different value requires a different function.

@JeffBezanson So how about adding ?? as the nothing-coalescing operator? See https://github.com/JuliaLang/julia/issues/26303.

@vtjnash has proposed this patch:

diff --git a/base/some.jl b/base/some.jl
index f75a493d91..6b9e1c84b6 100644
--- a/base/some.jl
+++ b/base/some.jl
@@ -64,8 +64,8 @@ function coalesce end

 coalesce(x::Any) = x
 coalesce(x::Some) = x.value
-coalesce(x::Nothing) = nothing
-coalesce(x::Missing) = missing
+coalesce(x::Nothing) = error("no default value")
+coalesce(x::Missing) = error("no default value")

Basically, the idea is that coalesce can never return nothing or missing unless it was wrapped in Some. That effectively forces you to supply a "default value", e.g. coalesce(x, y, Some(missing)).

Another proposal in addition:
Have coalesce decide which is the sentinel value based on the first argument, e.g.

coalesce(nothing, nothing, 1) = 1
coalesce(nothing, missing, 1) = missing
coalesce(missing, missing, 1) = 1
coalesce(missing, nothing, 1) = nothing

That way if you have, just pure mixtures of either nothing/missing or a value, that'll work just
fine. And you if you can have both, you can use the first argument as the sentinel:

coalesce(nothing, f(x), g(x), 1) # Will use `nothing` as the sentinel value
coalesce(missing, f(x), g(x), 1) # Will use `missing` as the sentinel value

Additionally @omus had a proposal to have a version of coalesce that takes a predicate for which values to consider missing, but that seems like a separate generic function.

Throwing an error when the result would be missing or nothing could make sense, but AFAICT it wouldn't fix the difficulty that both missing and nothing values are replaced when they are mixed. @Keno's proposal to specify the sentinel value as the first argument would be more to the point. To be clear: we would retain the current behavior, but allow a more restrictive one when passing missing or nothing as the first value?

I have two additional comments/questions:

  1. If we want to retain current functionality that coalesce unwraps Some(x) in missing-mode then I would document it in Some docstring (as now the documentation is unequivocal that Some is intended to be used with Nothing).
  2. @Keno's proposal highlights my earlier point - in general I agree with it, but the question is:

    • how does it differ from defining two functions (I am not trying for the best names - but to be clear what I mean) coalesce_missing and coalesce_nothing (i.e. - putting the sentinel into function name)

    • if we would use this approach what would be the use case of coalesce without sentinel (would anyone actually want this functionality? - of course if we disregard that this saves typing)

The idea of my proposal is that the sentinel value is optional. That way if you have a bunch of functions f,g,h that return either a value (other than missing) or nothing, coalesce(f(), g(), h()) would just work.

@Keno: yes - I understand this. But my question is: does there exist a realistic use case, when you use coalesce where you may get nothing or missing in the process and you want to skip over BOTH (effectively not knowing which one did happen).

This came up on the triage call, and I think the consensus was to create a separate function what takes an explicit predicate for this case instead. (Or just use first(filter(x->x === nothing || x === missing, args)))).

With https://github.com/JuliaLang/julia/pull/27157, coalesce throws an error when all arguments are nothing/missing. This strict behavior can always be made more permissive in 1.x since it wouldn't be breaking.

I've thought a bit more about how to handle the case where missing and nothing would be mixed, and it's not really possible to choose the sentinel value based on the first argument, unless you always require specifying it: if you don't know whether the first value is a "normal" value or whether it's supposed to indicate that the behavior should be strict, you can't change the behavior based on it. But coalesce(Missing, x, y...) or coalesce(ismissing, x, y...) would be fine.

I've thought a bit more about how to handle the case where missing and nothing would be mixed, and it's not really possible to choose the sentinel value based on the first argument, unless you always require specifying it: if you don't know whether the first value is a "normal" value or whether it's supposed to indicate that the behavior should be strict, you can't change the behavior based on it.

Yes, that is part of the design. The idea would be that the most common case is that function would return either a sentinel or a value guaranteed not to be either sentinel (e.g. Union{Missing, Int}). It would be odd for a function to return either sentinel or a value. If a function does that, I agree the behavior is confusing, but the though was to make things as simple as possible for cases where the user knows things to be unambiguous (i.e. not requiring a first argument for the most common cases).

OK, got it. Basically we just need to add methods to handle cases where nothing and missing follow one another. I've added a commit to https://github.com/JuliaLang/julia/pull/27157, please tell me what you think. The main issue is that I find it hard to describe in simple terms.

During triage, @JeffBezanson and I violently objected to coalesce's unwrapping behavior. There was some appeal to the approach of letting the first argument determine the sentinel value, but also concern that one could accidentally get missing as a value and have it be mistakenly taken as a sentinel (which would only be a problem if followed by a nothing). @JeffBezanson proposed the notion of making nothing "more missing" than missing—i.e. this sort of thing:

coalesce(nothing, 1) === 1
coalesce(missing, 1) === 1
coalesce(nothing, missing) === missing
coalesce(missing, nothing) === missing

We didn't come to a conclusion but there was significant support for separating the unwrapping behavior of coalesce out into an unwrap function and doing this sort of thing instead of auto-unwrapping:

identity_or_nothing(x) = rand(Bool) ? Some(x) : nothing

unwrap(coalesce(identity_or_nothing(x), Some(0)))

The other approach that was discussed was to go back to having two separate generic function. One that unwraps Some, skips nothing and errors if nothing is found and one that skips missing.

To give a concrete example of the first-argument-as-sentinel problem, suppose both f() and g() return nothing to indicate no value but can otherwise sometimes return missing (say because they're functions that operate on data frames or whatever). Then you can have code like this:

coalesce(f(), g(), 0)

If f() returns missing as a _value_ and g() returns nothing to indicate "no value", then you would get nothing when the intended behavior would be to missing. You could argue that the call should be coalesce(nothing, f(), g(), 0), but it's far too easy for people not to do that and if that's a known trap, it would be better to just always require the caller to explicitly indicate which sentinel they want. Note that the "nothing is more missing than missing" approach would not have this problem—missing would be returned as intended.

I think the key is to first be clear about what domain the function operates on. Currently there are two issues there:

  1. coalesce considers nothing and missing equivalently as non-values, which doesn't seem universally right.
  2. It unwraps Some, but it's not clear whether Some should be used alongside missing.

I think we should define two domains, and have a function for each:

  1. Union{Missing, T} and coalesce: Returns the first non-missing value, propagates missing if all values are missing. Propagation makes sense since that's how missing usually behaves.
  2. Union{Nothing, Some} and unwrap: Returns the first non-nothing value and unwraps it if it's a Some. Throws an error if every argument is nothing. To be really strict we could require non-nothing arguments to be Some, but that's probably not necessary. unwrap also might not be the best name for this but I don't have any other ideas yet.

If that proposal isn't appealing, I could also tolerate the following:

  1. Have coalesce prefer to return missing over nothing (regardless of argument order), and remove the Some-unwrapping behavior.
  2. Add unwrap(x::Some) for unwrapping.

How about calling that unwrap function something? As in it's guaranteed to return something. It also makes it unsurprising that it has to do with both nothing and Some.

I think we should define two domains, and have a function for each

This is exactly what I think is best from the very start.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

arshpreetsingh picture arshpreetsingh  Â·  3Comments

StefanKarpinski picture StefanKarpinski  Â·  3Comments

Keno picture Keno  Â·  3Comments

yurivish picture yurivish  Â·  3Comments

sbromberger picture sbromberger  Â·  3Comments