First let me report a bug:
julia> coalesce.([missing, nothing, Some(nothing), Some(missing)], "replacement")
ERROR: MethodError: convert(::Type{Union{Missing, Nothing, Some{Union{Missing, Nothing}}}}, ::Some{Nothing}) is ambiguous. Candidates:
convert(::Type{Union{Missing, T}}, x) where T in Base at missing.jl:39
convert(::Type{Union{Nothing, T}}, x) where T in Base at some.jl:21
Possible fix, define
convert(::Type{Union{Missing, Nothing}}, ::Any)
But I have tracked it because I want to get back to #26661 and I have second thoughts about the fact that coalesce
mixes handling missing
and Some
/nothing
. It would help me if I understood why (maybe e.g. @nalimilan could explain or give a reference to an earlier discussion - I could not find it unfortunately).
The problem I have is that I feel that current coalesce
is not very useful for handling missing values because coalesce(Some(missing), "replacement")
returns missing
, so e.g. this pattern to get rid of missings in a vector does not work (it worked in old Missings.jl package): coalesce.([Some(missing), "a", "b", missing], "replacement")
.
My first reaction is to have separate function for missing
and separate for Some/nothing
but maybe there is some good reason to mix it.
Good catch about the convert
bug. It's a pretty unlikely situation but we should handle it correctly. I'm not sure how to fix that though. These ::Type{Union{...}}
methods are terrible to manage. FWIW, defining the following function makes Julia crash when creating the array:
Base.convert(::Type{Union{Missing, Nothing, T}}, x::Any) where {T} = convert(T, x)
Regarding the behavior of coalesce
, the main reason it handles both missing
, nothing
and Some
is that it's useful for all three cases, and it's not clear how we would name separate functions if we distinguished these behaviors. In practice I'm not sure it's a problem. What's the issue with coalesce(Some(missing), "replacement")
returning missing
? Some(missing)
is definitely not a missing value, it's precisely a missing value protected by the Some
wrapper to be able to distinguish it from standard missing values. Anyway I doubt actual use cases would mix missing
and Some
like that.
The problem I have is that I would not like to have Some(missing)
unwrapped when I use coalesce in "remove missing" mode. In particular, if I understand it correctly Some
is designed to work with Nothing
(and this is how it is documented). Unwrapping Some(missing)
is a side effect only of the fact that now coalesce
works on both missing
and nothing
.
I agree that the situation that we mix missing
with nothing/Some
will be rare. In the worst case we can write a clear documentation warning not to mix them, but I feel it is not clean. I would prefer two functions. One handling missing
(it could have another name e.g. fuse
) only and the other handling nothing
/Some
(this could keep coalesce
name).
Alternatively coalesce
could get a keyword argument mode
with three traits:
missing
(only works with missing
);nothing
(only works with nothing
and unwraps Some
);There's nothing special about Some(missing)
, it's just like Some(1)
. AFAICT the question is rather whether coalesce
should replace missing
given that it also replaces nothing
and unwraps Some
. We could obviously allow tweaking the behavior with a keyword argument as you suggest, but that would be really verbose, when the point of coalesce
is to be short and convenient.
coalesce
is the standard SQL (and dplyr) term for this operation, so I'd rather not use another one like fuse
. unwrap
would be a good term, but it only applies to Union{Some{T}, Nothing}
, not to Union{T, Nothing}
.
I guess one possibility would be to introduce ??
as the null-coalescing operator, and reserve coalesce
to missing
. That would be consistent with C# and Swift (for ??
), and with SQL and dplyr (for coalesce
).
I've filed https://github.com/JuliaLang/julia/issues/26929 for the crash mentioned above.
AFAICT the question is rather whether
coalesce
should replacemissing
given that it also replacesnothing
and unwrapsSome
.
This is exactly my point. I am flexible to whatever choice is made - given it is properly documented 😄 (I agree that Some(missing)
is totally normal and should be unwrapped if unwrapping is wanted).
If we wanted to use R as a reference you have that dplyr
uses coalesce
for NA
BUT NOT FOR NULL
(which is an equivalent of nothing
in Julia). And in SQL NULL
is missing
in Julia. So based on these terms I would argue that current coalesce
will actually confuse people coming from R/SQL (as it will not ensure invariants they expect - namely that e.g. coalesce(x, 1)
is guaranteed not to be missing
no matter what x
is).
I am pushing this so much now as for me this function is an important part of data science targeted ecosystem in base Julia so I would prefer to have a well thought of API here in Julia 1.0.
To show a situation that bothers me this is a simplified situation I had today:
julia> x = ["a", "b", missing]
3-element Array{Union{Missing, String},1}:
"a"
"b"
missing
julia> y = [ismissing(v) ? missing : findfirst("a", v) for v in x]
3-element Array{Any,1}:
1:1
nothing
missing
and now you cannot use coalesce
meaningfully on such a vector.
My point is that for interactive use current coalesce
is probably not a problem (as user will know what and why is being replaced) but for production code (e.g. writing a general purpose package) it is very easy to forget some use-case of coalesce
and end up with a hard to catch bug in the code. Consider eg. some NLP processing package that would do processing similar to what I have given on the top.
I think coalesce should handle exactly one value specially; a different value requires a different function.
@JeffBezanson So how about adding ??
as the nothing
-coalescing operator? See https://github.com/JuliaLang/julia/issues/26303.
@vtjnash has proposed this patch:
diff --git a/base/some.jl b/base/some.jl
index f75a493d91..6b9e1c84b6 100644
--- a/base/some.jl
+++ b/base/some.jl
@@ -64,8 +64,8 @@ function coalesce end
coalesce(x::Any) = x
coalesce(x::Some) = x.value
-coalesce(x::Nothing) = nothing
-coalesce(x::Missing) = missing
+coalesce(x::Nothing) = error("no default value")
+coalesce(x::Missing) = error("no default value")
Basically, the idea is that coalesce
can never return nothing
or missing
unless it was wrapped in Some
. That effectively forces you to supply a "default value", e.g. coalesce(x, y, Some(missing))
.
Another proposal in addition:
Have coalesce decide which is the sentinel value based on the first argument, e.g.
coalesce(nothing, nothing, 1) = 1
coalesce(nothing, missing, 1) = missing
coalesce(missing, missing, 1) = 1
coalesce(missing, nothing, 1) = nothing
That way if you have, just pure mixtures of either nothing/missing or a value, that'll work just
fine. And you if you can have both, you can use the first argument as the sentinel:
coalesce(nothing, f(x), g(x), 1) # Will use `nothing` as the sentinel value
coalesce(missing, f(x), g(x), 1) # Will use `missing` as the sentinel value
Additionally @omus had a proposal to have a version of coalesce
that takes a predicate for which values to consider missing, but that seems like a separate generic function.
Throwing an error when the result would be missing
or nothing
could make sense, but AFAICT it wouldn't fix the difficulty that both missing
and nothing
values are replaced when they are mixed. @Keno's proposal to specify the sentinel value as the first argument would be more to the point. To be clear: we would retain the current behavior, but allow a more restrictive one when passing missing
or nothing
as the first value?
I have two additional comments/questions:
coalesce
unwraps Some(x)
in missing
-mode then I would document it in Some
docstring (as now the documentation is unequivocal that Some
is intended to be used with Nothing
).coalesce_missing
and coalesce_nothing
(i.e. - putting the sentinel into function name)coalesce
without sentinel (would anyone actually want this functionality? - of course if we disregard that this saves typing)The idea of my proposal is that the sentinel value is optional. That way if you have a bunch of functions f
,g
,h
that return either a value (other than missing
) or nothing
, coalesce(f(), g(), h())
would just work.
@Keno: yes - I understand this. But my question is: does there exist a realistic use case, when you use coalesce
where you may get nothing
or missing
in the process and you want to skip over BOTH (effectively not knowing which one did happen).
This came up on the triage call, and I think the consensus was to create a separate function what takes an explicit predicate for this case instead. (Or just use first(filter(x->x === nothing || x === missing, args)))
).
With https://github.com/JuliaLang/julia/pull/27157, coalesce
throws an error when all arguments are nothing
/missing
. This strict behavior can always be made more permissive in 1.x since it wouldn't be breaking.
I've thought a bit more about how to handle the case where missing
and nothing
would be mixed, and it's not really possible to choose the sentinel value based on the first argument, unless you always require specifying it: if you don't know whether the first value is a "normal" value or whether it's supposed to indicate that the behavior should be strict, you can't change the behavior based on it. But coalesce(Missing, x, y...)
or coalesce(ismissing, x, y...)
would be fine.
I've thought a bit more about how to handle the case where missing and nothing would be mixed, and it's not really possible to choose the sentinel value based on the first argument, unless you always require specifying it: if you don't know whether the first value is a "normal" value or whether it's supposed to indicate that the behavior should be strict, you can't change the behavior based on it.
Yes, that is part of the design. The idea would be that the most common case is that function would return either a sentinel or a value guaranteed not to be either sentinel (e.g. Union{Missing, Int}). It would be odd for a function to return either sentinel or a value. If a function does that, I agree the behavior is confusing, but the though was to make things as simple as possible for cases where the user knows things to be unambiguous (i.e. not requiring a first argument for the most common cases).
OK, got it. Basically we just need to add methods to handle cases where nothing
and missing
follow one another. I've added a commit to https://github.com/JuliaLang/julia/pull/27157, please tell me what you think. The main issue is that I find it hard to describe in simple terms.
During triage, @JeffBezanson and I violently objected to coalesce
's unwrapping behavior. There was some appeal to the approach of letting the first argument determine the sentinel value, but also concern that one could accidentally get missing
as a value and have it be mistakenly taken as a sentinel (which would only be a problem if followed by a nothing
). @JeffBezanson proposed the notion of making nothing
"more missing" than missing
—i.e. this sort of thing:
coalesce(nothing, 1) === 1
coalesce(missing, 1) === 1
coalesce(nothing, missing) === missing
coalesce(missing, nothing) === missing
We didn't come to a conclusion but there was significant support for separating the unwrapping behavior of coalesce
out into an unwrap
function and doing this sort of thing instead of auto-unwrapping:
identity_or_nothing(x) = rand(Bool) ? Some(x) : nothing
unwrap(coalesce(identity_or_nothing(x), Some(0)))
The other approach that was discussed was to go back to having two separate generic function. One that unwraps Some
, skips nothing
and errors if nothing is found and one that skips missing
.
To give a concrete example of the first-argument-as-sentinel problem, suppose both f()
and g()
return nothing
to indicate no value but can otherwise sometimes return missing
(say because they're functions that operate on data frames or whatever). Then you can have code like this:
coalesce(f(), g(), 0)
If f()
returns missing
as a _value_ and g()
returns nothing
to indicate "no value", then you would get nothing
when the intended behavior would be to missing
. You could argue that the call should be coalesce(nothing, f(), g(), 0)
, but it's far too easy for people not to do that and if that's a known trap, it would be better to just always require the caller to explicitly indicate which sentinel they want. Note that the "nothing
is more missing than missing
" approach would not have this problem—missing
would be returned as intended.
I think the key is to first be clear about what domain the function operates on. Currently there are two issues there:
coalesce
considers nothing
and missing
equivalently as non-values, which doesn't seem universally right.Some
, but it's not clear whether Some
should be used alongside missing
.I think we should define two domains, and have a function for each:
Union{Missing, T}
and coalesce
: Returns the first non-missing
value, propagates missing
if all values are missing
. Propagation makes sense since that's how missing
usually behaves.Union{Nothing, Some}
and unwrap
: Returns the first non-nothing
value and unwraps it if it's a Some
. Throws an error if every argument is nothing
. To be really strict we could require non-nothing arguments to be Some
, but that's probably not necessary. unwrap
also might not be the best name for this but I don't have any other ideas yet.If that proposal isn't appealing, I could also tolerate the following:
coalesce
prefer to return missing
over nothing
(regardless of argument order), and remove the Some-unwrapping behavior.unwrap(x::Some)
for unwrapping.How about calling that unwrap
function something
? As in it's guaranteed to return something. It also makes it unsurprising that it has to do with both nothing
and Some
.
I think we should define two domains, and have a function for each
This is exactly what I think is best from the very start.
Most helpful comment
To give a concrete example of the first-argument-as-sentinel problem, suppose both
f()
andg()
returnnothing
to indicate no value but can otherwise sometimes returnmissing
(say because they're functions that operate on data frames or whatever). Then you can have code like this:If
f()
returnsmissing
as a _value_ andg()
returnsnothing
to indicate "no value", then you would getnothing
when the intended behavior would be tomissing
. You could argue that the call should becoalesce(nothing, f(), g(), 0)
, but it's far too easy for people not to do that and if that's a known trap, it would be better to just always require the caller to explicitly indicate which sentinel they want. Note that the "nothing
is more missing thanmissing
" approach would not have this problem—missing
would be returned as intended.