Julia: Use ? to lift missing values

Created on 12 Jul 2020  Â·  35Comments  Â·  Source: JuliaLang/julia

As discussed in https://discourse.julialang.org/t/status-of-question-mark-syntax-for-missing-values/27189 it would be really nice to have this syntax at some future release where one could type T? to indicate Union{T,Missing}. Possibly, also write mean?(xs) to automatically handle inputs with missing values.

Most helpful comment

I (or even we as people working with data 😄) might be opinionated, but after a longer time of working with both nothing and missing I find Union{T, Missing} to be used every day, while Union{T, Nothing} is much less frequent.

Also an especially annoying thing with Union{T, Missing} is that you have to write e.g. Vector{Union{T, Missing}} all the time, while I assume that Nothing most often comes just as Union{T, Nothing} without being an indicator of element in the collection.

In summary - I support the proposal by @juliohm - if this is possible to extend a parser in this way of course (I have not analyzed the consequences in detail, but it seems safe as in ? : we require space before ?).

All 35 comments

I (or even we as people working with data 😄) might be opinionated, but after a longer time of working with both nothing and missing I find Union{T, Missing} to be used every day, while Union{T, Nothing} is much less frequent.

Also an especially annoying thing with Union{T, Missing} is that you have to write e.g. Vector{Union{T, Missing}} all the time, while I assume that Nothing most often comes just as Union{T, Nothing} without being an indicator of element in the collection.

In summary - I support the proposal by @juliohm - if this is possible to extend a parser in this way of course (I have not analyzed the consequences in detail, but it seems safe as in ? : we require space before ?).

Also then I would rise again the idea of:

Another use of ? syntax (if we decided to go for missing with it) is that then we could print ? (possibly using a different color in REPL) instead of missing when we show tabular data (similarly to the change in the way we now show Bool in some cases).

What "lift missing values" is needs to be specified.

but after a longer time of working with both nothing and missing I find Union{T, Missing} to be used every day, while Union{T, Nothing} is much less frequent.

It's interesting because I use Union{T, Nothing} all the time, in all kinds of different orthogonal packages. Just as an example: https://github.com/KristofferC/TOMLX.jl/blob/master/src/parser.jl.

We could have syntax for both Union{T, Missing} and Union{T, Nothing}.

Maybe T? could be shorthand for Union{T, Missing}.

And T! could be shorthand for Union{T, Nothing}.

There are several RFCs for adding API based on Union{Some{T}, Nothing} (https://github.com/JuliaLang/julia/pull/34821, https://github.com/JuliaLang/julia/pull/33758). For example, it'd be nice dict?[key] to return Union{Some{T}, Nothing} while it makes less sense to return Union{T,Missing} because dict may contain missings.

And T! could be shorthand for Union{T, Nothing}.

T! is already a valid name so we can't use it.

We could come up with another syntax. Maybe T? for missing and T?? for nothing.

Should we make them the same, and simply change the definition to const Missing = Nothing?
[_(me) quickly ducks and runs away_]

I think it could be worthwhile to have a JuliaCon BoF topic to hammer this discussion out, since we can't rely on hallway conversations this year! And syntax would be really nice for this.

Isn't the math different?

julia> 1 + missing
missing

julia> 1 + nothing
ERROR: MethodError: no method matching +(::Int64, ::Nothing)
Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...) at operators.jl:538
  +(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:86
  +(::Union{Int16, Int32, Int64, Int8}, ::BigInt) at gmp.jl:531
  ...
Stacktrace:
 [1] top-level scope at REPL[2]:1

Should we make them the same

Actually, I just made an exactly the same comment in Zulip:

By the way, I think concrete differences between missing and nothing in Julia ATM is that missing has more defined behaviors like isless(1, missing) and missing | true (other than "manual lifting" like missing + 1). Is there anything else?

I get that isless(1, missing) is super useful. But I wonder if people heavily rely on the 3VL? If not, maybe we can define isless for nothing and then merge the two kinds of missingness?

--- https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/Crazy.20thought.3A.20is.20it.20possible.20to.20merge.20nothing.20and.20missing.3F


Isn't the math different?

I guess you could write 1 ?+ nothing? Or, if it's tedious, we can have @? like @.. As I commented, I think a more serious problem is the 3VL and ordering.

Quoting yet another comment of mine:

By the way, another consideration of ? is what should happen with multi-argument functions. Do we want f?(a, b; c = c) to return missing/nothing if at least one of a, b, c is missing/nothing? Or, do we want to have argument-specific syntax like f(?a, b; c = ?c) (which just passes missing/nothing to f for b)?

--- https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/What.20are.20your.20thoughts.20on.20.3F.20for.20missing.3F/near/203663783

Isn't missing a much more specialized and narrow concept than nothing? The former is basically just for data scientists, while the latter is useful for anyone who writes code for any purpose.

I've never used missing even once in my life, while nothing is the default return value for any function. Isn't Union{T, Nothing} the obvious choice?

I think there is no obvious choice because people come from different backgrounds. I personally feel that ? would be a nice addition for the more "computing-oriented" users, but I acknowledge that many people use Julia for other kinds of applications that do not involve number crunching.

For a vector of objects xs it is already very convenient to write foo.(xs) and get the evaluation of foo on each object. Similarly, it would be very convenient to write foo.?(xs) and get the the evaluation on each object that can be possibly missing. The result would be a vector as before but with missing entries. This can clean up a lot of computing-oriented code.

I'm processing https://www.hl7.org/ data represented as JSON. In the JSON binding of FHIR, when data is not provided for a field, it is simply omitted empty key; null is not used. When parsed with JSON.jl we end up with Dict structures, and nothing never appears. Since the Clinical Quality Language (CQL) semantics treat these omitted values with three value logic, when we process FHIR data, we use get(dict, field, missing).

Hence, the datatypes and functions we've defined for working with FHIR data use quite a bit of Union{T, Missing} since that's how the data needs to be processed. At the top of our files are OptString = Union{String, Missing}, for example. In this case, being able to write T? as a type, or function?(args..) and have them lifted to work with missing values seems useful. Having Union{T, Missing} everywhere is tedious, and hence, we use OptString and cousins; that said, OptString may be unclear to a reader, while, on the other hand, String? may be much more obvious.

At a more abstract level, I picture nothing as the bottom type, in that it signals that there is no value; while, missing is the top type, existence of the data is possible but undetermined. Trying to operate on nothing does seem like an error, while, having missing propagate seems reasonable. In this sense, I think Stefan's suggestion makes sense.

Yeah, I think these parts of @StefanKarpinski's comment are key: (emphasis mine)

However, I’ve come back around to wanting it to mean Union{T, Missing} for one very key reason: we need a way to express lifting of functions to the Union{T, Missing} domain. What I mean by that is that instead of every couple of days getting a new feature request for missing support for the atanh function or whatever, I’d like to just make atanh? lift the atanh function so that it returns missing when any of its arguments are missing. We don’t need this feature for nothing since the whole point of nothing is that it doesn’t lift, whereas the point of missing is to lift.

How do f? and T? mesh? Quite well, actually, if you view the T constructor as a function that returns values of type T—then T? is a function that returns values of type Union{T, Missing} which is exactly what function lifting is supposed to do. Of course, you still want this to be represented as the union type, Union{T, Missing}, so you just need to make sure that the type, when called, does the lifted version of T.

Reposing my comments from https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/Crazy.20thought.3A.20is.20it.20possible.20to.20merge.20nothing.20and.20missing.3F/near/203667599

I'd replace "lift" with "automatically lift"; then f? for Union{Some{T},Nothing} suddenly makes sense because it's manual (but with such a terse syntax so that we don't need the automatic one).

Reading this and other Stefan's comments elsewhere, I think his point of view is based on the frustration of so much proliferation of dispatches on missing in Base. This is yet another way to put that "automatic lifting" based on dispatch does not scale. This is not news. See contextual dispatch systems used in autodiff etc.

What I call "automatic lifting" is

julia> 1 + missing
missing

It is automatic from users' perspective. From the maintainers' perspective, it's manual, tedious, and unscalable.

There are several RFCs for adding API based on Union{Some{T}, Nothing} (#34821, #33758). For example, it'd be nice dict?[key] to return Union{Some{T}, Nothing} while it makes less sense to return Union{T,Missing} because dict may contain missings.

A dict may contain nothings too. I don't see any reason why someone would even store a key in a dictionary if the value is missing; indeed, with dozens of possible keys, FHIR requires you to omit the keys if the data is missing. The common access pattern is get(dict, key, missing) in my area of work. Being able to write dict?[key] as syntax sugar to get(dict, key, missing) would be quite nice. For example, if you're computing a measurement, you could write base_score + dict?[language_subscore] + .... and missing propagates.

I guess this really comes down to making it nice for systems people (nothing) vs the data people (missing).

For a vector of objects xs it is already very convenient to write foo.(xs) and get the evaluation of foo on each object. Similarly, it would be very convenient to write foo.?(xs) and get the the evaluation on each object that can be possibly missing. The result would be a vector as before but with missing entries. This can clean up a lot of computing-oriented code.

Wouldn't this be foo?.(xs) -- that is, you're broadcasting the lifted function; you'd want a missing returned for each missing input?

A dict may contain nothings too.

There is no ambiguity because you'd get Some(nothing).

It's interesting because I use Union{T, Nothing} all the time

@KristofferC - I agree and I also use it often in package code. My comment was about user code (especially in data science related domains). Of course this is opinionated, as I have commented. Though, I think that in package code writing Unions is not problematic, where you want to save typing most is interactive sessions. And this is a typical use case for data science work.

Now regarding fun? - it is is an orthogonal thing I think. E.g. we currently have passmissing in Missings.jl and there was a lot of discussion how the missing should be handled there as there is no single best option here if you consider multiple args and multiple kwargs, therefore I would be reluctant to have fun? in Base (although if we find that there is one expected pattern that is predominantly common we could adopt it).

There is no ambiguity because you'd get Some(nothing).

Sure, but if the value of a dictionary key is nothing, why should it return Some(nothing). This seems just as magical as not being able to distinguish between a key having a value of missing and a key for a particular value being absent.

Anyway, I can see how one might want dict?[key] to return nothing if the key is absent, and probably it's more conservative approach; since you can't do much with nothing. So, if the type of the Dictionary is Dict{Any, T} then the type of the ?[] operator would be Union{Nothing, T} yes? I think that get(dict, key, missing) works quite fine in my particular context, I don't need a fancy new syntax for that. The only odd thing would be if T? meant Union{Missing, T} then one might expect dict?[key] to return T?. So, yea, I can see why these two tickets are intertwined.

Sure, but if the value of a dictionary key is nothing, why should it return Some(nothing).

Because that is how you distinguish a value of nothing from no value?

I just want to complain about the low quality of this issue (not the discussion, the issue itself). There is no actual proposal, and this seems to be doing no more than forking the discourse thread onto github. A link to discourse is not an acceptable issue description. If I want to read discourse, I'll read discourse.

This seems like a very niche thing to have it's own special case in the parser. Wouldn't it make way more sense to just add a method to | or ∪ such that of them evaluates to Union? This way, we don't privilege Missing or Nothing, write clearer code and don't add yet another new parser rule.

You could have

Base.:(|)(::Type{T}, ::Type{U}) where {T, U} = Union{T, U}

Vector{Int | Nothing} # Array{Union{Int, Nothing}, 1}

struct Foo 
    x::(Int | Missing) # field x must be either Int or Missing 
end

or if you prefer

Base.union(::Type{T}, ::Type{U}) where {T, U} = Union{T, U} # alias ∪

Vector{Int ∪ Missing} # Array{Union{Int, Missing}, 1}

struct Foo 
    x::(Int ∪ Nothing) # field x must be either Int or Nothing 
end

Even better is that you can have this today without type piracy or needing anyone else on the planet to agree with you:

# new sesssion
julia> |(a, b) = Base.:(|)(a, b)
| (generic function with 1 method)

julia> |(::Type{T}, ::Type{U}) where {T, U} = Union{T, U}
| (generic function with 2 methods)

julia> Int | Nothing
Union{Nothing, Int64}

julia> @eval Base Int | Nothing
ERROR: MethodError: no method matching |(::Type{Int64}, ::Type{Nothing})
Closest candidates are:
  |(::Any, ::Any, ::Any, ::Any...) at operators.jl:538

@JeffBezanson, I hope this isn't taken the wrong way, and I acknowledge that there may be some context I'm missing, but I feel your response here was not constructive and rather brusque.

I was going to say that I feel a better response would be to request that @juliohm consult the CONTRIBUTING.md on how to phrase a feature request that is up to the standards of this repository, but it appears that there isn't really any mention there about language level feature requests, only bug reporting and notes not to make library level feature feature requests, so I opened an issue on that here: https://github.com/JuliaLang/julia/issues/36667.

In the meantime, I'm feel that requesting that the issue be edited to contain an actual description of what Julio is proposing would be more constructive than simply complaining that the issue is not up to your standards.

Wouldn't it make way more sense to just add a method to | or ∪ such that of them evaluates to Union?

I see that the OP text (rather than the issue title) emphasizes the union type shorthand. But I fear this is a rather too trivial usage for very valuable ASCII-based syntax. Rather, I think a better focus here should be the lifting of the function calls like f?(g?(x, h?(y)) ?+ z). This change would be as fundamental as the dot call syntax we use everywhere.

I think that even if this weren't a very niche thing (unlike broadcast), I'd still be against it just because of the huge amount of ambiguity it'd add to the already confusing ternary operator syntax. If anything, we should be thinking of how we can have less special syntactic forms in the parser, not more.

I also think the use-case you show above is just better handled by a macro anyways:

julia> using MacroTools, Missings

julia> @eval macro $(Symbol("?"))(expr)
           (esc ∘ MacroTools.postwalk)(expr) do ex
               if isexpr(ex, :call)
                   ex.args[1] = :($(passmissing)($(ex.args[1])))
               end
               ex
           end
       end
@? (macro with 1 method)

julia> f(x::Int) = x + 1
f (generic function with 1 method)

julia> @? sin(f(missing) * 2)
missing

I think it could be worthwhile to have a JuliaCon BoF topic to hammer this discussion out, since we can't rely on hallway conversations this year! And syntax would be really nice for this.

It seems (from this issue discussion, as well as discussions on Discourse and Zulip) that this is a sufficiently complicated topic to justify a BoF at JuliaCon as @vtjnash suggested

There are reasons why it's nice to have native dot calls and not just @.. It is very useful to be explicit and concise about what you lift. Lift-everything macro is indeed possible to implement although not trivial (I'm actually writing Maybe.jl that does that). Furthermore, it may be useful to lift on arguments like f(?a, b, ?c, d) rather than the entire function f?(a, Some(b), c, Some(d)). It would be more syntax-heavy to do this in a macro. Another reason in favor of the native syntax is that macros do not compose well.

Rather, I think a better focus here should be the lifting of the function calls like f?(g?(x, h?(y)) ?+ z). This change would be as fundamental as the dot call syntax we use everywhere.

I also feel that this is the key point. In my view, the decision is whether we want f? to be syntax to grant composability of functions that can return either one or zero things, or to be syntax to handle missing data.

In particular, I'm not sure I agree with this comment:

We don’t need this feature for nothing since the whole point of nothing is that it doesn’t lift, whereas the point of missing is to lift.

Because to me the point of nothing is that it does not lift _implicitly_, but lifting over it explicitly makes sense (for example, the Option type in rust has a map method for this).

My two questions are:

  1. Does there exist a consistent definition of f? that would remove the need of all the special casing of missing in Base?
  2. If 1. has a positive answer, what is the advantage of separating missing and nothing, given that propagation would be done via ? rather than by whitelisting a set of functions?

We should review if this special ? syntax should apply to Missing or Nothing (presumably not both). Generally it has not been a burden for me to write OptString = Union{String, Missing} and use OptString everywhere. There is only minor ergonomic improvement writing String? for use in 3VL (aka Missing data). I'm not sure a generic lifting semantics for functions with potentially-missing arguments is even possible, moreover, I'm not sure that we want users of functions to have to know when to use ? and when to not use it... stuff like this can get confusing. Perhaps it's better if functions that handle Missing arguments do so explicitly, keeping all of the special casting of missing in Base. Moreover, I don't mind get(mydict, "mykey", missing), it's rather clear what's going on and if mydict["mykey"] == missing then I'm alright with the ambiguity (was it actually a missing value or was the key missing), in most cases, this ambiguity is a feature. Hence, for casual use, where 3VL logic is used, I'm not sure if an explicit lifting semantics or syntax sugar will be a net positive; things already work relatively well.

On the other hand, Nothing is the more explicit cousin and it's not easy to grok (e.g. Some(nothing)) -- perhaps it deserves to have an special syntax and explicit lifting semantics? Even those that are supporting data scientists end up putting on the system programmer hat where 3VL is not what we wish. In these cases, a nicer syntax might encourage proper use of Nothing (a value does not exist) over Missing (the value is unknown) as the default when "nothingness" is needed. This seems to be the direction that @tkf is suggesting. My question about this is nesting, on the Nth level, will I have Some(Some(...Some(nothing)...)) ?

Regardless, Jeff has a great point about venue; I suppose we should be using Discourse? (since Slack history disappears) Could there be a "Feature Request" section in Discourse, perhaps under "Usage" (most feature requests are answered with how to use Julia more effectively...) and shuffle discussion there till enough consensus emerges that we have a feature set suitable for review?

  1. Does there exist a consistent definition of f? that would remove the need of all the special casing of missing in Base?

@piever I think the answer is "no."

For concrete definition of "lifting", this function may be a nice toy example:

return1(_) = 1

This function picks the right argument. The question is what return1?(x) should do.

In the Union{Some{T},Nothing} world, it's pretty straightforward:

return1?(Some(x)) === Some(1)
return1?(nothing) === nothing


Note: equivalent code in Haskell's maybe monad

Prelude> Just 1 >>= return . const 1
Just 1
Prelude> Nothing >>= return . const 1
Nothing

It is also straightforward to define f? for Union{Some{T},Nothing} generally:


An implementation of f?

f? = (args...; kwargs...) -> maybecall(f, args...; kwargs...)

function maybecall(f, args...; kwargs...)
    any(isnothing, args) && return nothing
    any(isnothing, Tuple(kwargs.data)) && return nothing
    y = f(_somethings(args)...; _somethings(kwargs.data)...)
    return y === nothing ? y : Some(something(y))
end

_somethings(xs::Tuple) = map(something, xs)
_somethings(xs::NamedTuple{names}) where {names} =
    NamedTuple{names}(_somethings(Tuple(xs)))

In the Union{T,Missing} world, what return1?(x) should do? I'm guessing

return1?(x) === 1
return1?(missing) === 1

The main difference to Union{Some{T},Nothing} is return1?(missing) === 1 (@juliohm @bkamins @clarkevans @piever please point it out if this definition doesn't make sense).

So, I think using f? for Union{T,Missing} has a large conceptual issue. This is because it's hard to define "lifting" mechanistically (like return1?(missing, y)). What f? should do depends on the semantics of the code.

This is why I think f? should be used for Union{Some{T},Nothing} and not Union{T,Missing}.

I'm not sure that we want users of functions to have to know when to use ? and when to not use it

@clarkevans I think it's rather dangerous that users need to know which functions _implicitly_ special-case missing in "non generic" way even they are not seeing special syntax. Due to this, it is invalid to use

if x == y
    ...
end

in a generic code because if requires a Bool and x == y could evaluate to missing.

My question about this is nesting, on the Nth level, will I have Some(Some(...Some(nothing)...)) ?

I'm not sure what you mean. The direct answer is yes. This an important feature. However, I'm suspecting that this could be coming from a misunderstanding of how lifting works for Union{Some{T},Nothing}. Note that Some?(Some?(Some?(Some(nothing)))) is Some(nothing).

FYI here is my POC for @? and other APIs: https://github.com/tkf/Maybe.jl

My two questions are:

1. Does there exist a consistent definition of `f?` that would remove the need of all the special casing of `missing` in Base?

2. If 1. has a positive answer, what is the advantage of separating `missing` and `nothing`, given that propagation would be done via `?` rather than by whitelisting a set of functions?

@piever One difficulty is that the three-valued logic implemented by logical operators differs from standard lifting semantics. So lifting cannot replace all operations defined on missing. Though we could define these as special cases: that shouldn't be too confusing as the list of logical operators is very short.

In the Union{T,Missing} world, what return1?(x) should do? I'm guessing

return1?(x) === 1
return1?(missing) === 1

The main difference to Union{Some{T},Nothing} is return1?(missing) === 1 (@juliohm @bkamins @clarkevans @piever please point it out if this definition doesn't make sense).

So, I think using f? for Union{T,Missing} has a large conceptual issue. This is because it's hard to define "lifting" mechanistically (like return1?(missing, y)). What f? should do depends on the semantics of the code.

@tkf I don't get why you would want return1?(missing) === 1. The point of lifting is to return missing when one of the inputs is missing -- just like for nothing. See what passmissing(f) does in Missings.jl.

It seems that this is a popular topic. There is a blog post on String? in Dart. The idea that the compiler could know about potentially missing or nothing values, and help the developer avoid mistakes is quite attractive.

@nalimilan I know what passmissing does. I was just pointing out that you can't get something like the 3VL from the maybe monad.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Keno picture Keno  Â·  3Comments

omus picture omus  Â·  3Comments

i-apellaniz picture i-apellaniz  Â·  3Comments

felixrehren picture felixrehren  Â·  3Comments

sbromberger picture sbromberger  Â·  3Comments