Julia: The fate of Nullable

Created on 4 Jul 2017  Â·  43Comments  Â·  Source: JuliaLang/julia

With the perspective of representing missing data as Union{T, Null} in Julia 0.7, we should decide what will happen to Nullable. I think the consensus is that being container-like, Nullable is appropriate to represent "the software engineer's null", as opposed to "the data analyst's null", a.k.a. missing values. In other words, Nullable offers three properties which Union{T, Null} does not:

  1. It does not automatically propagate missingness.
  2. It requires explicit handling of the possibility that a value may be null, even when it isn't (contrary to Union{T, Null}, where code may work when a value is of type T but not when it is of type Null, which might not have been properly anticipated/tested).
  3. It allows distinguishing Nullable{Nullable{T}} from Nullable{T} (contrary to Union{Union{T, Null}, Null} == Union{T, Null}), which is useful when you need to make the difference between "no value" and "null value". Such situations arise e.g. when doing a dictionary lookup (tryget, cf. https://github.com/JuliaLang/julia/issues/13055 and https://github.com/JuliaLang/julia/pull/18211); when parsing a string via tryparse to a value which could either be of type T, null, or invalid; or when wrapping a value which could either be of type T or null in a Nullable before returning it from a function.

The two first features are the ones which turned out to be annoying when working with missing data, but which can provide additional safety for general programming. A detailed discussion of the advantages and drawbacks of these approaches can be found in the Nullable Julep.

Given that, several paths can be taken for Nullable in Julia 0.7:

  1. Make Nullable{T} a (deprecated) type alias for Union{T, Null}. This would have the advantage that Julia would have a single concept of null/missing values, but without the advantages of the three points above. Checks that code is correctly prepared to handle null values could still be done by a linter.
  2. Make Nullable{T} a (deprecated) type alias for Union{Some{T}, Null}, with Some{T} a wrapper around a value of type T which would behave essentially like Nullable{T} currently. Applying a function on the value would require using f.(x), broadcast or pattern matching, so that missingness would never propagate without explicitly asking for it. The advantages would be those of the three points above, at the cost of two different representations of missingness (or almost two, since the Null type would be used in both cases).
  3. Deprecate Nullable in Base and move it as-is to a package. A possible variant would be to rename it to Option in order to prevent confusion with Null and be more consistent with other languages like Scala, Rust or Swift (IIRC, Nullable was originally called Option and lived in the Options.jl package). The main advantage of this approach is to have a single representation of null values in Base, in particular to avoid setting the design of Nullable in stone in 1.0. The main issue is that no code would be able to use Nullable in Base, which implies in particular changing tryparse to return Union{T, Null}, and that no correct tryget method could be implemented for dicts (https://github.com/JuliaLang/julia/issues/13055). OTOH this could help increasing consistency with e.g. match, which returns Union{T, Void}, and uses of Nullable are not so widespread in Base.

EDIT: added mention of point 3. in the first series of bullets.

decision missing data

Most helpful comment

in part because we need to deprecate that syntax first in 0.7

It's actually been deprecated quite a while; I did it during JuliaCon last year. :slightly_smiling_face:

It's also unclear still whether T? will mean Union{T, Nothing} or Union{T, Missing}.

All 43 comments

In my opinion, we should do the following:

julia> Nullable{T} where T
WARNING: Nullable{T} is deprecated. Use T? for missing data or Maybe{T} for
a container similar to Nullable{T} in Julia versions 0.6 and lower.
Stacktrace:
 ...
while loading ...
Union{T, Null} where T

The current uses of Nullable in user and package code are likely to represent missing data, in which case the Union is what they'll actually want.

There's something to be said for having both representations in Base, as they serve fundamentally different purposes. The container-based approach is ideal for things like Keno's suggested revision to the iteration protocol, where a Union with a propagating null meaning "iteration is done" really isn't what you want, as the iterable itself may contain null values. It's also useful for cases such as tryparse(Int?, "null"): the Union approach would make that case unclear as to whether the parsing was successful.

Regarding tryparse(Int?, "null"). Why would it be a valid code?
I understand that only tryparse(Int, "null") would be valid, and it would return null (as for any other invalid string) as in this case parsing would fail.

Say you're parsing a text file, and you know that it contains data that is either an integer or null. Thus you want to parse something as Union{Int, Null}, i.e. Int?, since null is not a valid Int. So now if you split on the delimiter and do tryparse(Int?, x) on a value, you'll want to be able to distinguish between having parsed a null value and having gotten something you didn't expect. For example, we'd have tryparse(Int?, "george") == tryparse(Int?, "null") == null, which would be wrong.

The use case you give is typical, but I am not sure if this should be handled by tryparse without any additional arguments. In the text file missing value could be represented differently (and most probably it would not be "null" but typically either "NA" or "").
However, I agree that I would like tryparse(Int, "any invalid integer string including null") return Maybe{Int}() and not null to distinguish "valid missing data" from "parse error". And when we would get parse error we can check if the string represents missing data.

Of course tryparse (and parse) could be extended to handle something like tryparse(Int?, "some string to parse", "string representing NA").

My feeling is that in general the discussion about leaving or removing Nullable (probably renamed) should include exceptions. This can be seen for instance in comparison between parse and tryparse. Ideally the design of Base should be consistent when exception are thrown, when Nullable{T} is produced and when Union{T, Null} is returned.

Am I correct in reading that the optimizations for unions (and Arrays of unions) will only be possible for isbits types? In which case Nullable would still be useful for more complex types.

No, Nullable{T} is already inefficient for non-isbits (where Union{T, Null} is already the more optimal approach), so making Union{T, Null} efficient for isbits is expected to now cover all cases optimally using one representation.

Why does Nullable need to be formally deprecated? Can't the data libraries just update to use unions and people happily using Nullable as a Maybe type can continue to do so?

The problem with keeping the Nullable name is that it's confusingly similar to Null, and we'd need to deprecate it in order to rename it.

For the use cases where Nullable has been used as a result-or-flag that something went wrong (roughly as a deferred exception), I'm increasingly finding that it isn't only useful to know that something went wrong, but also a bit of detail about what specifically went wrong. So for the optional meaning we may get more mileage out of a result-or-error-code type than the current result-or-null type (assuming API's were migrated to start using it uniformly).

Such a result-or-error-code type should probably be a sum type, so should we reconsider adding a facility for general sum types to Base and make the result-or-error-code type a special case of it?

(Sum types are not Unions, ref https://discourse.julialang.org/t/sum-types-in-julia/2795/13. Assume tryparse(T, ...) returned a ResultOrError{T} where T == Union{T, Error} where T. Now if you wanted to parse logged errors with a tryparse(Error, ...), you couldn't distinguished successfully parsed errors from parse errors. What one wants here would be an Either{T, Error} where T which holds a tag whether it contents are from the left or right. So Either{Error, Error} would hold just that extra bit of information needed in above example.)

@tkelman This is exactly why I am asking Maybe vs throw an exception (or some other type like Either in Haskell).

See e.g. https://www.schoolofhaskell.com/school/starting-with-haskell/basics-of-haskell/10_Error_Handling

There are two things to consider in my opinion:

  1. Performance of try-catch vs Nullable, as I have never done such a comparison?
  2. Nullable forces user to think about an error (we know that function returns Nullable and we have to handle it); exception can be silently ignored

IMO exception objects should stay out of control flow. If something goes wrong and emits an error, the error should be thrown. If you really need to deal with an exception object, catch it. The container-based null (value is or is not present) says to the user, "This _worked_ fine; the code executed normally. Some valid logic in the code has told me that there should not be a value here."

What you're describing, @tkelman, sounds like result types. IMO they have their place, but generally speaking, if you have Result{T, Exception} as the result of a function call, you lose the stack trace information associated with the error. Only knowing the type of the error object not really all that useful, since you as a caller can't know what in the function caused it, or even if it was intentional.

Of course, careful judgement is needed for deciding throwing an exception versus using a result-or-error-code, but tryparse is a prime example where you might want a reason for a failed parse, but the whole point of the function is to not throw an exception in that case.

Of course, I completely forgot to mention the main interest of Nullable and Union{Some{T}, Null}, which is that they allow distinguishing a wrapped null from a null. This feature is required in particular for tryget on dictionaries to distinguish "no value associated with key" from "null value associated with key" (https://github.com/JuliaLang/julia/issues/13055 and https://github.com/JuliaLang/julia/pull/18211).

Result types would also be a nice extension of the Union{Some{T}, Null} approach, just replacing Null with one or more Exception types. That's been discussed at https://github.com/JuliaLang/julia/issues/14972, with networking functions as a use case.

Forgot to mention I've updated the description to cover these points.

My preference would be to just leave Nullable alone, or at most change its internal representation to
julia struct Nullable{T} value::Union{T,Null} end
I think the data-missing-values story should just stick with the names that are already used in DataFrames/DataArrays today, i.e. NA and NAtype (so Union{T, NAtype} for the Union story), or come up with some new names that are not null related.

That strategy would avoid a lot of deprecation work and breaking of old code: essentially all the code depending on the current Nullable would continue to work, and it would probably also mean a lot less breaking of code that uses the current DataFrame story.

I don't think there are particularly super strong arguments pro/con various naming schemes per se, for example I can see lots of arguments pro/con Nullable as is in base right now, and lots of pro/con re using null vs NA for the missing data story. I think at the end of the day it is a wash and comes down to personal preference. In such a situation I feel other criteria, like not breaking a lot of things and a general principle that the needs of one part of the package scene (data) shouldn't trigger renames in base, seem more important to me.

A separate point: if we do want to use different concepts for the software engineering and data science case of missing values (which I strongly support) and want to use a Union for both, we probably shouldn't use the same type in the union to represent missing values. In particular, option 2) above seems problematic: I assume we would add definitions like +(a::Null, b::Null) for the data science case, but now + would also work for the software engineering missingness case, whereas the whole point of having two distinct stories was to prevent that.

assume we would add definitions like +(a::Null, b::Null) for the data science case, but now + would also work for the software engineering missingness case, whereas the whole point of having two distinct stories was to prevent that.

+ would not work in general since e.g. Some(1) + 1 would fail. To avoid any confusion, we could use Void/nothing instead of Null/null, but then should isnull(nothing) be true? I think I'd prefer having only once concept of null value.

Arguments about naming are often quite close to bikeshedding, but in this case there are a few reasons to change the names:

  • Consistency with other languages: only R uses NA, while null is used by all recent languages which have that concept (though there is also nil e.g. in Go) and by SQL (NULL).
  • Consistency with other types: in Julia, objects use lowercase names (in particular nothing), and types use camel case (in particular Void). NA and NAtype really feel out of place given these conventions.

Also a reason to replace Nullable with Union{Some{T}, Null} is that it can be extended to represent result types via e.g. Union{Some{T}, Exception}. The current Nullable is not as flexible.

  • would not work in general since e.g. Some(1) + 1 would fail.

Some(1) + null would throw an error, but null + null would not, which in my mind would really not be the story we would want for the software dev missingness story, assuming there is a decision to separate those two types of missingness.

To avoid any confusion, we could use Void/nothing instead, but then should isnull(nothing) be true? I think I'd prefer having only once concept of null value.

I would keep the complete terminology separate for these two cases if the decision is to have two separate concepts of missingness. For example use isna and isnull for the two cases, or some other pair. Using the same function names/terminology for both use cases while at the same time trying to make them distinct seems very confusing to me and bound to generate problems down the road.

only R uses NA

Pandas also uses NA. In my mind R and Python are the two most important ecosystems for data science right now, so being in line with those two would be my first priority.

Also a reason to replace Nullable with Union{Some{T}, Null} is that it can be extended to represent result types via e.g. Union{Some{T}, Exception}. The current Nullable is not as flexible.

Isn't that the kind of result type that @ararslan described above? I agree with him, that seems a different thing than what Nullable is, so I would just use a different type for that.

NA and NAtype really feel out of place given these conventions.

That could also be changed to say na and NaType or something like that. I would still prefer to keep the backwards compat for DataFrame instead, but with that choice one would at least only break code that uses DataFrames, and not code that uses Nullable.

Pandas also uses NA. In my mind R and Python are the two most important ecosystems for data science right now, so being in line with those two would be my first priority.

Pandas calls missing values NA in their docs, but that's actually NaN even in doc examples, so that's really not a reference for consistency. na would be the worst of both worlds: not-backward compatible, not used by any other language.

Pandas calls missing values NA in their docs, but that's actually NaN even in doc examples, so that's really not a reference for consistency.

They use NA in the docs, and then have a bunch of APIs that use that terminology (e.g. fillna and dropna). My understanding of their plans for pandas2 is to get rid of the NaNs and move everything over to just NA. Having said that, they also have isnull, so yes, not super consistent currently.

My preference is

  • Nullable{T} --> Union{T, Null}
  • Optionals.Optional{T} --> Union{Optionals.Some{T}, Optionals.None}

I think that Null and Optionals.None will be sufficiently different (e.g. for broadcasting) that they should be distinct singleton types.

Nullable has, up to now, been used for both the data scientist's na and the software engineer's null. But it does a relatively poor job as the former, and a good job as the latter. So anything we change it to should not break it in the places where it's doing a good job. In other words, we should not assume people want magic that they haven't had in the past; if they want it, we should make them use it explicitly. (software engineer's null works through lack of magic, data engineer's na works through magic)

So, how about the following:

Nullable{T} --> Union{Some{T},Void} (software engineers' null; possibly deprecated to Maybe{T}

null = nothing and deprecate nothing... shorter is better

T? --> Union{T,na}

typeof(na)==NAType

I don't think any changes should affect nothing; that's a "system" value used for things like empty function returns and signifying a Void return type from ccalls. We wouldn't want people to get confused into using it for something data science or software related (which would also lead to things like regex match using Nullable instead).

T? --> Union{T, na}

That doesn't really make sense because na is a value, not a type.

I'd be fine w/

Nullable{T} => Union{T, Void} (since the Void should never really be apart of the usage of Nullable anyway; this will be the goto for "software"-related usages of nullables, since almost nothing is defined on nothing, meaning any accidental usage of nothing will stop pretty quickly

T? => Union{T, Null}: this will be the goto for data scientists, with const null = Null() acting as a "sentinel value" for any type.

I don't think we should give another meaning to Void.

Whoops. I meant T? --> Union{T,NAType}, not T? --> Union{T,na}. Of course, the question of whether to call it na or null (and its type, NAType or Null) is just bikeshedding (except for the possible confusion over the fact that a Nullable can't have the value null), so I endorse my _tocayo_ Quinnj's suggestion above in all substantive regards.

:-1: to Union{T, Void}. The software engineer's null needs to distinguish a some value of nothing from a null value. This is not just a theoretical concern; it's very important for things like collections methods which return null if things are missing and somes otherwise, because collections can themselves contain nothing.

Lint, in particular, heavily relies on Nullable{Any} where the element can be nothing.

Sure, that's fine @TotalVerb. @ararslan, it's not really another meaning to Void, it shouldn't even be that visible/used by end users. It's just a "not Some" value that would indicate a missing value.

"Not Some" is the other meaning I mean. Void should be the thing that for loops and other such constructs return—that's it.

OK, if the CS null is not going to be Union{T,Void}, then I think that it should be

struct Nullable{T}
    value::Union{T,Null}
end

as proposed by DavidAnthoff.

That would leave 4 different kinds of nullish values: na::NAType, null::Null, nothing::Void, and Nullable{T}(null). Seems like too many to me. We could reduce that to 3 if we had

struct Nullable{T}
    value::Union{T,Void}
end

... then there would be just na::NAType, nothing::Void, and Nullable{T}(nothing). The one downside of that simplification would be that Nullable{Void} would have just one possible value, though Nullable{Nullable{Void}} would still have two.

What you're calling na is what we've been calling null: the data analyst's null. So there isn't four, there's only two, null and Nullable{T}(). nothing is irrelevant to missing data.

I'll make a PR retaining my favorite solution to illustrate what this implies.

Can someone clarify if a shorthand like T? or Nullable{T} made it into this or a related PR?

It seems two would be needed, one for Union{T, Nothing} and another for Union{Some{T}, Nothing}.

Without a shorthand this union approach is less intuitive, harder to read, and only mildly less verbose in net than the old Nullable. Not a fan of Nullable, and I'm on board with the general spirit of this change, but I'm hoping a developer will rarely if ever see Union{Some{T}, Nothing}. Nullable parameters are too common for such awkward verbosity. I think nullable types in C# is a good example of how to handle this elegantly.

No, T? isn't supported yet, in part because we need to deprecate that syntax first in 0.7 (since it could mean other things in 0.6). I guess Union{Some{T}, Nothing} would then become Some{T}?. Anyway in most cases we recommend Union{T, Nothing}.

in part because we need to deprecate that syntax first in 0.7

It's actually been deprecated quite a while; I did it during JuliaCon last year. :slightly_smiling_face:

It's also unclear still whether T? will mean Union{T, Nothing} or Union{T, Missing}.

I think elsewhere we concluded that right meaning for the ?-operators will be Nothing-related, based on expected usages.

All this stuff go v1. Great !
Rather a good work.

One minor issue remains, the verbosity we have to deal with when we type and retype Union{Some{T}, Nothing all over the place.

So we have

  • Some{T}
  • something(x, y...)

And i propose to alleviate verbosity while typing by adding

  • const Sometime{T} = Union{Some{T}, Nothing} where T

PS The issue is closed, i don't know if it is the better place to discuss about that...
PPS Here is the source for v1.2 https://github.com/JuliaLang/julia/blob/v1.2.0/base/some.jl

I had proposed Option{T} originally, but since it's not strictly required for 1.0 it wasn't added.

My proposal was made especially to not change any semantics (consensus was hardly reached ), only to reduce retyping. The whole is homegeneous, easily memorable, and yet in use today.

I must confess i have not discern if Option{T} had similar goals, and if rejection was motivated on the same basis.

Anyway, there are some choices today.

Was this page helpful?
0 / 5 - 0 ratings