With the perspective of representing missing data as Union{T, Null}
in Julia 0.7, we should decide what will happen to Nullable
. I think the consensus is that being container-like, Nullable
is appropriate to represent "the software engineer's null", as opposed to "the data analyst's null", a.k.a. missing values. In other words, Nullable
offers three properties which Union{T, Null}
does not:
Union{T, Null}
, where code may work when a value is of type T
but not when it is of type Null
, which might not have been properly anticipated/tested).Nullable{Nullable{T}}
from Nullable{T}
(contrary to Union{Union{T, Null}, Null} == Union{T, Null}
), which is useful when you need to make the difference between "no value" and "null value". Such situations arise e.g. when doing a dictionary lookup (tryget
, cf. https://github.com/JuliaLang/julia/issues/13055 and https://github.com/JuliaLang/julia/pull/18211); when parsing a string via tryparse
to a value which could either be of type T
, null, or invalid; or when wrapping a value which could either be of type T
or null in a Nullable
before returning it from a function.The two first features are the ones which turned out to be annoying when working with missing data, but which can provide additional safety for general programming. A detailed discussion of the advantages and drawbacks of these approaches can be found in the Nullable
Julep.
Given that, several paths can be taken for Nullable
in Julia 0.7:
Nullable{T}
a (deprecated) type alias for Union{T, Null}
. This would have the advantage that Julia would have a single concept of null/missing values, but without the advantages of the three points above. Checks that code is correctly prepared to handle null values could still be done by a linter.Nullable{T}
a (deprecated) type alias for Union{Some{T}, Null}
, with Some{T}
a wrapper around a value of type T
which would behave essentially like Nullable{T}
currently. Applying a function on the value would require using f.(x)
, broadcast
or pattern matching, so that missingness would never propagate without explicitly asking for it. The advantages would be those of the three points above, at the cost of two different representations of missingness (or almost two, since the Null
type would be used in both cases).Nullable
in Base and move it as-is to a package. A possible variant would be to rename it to Option
in order to prevent confusion with Null
and be more consistent with other languages like Scala, Rust or Swift (IIRC, Nullable
was originally called Option
and lived in the Options.jl package). The main advantage of this approach is to have a single representation of null values in Base, in particular to avoid setting the design of Nullable
in stone in 1.0. The main issue is that no code would be able to use Nullable
in Base, which implies in particular changing tryparse
to return Union{T, Null}
, and that no correct tryget
method could be implemented for dicts (https://github.com/JuliaLang/julia/issues/13055). OTOH this could help increasing consistency with e.g. match
, which returns Union{T, Void}
, and uses of Nullable
are not so widespread in Base.EDIT: added mention of point 3. in the first series of bullets.
In my opinion, we should do the following:
julia> Nullable{T} where T
WARNING: Nullable{T} is deprecated. Use T? for missing data or Maybe{T} for
a container similar to Nullable{T} in Julia versions 0.6 and lower.
Stacktrace:
...
while loading ...
Union{T, Null} where T
The current uses of Nullable
in user and package code are likely to represent missing data, in which case the Union
is what they'll actually want.
There's something to be said for having both representations in Base, as they serve fundamentally different purposes. The container-based approach is ideal for things like Keno's suggested revision to the iteration protocol, where a Union
with a propagating null meaning "iteration is done" really isn't what you want, as the iterable itself may contain null values. It's also useful for cases such as tryparse(Int?, "null")
: the Union
approach would make that case unclear as to whether the parsing was successful.
Regarding tryparse(Int?, "null")
. Why would it be a valid code?
I understand that only tryparse(Int, "null")
would be valid, and it would return null
(as for any other invalid string) as in this case parsing would fail.
Say you're parsing a text file, and you know that it contains data that is either an integer or null
. Thus you want to parse something as Union{Int, Null}
, i.e. Int?
, since null
is not a valid Int
. So now if you split on the delimiter and do tryparse(Int?, x)
on a value, you'll want to be able to distinguish between having parsed a null value and having gotten something you didn't expect. For example, we'd have tryparse(Int?, "george") == tryparse(Int?, "null") == null
, which would be wrong.
The use case you give is typical, but I am not sure if this should be handled by tryparse
without any additional arguments. In the text file missing value could be represented differently (and most probably it would not be "null"
but typically either "NA"
or ""
).
However, I agree that I would like tryparse(Int, "any invalid integer string including null")
return Maybe{Int}()
and not null
to distinguish "valid missing data" from "parse error". And when we would get parse error we can check if the string represents missing data.
Of course tryparse
(and parse
) could be extended to handle something like tryparse(Int?, "some string to parse", "string representing NA")
.
My feeling is that in general the discussion about leaving or removing Nullable
(probably renamed) should include exceptions. This can be seen for instance in comparison between parse
and tryparse
. Ideally the design of Base should be consistent when exception are thrown, when Nullable{T}
is produced and when Union{T, Null}
is returned.
Am I correct in reading that the optimizations for unions (and Arrays of unions) will only be possible for isbits types? In which case Nullable would still be useful for more complex types.
No, Nullable{T}
is already inefficient for non-isbits (where Union{T, Null}
is already the more optimal approach), so making Union{T, Null}
efficient for isbits is expected to now cover all cases optimally using one representation.
Why does Nullable
need to be formally deprecated? Can't the data libraries just update to use unions and people happily using Nullable as a Maybe type can continue to do so?
The problem with keeping the Nullable
name is that it's confusingly similar to Null
, and we'd need to deprecate it in order to rename it.
For the use cases where Nullable has been used as a result-or-flag that something went wrong (roughly as a deferred exception), I'm increasingly finding that it isn't only useful to know that something went wrong, but also a bit of detail about what specifically went wrong. So for the optional meaning we may get more mileage out of a result-or-error-code type than the current result-or-null type (assuming API's were migrated to start using it uniformly).
Such a result-or-error-code type should probably be a sum type, so should we reconsider adding a facility for general sum types to Base and make the result-or-error-code type a special case of it?
(Sum types are not Unions
, ref https://discourse.julialang.org/t/sum-types-in-julia/2795/13. Assume tryparse(T, ...)
returned a ResultOrError{T} where T == Union{T, Error} where T
. Now if you wanted to parse logged errors with a tryparse(Error, ...)
, you couldn't distinguished successfully parsed errors from parse errors. What one wants here would be an Either{T, Error} where T
which holds a tag whether it contents are from the left or right. So Either{Error, Error}
would hold just that extra bit of information needed in above example.)
@tkelman This is exactly why I am asking Maybe
vs throw an exception (or some other type like Either in Haskell).
See e.g. https://www.schoolofhaskell.com/school/starting-with-haskell/basics-of-haskell/10_Error_Handling
There are two things to consider in my opinion:
Nullable
, as I have never done such a comparison?Nullable
forces user to think about an error (we know that function returns Nullable
and we have to handle it); exception can be silently ignoredIMO exception objects should stay out of control flow. If something goes wrong and emits an error, the error should be thrown. If you really need to deal with an exception object, catch
it. The container-based null (value is or is not present) says to the user, "This _worked_ fine; the code executed normally. Some valid logic in the code has told me that there should not be a value here."
What you're describing, @tkelman, sounds like result types. IMO they have their place, but generally speaking, if you have Result{T, Exception}
as the result of a function call, you lose the stack trace information associated with the error. Only knowing the type of the error object not really all that useful, since you as a caller can't know what in the function caused it, or even if it was intentional.
Of course, careful judgement is needed for deciding throwing an exception versus using a result-or-error-code, but tryparse
is a prime example where you might want a reason for a failed parse, but the whole point of the function is to not throw an exception in that case.
Of course, I completely forgot to mention the main interest of Nullable
and Union{Some{T}, Null}
, which is that they allow distinguishing a wrapped null from a null. This feature is required in particular for tryget
on dictionaries to distinguish "no value associated with key" from "null value associated with key" (https://github.com/JuliaLang/julia/issues/13055 and https://github.com/JuliaLang/julia/pull/18211).
Result types would also be a nice extension of the Union{Some{T}, Null}
approach, just replacing Null
with one or more Exception
types. That's been discussed at https://github.com/JuliaLang/julia/issues/14972, with networking functions as a use case.
Forgot to mention I've updated the description to cover these points.
My preference would be to just leave Nullable
alone, or at most change its internal representation to
julia
struct Nullable{T}
value::Union{T,Null}
end
I think the data-missing-values story should just stick with the names that are already used in DataFrames/DataArrays today, i.e. NA
and NAtype
(so Union{T, NAtype}
for the Union
story), or come up with some new names that are not null
related.
That strategy would avoid a lot of deprecation work and breaking of old code: essentially all the code depending on the current Nullable
would continue to work, and it would probably also mean a lot less breaking of code that uses the current DataFrame
story.
I don't think there are particularly super strong arguments pro/con various naming schemes per se, for example I can see lots of arguments pro/con Nullable
as is in base right now, and lots of pro/con re using null
vs NA
for the missing data story. I think at the end of the day it is a wash and comes down to personal preference. In such a situation I feel other criteria, like not breaking a lot of things and a general principle that the needs of one part of the package scene (data) shouldn't trigger renames in base, seem more important to me.
A separate point: if we do want to use different concepts for the software engineering and data science case of missing values (which I strongly support) and want to use a Union
for both, we probably shouldn't use the same type in the union to represent missing values. In particular, option 2) above seems problematic: I assume we would add definitions like +(a::Null, b::Null)
for the data science case, but now +
would also work for the software engineering missingness case, whereas the whole point of having two distinct stories was to prevent that.
assume we would add definitions like +(a::Null, b::Null) for the data science case, but now + would also work for the software engineering missingness case, whereas the whole point of having two distinct stories was to prevent that.
+
would not work in general since e.g. Some(1) + 1
would fail. To avoid any confusion, we could use Void
/nothing
instead of Null
/null
, but then should isnull(nothing)
be true? I think I'd prefer having only once concept of null value.
Arguments about naming are often quite close to bikeshedding, but in this case there are a few reasons to change the names:
NA
, while null
is used by all recent languages which have that concept (though there is also nil
e.g. in Go) and by SQL (NULL
).nothing
), and types use camel case (in particular Void
). NA
and NAtype
really feel out of place given these conventions.Also a reason to replace Nullable
with Union{Some{T}, Null}
is that it can be extended to represent result types via e.g. Union{Some{T}, Exception}
. The current Nullable
is not as flexible.
- would not work in general since e.g. Some(1) + 1 would fail.
Some(1) + null
would throw an error, but null + null
would not, which in my mind would really not be the story we would want for the software dev missingness story, assuming there is a decision to separate those two types of missingness.
To avoid any confusion, we could use Void/nothing instead, but then should
isnull(nothing)
be true? I think I'd prefer having only once concept of null value.
I would keep the complete terminology separate for these two cases if the decision is to have two separate concepts of missingness. For example use isna
and isnull
for the two cases, or some other pair. Using the same function names/terminology for both use cases while at the same time trying to make them distinct seems very confusing to me and bound to generate problems down the road.
only R uses
NA
Pandas also uses NA
. In my mind R and Python are the two most important ecosystems for data science right now, so being in line with those two would be my first priority.
Also a reason to replace
Nullable
withUnion{Some{T}, Null}
is that it can be extended to represent result types via e.g.Union{Some{T}, Exception}
. The currentNullable
is not as flexible.
Isn't that the kind of result type that @ararslan described above? I agree with him, that seems a different thing than what Nullable
is, so I would just use a different type for that.
NA
andNAtype
really feel out of place given these conventions.
That could also be changed to say na
and NaType
or something like that. I would still prefer to keep the backwards compat for DataFrame
instead, but with that choice one would at least only break code that uses DataFrames, and not code that uses Nullable
.
Pandas also uses NA. In my mind R and Python are the two most important ecosystems for data science right now, so being in line with those two would be my first priority.
Pandas calls missing values NA
in their docs, but that's actually NaN
even in doc examples, so that's really not a reference for consistency. na
would be the worst of both worlds: not-backward compatible, not used by any other language.
Pandas calls missing values NA in their docs, but that's actually NaN even in doc examples, so that's really not a reference for consistency.
They use NA
in the docs, and then have a bunch of APIs that use that terminology (e.g. fillna
and dropna
). My understanding of their plans for pandas2 is to get rid of the NaN
s and move everything over to just NA
. Having said that, they also have isnull
, so yes, not super consistent currently.
My preference is
I think that Null
and Optionals.None
will be sufficiently different (e.g. for broadcasting) that they should be distinct singleton types.
Nullable
has, up to now, been used for both the data scientist's na and the software engineer's null. But it does a relatively poor job as the former, and a good job as the latter. So anything we change it to should not break it in the places where it's doing a good job. In other words, we should not assume people want magic that they haven't had in the past; if they want it, we should make them use it explicitly. (software engineer's null works through lack of magic, data engineer's na works through magic)
So, how about the following:
Nullable{T}
--> Union{Some{T},Void}
(software engineers' null; possibly deprecated to Maybe{T}
null = nothing
and deprecate nothing
... shorter is better
T?
--> Union{T,na}
typeof(na)==NAType
I don't think any changes should affect nothing
; that's a "system" value used for things like empty function returns and signifying a Void
return type from ccall
s. We wouldn't want people to get confused into using it for something data science or software related (which would also lead to things like regex match
using Nullable instead).
T?
-->Union{T, na}
That doesn't really make sense because na
is a value, not a type.
I'd be fine w/
Nullable{T}
=> Union{T, Void}
(since the Void
should never really be apart of the usage of Nullable
anyway; this will be the goto for "software"-related usages of nullables, since almost nothing is defined on nothing
, meaning any accidental usage of nothing
will stop pretty quickly
T?
=> Union{T, Null}
: this will be the goto for data scientists, with const null = Null()
acting as a "sentinel value" for any type.
I don't think we should give another meaning to Void
.
Whoops. I meant T? --> Union{T,NAType}
, not T? --> Union{T,na}
. Of course, the question of whether to call it na
or null
(and its type, NAType
or Null
) is just bikeshedding (except for the possible confusion over the fact that a Nullable can't have the value null), so I endorse my _tocayo_ Quinnj's suggestion above in all substantive regards.
:-1: to Union{T, Void}
. The software engineer's null needs to distinguish a some value of nothing
from a null value. This is not just a theoretical concern; it's very important for things like collections methods which return null if things are missing and somes otherwise, because collections can themselves contain nothing
.
Lint, in particular, heavily relies on Nullable{Any} where the element can be nothing
.
Sure, that's fine @TotalVerb. @ararslan, it's not really another meaning to Void
, it shouldn't even be that visible/used by end users. It's just a "not Some" value that would indicate a missing value.
"Not Some
" is the other meaning I mean. Void
should be the thing that for
loops and other such constructs return—that's it.
OK, if the CS null is not going to be Union{T,Void}
, then I think that it should be
struct Nullable{T}
value::Union{T,Null}
end
as proposed by DavidAnthoff.
That would leave 4 different kinds of nullish values: na::NAType
, null::Null
, nothing::Void
, and Nullable{T}(null)
. Seems like too many to me. We could reduce that to 3 if we had
struct Nullable{T}
value::Union{T,Void}
end
... then there would be just na::NAType
, nothing::Void
, and Nullable{T}(nothing)
. The one downside of that simplification would be that Nullable{Void}
would have just one possible value, though Nullable{Nullable{Void}}
would still have two.
What you're calling na
is what we've been calling null
: the data analyst's null. So there isn't four, there's only two, null
and Nullable{T}()
. nothing
is irrelevant to missing data.
I'll make a PR retaining my favorite solution to illustrate what this implies.
Can someone clarify if a shorthand like T?
or Nullable{T}
made it into this or a related PR?
It seems two would be needed, one for Union{T, Nothing}
and another for Union{Some{T}, Nothing}
.
Without a shorthand this union approach is less intuitive, harder to read, and only mildly less verbose in net than the old Nullable
. Not a fan of Nullable
, and I'm on board with the general spirit of this change, but I'm hoping a developer will rarely if ever see Union{Some{T}, Nothing}
. Nullable parameters are too common for such awkward verbosity. I think nullable types in C# is a good example of how to handle this elegantly.
No, T?
isn't supported yet, in part because we need to deprecate that syntax first in 0.7 (since it could mean other things in 0.6). I guess Union{Some{T}, Nothing}
would then become Some{T}?
. Anyway in most cases we recommend Union{T, Nothing}
.
in part because we need to deprecate that syntax first in 0.7
It's actually been deprecated quite a while; I did it during JuliaCon last year. :slightly_smiling_face:
It's also unclear still whether T?
will mean Union{T, Nothing}
or Union{T, Missing}
.
I think elsewhere we concluded that right meaning for the ?
-operators will be Nothing
-related, based on expected usages.
All this stuff go v1. Great !
Rather a good work.
One minor issue remains, the verbosity we have to deal with when we type and retype Union{Some{T}, Nothing
all over the place.
So we have
Some{T}
something(x, y...)
And i propose to alleviate verbosity while typing by adding
const Sometime{T} = Union{Some{T}, Nothing} where T
PS The issue is closed, i don't know if it is the better place to discuss about that...
PPS Here is the source for v1.2 https://github.com/JuliaLang/julia/blob/v1.2.0/base/some.jl
I had proposed Option{T}
originally, but since it's not strictly required for 1.0 it wasn't added.
My proposal was made especially to not change any semantics (consensus was hardly reached ), only to reduce retyping. The whole is homegeneous, easily memorable, and yet in use today.
I must confess i have not discern if Option{T}
had similar goals, and if rejection was motivated on the same basis.
Anyway, there are some choices today.
Most helpful comment
It's actually been deprecated quite a while; I did it during JuliaCon last year. :slightly_smiling_face:
It's also unclear still whether
T?
will meanUnion{T, Nothing}
orUnion{T, Missing}
.