Julia: Improve Array{Union{T, Missing}} constructors

Created on 6 Dec 2017  路  8Comments  路  Source: JuliaLang/julia

We need a way to construct Array{Union{T, Missing}} objects filled with missing easily. Here are three proposals which can be seen either as complementary or as alternatives.

1) The Missings package provides the missings(::Type, dims...) function for that, which is really convenient since it avoids the need to write Union and Missing. I could make sense to add it to Base.

2) More generally, if would be nice if Array{Union{T, Missing}}(uninitialized, dims...) implemented a simpler behavior. Currently, it returns an array filled with missing for isbits element types, but an array full of #undef for other types:

julia> Array{Union{Int, Missing}}(uninitialized, 2)
2-element Array{Union{Missing, Int64},1}:
 missing
 missing

julia> Array{Union{String, Missing}}(uninitialized, 2)
2-element Array{Union{Missing, String},1}:
 #undef
 #undef

There are technical reasons for the first case of course. But it would sound reasonable to apply the same rule for all types, if only to make it easier to explain and remember. For example, in the section of the manual about missing values, it would be annoying to have to present an example for isbits types, and an example for other types.

A possible rule would be that uninitialized arrays are filled with the first singleton type, "first" being defined not by the user-specified order but by the internal sorting of types. In terms of performance, I'm not sure whether it would have an impact. Maybe the current behavior allows using zeroed memory pages provided by calloc, which can be more efficient than filling them with a custom value?

  1. As an alternative to Array{Union{T, Missing}}(uninitialized, dims...), it seems appealing to be able to write Array{Union{String, Missing}}(missing, dims...). If that was supported, it would be less of an issue that the rules determining the contents of the array created using Array{Union{T, Missing}}(uninitialized, dims...) are complex: it would only be used when you really want an uninitialized array (as the syntax indeed indicates).

PS: before somebody proposes it, using fill is not really an option since fill(missing, 2) gives an Array{Missing}, and the only alternative is very long: fill!(Array{Union{T, Missing}}(uninitialized, dims...), missing).

arrays missing data

Most helpful comment

Modulo a few details, option three seems mostly in keeping with the direction in #24595? :)

All 8 comments

I'm surprised to see that Array{Union{Int, Missing}}(uninitialized, 2) is filled with missing, is that documented or guaranteed behavior in general (e.g. also with Union{Int,Void})? What about Array{Union{Int, Missing, Void}}(uninitialized, 2) ? It seems missing has the priority over nothing, is that reliable? I thought the whole point of unitialized is to not make guarantees about the content of the container (just what happens to be the most efficient), so it seems strange to have a consistent filling in some case.

AFAIK that behavior is needed because arrays of isbits Unions cannot be left completely uninitialized: at least you need to choose a type for each entry, i.e. you need to set the type tag to a valid type. So you may as well choose a type reliably, by setting the type tag to zero everywhere. I think the order corresponds to what is shown when printing a type, but we may not want to guarantee it.

The problem is always the same: unless you return random contents, people will eventually depend on the implemented behavior even if its not part of the public API. But returning random contents is not always the most efficient solution.

Ok thanks for the clarification! I'm fine then with having consistent filling for isbits Unions with uninitialzied, but it seems much clearer to have another official way to consistenly fill with missing, even for non-isbits, so I support your options 1) or 3) over 2) by far.

Modulo a few details, option three seems mostly in keeping with the direction in #24595? :)

Agree that option 3 seems like the way to go.

I like option 3 because it shares the API of what I feel to be a more general solution here (https://github.com/JuliaLang/julia/issues/24595#issuecomment-348146353), which works a bit like:

out = Array{T,N}(input, dims)

# is equivalent to

out = Array{T,N}(unintialized, dims)
out .= input

Replace input with missing, and viol脿.

Fixed by #25054

Was this page helpful?
0 / 5 - 0 ratings

Related issues

omus picture omus  路  3Comments

sbromberger picture sbromberger  路  3Comments

dpsanders picture dpsanders  路  3Comments

wilburtownsend picture wilburtownsend  路  3Comments

felixrehren picture felixrehren  路  3Comments