There is broad consensus that missing
handling could be improved. Many discussions focus on making propagation of missing
s easier, and those discussions are worth having, but I also want to focus on how skipmissing
handling could be improved. Here are my suggestions. There is a lot of overlap with #30596 here but this discussion should focus more on building ideas and a roadmap rather than a specific implementation.
Tuple
of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting. cor
, accept any iterator and not vectors so we don't need to collect skipmissing
s. skipmissings
above in (1), this returns an iterator of tuples. both are useful. skipmissing
back to a vector with missings in the same locations. It would be nice for skipmissing
to have some kind of persistence so that you don't lose the location of missing
s when you collect. This allows you to, say, de-mean (or a more complicated function) elements of a vector with respect to non-missing entries. egen x = y - mean(z) if !missing(v)
And it will apply a filter on everything at the start of the function.
These are concrete changes that can be made without using relying on propagation of missing
s. They would lead to a workflow where one is able to take a vector, filter it to remove missings in whatever way you like, do things with the vector (the hard part), and then keep the missing
s in the correct locations.
With regards to point 4. map
on skipmissing
should return an object of the original type, but with missing
s where they are supposed to be. i.e.
x = [1, 2, missing, 4]
y = map(x) do xx
xx - 1
end
# [0, 1, missing, 3]
No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:
https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1
The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions
But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing
will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:
https://github.com/JuliaLang/julia/pull/26631#issuecomment-377349060
@propagatemissing f
where that expands to
g = x -> x === missing ? x : f(x)
So if the missing
debate is as settled as @StefanKarpinski is saying, we should fix the docs to signal in the same way.
For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)
instead of a macro.
For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)
instead of a macro.
Would it be worth adding a type parameter to force specialization on the type of f
?
For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)
instead of a macro.
Would it be worth adding a type parameter to force specialization on the type of
f
?
E.g.
propagate_missing(f::F) where {F} = x -> x === missing ? x : f(x)
These proposals do not sound very different from the current passmissing
function already defined in Missings.jl.
Alright, so should the docs just say functions don't propagate missing and point to using passmissing?
Alright, so should the docs just say functions don't propagate missing and point to using passmissing?
Yes, but there are still a lot of improvements to skipmissing
-type workflows that should be considered as well.
1. Make skipmissing work for multiple iterators, returning a
Tuple
of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.
3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike
skipmissings
above in (1), this returns an iterator of tuples. both are useful.
Just FYI, I think we can and should just make (t for t in zip(xs, ys, zs, ...) if any(ismissing, t))
and (... if all(ismissing, t))
fast to support these idioms. It's already kind of true as of #33526 (e.g., sum(x for x in xs if x !== missing)
is even faster than sum(skipmissing(xs))
though this is partially because the former doesn't use pairwise summation). #33526 doesn't work with reduce
yet so it doesn't cover the whole story. But it is straightforward to support reduce
at least when there is no dims
. (I want to have a go at it at some point but there is another reduce
related improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)
I think just improving vanilla iterator transformations (filter
etc.) is better as it would not only make missing
s faster but also make small Union
of user-defined types faster.
No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:
https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1
The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions
But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that
missing
will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:
This should probably have been "when passed to standard mathematical operators and functions". See https://github.com/JuliaLang/julia/pull/35264. (Do note that there's a paragraph not visible in the diff which explicitly says that most functions do not propagate.)
Most helpful comment
Just FYI, I think we can and should just make
(t for t in zip(xs, ys, zs, ...) if any(ismissing, t))
and(... if all(ismissing, t))
fast to support these idioms. It's already kind of true as of #33526 (e.g.,sum(x for x in xs if x !== missing)
is even faster thansum(skipmissing(xs))
though this is partially because the former doesn't use pairwise summation). #33526 doesn't work withreduce
yet so it doesn't cover the whole story. But it is straightforward to supportreduce
at least when there is nodims
. (I want to have a go at it at some point but there is anotherreduce
related improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)I think just improving vanilla iterator transformations (
filter
etc.) is better as it would not only makemissing
s faster but also make smallUnion
of user-defined types faster.