There is broad consensus that missing handling could be improved. Many discussions focus on making propagation of missings easier, and those discussions are worth having, but I also want to focus on how skipmissing handling could be improved. Here are my suggestions. There is a lot of overlap with #30596 here but this discussion should focus more on building ideas and a roadmap rather than a specific implementation.
Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting. cor, accept any iterator and not vectors so we don't need to collect skipmissings. skipmissings above in (1), this returns an iterator of tuples. both are useful. skipmissing back to a vector with missings in the same locations. It would be nice for skipmissing to have some kind of persistence so that you don't lose the location of missings when you collect. This allows you to, say, de-mean (or a more complicated function) elements of a vector with respect to non-missing entries. egen x = y - mean(z) if !missing(v)
And it will apply a filter on everything at the start of the function.
These are concrete changes that can be made without using relying on propagation of missings. They would lead to a workflow where one is able to take a vector, filter it to remove missings in whatever way you like, do things with the vector (the hard part), and then keep the missings in the correct locations.
With regards to point 4. map on skipmissing should return an object of the original type, but with missings where they are supposed to be. i.e.
x = [1, 2, missing, 4]
y = map(x) do xx
xx - 1
end
# [0, 1, missing, 3]
No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:
https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1
The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions
But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:
https://github.com/JuliaLang/julia/pull/26631#issuecomment-377349060
@propagatemissing f
where that expands to
g = x -> x === missing ? x : f(x)
So if the missing debate is as settled as @StefanKarpinski is saying, we should fix the docs to signal in the same way.
For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)
instead of a macro.
For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)instead of a macro.
Would it be worth adding a type parameter to force specialization on the type of f?
For documentation I would just use the higher-order function
propagate_missing(f) = x -> x === missing ? x : f(x)instead of a macro.
Would it be worth adding a type parameter to force specialization on the type of
f?
E.g.
propagate_missing(f::F) where {F} = x -> x === missing ? x : f(x)
These proposals do not sound very different from the current passmissing function already defined in Missings.jl.
Alright, so should the docs just say functions don't propagate missing and point to using passmissing?
Alright, so should the docs just say functions don't propagate missing and point to using passmissing?
Yes, but there are still a lot of improvements to skipmissing-type workflows that should be considered as well.
1. Make skipmissing work for multiple iterators, returning a
Tupleof iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.
3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike
skipmissingsabove in (1), this returns an iterator of tuples. both are useful.
Just FYI, I think we can and should just make (t for t in zip(xs, ys, zs, ...) if any(ismissing, t)) and (... if all(ismissing, t)) fast to support these idioms. It's already kind of true as of #33526 (e.g., sum(x for x in xs if x !== missing) is even faster than sum(skipmissing(xs)) though this is partially because the former doesn't use pairwise summation). #33526 doesn't work with reduce yet so it doesn't cover the whole story. But it is straightforward to support reduce at least when there is no dims. (I want to have a go at it at some point but there is another reduce related improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)
I think just improving vanilla iterator transformations (filter etc.) is better as it would not only make missings faster but also make small Union of user-defined types faster.
No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:
https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1
The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions
But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that
missingwill not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:
This should probably have been "when passed to standard mathematical operators and functions". See https://github.com/JuliaLang/julia/pull/35264. (Do note that there's a paragraph not visible in the diff which explicitly says that most functions do not propagate.)
Most helpful comment
Just FYI, I think we can and should just make
(t for t in zip(xs, ys, zs, ...) if any(ismissing, t))and(... if all(ismissing, t))fast to support these idioms. It's already kind of true as of #33526 (e.g.,sum(x for x in xs if x !== missing)is even faster thansum(skipmissing(xs))though this is partially because the former doesn't use pairwise summation). #33526 doesn't work withreduceyet so it doesn't cover the whole story. But it is straightforward to supportreduceat least when there is nodims. (I want to have a go at it at some point but there is anotherreducerelated improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)I think just improving vanilla iterator transformations (
filteretc.) is better as it would not only makemissings faster but also make smallUnionof user-defined types faster.