Julia: Non-propogation, `skipmissing`-related improvements to Missing handling.

Created on 9 Mar 2020  路  10Comments  路  Source: JuliaLang/julia

There is broad consensus that missing handling could be improved. Many discussions focus on making propagation of missings easier, and those discussions are worth having, but I also want to focus on how skipmissing handling could be improved. Here are my suggestions. There is a lot of overlap with #30596 here but this discussion should focus more on building ideas and a roadmap rather than a specific implementation.

  1. Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.
  2. Make way more functions in Base like cor, accept any iterator and not vectors so we don't need to collect skipmissings.
  3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.
  4. Change broadcasting so that we can go from a skipmissing back to a vector with missings in the same locations. It would be nice for skipmissing to have some kind of persistence so that you don't lose the location of missings when you collect. This allows you to, say, de-mean (or a more complicated function) elements of a vector with respect to non-missing entries.
  5. Use dispatch for DataFrames (or Tables or NamedTuples etc.) to simulate Stata's if syntax, where the new dataframe is a view into the non-missing elements uses 4. (above) to fill in missings where needed. R doesn't have this feature so I don't think its obvious to everyone that this is an option. Stata is really great with this, you can do
egen x = y - mean(z) if !missing(v)

And it will apply a filter on everything at the start of the function.

These are concrete changes that can be made without using relying on propagation of missings. They would lead to a workflow where one is able to take a vector, filter it to remove missings in whatever way you like, do things with the vector (the hard part), and then keep the missings in the correct locations.

missing data

Most helpful comment

1. Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.

3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.

Just FYI, I think we can and should just make (t for t in zip(xs, ys, zs, ...) if any(ismissing, t)) and (... if all(ismissing, t)) fast to support these idioms. It's already kind of true as of #33526 (e.g., sum(x for x in xs if x !== missing) is even faster than sum(skipmissing(xs)) though this is partially because the former doesn't use pairwise summation). #33526 doesn't work with reduce yet so it doesn't cover the whole story. But it is straightforward to support reduce at least when there is no dims. (I want to have a go at it at some point but there is another reduce related improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)

I think just improving vanilla iterator transformations (filter etc.) is better as it would not only make missings faster but also make small Union of user-defined types faster.

All 10 comments

With regards to point 4. map on skipmissing should return an object of the original type, but with missings where they are supposed to be. i.e.

x = [1, 2, missing, 4]
y = map(x) do xx
    xx - 1
end 
# [0, 1, missing, 3]

No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:

https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1

The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions

But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:

https://github.com/JuliaLang/julia/pull/26631#issuecomment-377349060

@propagatemissing f
where that expands to
g = x -> x === missing ? x : f(x)

So if the missing debate is as settled as @StefanKarpinski is saying, we should fix the docs to signal in the same way.

For documentation I would just use the higher-order function

propagate_missing(f) = x -> x === missing ? x : f(x)

instead of a macro.

For documentation I would just use the higher-order function

propagate_missing(f) = x -> x === missing ? x : f(x)

instead of a macro.

Would it be worth adding a type parameter to force specialization on the type of f?

For documentation I would just use the higher-order function

propagate_missing(f) = x -> x === missing ? x : f(x)

instead of a macro.

Would it be worth adding a type parameter to force specialization on the type of f?

E.g.

propagate_missing(f::F) where {F} = x -> x === missing ? x : f(x)

These proposals do not sound very different from the current passmissing function already defined in Missings.jl.

Alright, so should the docs just say functions don't propagate missing and point to using passmissing?

Alright, so should the docs just say functions don't propagate missing and point to using passmissing?

Yes, but there are still a lot of improvements to skipmissing-type workflows that should be considered as well.

1. Make skipmissing work for multiple iterators, returning a Tuple of iterators. I am working on a PR in Missings.jl to make this work. This will make it easier to work with vectors with mismatched values. This is especially useful for plotting.

3. Overload Zip so that we can zip together two vectors with missing elements and iterate over non-missing pairs. Unlike skipmissings above in (1), this returns an iterator of tuples. both are useful.

Just FYI, I think we can and should just make (t for t in zip(xs, ys, zs, ...) if any(ismissing, t)) and (... if all(ismissing, t)) fast to support these idioms. It's already kind of true as of #33526 (e.g., sum(x for x in xs if x !== missing) is even faster than sum(skipmissing(xs)) though this is partially because the former doesn't use pairwise summation). #33526 doesn't work with reduce yet so it doesn't cover the whole story. But it is straightforward to support reduce at least when there is no dims. (I want to have a go at it at some point but there is another reduce related improvement #31020 waiting for a review and I don't want to create a patch that would introduce a large conflict.)

I think just improving vanilla iterator transformations (filter etc.) is better as it would not only make missings faster but also make small Union of user-defined types faster.

No matter what the solution is, I think that a short term fix to the documentation is in order, because right now it's:

https://docs.julialang.org/en/v1/manual/missing/#Propagation-of-Missing-Values-1

The behavior of missing values follows one basic rule: missing values propagate automatically when passed to standard operators and functions, in particular mathematical functions

But there are a lot of standard functions where it's not propagated, and that's purposefully done because of a decision that's blocking PRs. That's fine, but we shouldn't tell users that they will propagate if there is no intent on adding such propagation. Instead, this section should probably be replaced with one that says that missing will not necessarily propogate, and if you want to propogate missings, you should do things like what @KristofferC suggested:

This should probably have been "when passed to standard mathematical operators and functions". See https://github.com/JuliaLang/julia/pull/35264. (Do note that there's a paragraph not visible in the diff which explicitly says that most functions do not propagate.)

Was this page helpful?
0 / 5 - 0 ratings