Turing.jl: Unifying assume and observe and handling missing data and user-input variable names

Created on 17 Nov 2019  Â·  21Comments  Â·  Source: TuringLang/Turing.jl

In this issue, I will describe what I think may be a feasible re-design of the Turing internals with the following goals:

  1. Simplifying the compiler design.
  2. Better handling of missing data including type stable autodiff, a more intuitive syntax, and partially missing data. Currently, missing data initialization and type stable syntax don't work well together, and partially missing data are not supported, e.g. when x isa Vector{T, Missing} so parts of x are data and parts are parameters.
  3. Enable user-input random variable names. Currently, variable names are collected partly when parsing the model if the LHS of ~ is not in the arguments and partly at compile-time when some input data variables are missing, but the user cannot pass the names as arguments to the model.

One of the biggest pains in the model macro currently is the handling of missing inputs which forces us to:

  1. Check if a ~ statement will call observe or assume using an if statement that compiles away,
  2. Keep track of default initializations of data variables in case they were not input, and
  3. Encode the data variable symbols in the Model type and define the Julia variable inside the model function using:
local x
# Check if `x` was given by the user when constructing `model`.
if isdefined(model.data, :x)
    if model.data.x isa Type && (model.data.x <: AbstractFloat || model.data.x <: AbstractArray)
        x = Turing.Core.get_matching_type(sampler, vi, model.data.x)
    else
        x = model.data.x
    end
else
    x = model.defaults.x
end

In this issue, I propose a re-design of this part of Turing to be as follows:

  • We keep track of the missing data symbols only when constructing the model, not the other data symbols. The symbols of the missing variables can get encoded in the type of some Val instance called missing_vars, e.g. missing_vars = Val((:x, :y)).
  • Missing inputs are not initialized at the top, but instead using the following syntax inside the model body. This gives the user access to the type stable syntax when initializing missing parameters.
if ismissing(x)
    x = ...
end
  • All lhs ~ rhs statements lower to assume_or_observe(spl, rhs, @preprocess(missing_vars, lhs), vi).
  • @preprocess(missing_vars, lhs) will construct and return a VarName if: 1) the symbol of lhs is not in the input arguments, 2) the symbol is there but is in missing_vars too, or 3) the symbol is there and not in missing_vars but ismissing(lhs) is true. Otherwise, lhs is just returned as is. Note that in the last case, lhs could still be a VarName constructed and passed by the user as input argument. This enables the custom variable names that Chris wanted in DiffEqBayes.
  • If the third argument to assume_or_observe is a VarName, assume is called, otherwise observe is called.

A similar thing can also be done to the 5-argument observe and assume for handling broadcasting as described in https://github.com/TuringLang/Turing.jl/issues/476#issuecomment-511111506.

If there are no objections here, I will experiment with this in #965 .

All 21 comments

Sounds like a good idea to me.

Sounds good to me. Just a small question, why @preprocess is a marco here instead of a function?

why @preprocess is a marco here instead of a function?

Because constructing VarName needs us to first dissect the expression x[i] which could be on the LHS of ~. This is only possible with a macro. When constructing VarName here, I will just call the macro @varname from https://github.com/TuringLang/Turing.jl/pull/965/files#diff-8dd4f136cba6c9e12b40d181c7ddf5d8R15 and interpolate lhs after it.

With this change, the "default value" of input arguments will now mean the intuitive thing, which is a default data value, not a default initialization when treating the variable as a parameter which it now does.

With this proposal implemented, it may also be possible to support https://github.com/TuringLang/Turing.jl/issues/792 which is kind of the opposite of missing data support, i.e. making random variables observed. I think we can provide an additional constructor with all the random variables as kwargs and a default value missing.

Because constructing VarName needs us to first dissect the expression x[i] which could be on the LHS of ~. This is only possible with a macro. When constructing VarName here, I will just call the macro @varname from https://github.com/TuringLang/Turing.jl/pull/965/files#diff-8dd4f136cba6c9e12b40d181c7ddf5d8R15 and interpolate lhs after it.

I see thanks.

With this change, the "default value" of input arguments will now mean the intuitive thing, which is a default data value, not a default initialization when treating the variable as a parameter which it now does.

Can you write me a minimal example for each case (before and after)?

With this proposal implemented, it may also be possible to support #792 which is kind of the opposite of missing data support, i.e. making random variables observed. I think we can provide an additional constructor with all the random variables as kwargs and a default value missing.

This sounds great.


So actually, may be it would be helpful to write minimal examples (the @model definiton) for all the cases we'd like to support. This would be helpful for both testing and documentation. Also we need to keep in mind that the support of stochastic control flow. Can you start by listing those mentioned in the issue descriptions first and we can complete the list together.

The current default initialization syntax is:

@model f(x, y = Vector{Float64}(undef, 4))
    ...
end

Let x be a scalar but y is a vector. Now when y is missing, y is initialized to Vector{Float64}(undef, 4). Otherwise, this "default" value is ignored. This is not intuitive.

The proposed syntax is:

@model f(x, y)
    if ismissing(y)
        y = Vector{Float64}(undef, 4)
    end
    ...
end

This is only needed if y will be looped over as such:

for i in 1:length(y)
    y[i] ~ ...
end

If y ~ MvNormal(...) for example, then there is no need to initialize y. This is left for the user to handle correctly.

So if the user needs the loop version, they need to do the ismissing check manually; otherwise it's handled by the compiler. Is my understanding correct?

792 is currently not supported and its syntax will be just:

@model f(x)
    a ~ Normal()
    b ~ Gamma()
    x ~ Normal(a, b)
end

Doing f(x = 1, a = 1) can be supported meaning to condition on a. So all we need here is an additional method with additional keyword arguments that default to missing. By default, @preprocess will do the right thing and construct the VarName passing it to assume_or_observe. But if the user decides to pass a value for a that is not missing, then @preprocess will also do the intended thing and treat a as observed by not constructing a VarName and just returning lhs as is.

I think this could all work together nicely. But I will need to try it all out first and see what I come across.

So if the user needs the loop version, they need to do the ismissing check manually; otherwise it's handled by the compiler. Is my understanding correct?

Correct.

With the new default value syntax:

@model f(x, y = zeros(4))
    ...
end
f(2)

will also do the intended thing treating y as a data variable taking the value zeros(4).

One has to do f(2, missing) to treat y as missing, or alternatively not pass a default value zeros(4) but do it inside the model using:

@model f(x, y)
    if ismissing(y)
        y = zeros(4)
    end
    ...
end

Does this make sense?

When defining the model like below

@model f(x, y)
    if ismissing(y)
        y = zeros(4)
    end
    ...
end

The intuitive behaviour f(x) should throw an error.

Maybe we should ask user to do

@model f(x, y=missing)
    if ismissing(y)
        y = zeros(4)
    end
    ...
end

i.e. explictly saying y might be missing for this case.

Hmm that might make defining the model more clunky but I see your point. I don't mind having it that way.

Yes it's more clunky but we really want to avoid any silent sampling when a user simply forget passing one of the data by mistake. I guess make things explicit here is better.

Proof of concept already in #965.

Enable user-input random variable names. Currently, variable names are collected partly when parsing the model if the LHS of ~ is not in the arguments and partly at compile-time when some input data variables are missing, but the user cannot pass the names as arguments to the model.

For this point, there might be a simple but also quite flexible solution by adding support for a suffix part of var name, e.g. provided by a special type NamedDistribution:

# simple case
x ∼ Named(Normal(), "123") # varname = "x123"

# complex case involving loops
for i = 1:5 
    x ∼ Named(Normal(), "$i") # varname = "x?"
end

# complex case involving stochastic control flows
z ~ Discrete([1/3, 1/3, 1/3])

if z == 1
   x ∼ Named(Normal(), "1")
elseif z==2
  x ~ Named(Normal(), "1") # here we use the same variable name as z==1
else 
  x ~ Named(Normal(), "2") # here we use a different variable name 
end 

# complex case involving nested models
@model foo()
   global counter = 0 # less ideal to use global, maybe there is a better solution
   x ~ Named(Normal(), "$counter")
end

@model bar()
   for i = 1:5
     y ~ Named(foo(), "$i")
   end
end

This design means that by default, we can assume all LHS variable names are global, that is they are unique if defined in the same scope within a model (we can still generate a unique id for each LHS of ~ and use it as part of var name). Then the user can take advantage of the NamedDistribution to manually manage variable names.

I like that proposal. I will play with it in #965 but instead of just using an index as the second argument to Named, we can have a VarName. I think Chris didn't want to use x as the variable name jut because it was on the LHS of ~.

Done!

Closed via #965.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hessammehr picture hessammehr  Â·  4Comments

xukai92 picture xukai92  Â·  3Comments

skanskan picture skanskan  Â·  5Comments

yebai picture yebai  Â·  6Comments

yebai picture yebai  Â·  6Comments