I was skimming through the AD implementations and am a bit confused by the ForwardDiff implementation, so could use some help understanding it.
If I was to implement a function to compute the gradient of the log joint of a model using ForwardDiff in the current version of Turing, the first thing I would think to do would be along the lines of the following:
"""
gradient_forward(
θ::AbstractVector{<:Real},
vi::VarInfo,
spl::Sampler,
model::Function,
)
Computes the gradient of the log joint of `θ` for the model specified by
`(vi, spl, model)`.
"""
function gradient_forward(
θ::AbstractVector{<:Real},
vi::VarInfo,
spl::Union{Nothing, Sampler},
model::Function,
)
# Record old parameters.
θ_old = vi[spl]
# Define function to compute log joint.
function f(θ)
vi[spl] = θ
return runmodel(model, vi, spl).logp
end
# Set chunk size and do ForwardMode.
config = GradientConfig(f, θ, Chunk{min(CHUNKSIZE, length(θ))}())
∂l∂θ = gradient!(similar(θ), f, θ, config)
# Replace old parameters to ensure this function doesn't mutate `vi`.
vi[spl] = θ_old
return ∂l∂θ
end
A minimal working example might be something like this:
using Turing, ForwardDiff, Distributions
using Turing: VarInfo, Sampler, runmodel, CHUNKSIZE
# Define a model.
@model ad_test() = begin
s ~ InverseGamma(2,3)
m ~ Normal(0,sqrt(s))
1.5 ~ Normal(m, sqrt(s))
2.0 ~ Normal(m, sqrt(s))
return s, m
end
model = ad_test()
vi = model(Turing.VarInfo(), nothing)
spl = nothing
# Get parameters + compute gradient.
θ = map(Float64, vi[spl])
∂l∂θ = gradient_forward(θ, vi, spl, model)
The gradient_forward function here is similar to the gradient_r function as currently implemented. So my first question is, other than the caching stuff at the top of the function, what is all of the rest of the code for? It looks to me like it's just emulating what ForwardDiff's GradientConfig would already do for us.
Hi @willtebbutt, the code you're referring to is a bit obsolete now. I think back in time we weren't super decided on how to handle dynamic dimentionality using ForwardDiff, therefore Kai implemented a version of gradient using the low-level APIs of ForwardDiff. This allows HMC to handle dynamic dimensionality, e.g. dimensionality of ForwardDiff.Dual could be manipulated during AD.
Later, we made a design decision that during an HMC step, the dimensionality of parameters assigned to HMC shouldn't change. However, the dimensionality of parameters assigned to HMC can still change for different HMC steps. For example, if a new parameter is created during an HMC iteration and is assigned to HMC, then before the current HMC iteration finishes, the newly created parameter is not updated. Therefore, the newly created parameter value will stay the same as its initialization (e.g. a draw from its prior). This design simplifies the HMC implementation and makes the code you are referring to obsolete.
Actually, I have also started cleaning up this bit of code over the weekend. But your code looks nicer than mine. So please feel free to create a PR!
Ps. the following line of code pre-allocates memory required by AD. This is slow and ideally, we should share this config across multiple gradient calls. However, we need to monitor dimensionality of theta. If it changes, we need to create a new configuration.
config = GradientConfig(f, θ, Chunk{min(CHUNKSIZE, length(θ))}())
My code is here: https://github.com/TuringLang/Turing.jl/tree/hg/fad-clenaup
Excellent, I'll make a PR this evening.
FWIW, some preliminary benchmarking I've done makes it look like the thing that takes the time isn't actually constructing GradientConfig, it's constructing the Chunk that is passed into it.
FWIW, some preliminary benchmarking I've done makes it look like the thing that takes the time isn't actually constructing GradientConfig, it's constructing the Chunk that is passed into it.
Yes you are right. It's the construction of partials in Chunk which is slow.
config = GradientConfig(f, θ, Chunk{min(CHUNKSIZE, length(θ))}())
Another reason why this is slow is the type instability, for which there are 2 reasons, the first is described in #489. The second is that even if the type of chunksize is known, that's not enough, its value must also be known.
I think this is a great use case for function barriers and value types. So one can add ::Type{Val{chunksize}}=Val{CHUNKSIZE} to the arguments to make chunksize known at compile time. I did this on my machine and resolved some of the type instability there. But I will wait for #499 to go in, before I touch this part of the code. I am assuming the number of different chunksizes to be used is not large. If it is, then a redesign is necessary to make this performant.
So one can add ::Type{Val{chunksize}}=Val{CHUNKSIZE} to the arguments to make chunksize known at compile time.
While I agree that this will remove a type instability, the issue of having to dynamically dispatch to the correct version of gradient_forward will remain, along with the corresponding overhead. I'm not convinced that it's even possible in principle to get rid of this while CHUNKSIZE is mutable. Do you have any thoughts on this?
I wonder if a constant CHUNKSIZE might prove to be faster then. If I understand correctly, the ratio of speeds between different CHUNKSIZEs is an asymptotic constant, so sacrificing type stability to tweak it may not be the "optimal" approach after all. Is there a benchmark somewhere that justifies changing CHUNKSIZE often?
Closed by #499