Pyro: Value sampled is not in distribution's support for baseball example

Created on 17 Oct 2018 · 17Comments · Source: pyro-ppl/pyro

The partially_pooled model in baseball.py seems to sporadically fail with the exception that the value argument is not within the support for the Pareto distribution. cc. @fehiepsi

Taking a look at this, but I am temporarily removing this from test_examples.py to unblock the build.

Traceback (most recent call last):
  File "/home/travis/build/uber/pyro/pyro/poutine/trace_struct.py", line 128, in log_prob_sum
    log_p = site["log_prob_sum"]
KeyError: 'log_prob_sum'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/travis/build/uber/pyro/examples/baseball.py", line 309, in <module>
    main(args)
  File "/home/travis/build/uber/pyro/examples/baseball.py", line 272, in main
    .run(partially_pooled, at_bats, hits)
  File "/home/travis/build/uber/pyro/pyro/infer/abstract_infer.py", line 84, in run
    for tr, logit in self._traces(*args, **kwargs):
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/mcmc.py", line 51, in _traces
    for trace in self._gen_samples(self.num_samples, trace):
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/mcmc.py", line 35, in _gen_samples
    trace = self.kernel.sample(trace)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/nuts.py", line 276, in sample
    direction, tree_depth, energy_current)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/nuts.py", line 159, in _build_tree
    direction, tree_depth-1, energy_current)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/nuts.py", line 178, in _build_tree
    direction, tree_depth-1, energy_current)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/nuts.py", line 178, in _build_tree
    direction, tree_depth-1, energy_current)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/nuts.py", line 155, in _build_tree
    return self._build_basetree(z, r, z_grads, log_slice, direction, energy_current)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/nuts.py", line 130, in _build_basetree
    z, r, self._potential_energy, self._inverse_mass_matrix, step_size, z_grads=z_grads)
  File "/home/travis/build/uber/pyro/pyro/ops/integrator.py", line 68, in single_step_velocity_verlet
    z_grads, potential_energy = _potential_grad(potential_fn, z_next)
  File "/home/travis/build/uber/pyro/pyro/ops/integrator.py", line 79, in _potential_grad
    potential_energy = potential_fn(z)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/hmc.py", line 170, in _potential_energy
    potential_energy = -self._compute_trace_log_prob(trace)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/hmc.py", line 154, in _compute_trace_log_prob
    return self._trace_prob_evaluator.log_prob(model_trace)
  File "/home/travis/build/uber/pyro/pyro/infer/mcmc/util.py", line 132, in log_prob
    return model_trace.log_prob_sum()
  File "/home/travis/build/uber/pyro/pyro/poutine/trace_struct.py", line 130, in log_prob_sum
    log_p = site["fn"].log_prob(site["value"], *site["args"], **site["kwargs"])
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/torch/distributions/transformed_distribution.py", line 86, in log_prob
    log_prob += _sum_rightmost(self.base_dist.log_prob(y),
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/torch/distributions/exponential.py", line 51, in log_prob
    self._validate_sample(value)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/torch/distributions/distribution.py", line 221, in _validate_sample
    raise ValueError('The value argument must be within the support')

bug

Source

neerajprad

All 17 comments

@neerajprad Which args did you use? I don't get any problem with partially_pooled so far but have faced some problems with partially_pooled_with_logit. These problems seem not related to recent pull requests though.

With seed=0, num-samples=200, warmup-steps=200, I observed that partially_pooled_with_logit model throws error at setup (specifically validate_trace). When scale is initialized with high value, alpha will take high positive or small negative values, which in turns make the probs parameter of Binomial very near 1 or 0. So log_prob will be -inf in many cases, which invalidates the init trace. Of course, probs ~ 1 or 0 is unrealistic and we should initialize scale with a more reasonable value. Maybe we reset the initial trace when it is invalid?

fehiepsi on 17 Oct 2018

Maybe we reset the initial trace when it is invalid?

Yeah, we should probably resample the initial trace (a few times) in that case, except that we need to be careful because the model itself might be ill-specified in which case throwing an error would make sense.

I am seeing this problem using pytorch==0.4 on python 3.6, using the default args. I will confirm shortly if this is still an issue with PyTorch master.

neerajprad on 17 Oct 2018

@neerajprad I did use pytorch=0.4 too (after facing errors of distributions in python 1.0rc). I didn't catch
what is the reason for your failure yet but I found a bug in NUTS when using mass_matrix: we should modify _is_turning method too. It is no longer a dot product.

fehiepsi on 17 Oct 2018

we should modify _is_turning method too

Oh good catch! We would need to multiply by the inverse of the mass matrix, I guess.

I am trying to debug why we are getting non deterministic output when we run HMC / NUTS with a fixed seed. I think that needs to be fixed first, even before we go on to debug why the model fails on certain initializations.

neerajprad on 17 Oct 2018

👍1

The reason for the non-determinism is torch.dot in self._kinetic_energy, which can give results that are slightly off on different runs. See snippet below - while expected and torch.dot might differ, it should differ consistently, which is not the case below. This is fixed on master, but for the time being I will revert to using the element wise product + sum operation to remove any non-determinism from our tests.

In [56]: for i in range(10):
    ...:     m = torch.load('m.pt')
    ...:     r = torch.load('r.pt')
    ...:     print("torch.dot", m.dot(r**2).item())
    ...:     print("expected", (m * (r**2)).sum().item())
    ...:
torch.dot 24.512012481689453
expected 24.51201057434082
torch.dot 24.51201057434082
expected 24.51201057434082
torch.dot 24.512012481689453
expected 24.51201057434082
torch.dot 24.512012481689453
expected 24.51201057434082
torch.dot 24.512012481689453
expected 24.51201057434082
torch.dot 24.512012481689453
expected 24.51201057434082
torch.dot 24.512012481689453
expected 24.51201057434082
torch.dot 24.51201057434082
expected 24.51201057434082
torch.dot 24.512012481689453
expected 24.51201057434082
torch.dot 24.512012481689453
expected 24.51201057434082

neerajprad on 18 Oct 2018

Is this the cause of our docs stage sporadically failing on CI (example)?

fritzo on 19 Oct 2018

Is this the cause of our docs stage sporadically failing on CI (example)?

The non determinism in HMC should be fixed now. This sounds like a different issue. cc. @fehiepsi ^^. Maybe we should just fix the rng seed in that doctest. I'll do that shortly.

neerajprad on 19 Oct 2018

I think it is rng seed problem too.

fehiepsi on 19 Oct 2018

I am keeping this issue open because while the non-determinism is fixed, the original failure of the model on certain initializations persists - i.e. getting invalid values that don't lie in the support of the Pareto distribution.

neerajprad on 19 Oct 2018

@neerajprad did you face this problem at the warmup phase? I tried different seed and num_samples, and got a different error when sample r_dist: "The parameter scale has invalid values". step_size is so small at that time (1e-34???)

fehiepsi on 19 Oct 2018

Let me try to isolate the seed that triggers this on my system - it is easy to see this with parallel chains.

"The parameter scale has invalid values". step_size is so small at that time

Any ideas why the step size is going so low - maybe we should have some reasonable limit on the lowest allowable step size, or at least throw a warning if it goes too low during the adaptation phase?

neerajprad on 19 Oct 2018

Ah, the reason is at inf for initial energy. :D This should be solved when we revise setup method to create new trace when initial trace is invalid.

I checked Stan, and they tried to generate valid initial trace with max 100 times.

fehiepsi on 19 Oct 2018

I checked Stan, and they tried to generate valid initial trace with max 100 times.

Good to know! Yeah, we should definitely fix this. Another thing that I was thinking was that in many of these cases (not the inf ones though), it might actually be faster to have a lower acceptance probability with a larger step size, rather than making the step size smaller and smaller to get to the 0.8 target acceptance probability. At the very least, we should offer this knob to the users.

neerajprad on 19 Oct 2018

Yes, I think we somehow should let users change hidden parameters. About small step_size, from what I learned from Stan reference and pymc3 notebook: smaller step_size the better. They even encourage increasing acceptance probability to a value ~1. When adapting, I think that with small step_size, we will get better mass_matrix. When mass_matrix is ready, step_size will automatically increase I guess (at least during the end_buffer).

To me, mass_matrix plays the role of setting different step_size for each parameter (with small variance params, we don't want to move far from a position; with large variance params, we don't want to stay around one position). So if we are able to learn it during warmup phase, the sampling will be good.

fehiepsi on 20 Oct 2018

👍1

When adapting, I think that with small step_size, we will get better mass_matrix. When mass_matrix is ready, step_size will automatically increase I guess (at least during the end_buffer).

Yeah, that's a good observation! I think I was only seeing these issues without mass matrix adaptation, but you are right that the mass matrix will likely help alleviate many of these issues by adapting the step size per parameter, so we will probably not have to run many steps before getting a u turn.

neerajprad on 20 Oct 2018

👍1

@neerajprad It seems that we can close this?

fehiepsi on 3 Dec 2018

Closing this, as I have not seen this in a while. Will reopen if this continues to be an issue.

neerajprad on 3 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Binarized MNIST in VAE tutorial

tristandeleu · 3Comments

Test failure of enum-parallel gradients after PyTorch #5776

neerajprad · 5Comments

Pyro minor release 0.3.1

neerajprad · 4Comments

which CDF to use for Bayesian Optimization?

tobyclh · 3Comments

Autoguide inference is incorrect for VonMises distribution

fritzo · 4Comments