Pymc3: Possible Performance Issue Using Sparse Matrices in Bipartite Noisy-Or Model

Created on 11 Sep 2017 · 9Comments · Source: pymc-devs/pymc3

Hello,

I'm trying to implement a certain model using pymc3, and have been running into what I believe are some severe performance issues.

My system:
2015 Macbook Pro, 3.1GHz i7, 16GB RAM
Python 3.5
pymc3 version 3.1
theano version 0.9.0

Some background:
The model I'm trying to implement is a bipartite, binary noisy-OR network. I have visible binary variables (call them 'effects'), each of which is dependent upon one or more hidden binary variables (call them 'causes'). There's a many-to-many relationship, where each effect has multiple causes, and each cause influences multiple effects. The CPD for a given visible variable is a noisy-OR, where an effect is "on" if at least one cause is "on". Where the "noisy" part comes in is that each cause can fail to activate a given effect with a certain probability (i.e. cause is "on" but effect stays "off"), and each effect can also be "on" with a certain probability, even when all causes are "off" (so-called "leak" node).

Perhaps the best succinct explanation for this type of model is this page on the QMR-DT network (the canonical application of this kind of model, where the presence of diseases (hidden layer) is inferred based on the presence of various symptoms (visible layer)).

As for some reproducible code, heres code to simulate some data:

import numpy as np
import pystan
import random
import scipy as sp
from scipy import optimize
from scipy import stats
import pandas as pd
import pymc3 as pm
import theano.tensor as T
import theano.printing
import matplotlib.pyplot as plt
%matplotlib inline
# Simulate some data. n hidden nodes, 
# all of which have probability 0.05 of being
# "on", except for node h02, which has probability
# 0.90. We'd like our inference to suggest that the
# parameter theta_h[2] has a likely value much higher
# than that of the others.

np.random.seed(42)

m,n = 10, 4
H = {"h"+str(i).zfill(2): 0.05 for i in range(1,n+1)}
H["h02"] = 0.90
H_keys = sorted(list(H.keys()))
V = {"v"+str(i): random.sample(H.keys(),random.randint(1,n)) for i in range(1,m+1)}

n_samples_per_v = 1 # in case we want more visible datapoints.

data = []
for v,pa_t in V.items():
    pa_t = pa_t
    for _ in range(n_samples_per_v):
        if any(np.random.binomial(1,H[h]) == 1 for h in pa_t):
            v_value = 1
        else:
            v_value = 0
        data.append((v_value,pa_t))
m,n = len(data), len(H)

# Making the sparse matrix. This could
# definitely be made more efficient, but
# its a one-time operation, so doesn't really
# matter at this point.
h_lookup = dict(zip(sorted(H_keys),range(len(H))))
observed, parents = zip(*data)
A_dense = np.zeros((m,n),dtype=np.int)
for i, parents_i in enumerate(parents):
    for parent in parents_i:
        A_dense[i,h_lookup[parent]] = 1
A = sp.sparse.csc_matrix(A_dense)

The data variable l made ends up looking like this:

[(1, ['h03', 'h04', 'h01', 'h02']),
 (1, ['h03', 'h02']),
 (1, ['h02', 'h03', 'h01']),
 (1, ['h02', 'h04', 'h03', 'h01']),
 (1, ['h03', 'h02']),
 (0, ['h04', 'h03']),
 (0, ['h04']),
 (1, ['h04', 'h03']),
 (1, ['h01', 'h02', 'h04']),
 (1, ['h03', 'h01', 'h02', 'h04'])]

and the resulting observed data array and dense mask matrix A_dense, look like

(1, 1, 1, 1, 1, 0, 0, 1, 1, 1)

and

array([[1, 1, 1, 1],
       [0, 1, 1, 0],
       [1, 1, 1, 0],
       [1, 1, 1, 1],
       [0, 1, 1, 0],
       [0, 0, 1, 1],
       [0, 0, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 1, 1, 1]])

, respectively

Here's my attempt at a pymc3 model:

with pm.Model() as model:
    # Hidden nodes: Nodes are bernoulli distributed,
    # with Uniform priors on theta_h.
    theta_h = pm.Uniform("theta_h",lower=0,upper=1,shape=n,testval = 0.5)
    h = pm.Bernoulli("h",p=theta_h, shape=n, testval = 1)

    # Noisy-OR and leak priors:
    # theta_i is the probability that the link between hidden node h[i]
    # and a visible node dependent on it "fails".
    # We're stipulating here that we expect the relationship between
    # hidden and visible nodes to be not too noisy, by specifying
    # with a left-skewed Beta prior that we expect the probability of
    # link failure to be low. Same with the leak probability.
    theta_leak = pm.Beta("theta_leak",alpha=2,beta=20,shape=m,testval=0.05)
    theta_i = pm.Beta("theta_i",alpha=2,beta=20,shape=n, testval = 0.05)

    # Log transformed
    wi = pm.math.log(theta_i).dimshuffle((0,'x')) # see below for why we dimshuffled
    w0 = pm.math.log(theta_leak)

    # Constructing the likelihood: 
    # P(v[i] = 1| h) = 1 - \exp{w0[i] + \sum_{h[j] \in parents(v[i])}{h[j]*wi[j]}}
    # Equivalent to exponential formualtion at the bottom of
    # http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume10/jaakkola99a-html/node2.html
    # We can express this using the following matrix equation: A*H*wi, where
    # - A is our sparse (m*n) mask of child-to-parent mappings.
    # - H is an (n*n) diagonal matrix with the vector of RV's h down the main diagonal
    # - wi is our log(theta_i), transformed from an (n,) array into an (n*1) column vector
    # NOTE: I would get cryptic errors about dropping non-broadcastable dimensions
    # if I used anything but structured_dot when constructing AHwi. Not sure what the deal
    # is there...
    H = theano.sparse.square_diagonal(h)
    AH = theano.sparse.basic.cast(theano.sparse.true_dot(A,H),"float64")
    AHwi = theano.sparse.structured_dot(AH,wi).ravel()
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi) # voila
    #theano.printing.Print()(p_v_given_h) # for debugging

    # Observed data consists of m binary observations, 
    # representing the states of the m visible nodes (1 = 'on').
    v = pm.Bernoulli("v",p=p_v_given_h,shape=m,observed=observed)

    trace = pm.sample()
    pm.forestplot(trace,varnames=["theta_h"])

My general difficulty is in implementing the many-to-many relationship between hidden nodes and visible nodes in an efficient manner in pymc3/theano.
As you can see, I've tried to express the likelihood in terms of matrix operations, where the parent-child relationships are encoded in the binary sparse-matrix A, which I then multiply by a diagonal matrix containing the hidden variable RVs, and multiply that by the column vector of the noise parameters. The model works just fine when the size of the data is small (both m and n below, say, 1000). However, my actual data I want to run this model on consists of around 2500 hidden nodes, and nearly 750,000 visible nodes (observations). When I try that, I get extremely slow speeds (on the order of 400s/iteration), despite the size of my sparse matrix being only on the order of around 20MB. I would have thought that pymc3 (and theano underneath) would be equipped to handle sparse structures of that size with relative ease.

I'm not sure if this is an issue with pymc3's handling of sparse data structures, or the way I've implemented my model (or both).

If anybody has any insight into why this runs so slowly, and how I might go about modifying my implementation, it would be much appreciated. Thanks in advance!

Source

PriceHardman

👍1

All 9 comments

Have you try to run it with ADVI?

junpenglao on 11 Sep 2017

Just tried it. Told me ADVI doesn't work with discrete RV's. I read a bit of this part of the documentation mentioning that one can sidestep this by trying to marginalize out the discrete variables. I don't know how to do this, so I'll likely have to read some more before ADVI becomes a possible path forward. Thanks for the suggestion!

PriceHardman on 11 Sep 2017

Oh I see you have hidden nodes are bernoulli distributed, yes, in that case, marginalize out the discrete variable is the way to go. Just on top of my head: is it possible to skip the h and just do H = theano.sparse.square_diagonal(theta_h)?

junpenglao on 11 Sep 2017

For example, something like:

with pm.Model() as model:
    theta_h = pm.Uniform("theta_h",lower=0,upper=1,shape=n)
    theta_leak = pm.Beta("theta_leak",alpha=2,beta=20,shape=m,testval=0.05)
    theta_i = pm.Beta("theta_i",alpha=2,beta=20,shape=n, testval = 0.05)

    # Log transformed
    wi = pm.math.log(theta_i)
    w0 = pm.math.log(theta_leak)
    AHwi = tt.dot(tt.dot(A_dense, tt.diag(theta_h)), wi)
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi) # voila
    v = pm.Bernoulli("v",p=p_v_given_h,shape=m,observed=observed)
    trace = pm.sample()

junpenglao on 11 Sep 2017

👍1

Interesting idea. It speeds things up a bit, but also changes the results.
Made these changes:

    H = theano.sparse.square_diagonal(theta_h)
    AH = theano.sparse.structured_dot(A,H)
    AHwi = theano.sparse.structured_dot(AH,wi).ravel()
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi)

The forestplot of the posteriors of theta_h are quite a bit different than before.

PriceHardman on 11 Sep 2017

BTW, this topic is more suitable in https://discourse.pymc.io/. I am closing it here for now. Could you please move to there?

junpenglao on 11 Sep 2017

Yep will do. Forgot about the discourse page. Thanks!

PriceHardman on 11 Sep 2017

👍1

FYI https://discourse.pymc.io/t/possible-performance-issue-using-sparse-matrices-in-bipartite-noisy-or-model/319

junpenglao on 11 Sep 2017

👍1

Thanks!

PriceHardman on 11 Sep 2017