Exact Sampling with Integer Linear Programs and Random Perturbations

Carolyn Kim
Ashish Sabharwal
S. Ermon
AAAI
2016
View in Semantic Scholar

Abstract

We consider the problem of sampling from a discrete probability distribution specified by a graphical model. Exact samples can, in principle, be obtained by computing the mode of the original model perturbed with an exponentially many i.i.d. random variables. We propose a novel algorithm that views this as a combinatorial optimization problem and searches for the extreme state using a standard integer linear programming (ILP) solver, appropriately extended to account for the random perturbation. Our technique, GumbelMIP, leverages linear programming (LP) relaxations to evaluate the quality of samples and prune large portions of the search space, and can thus scale to large tree-width models beyond the reach of current exact inference methods. Further, when the optimization problem is not solved to optimality, our method yields a novel approximate sampling technique. We empirically demonstrate that our approach parallelizes well, our exact sampler scales better than alternative approaches, and our approximate sampler yields better quality samples than a Gibbs sampler and a low-dimensional perturbation method.

Introduction

Probabilistic models are a key component of modern statistical machine learning (ML) and artificial intelligence (AI) systems (Murphy 2012) . Inference in such models is a computational challenge that has been intensively researched in physics, statistics, and computer science (Andrieu et al. 2003) . The problem is known to be intractable in the worst case, and approximate techniques are often used in practice. One of the most influential notions has been the idea of Monte Carlo approximation, where the computation of an expectation with respect to a complex model is approximated using a sample average. The problem of drawing samples from a general probabilistic model is therefore extremely important, and it is used as a building block inside countless learning, inference, and even numerical analysis algorithms. By far the most popular approach is the Markov Chain Monte Carlo (MCMC) method. The idea is to set up a Markov Chain that, in the limit, will converge to the target distribution, and then draw samples by simulating the chain for a sufficiently long time. Unfortunately, for many models of interest, such chains take exponential time to converge.

We introduce a novel approach for sampling exactly from arbitrary discrete probability distributions. Our approach is based on the Gumbel-max idea (Gumbel and Lieblein 1954) , a property of Gumbel random variables with numerous applications in statistics and econometric, and which has recently become popular in the ML community as well (Maddison, Tarlow, and Minka 2014; Gane, Hazan, and Jaakkola 2014; Hazan and Jaakkola 2012; Papandreou and Yuille 2011; Tarlow, Adams, and Zemel 2012; Kappes et al. 2015; Hazan, Maji, and Jaakkola 2013) . This property means that one can obtain samples from a probabilistic model by finding the mode of a randomly perturbed distribution obtained by perturbing the log-likelihood of each individual state with i.i.d. Gumbel noise. Modern combinatorial optimization tools such as integer linear programming (ILP) and weighted constraint satisfaction solvers can often handle problems with thousands of variables (Wolsey and Nemhauser 2014) and are well-suited for computing modes of discrete probability distributions. Translating a sampling problem to an optimization problem is therefore a big step forward.

Unfortunately, the Gumbel-max reduction cannot be directly used, as it requires too much "randomness" to generate an exponential number of Gumbel random variables, and too much space to store them (i.e., the resulting optimization problem could not even be encoded compactly). To overcome this issue, we propose an approach inspired by the recent A* sampling algorithm for continuous probability distributions (Maddison, Tarlow, and Minka 2014) . While we share with A* sampling the idea of instantiating the randomness as needed in a "lazy" manner, our focus on discrete distributions presents new challenges and opportunities. We show how to make this approach practical using powerful linear programming (LP) relaxations to prune "irrelevant" portions of the state space which can be safely ignored without affecting exactness. We integrate this idea into the commercial ILP solver CPLEX, transforming it into a general-purpose sampler by introducing a Gumbel-based randomization into its branch-and-cut search. Our approach thus directly leverages many recent advances in combinatorial optimization, including parallel search (Boehning, Butler, and Gillett 1988; Johnson, Nemhauser, and Savelsbergh 2000) . This opens a novel angle for parallelizing sampling algorithms in a principled fashion.

Our experiments demonstrate the ILP-based sampler scales to large-treewidth graphical models well beyond the reach of current exact inference methods, including rejection sampling and an adaptation of A* sampling to our discrete setting. Furthermore, even though the approach is generalpurpose and oblivious to any special structure in the model, it empirically scales very well on tractable distributions such as tree-structured graphical models, for which LP relaxations employed by CPLEX are known to be tight (Sontag et al. 2008) . Our approach also naturally defines a new efficient approximate sampling method, where we simply stop the optimization after a given amount of time, outputting the best solution found so far. Interestingly, information from the LP relaxations (the optimality gap) can be used to evaluate the quality of the approximate sample. This is in stark contrast with traditional MCMC techniques, where there is generally no information on the quality of the results if the chain is stopped earlier, as the chain might have missed important areas of the state space. We demonstrate that our approximate sampler outperforms traditional methods such as a Gibbs sampler as well as a recent low-dimensional perturbation approximation (Hazan, Maji, and Jaakkola 2013) .

Preliminaries

Given a finite set Σ and w : Σ → R + , we seek a randomized algorithm that selects each σ ∈ Σ with a probability proportional to w(σ), i.e., p(σ) = w(σ)/Z, where Z is the partition function σ∈Σ w(σ). For ease of exposition, we consider Σ = {0, 1} n , but the results extend naturally to general categorical variables. The Gumbel-max method uses the Gumbel distribution to turn this sampling problem into an optimization problem (Maddison, Tarlow, and Minka 2014) . The Gumbel distribution with location k, denoted Gumbel(k), can be defined by its CDF Pr[Gumbel(k) ≤ g] = exp(− exp(−(g − k))). Gumbel(k) has mean k + C E , where C E ≈ 0.5772 is the Euler-Mascheroni constant, and variance π 2 /6. Perhaps more intuitively, we can view Gumbel(k) as approximating max 1≤n≤exp(k) Y n , where Y n are i.i.d. standard exponential distributions; and in fact, as k approaches infinity, max 1≤n≤exp(k) Y n − k converges pointwise to Gumbel(0) (Leadbetter, Lindgren, and Rootzén 1983) . For b ∈ R, the Gumbel distribution "truncated" at b, i.e., forced to be at most b, is defined by its CDF

Pr[TruncGumbel(k, b) ≤ g] = exp(− exp(−(min(g, b) − k)))/ exp(− exp(−(b − k)))

The relevant properties of the Gumbel distribution are as follows (Maddison, Tarlow, and Minka 2014)

EQUATION (3): Not extracted; please refer to original document.

These properties provide a way of computing the partition function (eq. 2) and sampling from the distribution (eq. 3).

Mip Gumbel Sampling

Our algorithm is closely related to A* sampling, a recently introduced algorithm that exploits the Gumbel-max method to draw (exact) samples from continuous density functions (Maddison, Tarlow, and Minka 2014) . Like A* sampling, we also generate a sequence of i.i.d. Gumbel distributed random variables in a "lazy" and top-down manner. Specifically, the (exponentially) large number of Gumbel variates needed to fully specify the optimization problems in (2) and (3) is only instantiated as needed by a branchand-bound search procedure. The crucial observation is that in order to decide whether a node v can be pruned, it is sufficient to know the maximum value of the Gumbel random variables at the leaves of the subtree rooted at v, i.e., max{γ σ | v is an ancestor of σ}. Thanks to a property of Gumbel distributions (eq. 1), this value can be easily sampled without actually instantiating all the other leaves. Our approach is however tailored for discrete domains, which presents several opportunities and advantages. We can employ techniques from combinatorial optimization such as tractable relaxations to obtain bounds on the objective, which can be used in a branch-and-bound scheme to prune large portions of the search space. These approaches are usually much more effective than their continuous counterparts (Schlesinger 2009) . In particular, we use LP relaxations, which are known the be very effective in dealing with sparse structure of real world graphical models. Further, we can leverage decades of engineering in ILP solvers.

Ilp Formulation

We consider an ILP formulation for the MAP inference problem max σ w(σ). For simplicity of exposition, we focus on binary factors (i.e., pairwise interactions between variables), where

w(σ) = i∈V ψ i (σ i ) (i,j)∈E ψ ij (σ i , σ j ) for some edge set E.

Rewriting in terms of logpotentials, the MAP inference problem can be stated as

max σ∈Σ i∈V θ i (σ i ) + (i,j)∈E θ ij (σ i , σ j ).

This, in turn, may be written as an ILP Q using binary variables

{µ i ∈ {0, 1} | i ∈ V } and {µ ij (c i , c j ) ∈ {0, 1} | (i, j) ∈ E, c i , c j ∈ {0, 1}} (Wainwright and Jordan 2008): max µi,µij i∈V θ i (1) µ i + θ i (0) (1 − µ i ) + (4) (i,j)∈E ci,cj θ ij (c i , c j ) µ ij (c i , c j )

subject to the following constraints for all i ∈

V, (i, j) ∈ E: cj ∈{0,1} µ i,j (0, c j ) = 1 − µ i ; cj ∈{0,1} µ i,j (1, c j ) = µ i ; ci∈{0,1} µ i,j (c i , 0) = 1 − µ j ; ci∈{0,1} µ i,j (c i , 1) = µ j .

The objective function of Q can be upper bounded by solving its LP relaxation LPrelax (Q), obtained by allowing binary variables to take values in the continuous range [0, 1] and turning the above constraints into linear inequalities.

Gumbel Sampling Using Ilp

Algorithm 1 describes our sampling method, called Gum-belMIP, that takes as input an ILP Q. It keeps track of the Algorithm 1 GumbelMIP(Q: ILP with n variables) 1:

X Q ← variables of Q |X Q | = n 2: u ∼ Uniform[0, 1] ; γ Q ← n log 2 − log(− log(u)) sample γ Q ∼ Gumbel (n log 2) 3: σ ∼ Uniform({0, 1} n ) uniform random configuration associated with γ Q 4: σ * ← σ; v * ← v(σ) + γ Q initialize current best configuration 5: L ← {(Q, X Q , γ Q , σ)}

the set of currently open sub-problems 6: while L is non-empty do 7:

Select and remove (P, X P , γ P , σ) from L according to sub-problem selection heuristic 8:

Solve the LP relaxation of sub-problem P 9:

if LP relaxation is infeasible then Continue end if 10:

Otherwise denote the LP solution by τ with objective value v 11: if X P is empty then τ must equal σ and thus be integer valued 13:

if v + γ P ≤ v

σ * ← σ; v * ← v

u ∼ Uniform[0, 1] sampleγ ∼ TruncGumbel ((|X P | − 1) log 2, γ P ) 19:γ ← (|X P | − 1) log 2 − log (exp(−γ P + (|X P | − 1) log 2) − log u) 20:

Letσ be σ withσ = 1 − σ and values for X P \ {x } resampled uniformly from {0, 1}

|X P |−1 21: if v(σ) +γ > v * then σ * ←σ; v * ← v(σ) +γ end if 22: Add (P [x = σ ], X P \ {x }, γ P , σ) and (P [x = 1 − σ ], X \ {x },γ,σ) to L 23: end while 24: return σ *

current best configuration σ * and its Gumbel-perturbed objective value v * . It also maintains a set L of currently open sub-problems, each defined by an ILP P obtained by freezing the value of a subset of the variables of Q. The subproblem P is associated with its set of free variables X P , the max γ P of (implicit) Gumbel perturbations of all configurations consistent with P (i.e., those in the feasible set of P ), and a special configuration σ consistent with P whose Gumbel perturbation γ σ is fixed to be γ P . L initially contains Q, along with the max of 2 n independent Gumbel samples and a uniformly randomly chosen configuration σ. As noted earlier, the max of 2 n independent Gumbel samples is itself Gumbel distributed, and is thus easy to sample (Line 2). σ becomes the current best configuration σ * and its log-weight (the ILP objective value for σ) perturbed by γ Q becomes v * .

The branch-and-bound search for the ILP proceeds as follows. Using a sub-problem selection heuristic (e.g., the default in CPLEX), we choose (P, X P , γ P , σ) from L. If the LP relaxation of P is infeasible or if the LP relaxation objective perturbed by γ P isn't an improvement over the current best perturbed objective v * , we simply discard P . Otherwise, if P has no free variables, we must have an integer feasible solution in σ tied to the Gumbel perturbation γ P . We use this to update our running bound v * , if it's an improvement. Then P is discarded. If P does have free variables, we choose one, say x , using a heuristic (e.g., the default in CPLEX, but over-ridden by a forced variable selection step in case all variables already have an integer value in the LP relaxation, which would normally obviate the need for an ILP solver to branch any further; a feasible integer solution might however not be tied to any Gumbel perturbation).

The branching step now creates two new sub-problems, by fixing x to 0 and 1, resp. The sub-problem with x = σ inherits the perturbation γ Q along with the configuration σ tied to it. The sub-problem with x = 1 − σ gets a freshly sampled truncated Gumbel perturbationγ that is distributed as the max of 2 |X|−1 independent Gumbels each truncated to be no larger than γ P (Line 19). In this sub-problem,γ is tied to a uniformly randomly chosen configurationσ that is consistent with σ on all frozen variables of P . Ifσ improves upon the current best σ * , the latter is updated. After creating the two sub-problems, P is discarded.

The branch-and-bound search continues as above until L becomes empty, at which point it outputs σ * as the sample and v * as its Gumbel perturbed objective value.

Relationship With A* Sampling

Our method takes advantage of abstractions and algorithms tailored to discrete spaces, resulting in a few technical differences from the A* sampling approach for continuous spaces.

Use of LP relaxations: LP relaxations have been shown to be extremely effective for MAP/MPE inference and combinatorial optimization (Wolsey and Nemhauser 2014; Sontag et al. 2008) . Our work demonstrates that LP relaxations can also speed up exact sampling.

Inheritance of configurations: The branching employed by A* sampling would partition a search space into 3 parts: a singleton set {γ} and two other disjoint subspaces. The standard decomposition of discrete spaces into subtrees (correspoding to branching on a variable) is, however, much more natural and amenable to compact representations, as well as to the use of LP relaxations. At each branching point, rather than sampling two fresh truncated Gumbel perturbations for the children, we instead have one child inherit the perturbation of the parent (Line 22). This preserves correcteness and allows for traditional branching based on subtrees.

Heuristics for branching: Branching heuristics play a major role in combinatorial search. A* sampling always selects the subproblem with the highest upper bound (best first). While this is a natural heuristic, other more advanced heuristics are used in modern combinatorial optimization. Building on the commercial solver CPLEX, GumbelMIP can leverage decades of research on branching heuristics.

Parallelized searches: Similarly, significant progress has been made in the past 20 years on parallelizing combinatorial search. Our approach can take advantage of these techniques, dramatically reducing the runtime (see Figure 1a) .

Figure 1: Average runtime of exact sampling (over 100 samples) for hard and easy graphical models.

Analysis

The soundness of search space pruning based on LP relaxations guarantees that upon termination, σ * output by Gum-belMIP will be the optimal solution of the perturbed optimization problem. This, together with Gumbel-max properties discussed earlier and prior observations about lazy instantiation of randomness (Maddison, Tarlow, and Minka 2014) , thus guarantee the correctness of GumbelMIP.

The runtime is worst-case exponential, because Algorithm 1 might have to explore the entire (exponentially large) search space. This is consistent with the hardness of sampling, which is believed to be intractable in the worstcase (Jerrum and Sinclair 1997; Koller and Friedman 2009) . The empirical evaluation of the runtime is discussed in the experimental section. Intuitively, certain "easy" classes of probability distributions can be handled efficiently. For example, consider the case w(σ) = 1 for all σ, i.e., a uniform probability distribution over Σ. Then it can be shown that the first configuration σ sampled in line 4 of Algorithm 1 will be the optimal solution for the perturbed optimization problem. Assuming the LP relaxation provides a tight upper bound, Algorithm 1 will be able to prune all nodes and finish the search at the root node.

What happens if the search is aborted before the ILP is solved to optimality? We show that the approach is robust in the sense that its performance degrades gracefully with optimality gap (i.e., how far we are from the optimum).

To gain some intuition, suppose we abort the search when there are still open sub-problems in the set L that might potentially improve upon the current best solution σ * . Let S = ∪ (Q,X Q ,γ Q ,σ)∈L Feasible(Q) \ {σ} where Feasible(Q) denotes the set feasible (integer) solutions of the ILP Q. Let S = {0, 1} n \ S denote its complement, i.e., the space that has actually been explored so far. By design, GumbelMIP maintains the invariant σ * = arg max σ∈S log w(σ) + γ σ , i.e., σ * is the best solution found so far. For any set S, we know from Gumbel-max properties that Pr γ [σ = arg max σ∈S log w(σ) + γ σ ] = w (σ ) σ∈S w(σ) = p(σ | σ ∈ S). As expected, the larger the set S is, the better the samples are. When S = Σ, i.e., we have solved the problem to optimality, we are sampling from the desired target density p(σ).

Another way to measure optimization progress is to consider the rank of the current-best solution σ * . Suppose σ * has the k-th largest perturbed objective value. Intuitively, the smaller k is, the closer we are to the optimal solution and can expect better samples. This intuition is formalized by the following result, 1 which generalizes Gumbel-max properties and relates the rank of the top k solutions (according to the Gumbel-perturbed objective) to sampling without replacement from the original probability distribution p(σ): Theorem 1. Let γ ∈ R Σ be a vector of |Σ| samples drawn i.i.d. from Gumbel(0) . Let w : Σ → R + and w (σ) = log w(σ) + γ σ . Then,

Pr w (σ (1) ) ≥ • • • ≥ w (σ (k) ) ≥ w (σ ( ) ) ∀ > k = w(σ (1) ) Z w(σ (2) ) Z − w(σ (1) ) • • • w(σ (k) ) Z − k−1 j=1

w(σ (j) ) Interestingly, the information provided by the open nodes in the set L and their LP relaxations allows us to (probabilistically) estimate the rank of the current best solution σ * . Proposition 1. Suppose GumbelMIP stops early, outputs σ * , and has a set L of open sub-problems. Let k be the rank of σ * when all configurations are ordered by decreasing w (σ) = log w(σ) + γ σ . Then,

E γ [k] ≤ 1+ (Q,X,γ Q ,σ)∈L LPrelax (Q)+γ Q ≥w (σ * ) 1 + 2 |X| − 1 • 1 − F w (σ * ) − LPrelax (Q), γ Q

where F (g, b) is the CDF of TruncGumbel(0, b) evaluated at g.

LP relaxations of the open nodes in L (and, implicitly, the optimality gap) not only provide us with a way to evaluate the quality of the approximate samples produced by earlystopping (through Theorem 1 and Proposition 1) but can also be used to provide any-time upper and lower stochastic bounds on the value of the partition function Z: Theorem 2. Suppose GumbelMIP stops early. Let v * = w (σ * ) and U = max (Q, ,γ Q , )∈L LPrelax (Q) + γ Q . Let C E ≈ 0.5772 be the Euler-Mascheroni constant. Then,

E γ [v * ] ≤ log Z + C E ≤ E γ [U ].

Further, for any > 0 and δ > 0, suppose we repeat the algorithm T ≥ (1/δ − 1)π 2 /(6 2 ) times using independently generated Gumbel perturbations {γ (t) } t . Let {v * t } t and {U t } t be the corresponding samples of v * and U , resp., obtained by stopping the algorithm early. Then, While sampling-based lower bounds on Z can typically be obtained using importance sampling and its variants (Koller and Friedman 2009; Gogate and Dechter 2007; , upper bounds are much more difficult to obtain, requiring either an exponential number of samples or strong conditions on the skew of w(σ) (cf. Theorem 12.1 of Koller and Friedman (2009) , Chakraborty et al. (2014) ). In contrast, Theorem 2 requires a polynomial number of samples and does not make any assumption on w(σ). The bounds, of course, may not be tight if GumbelMIP is stopped too early, but they are guaranteed to be correct.

Pr log Z + C E ≥ 1 T T t=1 v * t − ≥ 1 − δ Pr log Z + C E ≤ 1 T T t=1 U t + ≥ 1 − δ

Experimental Evaluation

We implemented GumbelMIP on top of the commercial ILP solver CPLEX using "callbacks" to implement the Gumbel randomization. This approach allows us to directly take advantage of CPLEX's optimized search heuristics (line 8 and 17 in Algorithm 1) and parallel solving capabilities, where multiple threads are used to explore the search tree and analyze open sub-problems in L in parallel. We experimented with up to 48 threads and observed significantly improved runtimes with more threads.

Exact Sampling

GumbelMIP, like all known exact sampling algorithms, needs exponential time in the worst case. In practice, however, models typically do not exhibit worst case behavior and modern ILP solvers employ a host of heuristics to speed up search. We evaluate the practical scalability of our approach.

We consider synthetic Ising models with n binary variables In Figure 1a we report the runtime of GumbelMIP for randomly generated attractive clique-structured Ising models as a function of n. These models have treewidth n, making exact inference and sampling impractical for large n. In practice, exact state-of-the-art techniques run out of memory for n > 35. In contrast, our approach provides exact samples for models with up to n = 60. While the runtime appears to grow exponentially, the rate is milder and the memory usage limited. Further, we can exploit parallelism to drastically speed up the search even for a single sample (note the plot is in log-scale). While recently introduced hashingbased techniques can scale to models of similar size (Ermon et al. 2013) , their results are approximate, while we provide exact samples. To the best of our knowledge, no existing technique can provide exact samples for models of this size.

x i ∈ {−1, 1} and potentials ψ ij (x i , x j ) = exp(w ij x i x j +f i x i +f j x j ).

To evaluate the effectiveness of the LP relaxations we use, we compare with an A*-like sampling algorithm that uses a simpler bounding method (Line 8 of the pseudocode). Specifically, we compute an upper bound on (4) as follows. For each term in the sum (i.e., for each potential), we find the largest value that is consistent with the current partial assignment to the ILP variables, and sum up these upper bounds. This relaxation is much weaker than an LP relaxation as it does not enforce consistency of variable values across factors, but can be evaluated more efficiently. For ease of implementation, we still employ perturbation inheritance. Figure 1a shows that GumbelMIP vastly outperforms this algorithm, which cannot produce any sample when n > 35 (with a 4 hour timeout). The use of powerful LP relaxations is thus crucial for the technique on high dimensional problems.

Next, we compare the runtime of GumbelMIP with a variant of adaptive rejection sampling scheme that can leverage combinatorial optimization and is therefore closest to GumbelMIP. Rejection sampling (Andrieu et al. 2003) relies on a proposal distribution that is tractable (easy to sample from) and upper bounds the desired target density. The closer these distributions are, the more efficient the sampling is. Obtaining a good proposal distribution typically requires domain-specific knowledge, but general proposal distributions can be constructed using ideas from optimization, as in the OS* algorithm (Dymetman, Bouchard, and Carter 2012) . Specifically, we use a piecewise constant proposal distribution, obtained by partitioning the state space into J subtrees and computing an upper bound for w(σ) in each of these J subtrees. We use an LP relaxation for this upper bound, a more sophisticated bounding technique than the ones in the original OS*. While any partition can in principle be used, we obtain it following the branching heuristics employed by CPLEX when used to optimize w. We considered computing the partition and the corresponding upper bounds as a preprocessing step and we therefore did not include the time required to perform these steps in the runtime reported for rejection sampling. We see in Figure 1a (black line) that rejection sampling is faster than GumbelMIP for smaller instances, but the runtime grows very quickly and it cannot solve instances beyond n = 35. Finally, to test our hypothesis that "simpler" models might be easier to solve, we consider tree-structured graphical models. These are tractable and support exact inference (and sampling). GumbelMIP is general-purpose and a priori oblivious to this special structure, but might be able to leverage it anyways because, e.g., the corresponding LP relaxations are known to be tight (Sontag et al. 2008) . Indeed, we see in Figure 1b that scalability is drastically improved with respect to Figure 1a . Similar results are also obtained on other tractable classes, e.g., fully disconnected models.

Approximate Sampling and Any-time Bounds on Z GumbelMIP can be used as an approximate method when the (perturbed) optimization problem is not solved to optimality, with formal guarantees provided earlier.

To evaluate the empirical effectiveness of this approach, we consider grid structured Ising models with low treewidth for which we can compute ground truth marginal probabilities, and evaluate the quality of the samples obtained if the search is aborted before an optimal solution is found. In Figures 2a (attractive case) and 2b (mixed interactions) we compare with ground truth (using a scatter plot) the marginals obtained by running GumbelMIP and a Gibbs sampler each for 5 and 10 minutes respectively (using 200 independent samples). We see that in both cases Gum-belMIP provides more accurate marginals than the Gibbs sampler. The powerful heuristics employed by our optimizer are able to quickly find a near-optimal solution, even though the solver cannot prove its optimality. GumbelMIP therefore makes better use of the limited computational resources available. We also compare with the PerturbAndMap approximate sampler (Hazan, Maji, and Jaakkola 2013) based on the use of low dimensional Gumbel perturbations and ex- act optimization. Although PerturbAndMap is faster, the samples it generates can be quite inaccurate, emphasizing the need for exact, high-dimensional Gumbel perturbations used by GumbelMIP. Figure 2c reports the any-time upper and lower bounds on the partition function using Theorem 2. We see that CPLEX finds an optimal solution very quickly (steep blue curve), spending most of the runtime proving its optimality.

Figure 2: Correlation vs true marginals after 5 and 10 minutes; any-time bounds on logZ.

Blocked Gibbs With Large Blocks

GumbelMIP can be used to sample from arbitrary probability distributions. In particular, it can be used to sample from posterior distributions over relatively large subsets of variables (beyond what could be done by brute force). It can therefore be used as a black box inside a blocked Gibbs sampler. Rather than solving a single (perturbed) optimization problem over all the variables, one can solve a sequence of smaller sub-problems over subsets of the variables (the block), while keeping the other variables fixed. This resembles a block coordinate ascent approach, except the randomness is resampled after each coordinate ascent step. Since GumbelMIP produces exact samples, the asymptotic properties of the blocked Gibbs sampler are preserved. The key advantage is that by using large block sizes one can drastically reduce mixing times. In the extreme case where the block includes all the variables, the chain mixes in one step.

We demonstrate the effectiveness of this approach on a Restricted Boltzmann Machine (RBM) with v = 196 visible units and h = 100 hidden units, trained on MNIST (handwritten digits) with constrastive divergence (Carreira-Perpinan and Hinton 2005). We use a simpler and more ef-fective ILP encoding for RBMs, replacing 4 binary variable per pairwise factor with just one continuous variable tied to the corresponding pairwise potential. In every iteration of our blocked Gibbs sampler, we randomly select a block of k variables and sample from their posterior, given all the other variables. The case k = 1 corresponds to a traditional Gibbs sampler with random ordering of variable updates. In Figure 3 we plot the state of the chain every 100 iterations, for various block sizes. It is clear that using large block sizes makes the chain converge significantly faster, although the cost per iteration increases. Understanding the tradeoffs involved is largely empirical and a subject for future research.

Conclusions

We introduced GumbelMIP, a novel exact sampler for discrete probability distributions. By translating a sampling problem to a (perturbed) optimization problem, we can leverage a host of techniques from combinatorial optimization and operations research, including LP relaxations, branching heuristics, and parallel multi-threaded search. The resulting algorithm is very efficient and the first one to provide exact samples for high-treewidth models. Even when the resulting problems cannot be solved to optimality, approximate solutions provide high quality samples, as explained by our novel analysis of suboptimal solutions and tools to rigorously characterize the quality of the samples in terms of LP relaxations. Our technique can be used inside a blocked Gibbs sampler, allowing a tradeoff between more iterations and exact samples from larger blocks.

Due to limited space, all proofs are deferred to the companion technical report(Kim, Sabharwal, and Ermon 2015).

Figure 3: Samples from a blocked Gibbs sampler after 100 iterations, for various block sizes.