Approximate Inference via Weighted Rademacher Complexity

J. Kuck
Ashish Sabharwal
S. Ermon
AAAI
2018
View in Semantic Scholar

Abstract

Rademacher complexity is often used to characterize the learnability of a hypothesis class and is known to be related to the class size. We leverage this observation and introduce a new technique for estimating the size of an arbitrary weighted set, defined as the sum of weights of all elements in the set. Our technique provides upper and lower bounds on a novel generalization of Rademacher complexity to the weighted setting in terms of the weighted set size. This generalizes Massart's Lemma, a known upper bound on the Rademacher complexity in terms of the unweighted set size. We show that the weighted Rademacher complexity can be estimated by solving a randomly perturbed optimization problem, allowing us to derive high-probability bounds on the size of any weighted set. We apply our method to the problems of calculating the partition function of an Ising model and computing propositional model counts (#SAT). Our experiments demonstrate that we can produce tighter bounds than competing methods in both the weighted and unweighted settings.

Introduction

A wide variety of problems can be reduced to computing the sum of (many) non-negative numbers. These include calculating the partition function of a graphical model, propositional model counting (#SAT), and calculating the permanent of a non-negative matrix. Equivalently, each can be viewed as computing the discrete integral of a non-negative weight function. Exact summation, however, is generally intractable due to the curse of dimensionality (Bellman 1961) .

As alternatives to exact computation, variational methods (Jordan et al. 1998; Wainwright, Jordan, and others 2008) and sampling (Jerrum and Sinclair 1996; Madras 2002) are popular approaches for approximate summation. However, they generally do not guarantee the estimate's quality.

An emerging line of work estimates and formally bounds propositional model counts or, more generally, discrete integrals (Ermon et al. 2013a; Chakraborty, Meel, and Vardi 2013; Ermon et al. 2014; Zhao et al. 2016) . These approaches reduce the problem of integration to solving a small number of optimization problems involving the same weight function but subject to additional random constraints introduced by a random hash function. This results in approximating the #P-hard problem of exact summation (Valiant 1979) using the solutions of NP-hard optimization problems.

Optimization can be performed efficiently for certain classes of weight functions, such as those involved in the computation of the permanent of a non-negative matrix. If instead of summing (permanent computation) we maximize the same weight function, we obtain a maximum weight matching problem, which is in fact solvable in polynomial time (Kuhn 1955) . However, adding hash-based constraints makes the maximum matching optimization problem intractable, which limits the application of randomized hashing approaches (Ermon et al. 2013c ). On the other hand, there do exist fully polynomial-time randomized approximation schemes (FPRAS) for non-negative permanent computation (Jerrum, Sinclair, and Vigoda 2004; Bezáková et al. 2006) . This gives hope that approximation schemes may exist for other counting problems even when optimization with hash-based constraints is intractable.

We present a new method for approximating and bounding the size of a general weighted set (i.e., the sum of the weights of its elements) using geometric arguments based on the set's shape. Our approach, rather than relying on hash-based techniques, establishes a novel connection with Rademacher complexity (Shalev-Shwartz and Ben-David 2014) . This generalizes geometric approaches developed for the unweighted case to the weighted setting, such as the work of Barvinok (1997) who uses similar reasoning but without connecting it with Rademacher complexity. In particular, we first generalize Rademacher complexity to weighted sets. While Rademacher complexity is defined as the maximum of the sum of Rademacher variables over a set, weighted Rademacher complexity also accounts for the weight of each element in the set. Just like Rademacher complexity is related to the size of the set, we show that weighted Rademacher complexity is related to the total weight of the set. Further, it can be estimated by solving multiple instances of a maximum weight optimization problem, subject to random Rademacher perturbations. Notably, the resulting optimization problem turns out to be computationally much simpler than that required by the aforementioned randomized hashing schemes. In particular, if the weight function is log-supermodular, the corresponding weighted Rademacher complexity can be estimated efficiently, as our perturbation does not change the original optimization problem's complexity (Orlin 2009; Bach and others 2013).

Our approach most closely resembles a recent line of work involving the Gumbel distribution (Hazan and Jaakkola 2012; Hazan, Maji, and Jaakkola 2013; Hazan et al. 2016; Balog et al. 2017; Mussmann and Ermon 2016; Mussmann, Levy, and Ermon 2017) . There, the Gumbel-max idea is used to bound the partition function by performing MAP inference on a model where the unnormalized probability of each state is perturbed by random noise variables sampled from a Gumbel distribution. While very powerful, exact application of the Gumbel method is impractical, as it requires exponentially many independent random perturbations. One instead uses local approximations of the technique.

Empirically, on spin glass models we show that our technique yields tighter upper bounds and similar lower bounds compared with the Gumbel method, given similar computational resources. On a suite of #SAT model counting instances our approach generally produces comparable or tighter upper and lower bounds given limited computation.

Background

Rademacher complexity is an important tool used in learning theory to bound the generalization error of a hypothesis class (Shalev-Shwartz and Ben-David 2014). Definition 1. The Rademacher complexity of a set A ⊆ R n is defined as:

EQUATION (1): Not extracted; please refer to original document.

where E c denotes expectation over c, and c is sampled uniformly from {−1, 1} n . As the name suggests, it is a measure of the complexity of set A (which, in learning theory, is usually a hypothesis class). It measures how "expressive" A is by evaluating how well we can "fit" to a random noise vector c by choosing the closest vector (or hypothesis) from A. Intuitively, Rademacher complexity is related to |A|, the number of vectors in A, another crude notion of complexity of A. However, it also depends on how vectors in A are arranged in the ambient space R n . A central focus of this paper will be establishing quantitative relationships between R(A) and |A|.

A key property of Rademacher complexity that makes it extremely useful in learning theory is that it can be estimated using a small number of random noise samples c under mild conditions (Shalev-Shwartz and Ben-David 2014). The result follows from McDiarmid's inequality: Proposition 1 (McDiarmid, 1989) . Let X 1 , ..., X m ∈ X be independent random variables. Let f : X m → R be a function that satisfies the bounded differences condition that ∀i ∈ {1, ..., m} and ∀x 1 , ..., x m , x i ∈ X :

|f (x 1 , ..., x i , ..., x m ) − f (x 1 , ..., x i , ..., x m )| ≤ d i .

Then for all > 0

Pr f (X1, ..., Xm) − E f (X1, ..., Xm) ≥ ≤ exp −2 2 j d 2 j .

McDiarmid's inequality says we can bound, with high probability, how far a function f of random variables may deviate from its expected value, given that the function does not change much when the value of a single random variable is changed. Because the function in Eq. (1) satisfies this property (Shalev-Shwartz and Ben-David 2014), we can use Eq. (1) to bound R(A) with high probability by computing the supremum for only a small number of noise samples c.

Problem Setup

In this section we formally define our problem and introduce the optimization oracle central to our solution. Let w : {−1, 1} n → [0, ∞) be a non-negative weight function. We consider the problem of computing the sum

Z(w) = x∈{−1,1} n w(x).

Many problems, including computing the partition function of an undirected graphical model, where w(x) is the unnormalized probability of state x (see Koller and Friedman (2009) ), propositional model counting (#SAT), and computing the permanent of a non-negative matrix can be reduced to calculating this sum. The problem is challenging because explicit calculation requires summing over 2 n states, which is computationally intractable in cases of interest.

Due to the general intractability of exactly calculating Z(w), we focus on an efficient approach for estimating Z(w) which additionally provides upper and lower bounds that hold with high probability. Our method depends on the following assumption: Assumption 1. We assume existence of an optimization oracle that can output the value

δ(c, w) = max x∈{−1,1} n { c, x + log w(x)} (2)

for any vector c ∈ {−1, 1} n and weight function w :

{−1, 1} n → [0, ∞).

Note that throughout the paper we simply denote log 2 as log, log e as ln, and assume log 0 = −∞. Assumption 1 is reasonable, as there are many classes of models where such an oracle exists. For instance, polynomial time algorithms exist for finding the maximum weight matching in a weighted bipartite graph (Hopcroft and Karp 1971; Jonker and Volgenant 1987) . Graph cut algorithms can be applied to efficiently maximize a class of energy functions (Kolmogorov and Zabin 2004). More generally, MAP inference can be performed efficiently for any log-supermodular weight function (Orlin 2009; Chakrabarty, Jain, and Kothari 2014; Fujishige 1980) . Our perturbation preserves the submodularity of − log w(x), as c, x can be viewed as n independent single variable perturbations, so we have an efficient optimization oracle whenever the original weight function is log-supermodular. Further, notice that this is a much weaker assumption compared with the optimization oracle required by randomized hashing methods (Chakraborty, Meel, and Vardi 2013; Ermon et al. 2014; Zhao et al. 2016) .

If an approximate optimization oracle exists that can find a value within some known bound of the maximum, we can modify our bounds to use the approximate oracle. This may improve the efficiency of our algorithm or extend its use to additional problem classes. For the class of log-supermodular distributions, approximate MAP inference is equivalent to performing approximate submodular minimization (Jegelka, Lin, and Bilmes 2011) .

We note that even when an efficient optimization oracle exists, the problem of exactly calculating Z(w) is generally still hard. For example, polynomial time algorithms exist for finding the maximum weight perfect matching in a weighted bipartite graph. However, computing the permanent of a bipartite graph's adjacency matrix, which equals the sum of weights for all perfect matchings or Z(w), is still #Pcomplete (Jerrum, Sinclair, and Vigoda 2004) . A fully polynomial randomized approximation scheme (FPRAS) exists (Jerrum, Sinclair, and Vigoda 2004; Bezáková et al. 2006) , based on Markov chain Monte Carlo to sample over all perfect matchings. However, the polynomial time complexity of this algorithm suffers from a large degree, limiting its practical use.

Weighted Rademacher Bounds On Z(W)

Our approach for estimating the sum Z(w) =

x w(x) is based on the idea that the Rademacher complexity of a set is related to the set's size. In particular, Rademacher complexity is monotonic in the sense that R(A) ≤ R(B) whenever A ⊆ B. Note that monotonicity does not hold for |A| ≤ |B|, that is, R(A) is monotonic in the contents of A but not necessarily in its size. We estimate the sum of arbitrary non-negative elements by generalizing the Rademacher complexity in definition 2. Definition 2. We define the weighted Rademacher complexity of a weight function w :

{−1, 1} n → [0, ∞) as R(w) = E c max x∈{−1,1} n { c, x + log w(x)} , (3) for c sampled uniformly from {−1, 1} n .

In the notation of Eq. (2), the weighted Rademacher complexity is simply

R(w) = E c [δ(c, w)]. For a set A ⊆ {−1, 1} n , let I A : {−1, 1} n → {0, 1} denote the indicator weight function for A, defined as I A (x) = 1 ⇐⇒ x ∈ A. Then R(I A ) = R(A)

, that is, the weighted Rademacher complexity is identical to the standard Rademacher complexity for indicator weight functions. For a general weight function, the weighted Rademacher complexity extends the standard Rademacher complexity by giving each element (hypothesis) its own weight.

Algorithmic Strategy

The key idea of this paper is to use the weighted Rademacher complexity R(w) to provide probabilistic estimates of Z(w), the total weight of w.

This is a reasonable strategy because as we have seen before, for an indicator weight function I A : {−1, 1} n → {0, 1}, R(I A ) reduces to the standard Rademacher complexity R(A), and Z(I A ) = |A| is simply the cardinality of the set. Therefore we can use known quantitative relationships between R(A) and |A| from learning theory to estimate |A| = Z(I A ) in terms of R(A) = R(I A ). Although not formulated in the framework of Rademacher complexity, this is the strategy used by Barvinok (1997) .

Here, we generalize these results to general weight functions w and show that it is, in fact, possible to use R(w) to obtain estimates of Z(w). This observation can be turned into an algorithm by observing that R(w) is the expectation of a random variable concentrated around its mean. Therefore, as we will show in Proposition 2, a small number of samples suffices to reliably estimate R(w) (and hence, Z(w)) with high probability. Whenever w is 'sufficiently nice' and we have access to an optimization oracle, the estimation algorithm is efficient.

Algorithm 1 Rademacher Estimate of log Z(w) Inputs: A positive integer k and weight function w :

{−1, 1} n → [0, ∞). Output: A numberδ k (w) which approximates log Z(w) = log

x∈{−1,1} n w(x) . 1. Sample k vectors c 1 , c 2 , . . . , c k independently and uniformly from {−1, 1} n .

2. Apply the optimization oracle of assumption 1 to each vector c and compute the mean

δ k (w) = 1 k k i=1 max x∈{−1,1} n { c i , x + log w(x)}.

3. Outputδ k (w) as an estimator of R(w) and thus log Z(w).

Bounding Weighted Rademacher Complexity

The weighted Rademacher complexity is an expectation over optimization problems. The optimization problem is defined by sampling a vector, or direction since all have length √ n, uniformly from {−1, 1} n and finding the vector x that is most aligned (largest dot product) after adding log w(x). Our first objective is to derive bounds on the weighted Rademacher complexity in terms of the sum Z(w).

We begin with the observation that it is impossible to derive bounds on the Rademacher complexity in terms of set size that are tight for sets of all shapes. To gain intuition, note that in high dimensional spaces the dot product of any particular vector and another chosen uniformly at random from {−1, 1} n is close to 0 with high probability. The distribution of weight vectors throughout the space may take any geometric form. One extreme configuration is that all vectors with large weights are packed tightly together, forming a Hamming ball. At the other extreme, all vectors with large weights could be distributed uniformly through the space. As Figure 1 illustrates, a large set of tightly packed vectors and a small set of well-distributed vectors will both have similar Rademacher complexity. Thus, bounds on Rademacher complexity that are based on the underlying set's size fundamentally cannot always be tight for all distributions. Nevertheless, the lower and upper bounds we derive next are tight enough to be useful in practice. Figure 1 : Illustration mapping a set of vectors in high dimensional space {−1, 1} n to the unit circle. Red regions correspond to regions of space that have a large dot product with some vector in the set. Left: when the size of a set is small, very few regions have a large dot product with any vector in the set, so the Rademacher complexity will be small. Right: when a large set of vectors is tightly packed in a small region of space, the Rademacher complexity will remain relatively small. In both left and right figures we have similar (small) Rademacher complexities, yet different set sizes. This illustrates why tight bounds on the set size based on Rademacher complexity are difficult to achieve.

Figure 1: Illustration mapping a set of vectors in high dimensional space {−1, 1}n to the unit circle. Red regions correspond to regions of space that have a large dot product with some vector in the set. Left: when the size of a set is small, very few regions have a large dot product with any vector in the set, so the Rademacher complexity will be small. Right: when a large set of vectors is tightly packed in a small region of space, the Rademacher complexity will remain relatively small. In both left and right figures we have similar (small) Rademacher complexities, yet different set sizes. This illustrates why tight bounds on the set size based on Rademacher complexity are difficult to achieve.

Lower bound. To lower bound the weighted Rademacher complexity we adapt the technique of (Barvinok 1997) for lower bounding the standard Rademacher complexity. The high level idea is that the space {−1, 1} n can be mapped to the leaves of a binary tree. By following a path from the root to a leaf, we are dividing the space in half n times, until we arrive at a leaf which corresponds to a single element (with some fixed weight). By judiciously choosing which half of the space (branch of the tree) to recurse into at each step we derive the bound in Lemma 1, whose proof is given in the appendix. Lemma 1. For any β ∈ (0, 1/2), the weighted Rademacher complexity of a weight function w :

{−1, 1} n → [0, ∞) is lower bounded by R(w) ≥ log w * (β) + n log (1 − β) + log Z(w) − log w * (β) log 1−β β , with w * (β) = wmax = maxx w(x), if β ≥ 1/3 wmin = minx{w(x) : w(x) > 0}, if β ≤ 1/3 .

Upper bound. In the unweighted setting, a standard upper bound on the Rademacher complexity is used in learning theory to show that the Rademacher complexity of a small hypothesis class is also small, often to prove PAClearnability. Massart's Lemma (see (Shalev-Shwartz and Ben-David 2014), lemma 26.8) formally upper bounds the Rademacher complexity in terms of the size of the set. This result is intuitive since, as we have noted, the dot product between any one vector x ∈ {−1, 1} n is small with most other vectors c ∈ {−1, 1} n . Therefore, if the set is small the Rademacher complexity must also be small. Adapting the proof technique of Massart's Lemma to the weighted setting we arrive at the following bound:

Lemma 2. For any λ > 0, γ > 0, and weight functions w, w γ : {−1, 1} n → [0, ∞) with w γ (x) = w(x) γ , the weighted Rademacher complexity of w γ is upper bounded by

R(w γ ) ≤ 1 λ log Z(w) + λγ − 1 λ log w * (λ, γ) + λ n 2 , (4) with w * (λ, γ) = wmax = maxx w(x), if λγ ≥ 1 wmin = minx{w(x) : w(x) > 0}, if λγ ≤ 1 .

Note that for an indicator weight function we recover the bound from Massart's Lemma by setting λ = 2 log Z(w) n and γ = 1. Corollary 2.1. For sufficiently large γ and λ = 2 log Z(w) wmax n ,

we recover the bound w max ≤ Z(w) from Lemma 2. Lemma 2 holds for any λ > 0 and γ > 0. In general we set γ = 1 and optimize over λ to make the bound as tight as possible, comparing the result with the trivial bound given by Corollary 2.1. More sophisticated optimization strategies over λ and γ could result in a tighter bound. Please see the appendix for further details and proofs.

Bounding The Weighted Sum Z(W)

With our bounds on the weighted Rademacher complexity from the previous section, we now present our method for efficiently bounding the sum Z(w). Proposition 2 states that we can estimate the weighted Rademacher complexity using the optimization oracle of assumption 1. Proposition 2. For c ∈ {−1, 1} n sampled uniformly at random, the bound

R(w) − √ 6n ≤ δ(c, w) ≤ R(w) + √ 6n (5)

holds with probability greater than .95.

Proof. By applying Proposition 1 to the function f w (c) = δ(c, w), and noting the constant

d i = 2, we have P [|δ(c, w) − R(w)| > √ 6n] ≤ e −3 ≤ .05.

This finishes the proof.

To bound Z(w) we use our optimization oracle to solve a perturbed optimization problem, giving an estimate of the weighted Rademacher complexity, R(w). Next we invert the bounds on R(w) (Lemmas 1 and 2) to obtain bounds on Z(w). We optimize the parameters λ and β (from equations 1 and 2) to make the bounds as tight as possible. By applying our optimization oracle repeatedly, we can reduce the slack introduced in our final bound when estimating R(w) (by Lemma 2) and arrive at our bounds on the sum Z(w), stated in the following theorem. Theorem 1. With probability at least 0.95, the sum Z(w) =

x∈{−1,1} n w(x) of any weight function w : {−1, 1} n → [0, ∞) is bounded by the outputs of algorithms 2 and 3 as

ψ LB < log Z(w) < ψ UB .

Algorithm 2 Rademacher Lower Bound for log Z(w) Inputs: The estimatorδ k (w) output by algorithm 1, k used to computeδ k (w), and optionally w min and w max . Output: A number ψ LB which lower bounds log Z(w). 1. If log w min was provided as input, calculate

λ =δ k (w) − 6n k − log w min n

2. If log w min was provided as input and λ ≤ 1,

ψ LB = (δ k (w) − 6n k − log w min ) 2 2n + log w min .

3. Otherwise,

ψ LB =δ k (w) − 6n k − n 2 .

4. Output the lower bound max{ψ LB , log w max }.

Experiments

The closest line of work to this paper showed that the partition function can be bounded by solving an optimization problem perturbed by Gumbel random variables (Hazan and Jaakkola 2012; Hazan, Maji, and Jaakkola 2013; Hazan et al. 2016; Kim, Sabharwal, and Ermon 2016; Balog et al. 2017) . This approach is based on the fact that

ln Z(w) = E γ max x∈{−1,1} n {ln w(x) + γ(x)} ,

where all 2 n random variables γ(x) are sampled from the Gumbel distribution with scale 1 and shifted by the Euler-Mascheroni constant to have mean 0. Perturbing all 2 n states with IID Gumbel random variables is intractable, leading the authors to bound ln Z(w) by perturbing states with a combination of low dimensional Gumbel perturbations. Specifically the upper bound (Hazan et al. 2016) and lower bound (Balog et al. 2017, p. 6 ) hold in expectation, where γ i (x) for i = 1, . . . , n are sampled from the Gumbel distribution with scale 1 and shifted by the Euler-Mascheroni constant to have mean 0.

ln Z(w) ≤ ΘUB = Eγ max x∈{−1,1} n ln w(x) + n i=1 γi(xi)

ln Z(w) ≥ ΘLB = Eγ max x∈{−1,1} n ln w(x) + n i=1 1 n γi(xi)

To obtain bounds that hold with high probability using Gumbel perturbations we calculate the slack term (Hazan et al. 2016, p. 32)

g = min 2 √ n 1 + 1 2k ln 2 α 2 , √ n max 4 k ln 2 α , 32 k ln 2 α

Algorithm 3 Rademacher Upper Bound for log Z(w)

Inputs: The estimatorδ k (w), k used to computeδ k (w), and optionally w min and w max . Output: A number ψ UB which upper bounds log Z(w).

1. If w min was provided as input, calculate

β min =δ k (w) + 6n k − log w min n .

2. If w max was provided as input, calculate

β max =δ k (w) + 6n k − log w max n .

3. Set the value

β opt = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ β min , if 0 < β min < 1 3 β max , if 1 3 < β max < 1 2 1 2 , if 1 2 < β max 1 3 , otherwise .

4. Output the upper bound ψ UB :

(a) If β opt = 1 3 , ψ UB =δ k (w) + 6n k + n log 3 2 . (b) If β opt = 1 2 , ψ UB = n + log w max . (c) Otherwise, ψUB = nβopt log 1 − βopt βopt −n log (1 − βopt)+log w * ,

where

w * = w min , if β opt < 1 3 w max , if β opt > 1 3 .

giving upper and lower bounds θ UB = Θ UB + g and θ LB = Θ LB − g n that hold with probability 1 − α where k samples are used to estimate the expectation bounds.

We note the Gumbel expectation upper bound takes nearly the same form as the weighted Rademacher complexity, with two differences. The perturbation is sampled from a Gumbel distribution instead of a dot product with a vector of Rademacher random variables and, without scaling, the two bounds are naturally written in different log bases.

We experimentally compare our bounds with those obtained by Gumbel perturbations on two models. First we bound the partition function of the spin glass model from (Hazan et al. 2016) . For this problem the weight function is given by the unnormalized probability distribution of the spin glass model. Second we bound the propositional model counts (#SAT) for a variety of SAT problems. This problem falls into the unweighted category where every weight is either 0 or 1, specifically every satisfying assignment has weight 1 and we bound the total number of satisfying assignments. Figure 2 : Bounds for a 7x7 spin glass model with k = 5 (for both methods), that hold with probability .95. Our bounds and estimator are scaled to match Gumbel log base e bounds.

Spin Glass Model

Following (Hazan et al. 2016) , we bound the partition function of a spin glass model with variables x i ∈ {−1, 1} for i = 1, 2, . . . , n, where each variable represents a spin. Each spin has a local field parameter θ i which corresponds to its local potential function θ i (x i ) = θ i x i . We performed experiments on grid shaped models where each spin variable has 4 neighbors, unless it occupies a grid edge. Neighboring spins interact with coupling parameters

θ i,j (x i , x j ) = θ i,j x i x j .

The potential function of the spin glass model is

θ(x 1 , x 2 , . . . , x n ) = i∈V θ i x i + (i,j)∈E θ i,j x i x j , with corresponding weight function w(x) = exp ⎛ ⎝ i∈V θ i x i + (i,j)∈E θ i,j x i x j ⎞ ⎠ .

We compare our bounds on a 7x7 spin glass model. We sampled the local field parameters θ i uniformly at random from [−1, 1] and the coupling parameters uniformly at random from [0, c) with c varying. Non-negative coupling parameters make it possible to perform MAP inference efficiently using the graph-cuts algorithm (Kolmogorov and Zabin 2004; Greig, Porteous, and Seheult 1989) . We used the python maxflow module wrapping the implementation from Boykov and Kolmogorov (2004) . Figure 2 shows bounds that hold with probability .95, where all bounds are computed with k = 5. For this value of k, our approach produces tighter upper bounds than using Gumbel perturbations. The crossover to a tighter Gumbel perturbation upper bound occurs around k ≈ 15. Lower bounds are equivalent, although we note it is trivial to recover this bound by simply calculating the largest weight over all states.

Propositional Model Counting

Next we evaluate our method on the problem of propositional model counting. Given a boolean formula F , this poses the question of how many assignments x to the underlying boolean variables result in F evaluating to true. Our weight function is given by w(x) = 1 if F (x) evaluates to true, and 0 otherwise.

We performed MAP inference on the perturbed problem using the weighted partial MaxSAT solver MaxHS (Davies 2013) . Ground truth was obtained for a variety of models 1 using three exact propositional model counters (Thurley 2006; Sang et al. 2004 ; Oztok and Darwiche 2015) 2 . Table 1 shows bounds that hold with probability .95 and k = 1. While the Gumbel lower bounds are always trivial, we produce nontrivial lower bounds for several model instances. Our upper bounds are generally comparable to or tighter than Gumbel upper bounds.

Table 1: Empirical comparison of our estimate of (δ̄1(w)) and bounds on (ψ) propositional model counts against bounds based on Gumbel perturbations (θ). The mean over 100 runs is shown with the standard deviation in parentheses. Bounds hold with probability .95 and k = 1 for both methods. Tighter bounds are in bold. Meta column descriptions, left to right: model name and information, natural logarithm of ground truth model counts and our estimator, upper bounds, and lower bounds.

Analysis

Our bounds are much looser than those computed by randomized hashing schemes (Chakraborty, Meel, and Vardi 2013; Ermon et al. 2013d; 2013b; Zhao et al. 2016) , but also require much less computation (Ermon et al. 2013c; Achim, Sabharwal, and Ermon 2016) . While our approach provides polynomial runtime guarantees for MAP inference in the spin glass model after random perturbations have been applied, randomized hashing approaches do not. For propositional model counting, we found that our method is computationally cheaper by over 2 orders of magnitude than results reported in Zhao et al. (2016) . Additionally, we tried reducing the runtime and accuracy of randomized hashing schemes by running code from Zhao et al. (2016) with f values of 0, .01, .02, .03, .04, and .05. We set the maximum time limit to 1 hour (while our method required .01 to 6 seconds of computation for reported results). Throughout experiments on models reported in Table 1 our approach still generally required orders of magnitude less computation and also found tighter bounds in some instances.

Empirically, our lower bounds were comparable to or tighter than those obtained by Gumbel perturbations on both models. The weighted Rademacher complexity is generally at least as good an estimator of log Z as the Gumbel upper bound, however it is only an estimator and not an upper bound. Our upper bound using the weighted Rademacher complexity, which holds in expectation, is empirically weaker than the corresponding Gumbel expectation upper bound. However, the slack term needed to transform our expectation bound into a high probability bound is tighter than the corresponding Gumbel slack term. Since both slack terms approach 0 in the limit of infinite computation (k = ∞, the number of samples used to estimate the expectation bound), this can result in a trade-off where we produce a tighter upper bound up to some value of k, after which the Gumbel bound becomes tighter. Table 1 : Empirical comparison of our estimate of (δ 1 (w)) and bounds on (ψ) propositional model counts against bounds based on Gumbel perturbations (θ). The mean over 100 runs is shown with the standard deviation in parentheses. Bounds hold with probability .95 and k = 1 for both methods. Tighter bounds are in bold. Meta column descriptions, left to right: model name and information, natural logarithm of ground truth model counts and our estimator, upper bounds, and lower bounds.

Conclusion

We introduced the weighted Rademacher complexity, a novel generalization of Rademacher complexity. We showed that this quantity can be used as an estimator of the size of a weighted set, and gave bounds on the weighted Rademacher complexity in terms of the weighted set size. This allowed us to bound the sum of any non-negative weight function, such as the partition function, in terms of the weighted Rademacher complexity. We showed how the weighted Rademacher complexity can be efficiently approximated whenever an efficient optimization oracle exists, as is the case for a variety of practical problems including calculating the partition function of certain graphical models and the permanent of non-negative matrices. Experimental evaluation demonstrated that our approach provides tighter bounds than competing methods under certain conditions. In future work our estimator R(w) and bounds on Z(w) may be generalized to other forms of randomness. Rather than sampling c uniformly from {−1, 1} n , we could conceivably sample each element c i from some other distribution, such as the uniform distribution over [−1, 1], a Gaussian, or Gumbel. Our bounds should readily adapt to continuous uniform or gaussian distributions, although derivations may be more complex in general. As another line of future work, the weighted Rademacher complexity may be useful beyond approximate inference to learning theory.

The Thirty-Second AAAI Conference on Artificial Intelligence

The models used in our experiments can be downloaded from http://reasoning.cs.ucla.edu/c2d/results.html 2 Precomputed model counts were downloaded from https://sites.google.com/site/marcthurley/sharpsat/benchmarks/ collected-model-counts