Insights into Parallelism with Intensive Knowledge Sharing

Ashish Sabharwal
H. Samulowitz
CP
2014
View in Semantic Scholar

Abstract

Novel search space splitting techniques have recently been successfully exploited to paralleliz Constraint Programming and Mixed Integer Programming solvers. We first show how universal hashing can be used to extend one such interesting approach to a generalized setting that goes beyond discrepancy-based search, while still retaining strong theoretical guarantees. We then explain that such static or explicit splitting approaches are not as effective in the context of parallel combinatorial search with intensive knowledge acquisition and sharing such as parallel SAT, where implicit splitting through clause sharing appears to dominate. Furthermore, we show that in a parallel setting there exists a surprising tradeoff between the well-known communication cost for knowledge sharing across multiple compute nodes and the so far neglected cost incurred by the computational load per node. We provide experimental evidence that one can successfully exploit this tradeoff and achieve reasonable speedups in parallel SAT solving beyond 16 cores.

1 Introduction

There have recently been several successful proposals for parallelizing combinatorial search and optimization, especially in the context of Constraint Programming (CP) and Mixed Integer Programming (MIP), such as by Régin et al. [25] , Moisan et al. [22, 23] , and Fischetti et al. [13] . A desirable strength of these and prior approaches such as the guiding path heuristic [31] is that they achieve parallelization without any communication between the compute cores. In one way or another, they split the underlying search space upfront or statically amongst the k available compute cores, which obviates the need for communication. Unlike search schemes based on global load-balancing or work-stealing [10, 21, 26, 27] , these communication-less approaches compute a static assignment of subproblems (or of subtrees induced by a static assignment of search tree leaves) to each compute core.

We begin by discussing these static splitting, communication-less approaches and proposing a novel generalized static search space splitting scheme that, unlike some of these recent proposals, is not limited to a particular search strategy (e.g., discrepancy-based search) or a class of problem instances. This general scheme uses randomly generated XOR or parity constraints and relies on their desirable universal hashing properties in order to achieve a dynamic balanced split of the search space amongst the compute cores. Prior works by Bordeaux et al. [8] and Plaza et al. [24] have alluded to XOR-based splitting, but only from an empirical perspective and without a formal analysis, especially with respect to the balanced effect of pruning any large-enough subtree of the whole space.

The formal correctness argument for our XOR-based splitting scheme highlights certain assumptions crucial to the success of some of the recent parallelization proposals. We explain why these assumptions, unfortunately, often fail in the context of search algorithms that dynamically learn from failures, such as CDCL (conflict-directed clause learning) solvers for propositional satisfiability (SAT) and Lazy Clause Generating CP solvers. In combinatorial search algorithms that support knowledge acquisition from failures or conflicts, intensively sharing that learned knowledge can provide an implicit way of splitting the search space explored by each core. As long as the solvers running at each core are sufficiently different, one of them will encounter a certain failed state before others and, by informing others of the reason of its failure, will indirectly prevent them from exploring this failed state as well as any search sub-space that fails for the same reason.

When it comes to knowledge sharing, it is common wisdom that communication across a network is costly. When designing distributed constraint solvers running on multiple machines and sharing information, it is deemed desirable to pack as many solvers on compute cores on individual machines as possible, so that inter-machine network latency does not hurt performance. This common wisdom stems from the undisputed fact that communicating across processes (or threads) on a single machine is much faster than communicating across different machines. While true, this reasoning ignores the memory bandwidth aspect, which can in principle have a negative impact on solvers. For example, it is folklore knowledge that running multiple copies of the MIP solver CPLEX [18] on the same machine notably degrades performances. The main reason is that more threads running memory intensive applications lead to more cache misses and involuntary context switches, both of which negatively impact performance. This happens even if one uses fewer threads than the number of available compute cores, because multiple cores tend to share at least the L3 cache. While cache performance of SAT solvers on a single compute node has been analyzed earlier [1, 32] , its study in the context of multiple compute nodes and the resulting trade-offs remain unexplored.

As a motivating example, we consider the performance of the state-of-theart parallel SAT solver Plingeling [6] across k = 1, 2, 4, . . . , 32 cores of a single 32-core machine. Plingeling, like most current parallel SAT solvers, implements implicit search space partitioning. The results (geometric average runtime on our dataset, discussed later) are shown in Figure 1 . We observe a sharp decline in performance when going from 16 to 32 cores. In this work, we ask: Is this decline in performance caused mainly by duplication of work by cores given the lack of a static search space splitting mechanism, or by over-utilization of the memory bus? We find that reduced performance can be attributed to the latter and, more surprisingly, to such a large extent that one can, in fact, even successfully trade off the intra-node memory bandwidth bottleneck with the presumably high internode network communication cost.

Fig. 1. Performance of Plingeling across k = 1, 2, 4, . . . , 32 core on a single node compared to GlucoseX10 using various node-per-core configurations up to 64 cores

Can this surprising trade off be successfully exploited in practice? To assess this, we first implement as a baseline two static splitting strategies that are promising in the context of SAT, one of them based on universal hashing through XOR constraints with strong theoretical guarantees as mentioned earlier. While a careful implementation of these strategies allows full knowledge sharing amongst the compute cores as well as the utilization of the highly effective dynamic search heuristics embedded in SAT solvers, we find that static splitting strategies have limited success in the context of SAT. However, exploiting our findings about the communication vs. node utilization trade-off, we show that a simple distributed variant of the Glucose 3.0 solver [3] , created using the SatX10 framework [7] and performing implicit search space splitting by sharing the shortest 5% of the learned clauses, can continue to scale (i.e., have increasing speedups) on up to 64 cores when the copies of the solver are split carefully across multiple compute nodes. As shown in Figure 1 on our dataset, this simple distributed solver is more than competitive with Plingeling, the winner of the parallel track in the Application instance category of the 2013 SAT Competition [4] . Even though slower by as much as 1.5x when using 1 core, it clearly outperforms Plingeling as one moves to 32 or more cores, demonstrating that our insights can indeed be successfully exploited in practice. 1

2 Generalizing Static Search Space Splitting

We begin this section with a brief recapitulation of recently proposed successful parallelization strategies for CP and MIP search and optimization using what we will refer to as static search space splitting. With k compute cores, the search space is split upfront into k disjoint subspaces and then each core proceeds to search in the subspace it is responsible for. The novelty here is to achieve such a splitting in a way that requires no communication between the compute cores what-so-ever, which means communication cost never becomes a bottleneck as the number of compute cores is increased.

An interesting recent example of explicit search space splitting is the so-called embarrassingly parallel search (henceforth referred to as EPS) [25] . Suppose the problem instance has n variables, all of which are binary. The idea is to simply split the entire search space of size 2 n into 2ñ disjoint subspaces, by fixing the value of someñ variables,ñ < n, in all possible ways. The resulting 2ñ subproblems are then divided up equally amongst k compute cores. As long as 2ñ k, by the law of large numbers, one expects the distribution of overall computation load across the compute cores to be roughly uniform. A related scheme [13] recently proposed in the context of MIP makes each core branch on the topñ variables and then choose a 1/k fraction of the resulting child nodes in a round robin fashion.

Another explicit parallelization technique-which in fact inspired some of our work-is the recent proposal by Moisan et al. [22] who study parallelization of a particular class of search heuristics in the context of CP, namely limited discrepancy-based search (LDS) [16] . We will refer to this technique as PLDS. It was shown that it is possible to assign the 2 n leaves of the entire search tree to the k available cores such that the leaves are visited by the k cores jointly in roughly the same order as a sequential LDS and the total number of search nodes visited by each core c (in the subtree induced by the leaves assigned to c) is no more than (2 n /k) log k. Further, and more importantly, each of the cores is guaranteed to benefit roughly equally from any dynamic pruning of a subtree T of the entire search space by constraint propagators. All this is achieved by PLDS notably without any communication between the processors. This idea can also be extended to Depth-bounded Discrepancy based search (DDS) [23] .

In the context of SAT, earlier work by Bordeaux et al. [8] suggested a completely distributed strategy where each core fixes a small numberñ of variables (e.g., at random) without coordinating with other cores. Each core is allowed to independently select whichñ variables it wants to fix and the values it wants to assign to them. To ensure completeness as well as to address the high likelihood of load imbalance in this context, the authors employed an interesting strategy of allowing the solver at each core to backtrack over the topñ variables-but only after it had proved its restricted sub-formula to be unsatisfiable. The authors also suggested more traditional search space splitting strategies that added new constraints to split the search space into disjoint subspaces, but, perhaps in part because of no knowledge sharing, they found the distributed variable fixing strategy to be the most effective. More recently, Heule et al. [17] and van der Tak et al. [30] have proposed the use of more complex inference techniques than unit propagation to split the search space in the first phase of search and then solve the resulting sub-problems in parallel without knowledge sharing.

2.1 Generalized Splitting Using Universal Hashing

The theoretical balancing guarantees provided by the PLDS approach can in fact be extended to a more general setting for dynamic search heuristics that go well beyond LDS and DDS, including the conflict analysis driven heuristics employed by SAT solvers (e.g., VSIDS) as well as impact based search (IBS) in CP. We discuss here a novel way to achieve this in a search-independent and problem-independent manner, using parity or XOR constraints.

XOR constraints of length over binary variables x i are constraints of the form i=1 x i = p mod 2, where p ∈ {0, 1} is referred to as the parity of the constraint. When generated at random by choosing the set of variables as well as the parity p uniformly amongst all choices, XOR constraints (of large enough length) act as a family of uniform hash functions, resulting in desirable search space splitting properties that have been exploited in theoretical computer science [5, 29] as well as in the design of practically efficient approaches for approximating the number of solutions of a combinatorial problem and for probabilistic inference [9, 11, 14] . In the interest of space, we refer the reader to any of these other works for formal properties of XOR constraints. In our context, they have precisely the key properties exploited by the PLDS approach. We discuss below how this observation can be exploited.

Consider a sequential search algorithm S. Given a problem instance I on n binary variables, let σ be the ordered sequence of the subset of the 2 n leaves (at depth n) of the underlying search tree T (of size 2 n ) that S visits when operating on I. To create a parallel version S k of S that utilizes k compute cores, where for simplicity of exposition we assume k is a perfect power of 2, we generate at random log k sets X j , 1 ≤ j ≤ log k, of variables each and restrict the search space explored by core i, 1 ≤ i ≤ k, to the sub-space determined by XOR constraints C ij defined as x∈Xj x = b ij mod 2, where b ij is the j-th bit of the log k bit binary representation b i of the integer i. Let C i = j C ij . The i-th solver S i of S k running on core i operates on the restricted problem instance I ∧ C i . S i follows the original leaf sequence σ, but simply skips the leaves that do not satisfy its XOR constraints C i . For efficiency, if there is a subtree T of T none of whose leaves satisfy C i , S i must identify this fact and not waste time exploring T at all. This, in our case, is easily achieved as at least one constraint C ij * for some j * must be violated by the partial truth assignment that defines the root node of T , which means that the XOR propagator for C ij * would make S i fail immediately as soon as it reaches T .

This restriction scheme ensures that every pair S i , S i of solvers operates in disjoint subspaces of T , and that the k cores together cover all of T . Formally:

Proposition 1. For Constraints C I As Defined Above And For

i = i , C i ∧ C i = ⊥ and i C i = . Further, (I ∧ C i ) ∧ (I ∧ C i ) = ⊥ and i (I ∧ C i ) = I.

Proof. Let j * be a bit in which the log k bit binary representations of i and i differ, i.e., b

ij * = b i j * . Then (C i ∧ C i ) = ( j C ij ) ∧ ( j C ij ) ⇒ (C ij * ∧ C i j * ) ⇒ ( x∈X j * x = b ij mod 2) ∧ ( x∈X j * x = b i j * mod 2) ⇒ (b ij * = b i j * )

, which, by the choice of j * , is never the case. Hence,

C i ∧ C i = ⊥.

On the other hand, The remaining claims now follow from these results. First, (

i C i = i j C ij = j i C ij = j i ( x∈Xj x = b

I ∧C i )∧(I ∧C i ) = I ∧ (C i ∧ C i ) = ⊥. Next, i (I ∧ C i ) = I ∧ i C i = I ∧ = I.

We thus have a static partition of the search space amongst the k solvers. Moreover, the partition is balanced in the sense that each solver S i gets precisely a 1/k fraction of the overall 2 n size search space T (irrespective of I). Most interestingly, the uniform hashing properties of XORs guarantee that, with large enough , with high probability, every large-enough subspace of T has roughly equal representation in each of the k compute cores, which act as k "buckets" for the underlying hash function. This is formalized in the following theorem, which notably is independent of the properties of the search algorithm S (e.g., using LDS or not) or of the problem instance I. Theorem 1. Let S k be the parallel constraint solver for k cores as described above, operating on a problem instance I over n binary variables forming the 2 n size search space T . For = n/2, ∈ (0, 1), δ > 0, and k ≤ 2 n/(2+δ) , 1. the entire subtree T i of T induced by the leaves assigned to S i contains no more than (2 n+1 /k) log k internal and leaf nodes combined; and 2. for any arbitrarily chosen subtree T of T with L ≥ k 2+δ / leaves, with probability at least 1 − over the choice of the random XOR constraints, the following holds: For any core i, the number of leaves of T that are assigned to S i lies within μ • (1 ± k −δ/2 ) where μ = L/k is the expected value.

Proof. To argue that the first claim holds, we observe that T i has exactly 2 n /k leaves, which implies that the number of internal nodes of T i with two children must be exactly 2 n /k − 1. Thus, the total number of nodes in T i is higher precisely when it has more internal nodes with only one child. As can be seen from a tree rotation argument, the number of internal nodes with only one child is maximized when the leaves of T i , all at depth n of T , are uniformly spread apart at distance k from each other. In this case, T i contains all internal nodes of T up to depth n − log k, for a total of 2 n−log k+1 − 1 nodes, each of which is extended to depth n by a unique path of length log k containing nodes with one child. It follows that the number of nodes in T i is upper bounded by (2 n−log k+1 − 1) log k < (2 n+1 /k) log k as desired.

In order to prove the second claim, we capitalize on the known fact that log k random XORs of length n/2 act as a universal family of hash functions on the 2 n leaves of T , placing the leaves pairwise independently into k different "buckets", which correspond to our k cores. Let L i be a random variable (with randomness over the choice of XORs) capturing the number of leaves of T assigned to core i. Pairwise independence of the assignment of leaves to cores implies that the variance Var(L i ) of L i is no more than its expected value E(L i ) = L/k = μ. Applying first the Chebychev inequality and then the union bound,

Pr |L i − μ| ≥ μk −δ/2 ≤ Var(L i ) μ 2 k −δ ≤ μ μ 2 k −δ ≤ 1 k k 2+δ L ≤ k ⇒ Pr ∃i. |L i − μ| ≥ μk −δ/2 ≤ k i=1 Pr |L i − μ| ≥ μk −δ/2 ≤

Taking the complement of this probability finishes the proof.

We note that although the theorem is stated for = n/2, Ermon et al. [12] have recently shown that certain desirable hashing properties still hold with XORs of length n/2, which are often much easier to propagate. As an illustration of the result, suppose = 1/n, δ = 1, and T is any subtree of the search space with L ≥ nk 3 leaves. Then the theorem states that with probability at least 1 − 1/n over the choice of random XORs, the number of leaves of T assigned to S i will be within L/k • (1 ± 1/ √ k), i.e., very close to the ideal balancing value of L/k. As a consequence, each core will benefit roughly equally if a constraint propagator prunes T .

2.2 Implementing Xor-Based Splitting With Knowledge Sharing

While the above reasoning shows that adding randomly generated XOR constraints can, in principle, qualitatively provide the guarantees of PLDS in a much more generic setting, it is not obvious how best to implement this strategy. One of the parallelization suggestions by Bordeaux et al. [8] was in fact to add random XOR constraints by converting then into a CNF formulation. An XOR constraint with variables, however, requires adding 2 −1 clauses, which quickly becomes impractical as grows. 2 Thus, for practical reasons, we limit our evaluation to small values of . Bordeaux et al. did not find this to be effective, but their tests were performed without communication while we now test the approach in the presence of knowledge sharing, in the context of SAT. This, however, immediately raises an implementation challenge, which we discuss next.

Since each S i operates on the original instance conjoined with new constraints that differ from core to core, clauses learned by one core may not be valid for other cores. In principle, one can label each learned clause C as sharable or not based on the information used to derive C. However, due to subtleties in the implementation of modern SAT solvers, it is insufficient to simply check whether the conflict analysis that led to the derivation of C involved one of the clauses encoding an XOR constraint. Other operations in the solver, such as propagation of unit literals learnt based on XOR constraints and clause base reductions, must also be appropriately altered to take the effect of the XOR constraints that differ from core to core.

An interesting alternative to explicitly adding new constraints X to the formula F is to alter the branching heuristic such that the solver automatically searches only in the assignment subspace that satisfies the constraints X . The idea is to pre-compute all solutions to X over the set of variables appearing in X and add them to a solution pool S. Note that each σ ∈ S is a partial assignment for F . Now one can iterate through these partial assignments σ 1 , σ 2 , . . . , σ |S| ∈ S, moving from σ i to σ i+1 as soon as the solver refutes the subtree under σ i . Since we do not add XORs explicitly as constraints and instead just branch in a way that is consistent with XORs, the original formula must logically entail any learnt clause. Furthermore, since we enumerate the solution pool S exhaustively the approach is both sound and complete.

A notable advantage of this approach is that all clauses learnt by any core are valid for all other cores as well, and can thus be freely shared. 3 This obviates the need to implement mechanisms to decide which clauses are safe to share and which aren't. On the other hand, for the approach to be practical, the solution pool S must have a succinct representation that the solvers can exploit. For example, if X is the conjunction of log k XOR constraints of length each on disjoint sets of variables, then |S| = (2 −1 ) log k = k −1 , which can quickly become huge as k grows. To avoid this blow up, we instead use log k solution sub-pools of size 2 −1 , one for each XOR constraint. The branching heuristic first fixes all variables from one sub-pool before moving on to the next sub-pool.

Our implementation includes a range of variations and extensions. For instance, we experimented with the setting where one core remains completely unaltered while all other cores employ XOR based branching. This strategy was motivated by the fact that altering branching decisions can have a tremendous impact on the search, and adding one unchanged solver to the pool of solvers would retain some of the original search pattern and the solver's flexibility. Furthermore, when branching according to the XOR variables at the top of the search tree we take propagation results directly into account so that an implication that falsifies the current XOR assignment causes us to directly move on to the next assignment. Since we have a sequence of XORs to branch on, this often allows us to skip entire sets of assignments.

2.3 Limits Of Static Splitting

To our surprise, none of the approaches discussed above was truly effective in parallelizing SAT solvers. We tried several variations and parameter settings and will describe some representative results in Section 3, Table 1 . As we discuss next, the reason might lie in the "rigid" static search space splitting interfering with the highly dynamic conflict-directed search performed by SAT solvers.

Table 1. Performance of various search space splitting and knowledge sharing approaches for SAT, using k = 32 cores

In general, the recent static splitting proposals discussed at the beginning of this section and which inspired our XOR-based splitting mechanism, unfortunately, have significant limitations in the context of search strategies that apply aggressive search space pruning through powerful propagators or that use information learned from failures to guide future search.

For example, the EPS approach and its variation for MIP implicitly assume that once the huge pool of 2ñ subproblems has been created, one simply must resolve each subproblem independently and knowledge learned from solving one subproblem cannot significantly help inference on other subproblems. This is clearly not the case for CDCL solvers which are heavily guided by clauses learned from conflicts and are in fact able to quickly prune new subproblems based on experience from previously encountered subproblems. The same applies also to CP solvers that perform conflict analysis and lazy clause generation [28] .

The PLDS and XOR-splitting approaches, on the other hand, do not address the rather common case where the number of nodes s visited by a sequential search algorithm is significantly less than 2 n . In fact, s being vastly smaller than 2 n on real-world instances of interest is precisely what allows us to tackle large NP-complete problems in a reasonable amount of time. For example, consider an infeasible instance where fixing well-selectedñ (n − log k) variables lets constraint propagators already deduce infeasibility. With k = 1024, this condition holds wheneverñ is much smaller than n − 10, which most SAT and CP solvers will guarantee fairly easily. While it does holds that each S i will not process more than (2 n+1 /k) log k nodes and subtree pruning will positively impact each S i roughly equally, this is not a very useful guarantee as even a well-guided sequential search would take only 2ñ 2 n /k steps to begin with! In general, uniform splitting of the naïve search space of size 2 n across k cores does not say much about speedups being close to k unless the underlying search algorithm works in a rather brute force manner.

Further, PLDS has so far been demonstrated to be effective in a setting where the leaves of the search tree are much more costly to process than internal nodes. This, however, is usually not the case in most SAT, CP, and MIP applications, where node processing time often decreases as one goes deeper in the search tree.

Finally, forcefully fixing variables at the top of the search tree, as is the case in our XOR implementation and the random prefix method [8] , seems to often interfere with dynamic branching choices that tend to make search more effective. This is especially true for SAT solvers, which heavily rely on recent conflicts for branching decisions, and variable activities maintained by them often change very rapidly. This behavior is also reflected in improved performance that we observed when using shorter XORs and random prefixes.

3 Implicit Splitting Through Intensive Knowledge Sharing

An alternative to static splitting is implicit search space splitting, where the k compute cores start exploring the entire search space independently but dynamically communicate to each other-ideally in a succinct fashion-which subspaces they have already explored. We argue that in combinatorial search algorithms that learn from failures, such as CDCL SAT solvers and CP solver employing Lazy Clause Generation (LCG), implicit search space splitting achieved by sharing succinct reasons of failures can be much more effective than explicit search space splitting schemes such as those discussed above. When such solvers encounter a "conflict" or a failed state, they analyze the reason of the failure in terms of constraint propagations and learn a clause (or a small set of clauses) that would prevent the solver from wasting time in other states that fail for a similar reason. By sharing such learned clauses with other cores, one can implicitly achieve search space splitting-as long as there is enough variation among the solvers executing on different cores that one of them encounters a particular failed state before others do. While simply having different random seeds at each core can make the search sufficiently different, in our experiments we vary also some of the solver's parameters, such as activity decay rate and restart frequency, across various cores. When viewing knowledge sharing as implicit search space splitting, one must revisit the heuristics commonly used to decide how much to share and when. Currently employed heuristics, again guided by the common wisdom of communication latency being the main hurdle to avoid, tend to share perhaps too little information [19] . As the representative results in Table 1 (to be discussed shortly) demonstrate, it can be better to pay the price of additional communication for implicit search space splitting than use explicit splitting.

The results throughout the paper are for experiments on a cluster of 32-core compute nodes. Each node is a 4x8 3.8 GHz Power7 machine (CHRP IBM 9125-F2C), 4 MB cache per CPU, and 128 GB of RAM. The nodes are connected via a network that supports the PAMI message passing interface [20] . The evaluation is on SAT Race 2010 instances (all industrial/application category) restricted to the 73 that 1-core Glucose 3.0 could not solve within 10 seconds. The timeout (wall-clock) used was 5,000 seconds; our results remain qualitatively unchanged for smaller timeouts as well.

In Table 1 , we compare (a) the implicit splitting approach with each core sharing the shortest 2%, 5%, and 8% of its learned clauses; 4,5 (b) the XORbased explicit splitting approach with XORs of lengths 2 and 3; and (c) the "random prefix" approach of Bordeaux et al. [8] fixing 5 or 10 variables and enhanced with sharing the shortest 5% of the learned clauses. For the XOR-based splitting approach, longer XORs achieved more load balancing as expected but worsened per-core performance as XOR constraints do not unit propagate very well. Shorter XORs led to significantly increased load imbalance, but we could counter this and somewhat improve the overall performance by (i) restarting an early finishing core on the full problem instance (no XORs) or (ii) using one additional core in parallel on this full instance, in both cases continuing to share information. The table reports numbers for the latter, which, as we see, was still insufficient to outperform implicit search space splitting.

Implicit sharing was nearly always the best and sharing 5% yielded good, stable performance for various values of k. This is the setup we use henceforth.

4 The Communication-Utilization Tradeoff

Suppose we have two machines M 1 and M 2 with 32 compute cores each. Let S be a solver that we can run in parallel on one or both of these machines, and we can share information learned from failures across copies of S running on these machines. We are interested in minimizing the time the slowest of the parallel copies of S takes. When running k ≤ 32 copies of S on a problem instance I, is it better to run k copies on M 1 or k/2 copies each on M 1 and M 2 ?

The answer, it turns out, is not that straightforward. It depends on k (in relation to the number of cores, 32, on each machine), the amount c of communication between each pair of copies of S, and the intensity of memory accesses (and the resulting cache hits and misses) performed by S. As one might expect, when k 32 and c is small, either option results in about the same performance. As one might also expect, when k 32 but c is substantial, network latency does play a significant role and it is better to run k copies on M 1 rather than communicate across machines. On the other hand, when k ∼ 32 but the amount c of communication is very small, the latency of inter-machine communication does not play a significant role and it is actually better to run k/2 copies on two machines because constraint solvers often require high memory bandwidth and interfere (at the system level) more with each other the more copies of them are run in parallel on a single machine (we quantify this interference later in the paper). Remarkably, we find that this trend continues to hold for k ∼ 32 even when the amount c of communication is high. Specifically, even if each copy of solver S shares with every other k − 1 copy as many as 5% of the clauses that it learns, it is better to run k/2 copies each on M 1 and M 2 and pay the price of inter-machine network communication than have k copies compete for memory bandwidth on M 1 . E.g., it is often better to run 16 copies on 2 machines, and sometimes even better to run 8 copies on 4 machines, than run 32 copies on a single machine with 32 cores. The best allocation of cores across nodes is clearly dependent on the machine type. However, our empirical observations suggest that the configuration process does not have to be very fine grained.

We note that with k = 32 copies, sharing 5% of the learned clauses results in a significant amount of communication as each copy of S, when generating m learned clauses per second on its own, is expected to receive m * (k − 1) * 5/100 = 1.55m clauses from other copies of S. In other words, each copy listens more to others than spend time doing its own deductions.

4.1 Time Profile Of Sat Solvers

When working in a parallel setting where solvers run possibly on different machines and intensively share information, it is important to understand where these solvers spend most of their time and what role does the communication cost play. For CDCL SAT solvers, the total time T can be divided up as follows:

T = T conf + T prop + T comm + T misc = N conf R conf + N prop R prop + N comm R comm + T misc

Here T conf represents the total time spent performing conflict analysis, N conf represents the number of times conflict analysis is performed, and R conf = N conf /T conf is the associated rate, i.e., time per conflict analysis. The other symbols similarly correspond to the time, number, and rate of unit propagations, and of communication between cores. Since our main interest is in studying parallelization, we will compute and report all times in terms of wall-clock time.

There have been studies suggesting that SAT solvers spend a substantial amount of time performing unit propagation, and this observation has been used to help understand the behavior and limitations of parallel SAT solvers [15] . To make the numbers concrete in our setting and with our solver, we computed the values of T conf and T prop for Glucose 3.0 on our dataset. Figure 2 shows the result in relative terms as a percentage of T . To make relative numbers meaningful, the dataset was restricted to instances needing at least 30 seconds to solve. Clearly, unit propagation does dominate the time Glucose spends solving most instances, 71.01% on average 6 and never less than 40% on our dataset. Conflict analysis is the next most expensive operation. It can vary from 5% to 45%, with the geometric average being 12.45%. We will return to these observations later. While it is folklore knowledge that running k independent processes on a single compute node with m cores can slow down each process as k approaches m, the large extent of slow down is rather surprising for SAT solvers. To quantify this effect, we ran k independent copies of the 1-core solver on a compute node with 32 cores. The plot in Figure 3 , which shows the results on individual instances where Glucose on 1 core took between 600 and 1,800 seconds, demonstrates that the rate R conf of conflict analysis is significantly reduced as k increases, by as much as 45% when increasing k from 8 independent copies to 32 copies. We obtained similar results for R prop (omitted due to space limitation).

Fig. 2. Relative split of the total time spent by Glucose into Tconf, Tprop, and Tmisc

Fig. 3. Impact on Rconf when running k = 4, 8, 16, 32 independent copies of Glucose. Shown, for a selected set of instances, is Rconf as a percentage of the baseline Rconf obtained by running only a single copy of Glucose.

4.2 Communication Cost Vs. Node Utilization

Now suppose we run k cooperating copies of a solver in a parallel setting with information sharing among the copies. How does the slow down seen when running Figure 4 depicts this trade-off, where on the horizontal axis we have the number N of compute nodes used, on the vertical axis is the performance (measured as the geometric mean of the runtimes across instances), and each curve corresponds to a different total number k of cores used in parallel. Each compute node thus runs k/N solvers in parallel.

Fig. 4. Trade-off: communication cost vs. node utilization

While for small k (e.g., k = 8) performance, as expected, drops when increasing N due to the communication overhead, for larger values of k, increasing N surprisingly leads to significantly better performance. For example, the data shows that when wanting to run k = 32 solvers in parallel, it is substantially better to run only k/N = 16 or even 8 solvers per compute node than to fully utilize the node by running k = 32 solvers on it. Based on these findings, we used the following configurations of N nodes with k/N cores per node (for a total of N × k/N = k cores) produce data for the GlucoseX10 curve in Figure 1 : 1 × 1 = 1, 2 × 4 = 8, 2 × 8 = 16, 4 × 8 = 32, and 8 × 8 = 64. 7 The figure shows that GlucoseX10, while worse than Plingeling for k = 1, continues to scale reasonably well even up to k = 64 and clearly outperforms Plingeling for k > 16.

To understand this behavior, let us consider the time profile of SAT solvers discussed earlier. By increasing N and keeping everything else unchanged, we must clearly decrease the communication rate R comm . Assuming T misc is not affected significantly and neither are the total numbers N conf and N prop of conflicts and unit propagations, respectively, the only way the overall time T can decrease, is for one or both of the rates R conf and R prop to significantly increase. This, indeed, is the case. Across all problem instances N conf is not systematically altered one way or another by changing N from 1 to 4 with k = 32 cores in total. However, as expected, R conf increases quite consistently across all instances and the geometric mean of R conf is roughly 20% larger when using 4 nodes compared to 1 node. A similar trend holds also for N prop and R prop .

These results highlight the rather surprising tradeoff between the utilization of each compute node and the communication cost across multiple nodes. They also show that this tradeoff can be fruitfully exploited.

5 Concluding Remarks

Limited intra-node memory bandwidth has a substantial impact on the performance of today's combinatorial search methods when several such solvers, or their parallel versions, operate on a single machine. One may naïvely consider this impact smaller than the usually high latency of communication across a network. Our results, however, demonstrate that one can significantly gain in performance by distributing a parallel solver across multiple machines even when the solver employs extensive knowledge acquisition and sharing. For example, a SAT solver learning around 1,000 clauses per second and sharing 5% of what it learns with other 31 solvers in turn receives 1,000 × 31 × 0.05, or over 1,500 clauses per second. Even then, as our results show, distributing 32 cores across 4 or 8 compute nodes pays off. A testament to the practical importance of this insight is that one is able to significantly outperform the state of the art in parallel SAT solving on 32 or more cores. This is, of course, not a complete solution to effective parallelization of SAT solvers or CP solvers with lazy clause generation. By using only a subset of the available cores on a machine and letting others idle, we are essentially wasting resources. However, our results suggest that, rather than allocating idle cores to other solvers running in parallel, one should consider other uses of the idle cores. In particular, it may be worthwhile revisiting operations such as unit propagation and conflict analysis, which often take up nearly 90% of the solver's time and must be performed in any case. Our results motivate parallel propagation and conflict analysis schemes as was also suggested earlier by Hamadi and Wintersteiger [15] . In this context, it is important to note that while unit propagation is P-complete [15] , the only known theoretical consequence of this completeness observation is that a q-step unit propagation sequence cannot be parallelized to log i q parallel steps for any constant i. However, this does not preclude reductions by large constant factors or even asymptotic reductions to, say, √ q parallel steps. This may be a more promising use of idle cores than running additional copies of the solver as this is likely to have a more coherent memory footprint across cores and thus be more amenable to better cache performance.

As the total number of compute cores grows, the communication cost must eventually become dominant. It, therefore, remains important to consider more sophisticated knowledge sharing schemes [2] or more parallelizable proofs [19] , both of which are promising research directions orthogonal to our findings. Any improvements along these lines will affect our approach positively as well and help it scale to more compute cores.

We emphasize that this comparison is not meant to argue that GlucoseX10 is superior to Plingeling, but rather to illustrate that there exist unusual yet successful ways of making use of available compute resources.

One could alternatively use O( ) new variables to encode the XOR constraint, but this is known to slow down the search[14].

Clauses learnt at core i could be filtered based on the alignment of XORs between cores i and j so that only the ones that have a chance of ever being triggered at core j are shared with it.

Sharing of the shortest x% learned clauses is implemented by maintaining a dynamic cutoff length L such that all learned clauses of length up to L are shared. L is adjusted periodically to achieve the x% sharing target. In principle, x itself could be adjusted based on the total number k of cores or properties of the problem instance.5 We also experimented with more sophisticated sharing schemes such as those based on the LBD level of the learned clause[3], but our main findings remained unaffected.To avoid unintended consequences of complex sharing mechanisms, we report results on the simplest setting which still achieved state-of-the-art performance on 64 cores.

All results are reported as a geometric mean, which is often more robust to high outliers than arithmetic mean. Results remain qualitatively unchanged either way.

While the specific numbers reported here are based on evaluation on our compute hardware, our qualitative findings are likely to be applicable to other compute systems as well. Simple experimentation can be used to identify the most effective split of k cores of a parallel solver across multiple compute nodes.