Abstract
The success of neural networks has driven a shift in focus from feature engineering to architecture engineering. However, successful networks today are constructed using a small and manually defined set of building blocks. Even in methods of neural architecture search (NAS) the network connectivity patterns are largely constrained. In this work we propose a method for discovering neural wirings. We relax the typical notion of layers and instead enable channels to form connections independent of each other. This allows for a much larger space of possible networks. The wiring of our network is not fixed during training -- as we learn the network parameters we also learn the structure itself. Our experiments demonstrate that our learned connectivity outperforms hand engineered and randomly wired networks. By learning the connectivity of MobileNetV1we boost the ImageNet accuracy by 10% at ~41M FLOPs. Moreover, we show that our method generalizes to recurrent and continuous time networks. Our work may also be regarded as unifying core aspects of the neural architecture search problem with sparse neural network learning. As NAS becomes more fine grained, finding a good architecture is akin to finding a sparse subnetwork of the complete graph. Accordingly, DNW provides an effective mechanism for discovering sparse subnetworks of predefined architectures in a single training run. Though we only ever use a small percentage of the weights during the forward pass, we still play the so-called initialization lottery with a combinatorial number of subnetworks. Code and pretrained models are available at this https URL while additional visualizations may be found at this https URL.
1 Introduction
Deep neural networks have shifted the prevailing paradigm from feature engineering to feature learning. The architecture of deep neural networks, however, must still be hand designed in a process known as architecture engineering. A myriad of recent efforts attempt to automate the process of the architecture design by searching among a set of smaller well-known building blocks [24, 27, 30, 16, 1, 17] . While methods of search range from reinforcement learning to gradient based approaches [27, 17] , the space of possible connectivity patterns is still largely constrained. NAS methods explore wirings between large blocks. We believe that more efficient solutions may arrive from searching the space of wirings at a more fine grained level, i.e. single channels.
In this work, we consider an unconstrained set of possible wirings by allowing channels to form connections independent of each other. This enables us to discover a wide variety of operations (e.g. depthwise separable convs [10] , channel shuffle and split [29] , and more). Formally, we treat the network as a large neural graph where each each node processes a single channel.
One key challenge lies in searching the space of all possible wirings -the number of possible sub-graphs is combinatorial in nature. When considering thousands of nodes, traditional search methods are either prohibitive or offer approximate solutions. In this paper we introduce a simple and efficient algorithm for discovering neural wirings (DNW). Our method searches the space of all possible wirings with a simple modification of the backwards pass.
Most similar to our approach is work in randomly wired neural networks [28] aiming to explore the space of novel neural network wirings. Intriguingly, they show that constructing neural networks with random graph algorithms often outperforms a manually engineered architecture. However, these wirings are fixed at training.
Our method for discovering neural wirings is as follows: First, we consider the sole constraint that that the total number of edges in the neural graph is fixed to be k. Initially we randomly assign a weight to each edge. We then choose the weighted edges with the highest magnitude and refer to the remaining edges as hallucinated. As we train, we modify the weights of all edges according to a specified update rule. Accordingly, a hallucinated edge may strengthen to a point it replaces a real edge. We tailor the update rule so that when swapping does occur, it is beneficial.
We consider the application of DNW for static and dynamic neural graphs. In the static regime each node has a single output and the graphical structure is acyclic. In the case of a dymanic neural graph we allow the state of a node to vary with time. Dymanic neural graphs may contain cycles and express popular sequential models such as LSTMs [9] . As dymanic neural graphs are strictly more expressive than static neural graphs, they can also express feed-forward networks (as in Figure 1 ).
Our work may also be regarded as an effective mechanism for training a sparse network. The Lottery Ticket Hypothesis [6, 7] has demonstrated the existence of sparse sub-networks which may be trained in isolation. However, their method for identifying so-called winning-tickets is quite expensive and requires multiple passes of training. Though we do not train our networks in isolation (we consider a set of hallucinated edges), DNW may be used to train a sparse network in a single pass.
We demonstrate the efficacy of DNW on small and large scale data-sets, and for feed-forward, recurrent, and continuous networks. Notably, we augment MobileNetV1 [10] with DNW to achieve a 10% improvement on ImageNet [4] from the hand engineered MobileNetV1 at ∼ 41M FLOPs 1 .
2 Discovering Neural Wirings
In this section we describe our method for jointly discovering the structure and learning the parameters of a neural network. We first consider the algorithm in a familiar setting, a feed-forward neural network, which we abstract as a static neural graph. We then present a more expressive dynamic neural graph which extends to discrete and continuous time and generalizes feed-forward, recurrent, and continuous time neural networks.
2.1 Static Neural Graph
A static neural graph is a directed acyclic graph G = (V, E) consisting of nodes V and edges E ⊆ V × V. The state of a node v ∈ V is given by the random variable Z v . At each node v we apply a function f θv and with each edge (u, v) we associate a weight w uv . In the case of a multi-layer perceptron, f is simply a parameter-free non-linear activation like ReLU [14] .
For any set A ⊆ V we let Z A denote (Z v ) v∈A and so Z V is a vector containing the state of all nodes in the network.
V contains a subset of input nodes V 0 with no parents and output nodes V E with no children. The input data X ∼ p x flows into the network through V 0 as Z V0 = g φ (X ) for a function g which may have parameters φ. Similarly, the output of the networkŶ is given by h ψ (Z V E ). For brevity, we let I v denote the "input" to node v, where I v may be expressed
EQUATION (1): Not extracted; please refer to original document.
EQUATION (2): Not extracted; please refer to original document.
In this work we consider the case where the input and output of each node is a two-dimensional matrix, commonly referred to as a channel. Each node performs a non-linear activation followed by normalization and convolution (which may be strided to reduce the spatial resolution). As in [28] , we no longer conform to the traditional notion of "layers" in a deep network.
The combination of a separate 3×3 convolution for each channel (depthwise convolution) followed by a 1 × 1 convolution (pointwise convolution) is often referred to as a depthwise seperable convolution, and is essential in efficient network design [10, 19] . With a static neural graph this process may be interpreted equivalently as a 3 × 3 convolution at each node followed by information flow on a complete bipartite graph.
2.2 Discovering A K-Edge Neural Graph
We now outline our method for discovering the edges of a static neural graph subject to the constraint that the total number of edges must not exceed k.
We consider a set of real edges E and a set of hallucinated edges E hal = V × V \ E. The real edge set is comprised of the k-edges which have the largest magnitude weight. As we allow the magnitude of the weights in both sets to change throughout training the edges in E hal may replace those in E.
Consider a hallucinated edge (u, v) ∈ E. If the gradient is pushing I v in a direction which aligns with Z u , then our update rule strengthens the magnitude of the weight w uv . If this alignment happens consistently then w uv will be eventually be strong enough to enter the real edge set E. As the total number of edges is conserved, when (u, v) enters the edge set E another edge is removed and placed in E hal . This procedure is detailed by Algorithm 1, where V is the node set, V 0 , V E are the input and output node sets, g φ , h ψ and {f θv } v∈V are the input, output, and node functions, p xy is the data distribution, k is the number of edges in the graph and L is the loss.
In practice we may also include a momentum and weight decay 2 term in the weight update rule (line 10 in Algorithm 1). In fact, the weight update rule looks nearly identical to that in traditional SGD & Backprop but for one key difference: we allow the gradient to flow to edges which did not exist during the forwards pass. Importantly, we do not allow the gradient to flow through these edges and so the rest of the parameters update as in traditional SGD & Backprop. This gradient flow is illustrated in Figure 3 .
Under certain conditions we formally show that swapping an edge from E hal to E decreases the loss L.
We first consider the simple case where the hallucinated edge (i, k) replaces (j, k) ∈ E. In Section C we discuss the proof to a more general case.
Algorithm 1 DNW-Train(V, V 0 , V E , g φ , h ψ , {f θv } v∈V , p xy , k, L) 1: for each pair of nodes (u, v) such that u < v do Initialize 2:
Initialize w uv by independently sampling from a uniform distribution. 3: for each training iteration do
4:
Sample mini batch of data and labels (X
, Y) = {(X i , Y i )} using p xy Sample data 5: E ← {(u, v) ∈ V × V : |w uv | ≥ τ }
where τ is chosen so that |E| = k Choose edges 6 :
Z v ← f θv (u,v)∈E w uv Z u v ∈ V \ V 0 g (v) ψ (X ) v ∈ V 0 Forward pass 7:Ŷ = h ψ ({Z v } v∈V E ) Compute output 8:
Update φ,
{θ v } v∈V , ψ via SGD & Backprop [21] using loss L Ŷ , Y 9:
for each pair of nodes (u, v) such that u < v do Update edge weights 10 : We letw to denote the weight w after the weight update rulew uv = w uv + Z u , −α ∂L ∂Iv . We assume that α is small enough so that sign(w) = sign(w).
w uv ← w uv + Z u , −α ∂L ∂Iv Recall I v = (u,v)∈E w uv Z u ! " # "$ # "% &ℒ &! " = ) ",% ∈ℰ &ℒ &! % &! % "% ! " # "$ # "% Backward Forward # -" # ." # -" # ." ! " = / 0 1 ) (.,")∈ℰ # ." ! .
Claim: Assume L is Lipschitz continuous. There exists a learning rate α * > 0 such that for α ∈ (0, α * ) the process of swapping (i, k) for (j, k) will decrease the loss when the state of the nodes are fixed and |w
ik | < |w jk | but |w ik | > |w jk |.
Proof. Let A be value of I k after the update rule if (j, k) is replaced with (i, k). Let B be the state of I k after the update rule if we do not allow for swapping. A and B are then given by
EQUATION (3): Not extracted; please refer to original document.
Additionally, let g = −α ∂L ∂I k be the direction in which the loss most steeply descends with respect to I k . By Lemma 1 (Section D of the Appendix) it suffices to show that moving I k towards A is more aligned with g then moving I k towards B. Formally we wish to show that
EQUATION (4): Not extracted; please refer to original document.
which simplifies tow
EQUATION (6): Not extracted; please refer to original document.
In the case wherew ik and (w ik −w ik ) have the same sign butw jk and (w jk −w jk ) have different signs the inequality immediately holds. This corresponds to the case where w ik increases in magnitude but w jk decreases in magnitude. The opposite scenario (w ik decreases in magnitude but w jk increases) is impossible since |w
ik | < |w jk | but |w ik | > |w jk |.
We now consider the scenario where both sides of the inequality (equation 6) are positive. Simplifying further we obtain
EQUATION (7): Not extracted; please refer to original document.
and are now able to identify a range for α such that the inequality above is satisfied. By assumption the right hand side is less than 0 and sign(w) = sign(w) soww = |w||w|. Accordingly, it suffices to show that
EQUATION (8): Not extracted; please refer to original document.
If we let = |w jk | − |w ik | and α * = sup{α :
EQUATION (9): Not extracted; please refer to original document.
the inequality (equation 7) is satisfied. Here we are implicitly using our assumption that the gradient is bounded and we may "tune" α to control the magnitude |w ik | − |w jk |. In the case where α = inf{α : |w ik | > |w jk |} the right hand side of equation 7 becomes 0 while the left hand side is > 0.
2.3 Dynamic Neural Graph
We now consider a more general setting where the state of each node Z v (t) may vary through time.
We refer to this model as a dynamic neural graph.
The initial conditions of a dynamic neural graph are given by
Z v (0) = g (v) φ (X ) v ∈ V 0 0 v ∈ V \ V 0 (10)
where V 0 is a designated set of input nodes, which may now have parents.
Discrete Time Dynamics: For a discrete time neural graph we consider times ∈ {0, 1, ..., L}.
The dynamics are then given by
EQUATION (11): Not extracted; please refer to original document.
and the network output isŶ = h ψ (Z V E (L)). We may express equation 11 more succinctly as
EQUATION (12): Not extracted; please refer to original document.
where
Z V ( ) = (Z v ( )) v∈V , f θ (z, ) = (f θv (z v , ))
v∈V , and A G is the weighted adjacency matrix for graph G. Equation 12 suggests the following interpretation: At each time step we send information through the edges using A G then apply a function at each node.
Continuous Time Dynamics: As in [2] , we consider the case where t may take on a continuous range of values. We then arrive at dynamics given by
EQUATION (13): Not extracted; please refer to original document.
Interestingly, if V 0 is a strict subset of V we uncover an Augmented Neural ODE [5] .
The discrete time case is unifying in the sense that it may also express any static neural graph. In Figure 1 we illustrate than an MLP may also be expressed by a discrete time neural graph. Additionally, the discrete time dynamics are able to capture sequential models such as LSTMs [9] , as long as we allow input to flow into V 0 at any time.
In continuous time it is not immediately obvious how to incorporate strided convolutions. One approach is to keep the same spatial resolution throughout and pad with zeros after applying strided convolutions. This design is illustrated by Figure 2 .
We may also apply Algorithm 1 to learn the structure of dynamic neural graphs. One may use backpropogation through time [26] and the adjoint-sensitivity method [2] for optimization in the discrete and continuous time settings respectively. In Section 3.1, we demonstrate empirically that our method performs better than a random graph, though we do not formally justify the application of our algorithm in this setting.
2.4 Implementation Details For Large Scale Experiments
For large scale experiments we do not consider the dynamic case as optimization is too expensive. Accordingly, we now present our method for constructing a large and efficient static neural graph.

With this model we may jointly learn the structure of the graph along with the parameters on ImageNet [4] . As illustrated by Table 4 our model closely follows the structure of MobileNetV1 [10] , and so we refer to it as MobileNetV1-DNW. We consider a separate neural graph for each spatial resolution -the output of graph G i is the input of graph G i+1 . For width multiplier [10] d and spatial resolution s × s we constrain MobileNetV1-DNW to have the same number of edges for resolution s × s as the corresponding MobileNetV1 ×d. We use a slightly smaller width multiplier to obtain a model with similar FLOPs as we do not explicitly reduce the number of depthwise convolutions in MobileNetV1-DNW. However, we do find that neurons often die (have no output) and we may then skip the depthwise convolution during inference. Note that if we interpret a pointwise convolution with c 1 input channels and c 2 output channels as a complete bipartite graph then the number of edges is simply c 1 * c 2 .
We also constrain the longest path in graph G to be equivalent to the number of layers of the corresponding MobileNetV1. We do so by partitioning the nodes V into blocks
B = {B 0 , ..., B L−1 } where B 0 is the input nodes V 0 , B L−1 is output nodes V E
, and we only allow edges between nodes in B i and B j if i < j. The longest path in a graph with L blocks is then L − 1. Splitting the graph into Blocks also improves efficiency as we may operate on one block at a time. The structure of MobileNetV1 may be recovered by considering a complete bipartite graph between adjacent blocks.
The operation f θv at each non-output node is a batch-norm [12] (2 parameters), ReLU [14] , 3 × 3 convolution (9 parameters) triplet. There are no operations at the output nodes. When the spatial resolution decreases in MobileNetV1 we change the convolutional stride of the input nodes to 2.
In models denoted MobileNetV1-DNW-Small (×d) we also limit the last fully connected (FC) layer to have the same number of edges as the FC layer in MobileNetV1 (×d). In the normal setting of MobileNetV1-DNW we do not modify the last FC layer.
3 Experiments
In this section we demonstrate the effectiveness of DNW for image classification in small and large scale settings. We begin by comparing our method with a random wiring on a small scale dataset and model. This allows us to experiment in static, discrete time, and continuous settings. Next we explore the use of DNW at scale with experiments on ImageNet [4] . Finally, we compare DNW with other methods of discovering network structures.
Throughout this section we let RG denote our primary baseline -a randomly wired graph. To construct a randomly wired graph with k-edges we assign a uniform random weight to each edge then pick the k edges with the largest magnitude weights. As shown in [28] , random graphs often outperform manually designed networks.
3.1 Small Scale Experiments For Static And Dynamic Neural Graphs
We begin by training tiny classifiers for the CIFAR-10 dataset [13] . Our initial aim is not to achieve state of the art performance but instead to explore DNW in the static, discrete, and continuous time settings. As illustrated by Table 1 , our method outperforms a random graph by a large margin. The image is first downsampled 3 then each channel is given as input to a node in a neural graph. The static graph uses 5 blocks and the discrete time graph uses 5 time steps. For the continuous case we backprop through the operation of an adaptive ODE solver 4 . The models have 41k parameters. At each node we perform Instance Normalization [25] , ReLU, and a 3 × 3 single channel convolution.
3.2 Imagenet Classification
For large scale experiments on ImageNet [4] we are limited to exploring DNW in the static case (recurrent and continuous time networks are more expensive to optimize due to lack of parallelization). Although our network follows the simple structure of MobileNetV1 [10] we are able to achieve higher accuracy than modern networks which are more advanced and optimized. Notably, MobileNetV2 [22] extends MobileNetV1 by adding residual connections and linear bottlenecks and ShuffleNet [29, 19] introduces channel splits and channel shuffles. The results of the large scale experiments may be found in Table 3 .
As standard, we have divided the results of Table 3 to consider models which have similar FLOPs. In the more sparse case (∼ 41M FLOPs) we are able to use DNW to boost the performance of MobileNetV1 by 10%. Though random graphs perform extremely well we still observe a 7% boost in performance. In each experiment we train for 250 epochs using Cosine Annealing as the learning rate scheduler with initial learning rate 0.1, as in [28] . Models using random graphs have considerably more FLOPs as nearly all depthwise convolutions must be performed. DNW allows neurons to die and we may therefore skip many operations.
3.3 Related Methods
We compare DNW with various methods for discovering neural wirings. In Table 2 we use the structure of MobileNetV1-DNW but try other methods which find k-edge sub-networks. The experiments in Table 2 are conducted using CIFAR-10 [13] . We train for 160 epochs using Cosine Annealing as the learning rate scheduler with initial learning rate α = 0.1 unless otherwise noted.
The Lottery Ticket Hypothesis: The authors of [6, 7] offer an intriguing hypothesis: sparse subnetworks may be trained in isolation. However, their method for finding so-called winning tickets is quite expensive as it requires training the full graph from scratch. We compare with one-shot pruning from [7] . One-shot pruning is more comparable in training FLOPS than iterative pruning [6] , though both methods are more expensive in training FLOPS than DNW. After training the full network G full Table 3 : ImageNet Experiments (see Section 2.4 for more details). Models with * use the implementations of [19] . Models with multiples asterisks use different image resolutions so that the FLOPs is comparable (see Table 8 in [19] for more details).

MobileNetV1 (×0.5) 1.3M 149M 63.7% MobileNetV2 (×0.6) * - 141M 66.6% MobileNetV2 (×0.75) * * * - 145M 67.9% DenseNet (×1) * - 142M 54.8% Xception (×1) * - 145M 65.9% ShuffleNetV1 (×1, g = 3) - 140M 67.4% ShuffleNetV2 (×1) 2.3M 146M 69.4% MobileNetV1-RG(×0.49) 1.8M 170M 64.1% MobileNetV1-DNW(×0.49) 1.8M 154M 70.4%
(i.e. no edges pruned) the optimal sub-network G k with k-edges is chosen by taking the weights with the highest magnitude. In the row denoted Lottery Ticket we retrain G k using the initialization of G full . We found it better to initialize G k with the weights of G full after training -denoted by FT for fine-tune (we try different initial learning rates α). Though these experiments perform comparably with DNW, their training is much more expensive as the full graph must initially be trained.
Exploring Randomly Wired Networks for Image Recognition: The authors of [28] explore "a more diverse set of connectivity patterns through the lens of randomly wired neural networks." They achieve impressive performance on ImageNet [4] using random graph algorithms to generate the structure of a neural network. Their network connectivity, however, is fixed during training. Throughout this section we have a random graph (denoted RG) as our primary baseline -as in [28] we have seen that random graphs outperform hand-designed networks.
No Update Rule: In this ablation on DNW we do not apply the update rule to the hallucinated edges. An edge may only leave the hallucinated edge set if the magnitude of a real edge is sufficiently decreased. This experiment demonstrates the importance of the update rule.
L1 + Anneal:
We experiment with a simple pruning technique -start with a fully connected graph and remove edges by magnitude throughout training until there are only k remaining. We found that accuracy was much better if we added an L1 regularization term.
Targeted Dropout: The authors of [8] present a simple and effective method for training a network which is robust to subsequent pruning. Their method outperforms variational dropout [20] and L 0 pruning [18] . We compare with Weight Dropout/Pruning from [8] , which we denote as TD. Their method is as follows: choose the bottom γ fraction of weights by magnitude 5 and apply dropout with probability ρ. We use the same architecture as MobileNetV1-DNW (×0.225) and fix γ at each stage so that the network post pruning has the same number of edges per stage as MobileNetV1-DNW (×0.225). As illustrated by Table 2 we experiment with a couple variants of constant ρ and also try ρ = γ.
Neural Architecture Search: As illustrated by Table 3 , our network (with a very simple Mo-bileNetV1 like structure) is able to achieve comparable accuracy to an expensive method which performs neural architecture search using reinforcement learning [24] .
4 Scope & Limitations
Efficiency: Our training process is still more expensive than training a sparse sub-network in isolation -on the backwards pass we must consider the complete graph. For this reason our large scale experiments are still limited to the small FLOP regimes. In the future we hope to explore more stochastic methods where we only update a "mini-batch" of weight values.
Locality: Our algorithm is quite simple and local. We anticipate that methods of discovering neural wirings which take global structure into account may perform better. We look forward to exploring these methods in future work.
5 Conclusion
We present a novel method for discovering neural wirings. With a simple algorithm we demonstrate a significant boost in accuracy over randomly wired networks. Just as in [28] , our networks are free from the typical constraints of NAS. This work suggests exciting directions for more complex and efficient methods of discovering neural wirings.
A Architecture In the setting we consider, unconstrained targeted weight dropout outperforms the targeted weight dropout presented in [8] (which we refer to as regular). Accordingly, the results we present in Table 2 correspond to unconstrained targeted weight dropout. In regular targeted weight dropout, dropout is applied to the bottom γ fraction of incoming weights to each neuron. In unconstrained targeted weight dropout, we apply dropout to the bottom γ fraction of edges at the spatial resolution. Accordingly, neurons may die and have no incoming or outgoing edges. We compare unconstrained and regular targeted dropout in Table 5 . Table 5 : Comparing variants of targeted weight dropout using the architecture described in Table 4 and tested on CIFAR-10 shown as mean and std over 5 runs. Model Accuracy Accuracy (Unconstrained) (Regular) TD ρ = 0.9 89.0 ± 0.2% 87.9 ± 0.5% TD ρ = 0.95 89.2 ± 0.4% 87.9 ± 0.2% TD ρ = 0.99 88.6 ± 0.2% 87.7 ± 0.3% TD ρ = γ 88.8 ± 0.2% 87.9 ± 0.2%
C A More General Case
We now consider the case where the hallucinated edge (i, ) replaces (j, k) ∈ E.
As before we usew to denote the weight w after the weight update rulew uv = w uv + Z u , −α ∂L ∂Iv . We assume that α is small enough so that sign(w) = sign(w).
Claim: Assume L is Lipschitz continuous. There exists a learning rate α * > 0 such that for α ∈ (0, α * ) the process of swapping (i, ) for (j, k) will decrease the loss when the state of the nodes are fixed, there is no path from i to j, and |w i | < |w jk | but |w i | > |w jk |.
Proof. Let A k , A be value of I k and I after the update rule if (j, k) is replaced with (i, ). Let B k and B be the state of I k and I after the update rule if we do not allow for swapping. A k , A , B k and B are then given by
EQUATION (14): Not extracted; please refer to original document.
A
EQUATION (15): Not extracted; please refer to original document.
Additionally, let g k = −α ∂L ∂I k and g = −α ∂L ∂I be the direction in which the loss most steeply descends with respect to I k and I . By Lemma 1 (Section D of the Appendix) it suffices to show that A k − I k , g k + A − I , g ≥ B k − I k , g k + B − I , g
which simplifies tow
EQUATION (18): Not extracted; please refer to original document.
We are now in the equivalent setting as equation 6 and may complete the proof as before.
In practice there may be a path from i and j the state of the nodes will never be the fixed due to stochasticity of mini-batches and updates to the rest of the parameters in the network. However, as the graph grows large the state of one node will have little effect on the state of another, even if there is a path between them. The proofs are done in an idealized case and the empirical results demonstrate that the method works in practice.
D Lemma 1
Here we show that for sufficiently small α
EQUATION (19): Not extracted; please refer to original document.
implies that
EQUATION (20): Not extracted; please refer to original document.
Note that for brevity we have written the loss as a function of I v . By taking a Taylor expansion we find that
EQUATION (22): Not extracted; please refer to original document.
and so for sufficiently small α
EQUATION (24): Not extracted; please refer to original document.
which completes the lemma.
We follow[29,19] and define FLOPS as the number of Multiply Adds.
Weight decay[15] may in fact be very helpful for eliminating dead ends.
We use two 3 × 3 strided convolutions. The first is standard while the second is depthwise-separable.4 We use a 5th order Runge-Kutta method[23] as implemented by[2] (from t = 0 to 1 with tolerance 0.001).
In Weight Dropout/Pruning from[8] they choose the bottom γ fraction of incoming weights to each neuron. We instead consider the bottom γ fraction of all weights in the graph. As illustrated by the experiments in Section B of the Appendix, we observe that this performs better in the considered setting.