Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations

Dipendra Kumar Misra
Ming-Wei Chang
X. He
Wen-tau Yih
EMNLP
2018
View in Semantic Scholar

Abstract

Semantic parsing from denotations faces two key challenges in model training: (1) given only the denotations (e.g., answers), search for good candidate semantic parses, and (2) choose the best model update algorithm. We propose effective and general solutions to each of them. Using policy shaping, we bias the search procedure towards semantic parses that are more compatible to the text, which provide better supervision signals for training. In addition, we propose an update equation that generalizes three different families of learning algorithms, which enables fast model exploration. When experimented on a recently proposed sequential question answering dataset, our framework leads to a new state-of-the-art model that outperforms previous work by 5.0% absolute on exact match accuracy.

1 Introduction

Semantic parsing from denotations (SpFD) is the problem of mapping text to executable formal representations (or program) in a situated environment and executing them to generate denotations (or answer), in the absence of access to correct representations. Several problems have been handled within this framework, including question answering (Berant et al., 2013; Iyyer et al., 2017) and instructions for robots (Artzi and Zettlemoyer, 2013; Misra et al., 2015) .

Consider the example in Figure 1 . Given the question and a table environment, a semantic parser maps the question to an executable program, in this case a SQL query, and then executes the query on the environment to generate the answer England. In the SpFD setting, the training data does not contain the correct programs. Thus, the existing learning approaches for SpFD perform two steps for every training example, a search step that explores the space of programs and finds suitable candidates, and an update step that uses these programs to update the model. Figure 2 shows the two step training procedure for the above example. In this paper, we address two key challenges in model training for SpFD by proposing a novel learning framework, improving both the search and update steps. The first challenge, the existence of spurious programs, lies in the search step. More specifically, while the success of the search step relies on its ability to find programs that are semantically correct, we can only verify if the program can generate correct answers, given that no gold programs are presented. The search step is complicated by spurious programs, which happen to evaluate to the correct answer but do not represent accurately the meaning of the natural language question. For example, for the environment in Figure 1 , the program Select Nation Where Name = Karen Andrew is spurious. Selecting spurious programs as positive examples can greatly affect the performance of semantic parsers as these programs generally do not gen-eralize to unseen questions and environments.

Figure 1: An example of semantic parsing from denotations. Given the table environment, map the question to an executable program that evaluates to the answer.

Figure 2: An example of semantic parsing from denotation. Given the question and the table environment, there are several programs which are spurious.

The second challenge, choosing a learning algorithm, lies in the update step. Because of the unique indirect supervision setting of SpFD, the quality of the learned semantic parser is dictated by the choice of how to update the model parameters, often determined empirically. As a result, several families of learning methods, including maximum marginal likelihood, reinforcement learning and margin based methods have been used. How to effectively explore different model choices could be crucial in practice.

Our contributions in this work are twofold. To address the first challenge, we propose a policy shaping (Griffith et al., 2013) method that incorporates simple, lightweight domain knowledge, such as a small set of lexical pairs of tokens in the question and program, in the form of a critique policy ( § 3). This helps bias the search towards the correct program, an important step to improve supervision signals, which benefits learning regardless of the choice of algorithm. To address the second challenge, we prove that the parameter update step in several algorithms are similar and can be viewed as special cases of a generalized update equation ( § 4). The equation contains two variable terms that govern the update behavior. Changing these two terms effectively defines an infinite class of learning algorithms where different values lead to significantly different results. We study this effect and propose a novel learning framework that improves over existing methods.

We evaluate our methods using the sequential question answering (SQA) dataset (Iyyer et al., 2017) , and show that our proposed improvements to the search and update steps consistently enhance existing approaches. The proposed algorithm achieves new state-of-the-art and outperforms existing parsers by 5.0%.

2 Background

We give a formal problem definition of the semantic parsing task, followed by the general learning framework for solving it.

2.1 The Semantic Parsing Task

The problem discussed in this paper can be formally defined as follows. Let X be the set of all possible questions, Y programs (e.g., SQL-like queries), T tables (i.e., the structured data in this work) and Z answers. We further assume access to an executor Φ : Y × T → Z, that given a program y ∈ Y and a table t ∈ T , generates an answer Φ(y, t) ∈ Z. We assume that the executor and all tables are deterministic and the executor can be called as many times as possible. To facilitate discussion in the following sections, we define an environment function e t : Y → Z, by applying the executor to the program as e t (y) = Φ(y, t).

Given a question x and an environment e t , our aim is to generate a program y * ∈ Y and then execute it to produce the answer e t (y * ). Assume that for any y ∈ Y, the score of y being a correct program for x is score θ (y, x, t), parameterized by θ. The inference task is thus:

EQUATION (1): Not extracted; please refer to original document.

As the size of Y is exponential to the length of the program, a generic search procedure is typically employed for Eq. (1), as efficient dynamic algorithms typically do not exist. These search procedures generally maintain a beam of program states sorted according to some scoring function, where each program state represents an incomplete program. The search then generates a new program state from an existing state by performing an action. Each action adds a set of tokens (e.g., Nation) and keyword (e.g., Select) to a program state. For example, in order to generate the program in Figure 1 , the DynSP parser (Iyyer et al., 2017) will take the first action as adding the SQL expression Select Nation. Notice that score θ can be used in either probabilistic or nonprobabilistic models. For probabilistic models, we assume that it is a Boltzmann policy, meaning that p θ (y | x, t) ∝ exp{score θ (y, x, t)}.

2.2 Learning

Learning a semantic parser is equivalent to learning the parameters θ in the scoring function, which is a structured learning problem, due to the large, structured output space Y. Structured learning algorithms generally consist of two major components: search and update. When the gold programs are available during training, the search procedure finds a set of high-scoring incorrect programs. These programs are used by the update step to derive loss for updating parameters. For example, these programs are used for approximating the partition-function in maximum-likelihood objective (Liang et al., 2011) and finding set of programs causing margin violation in margin based methods (Daumé III and Marcu, 2005) . Depending on the exact algorithm being used, these two components are not necessarily separated into isolated steps. For instance, parameters can be updated in the middle of search (e.g., Huang et al., 2012) . For learning semantic parsers from denotations, where we assume only answers are available in a training set

{(x i , t i , z i )} N i=1

of N examples, the basic construction of the learning algorithms remains the same. However, the problems that search needs to handle in SpFD is more challenging. In addition to finding a set of high-scoring incorrect programs, the search procedure also needs to guess the correct program(s) evaluating to the gold answer z i . This problem is further complicated by the presence of spurious programs, which generate the correct answer but are semantically incompatible with the question. For example, although all programs in Figure 2 evaluate to the same answer, only one of them is correct. The issue of the spurious programs also affects the design of model update. For instance, maximum marginal likelihood methods treat all the programs that evaluate to the gold answer equally, while maximum margin reward networks use model score to break tie and pick one of the programs as the correct reference.

3 Addressing Spurious Programs: Policy Shaping

Given a training example (x, t, z), the aim of the search step is to find a set K(x, t, z) of programs consisting of correct programs that evaluate to z and high-scoring incorrect programs. The search step should avoid picking up spurious programs for learning since such programs typically do not generalize. For example, in Figure 2, the spurious program Select Nation Where Index is Min will evaluate to an incorrect answer if the indices of the first two rows are swapped 1 . This problem is challenging since among the programs that evaluate to the correct answer, most of them are spurious. The search step can be viewed as following an exploration policy b θ (y|x, t, z) to explore the set of programs Y. This exploration is often performed by beam search and at each step, we either sample from b θ or take the top scoring programs. The set K(x, t, z) is then used by the update step for parameter update. Most search strategies use an exploration policy which is based on the score function, for example b θ (y|x, t, z) ∝ exp{score θ (y, t)}. However, this approach can suffer from a divergence phenomenon whereby the score of spurious programs picked up by the search in the first epoch increases, making it more likely for the search to pick them up in the future. Such divergence issues are common with latent-variable learning and often require careful initialization to overcome (Rose, 1998) . Unfortunately such initialization schemes are not applicable for deep neural networks which form the model of most successful semantic parsers today (Jia and Liang, 2016; Misra and Artzi, 2016; Iyyer et al., 2017) . Prior work, such as -greedy exploration (Guu et al., 2017) , has reduced the severity of this problem by introducing random noise in the search procedure to avoid saturating the search on high-scoring spurious programs. However, random noise need not bias the search towards the correct program(s). In this paper, we introduce a simple policy-shaping method to guide the search. This approach allows incorporating prior knowledge in the exploration policy and can bias the search away from spurious programs.

Algorithm 1 Learning a semantic parser from denotation using generalized updates.

Input: Training set {(xi, ti, zi} N i=1 (see Section 2)

, learning rate µ and stopping epoch T(see Section 4). Definitions: score θ (y, x, t) is a semantic parsing model parameterized by θ. ps(y | x, t) is the policy used for exploration and search(θ, x, t, z, ps) generates candidate programs for updating parameters (see Section 3). ∆ is the generalized update (see Section 4). Output: Model parameters θ.

1: » Iterate over the training data. 2: for t = 1 to T , i = 1 to N do 3: » Find candidate programs using the shaped policy. 4:

K = search(θ, xi, ti, zi, ps) 5:

» Compute generalized gradient updates 6:

θ = θ + µ∆(K) 7: return θ

Policy Shaping Policy shaping is a method to introduce prior knowledge into a policy (Griffith et al., 2013) . Formally, let the current behavior policy be b θ (y|x, t, z) and a predefined critique policy, the prior knowledge, be p c (y|x, t). Policy shaping defines a new shaped behavior policy p b (y|x, t) given by:

(2)

Using the shaped policy for exploration biases the search towards the critique policy's preference. We next describe a simple critique policy that we use in this paper.

Lexical Policy Shaping

We qualitatively observed that correct programs often contains tokens which are also present in the question. For example, the correct program in Figure 2 contains the token Points, which is also present in the question. We therefore, define a simple surface form similarity feature match(x, y) that computes the ratio of number of non-keyword tokens in the program y that are also present in the question x.

However, surface-form similarity is often not enough. For example, both the first and fourth program in Figure 2 contain the token Points but only the fourth program is correct. Therefore, we also use a simple co-occurrence feature that triggers on frequently co-occurring pairs of tokens in the program and instruction. For example, the token most is highly likely to co-occur with a correct program containing the keyword Max. This happens for the example in Figure 2 . Similarly the token not may co-occur with the keyword NotEqual. We assume access to a lexicon Λ = {(w j , ω j )} k j=1 containing k lexical pairs of tokens and keywords. Each lexical pair (w, ω) maps the token w in a text to a keyword ω in a program. For a given program y and question x, we define a co-occurrence score as co_occur(y, x) = (w,ω)∈Λ 1{w ∈ x ∧ ω ∈ y}}. We define critique score critique(y, x) as the sum of the match and co_occur scores. The critique policy is given by:

EQUATION (3): Not extracted; please refer to original document.

where η is a single scalar hyper-parameter denoting the confidence in the critique policy.

4 Addressing Update Strategy Selection: Generalized Update Equation

Given the set of programs generated by the search step, one can use many objectives to update the parameters. For example, previous work have utilized maximum marginal likelihood (Krishnamurthy et al., 2017; Guu et al., 2017) , reinforcement learning (Zhong et al., 2017; Guu et al., 2017) and margin based methods (Iyyer et al., 2017) . It could be difficult to choose the suitable algorithm from these options.

In this section, we propose a principle and general update equation such that previous update algorithms can be considered as special cases to this equation. Having a general update is important for the following reasons. First, it allows us to understand existing algorithms better by examining their basic properties. Second, the generalized update equation also makes it easy to implement and experiment with various different algorithms. Moreover, it provides a framework that enables the development of new variations or extensions of existing learning methods.

In the following, we describe how the commonly used algorithms are in fact very similartheir update rules can all be viewed as special cases of the proposed generalized update equation. Algorithm 1 shows the meta-learning framework. For every training example, we first find a set of candidates using an exploration policy (line 4). We use the program candidates to update the parameters (line 6).

4.1 Commonly Used Learning Algorithms

We briefly describe three algorithms: maximum marginalized likelihood, policy gradient and maximum margin reward.

Maximum Marginalized Likelihood The maximum marginalized likelihood method maximizes the log-likelihood of the training data by marginalizing over the set of programs.

J M M L = log p(z i |x i , t i ) = log y∈Y p(z i |y, t i )p(y|x i , t i ) (4)

Because an answer is deterministically computed given a program and a table, we define p(z | y, t) as 1 or 0 depending upon whether the y evaluates to z given t, or not. Let Gen(z, t) ⊆ Y be the set of compatible programs that evaluate to z given the table t. The objective can then be expressed as:

J M M L = log y∈Gen(zi,ti) p(y|x i , t i ) (5)

In practice, the summation over Gen(.) is approximated by only using the compatible programs in the set K generated by the search step.

Policy Gradient Methods Most reinforcement learning approaches for semantic parsing assume access to a reward function R : Y × X × Z → R, giving a scalar reward R(y, z) for a given program y and the correct answer z. 2 We can further assume without loss of generality that the reward is always in [0, 1]. Reinforcement learning approaches maximize the expected reward J RL :

EQUATION (6): Not extracted; please refer to original document.

J RL is hard to approximate using numerical integration since the reward for all programs may not be known a priori. Policy gradient methods solve this by approximating the derivative using a sample from the policy. When the search space is large, the policy may fail to sample a correct program, which can greatly slow down the learning. Therefore, off-policy methods are sometimes introduced to bias the sampling towards high-reward yielding programs. In those methods, an additional exploration policy u(y|x i , t i , z i ) is used to improve sampling. Importance weights are used to make the gradient unbiased (see Appendix for derivation).

Maximum Margin Reward For every training example (x i , t i , z i ), the maximum margin reward method finds the highest scoring program y i that evaluates to z i , as the reference program, from the set K of programs generated by the search. With a margin function δ : Y ×Y ×Z → R and reference program y, the set of programs V that violate the margin constraint can thus be defined as:

EQUATION (7): Not extracted; please refer to original document.

where δ(y, y , z) = R(y, z) − R(y , z). Similarly, the program that most violates the constraint can be written as:

EQUATION (8): Not extracted; please refer to original document.

The most-violation margin objective (negative margin loss) is thus defined as:

J M M R = − max{0, score θ (ȳ, x i , t i ) −score θ (y i , x i , t i ) + δ(y i ,ȳ, z i )}

Unlike the previous two learning algorithms, margin methods only update the score of the reference program and the program that violates the margin.

4.2 Generalized Update Equation

Although the algorithms described in §4.1 seem very different on the surface, the gradients of their loss functions can in fact be described in the same generalized form, given in Eq. (9) 3 . In addition to the gradient of the model scoring function, this equation has two variable terms, w(•), q(•). We call the first term w(y, x, t, z) intensity, which is a positive scalar value and the second term q(y|x, t) the competing distribution, which is a probability distribution over programs. Varying them makes the equation equivalent to the update rule of the algorithms we discussed, as shown in Table 1 . We also consider meritocratic update policy which uses a hyperparameter β to sharpen or smooth the intensity of maximum marginal likelihood (Guu et al., 2017) . Intuitively, w(y, x, t, z) defines the positive part of the update equation, which defines how aggressively the update favors program y. Likewise, q(y|x, t) defines the negative part of the learning Generalized Update Equation:

Table 1: Parameter updates for various learning algorithms are special cases of Eq. (9), with different choices of intensity w and competing distribution q. We do not show dependence upon table t for brevity. For off-policy policy gradient, u is the exploration policy. For margin methods, y∗ is the reference program (see §4.1), V is the set of programs that violate the margin constraint (cf. Eq. (7)) and ȳ is the most violating program (cf. Eq. (8)). For REINFORCE, ŷ is sampled from K using p(.) whereas for Off-Policy Policy Gradient, ŷ is sampled using u(.).

∆(K) = y∈K w(y, x, t, z)   ∇ θ score θ (y, x, t) − y ∈Y q(y |x, t)∇ θ score θ (y , x, t)   (9) Learning Algorithm Intensity Competing Distribution w(y, x, t, z) q(y|x, t) Maximum Margin Likelihood p(z|y)p(y|x) y p(z|y )p(y |x) p(y|x) Meritocratic(β) (p(z|y)p(y|x)) β y (p(z|y )p(y |x)) β p(y|x) REINFORCE 1{y =ŷ}R(y, z) p(y|x)

Off-Policy Policy Gradient

1{y =ŷ} R(y, z) p(y|x) u(y|x,z) p(y|x)

Maximum Margin Reward (MMR) algorithm, namely how aggressively the update penalizes the members of the program set. The generalized update equation provides a tool for better understanding individual algorithm, and helps shed some light on when a particular method may perform better.

1{y = y * } 1{y =ȳ} Maximum Margin Avg. Violation Reward (MAVER) 1{y = y * } 1/|V|1{y ∈ V}

Intensity versus Search Quality In SpFD, the effectiveness of the algorithms for SpFD is closely related to the quality of the search results given that the gold program is not available. Intuitively, if the search quality is good, the update algorithm could be aggressive on updating the model parameters. When the search quality is poor, the algorithm should be conservative.

The intensity w(•) is closely related to the aggressiveness of the algorithm. For example, the maximum marginal likelihood is less aggressive given that it produces a non-zero intensity over all programs in the program set K that evaluate to the correct answer. The intensity for a particular correct program y is proportional to its probability p(y|x, t). Further, meritocratic update becomes more aggressive as β becomes larger.

In contrast, REINFORCE and maximum margin reward both have a non-zero intensity only on a single program in K. This value is 1.0 for maximum margin reward, while for reinforcement learning, this value is the reward. Maximum margin reward therefore updates most aggressively in favor of its selection while maximum marginal likelihood tends to hedge its bet. Therefore, the maximum margin methods should benefit the most when the search quality improves.

Stability The general equation also allows us to investigate the stability of a model update algorithm. In general, the variance of update direction can be high, hence less stable, if the model update algorithm has peaky competing distribution, or it puts all of its intensity on a single program. For example, REINFORCE only samples one program and puts non-zero intensity only on that program, so it could be unstable depending on the sampling results.

The competing distribution affects the stability of the algorithm. For example, maximum margin reward penalizes only the most violating program and is benign to other incorrect programs. Therefore, the MMR algorithm could be unstable during training.

New Model Update Algorithm The general equation provides a framework that enables the development of new variations or extensions of existing learning methods. For example, in order to improve the stability of the MMR algorithm, we propose a simple variant of maximum margin reward, which penalizes all violating programs instead of only the most violating one. We call this approach maximum margin average violation reward (MAVER), which is included in Table 1 as well. Given that MAVER effectively considers more negative examples during each update, we expect that it is more stable compared to the MMR algorithm.

5 Experiments

We describe the setup in §5.1 and results in §5.2.

5.1 Setup

Dataset We use the sequential question answering (SQA) dataset (Iyyer et al., 2017) for our experiments. SQA contains 6,066 sequences and each sequence contains up to 3 questions, with 17,553 questions in total. The data is partitioned into training (83%) and test (17%) splits. We use 4/5 of the original train split as our training set and the remaining 1/5 as the dev set. We evaluate using exact match on answer. Previous state-of-theart result on the SQA dataset is 44.7% accuracy, using maximum margin reward learning.

Semantic Parser Our semantic parser is based on DynSP (Iyyer et al., 2017) , which contains a set of SQL actions, such as adding a clause (e.g., Select Column) or adding an operator (e.g., Max). Each action has an associated neural network module that generates the score for the action based on the instruction, the table and the list of past actions. The score of the entire program is given by the sum of scores of all actions.

We modified DynSP to improve its representational capacity. We refer to the new parser as DynSP++. Most notably, we included new features and introduced two additional parser actions. See Appendix 8.2 for more details. While these improvements help us achieve state-of-the-art results, the majority of the gain comes from the learning contributions described in this paper.

Hyperparameters For each experiment, we train the model for 30 epochs. We find the optimal stopping epoch by evaluating the model on the dev set. We then train on train+dev set till the stopping epoch and evaluate the model on the held-out test set. Model parameters are trained using stochastic gradient descent with learning rate of 0.1. We set the hyperparameter η for policy shaping to 5. All hyperparameters were tuned on the dev set. We use 40 lexical pairs for defining the co-occur score. We used common English superlatives (e.g., highest, most) and comparators (e.g., more, larger) and did not fit the lexical pairs based on the dataset.

Given the model parameter θ, we use a base exploration policy defined in (Iyyer et al., 2017) . This exploration policy is given by b θ (y | x, t, z) ∝ exp(λ • R(y, z) + score θ (y, θ, z)). R(y, z) is the reward function of the incomplete program y, given the answer z. We use a reward function R(y, z) given by the Jaccard similarity of the gold answer z and the answer generated by the program y. The value of λ is set to infinity, which essentially is equivalent to sorting the programs based on the reward and using the current model score for tie breaking. Further, we prune all syntactically invalid programs. For more details, we refer the reader to (Iyyer et al., 2017) . Table 2 contains the dev and test results when using our algorithm on the SQA dataset. We observe that margin based methods perform better than maximum likelihood methods and policy gradient in our experiment. Policy shaping in general improves the performance across different algorithms. Our best test results outperform previous SOTA by 5.0%.

Table 2: Experimental results on different model update algorithms, with and without policy shaping.

5.2 Results

Policy Gradient vs Off-Policy Gradient RE-INFORCE, a simple policy gradient method, achieved extremely poor performance. This likely due to the problem of exploration and having to sample from a large space of programs. This is further corroborated from observing the much superior performance of off-policy policy gradient methods. Thus, the sampling policy is an important factor to consider for policy gradient methods.

The Effect Of Policy Shaping

We observe that the improvement due to policy shaping is 6.0% on the SQA dataset for MAVER and only 1.3% for maximum marginal likelihood. We also observe that as β increases, the improvement due to policy shaping for meritocratic update increases. This supports our hypothesis that aggressive updates of margin based methods is beneficial when the search method is more accurate as compared to maximum marginal likelihood which hedges its bet between all programs that evaluate to the right answer.

Stability of MMR In Section 4, the general update equation helps us point out that MMR could be unstable due to the peaky competing distribution. MAVER was proposed to increase the stability of the algorithm. To measure stability, we cal- Table 3 , we show that by mixing the MMR's intensity and MML's competing distribution, we can create an algorithm that outperform MMR on the development set.

Table 3: The dev set results on the new variations of the update algorithms.

Policy Shaping helps against Spurious Programs In order to better understand if policy shaping helps bias the search away from spurious programs, we analyze 100 training examples. We look at the highest scoring program in the beam at the end of training using MAVER. Without policy shaping, we found that 53 programs were spurious while using policy shaping this number came down to 23. We list few examples of spurious program errors corrected by policy shaping in Table 4 .

Table 4: Training examples and the highest ranked program in the beam search, scored according to the shaped policy, after training with MAVER. Using policy shaping, we can recover from failures due to spurious programs in the search step for these examples.

Policy Shaping vs Model Shaping Critique policy contains useful information that can bias the search away from spurious programs. Therefore, one can also consider making the critique policy as part of the model. We call this model shaping. We define our model to be the shaped policy and train and test using the new model. Using MAVER updates, we found that the dev accuracy dropped to 37.1%. We conjecture that the strong prior in the critique policy can hinder generalization in model shaping.

6 Related Work

Semantic Parsing from Denotation Mapping natural language text to formal meaning representation was first studied by Montague (1970) . Early work on learning semantic parsers rely on labeled formal representations as the supervision signals Collins, 2005, 2007; Zelle and Mooney, 1993) . However, because getting access to gold formal representation generally requires expensive annotations by an expert, distant supervision approaches, where semantic parsers are learned from denotation only, have become the main learning paradigm (e.g., Clarke et al., 2010; Liang et al., 2011; Artzi and Zettlemoyer, 2013; Berant et al., 2013; Iyyer et al., 2017; Krishnamurthy et al., 2017) . Guu et al. (2017) studied the problem of spurious programs and considered adding noise to diversify the search procedure and introduced meritocratic updates.

Reinforcement Learning Algorithms Reinforcement learning algorithms have been applied to various NLP problems including dialogue (Li et al., 2016 ), text-based games (Narasimhan et al., 2015) , information extraction (Narasimhan et al., 2016) , coreference resolution (Clark and Man-Question without policy shaping with policy shaping "of these teams, which had more SELECT Club SELECT Club than 21 losses?" WHERE Losses = ROW 15 WHERE Losses > 21 "of the remaining, which SELECT Nation WHERE FollowUp WHERE earned the most bronze medals?" Rank = ROW 1 Bronze is Max "of those competitors from germany, SELECT Name WHERE FollowUp WHERE which was not paul sievert?" Time (hand) = ROW 3 Name != ROW 5 Table 4 : Training examples and the highest ranked program in the beam search, scored according to the shaped policy, after training with MAVER. Using policy shaping, we can recover from failures due to spurious programs in the search step for these examples.

ning, 2016), semantic parsing (Guu et al., 2017) and instruction following (Misra et al., 2017) . Guu et al. (2017) show that policy gradient methods underperform maximum marginal likelihood approaches. Our result on the SQA dataset supports their observation. However, we show that using off-policy sampling, policy gradient methods can provide superior performance to maximum marginal likelihood methods.

Margin-based Learning Margin-based methods have been considered in the context of SVM learning. In the NLP literature, margin based learning has been applied to parsing (Taskar et al., 2004; McDonald et al., 2005 ), text classification (Taskar et al., 2003) , machine translation (Watanabe et al., 2007) and semantic parsing (Iyyer et al., 2017) . Kummerfeld et al. (2015) found that max-margin based methods generally outperform likelihood maximization on a range of tasks. Previous work have studied connections between margin based method and likelihood maximization for supervised learning setting. We show them as special cases of our unified update equation for distant supervision learning. Similar to this work, Lee et al. (2016) also found that in the context of supervised learning, margin-based algorithms which update all violated examples perform better than the one that only updates the most violated example.

Latent Variable Modeling Learning semantic parsers from denotation can be viewed as a latent variable modeling problem, where the program is the latent variable. Probabilistic latent variable models have been studied using EM-algorithm and its variant (Dempster et al., 1977) . The graphical model literature has studied latent variable learning on margin-based methods (Yu and Joachims, 2009) and probabilistic models (Quattoni et al., 2007) . Samdani et al. (2012) studied various vari-ants of EM algorithm and showed that all of them are special cases of a unified framework. Our generalized update framework is similar in spirit.

7 Conclusion

In this paper, we propose a general update equation from semantic parsing from denotation and propose a policy shaping method for addressing the spurious program challenge. For the future, we plan to apply the proposed learning framework to more semantic parsing tasks and consider new methods for policy shaping.

This transformation preserves the answer of the question.

This is essentially a contextual bandit setting.Guu et al. (2017) also used this setting. A general reinforcement learning setting requires taking a sequence of actions and receiving a reward for each action. For example, a program can be viewed as a sequence of parsing actions, where each action can get a reward. We do not consider the general setting here.

See Appendix for the detailed derivation.

. We also add recall features which measure how many tokens in the question that are also present in the table are covered by a given program. To compute this feature, we first compute the set E 1 of all tokens in the question that are also present in the table. We then find a set of non-keyword tokens E 2 that are present in the program. The recall score is then given by w * |E 1 −E 2 | |E 1 | , where w is a learned parameter.