Paraphrasing vs Coreferring: Two Sides of the Same Coin

Y. Meged
Avi Caciularu
Vered Shwartz
Ido Dagan
FINDINGS
2020
View in Semantic Scholar

Abstract

We study the potential synergy between two different NLP tasks, both confronting predicate lexical variability: identifying predicate paraphrases, and event coreference resolution. First, we used annotations from an event coreference dataset as distant supervision to re-score heuristically-extracted predicate paraphrases. The new scoring gained more than 18 points in average precision upon their ranking by the original scoring method. Then, we used the same re-ranking features as additional inputs to a state-of-the-art event coreference resolution model, which yielded modest but consistent improvements to the model’s performance. The results suggest a promising direction to leverage data and models for each of the tasks to the benefit of the other.

1 Introduction

Recognizing that lexically-divergent predicates discuss the same event is a challenging task in NLP (Barhom et al., 2019) . Lexical resources such as WordNet (Miller, 1995) capture synonyms (say, tell), hypernyms (whisper, talk), and antonyms, which can be used with caution when the arguments are reversed ([a] 0 beat [a] 1 , [a] 1 lose to [a] 0 ). But WordNet's coverage is insufficient, in particular missing context-specific paraphrases (hide, launder in the context of money). Distributional methods, on the other hand, enjoy a broader coverage, but their precision is limited because similar terms are as often as not mutually-exclusive (born, die) or are temporally or causally related (sentenced, convicted) .

Two prominent lines of work pertaining to identifying predicates whose meaning or referents can be matched are cross-document (CD) event coreference resolution and recognizing predicate paraphrases. The former identifies and clusters event mentions across multiple documents, that refer to Tara Reid has checked into∨ Promises Treatment Center. Actress Tara Reid entered∨ well-known Malibu rehab center. Lindsay Lohan checked into × rehab in Malibu, California.

Director Chris Weitz is expected to direct∨ New Moon. Chris Weitz will take on∨ the sequel to "Twilight". Gary Ross is still in negotiations to direct × the sequel. the same event within their respective contexts, while the latter collects pairs of event expressions that may refer to the same events in certain contexts. Table 1 illustrates this difference with examples of predicate paraphrases that are not always co-referring.

Table 1: Examples from ECB+ that illustrate the context-sensitive nature of event coreference. The event mentions are co-referable in certain contexts but are not always co-referring in practice.

Event coreference resolution systems are typically supervised, using the ECB+ dataset (Cybulska and Vossen, 2014) which contains collections of news articles (documents) on different topics. Approaches for extracting predicate paraphrases are typically unsupervised, based on the similarity between the distribution of arguments (Lin and Pantel, 2001; Berant, 2012) , general paraphrasing approaches such as backtranslation (Barzilay and McKeown, 2001; Ganitkevitch et al., 2013; Mallinson et al., 2017) , or leveraging redundant news reports of the same event (Shinyama et al., 2002; Shinyama and Sekine, 2006; Barzilay and Lee, 2003; Zhang and Weld, 2013; Xu et al., 2014; Shwartz et al., 2017) .

In this paper, we study the potential synergy between predicate paraphrases and event coreference resolution. We show that the data and models for one task can benefit the other. In one direction, we use event coreference annotations from the ECB+ dataset as distant supervision to learn an improved scoring of predicate paraphrases in the unsupervised Chirps resource (Shwartz et al., 2017) . The distantly supervised scorer significantly improves upon ranking by the original Chirps scores, adding 18 points to average precision over a test sample.

In the other direction, we incorporate unlabeled data and features used for the re-scored Chirps method into a state-of-the-art (SOTA) event coreference system (Barhom et al., 2019) . We first assess that Chirps has a substantial coverage over ECB+ corefering mention pairs. Consequently, incorporating the Chirps source-data and features reduces 15% of the corference merging errors in ECB+ and yields a modest but consistent improvement across the various coreference metrics.

2 Background 2.1 Event Coreference Resolution

Event coreference resolution aims to identify and cluster event mentions, that, within their respective contexts, refer to the same event. The task has two variants, one in which corefering mentions are within the same document (within document) and another in which corefering mentions can be in different documents (cross-document, CD), on which we focused in this paper.

The standard dataset for the CD event coreference is ECB (Cybulska and Vossen, 2014) and its predecessors EECB (Lee et al., 2012) and ECB (Bejan and Harabagiu, 2010) . ECB+ contains a set of topics, each containing documents describing the same global event. Both event and entity coreference are annotated in ECB+, for within and cross-document coreference resolution.

Models for CD event coreference utilize a range of features from lexical overlap among mention pairs and semantic knowledge from WordNet Harabagiu, 2010, 2014; Yang et al., 2015) , to distributional (Choubey and Huang, 2017) and contextual representations (Kenyon-Dean et al., 2018; Barhom et al., 2019) . The current SOTA model from Barhom et al. (2019) iteratively and intermittently learns to cluster events and entities. The mention representation m i consists of several components corresponding to the span representation and the surrounding context. The interdependence between the two tasks is encoded into the mention representation such that an event mention representation contains a component reflecting the current entity clustering, and vice versa. The model trains a pairwise mention scoring function which predicts the probability that two mentions refer to the same event.

2.2 Predicate Paraphrase Identification

There are 3 main approaches for identifying and collecting predicate paraphrases. The first approach considers a pair of predicate templates as paraphrases if the distributions of their argument instantiation are similar in pairs. For instance, in "[a 0 ] quit from [a 1 ]", [a 0 ] contains people names while [a 1 ] contains job titles. A semanticallysimilar template like "[a 0 ] resign from [a 1 ]" is expected to have similar argument distributions (Lin and Pantel, 2001; Berant, 2012) . The second approach, backtranslation, is applied in a general paraphrasing setup and not specifically to predicates. The idea is that if two English phrases translate to the same term in a foreign language, across multiple foreign languages, it indicates that they are paraphrases. This approach was first suggested by Barzilay and McKeown (2001) , later adapted to the large PPDB resource (Ganitkevitch et al., 2013) , and also works well with neural machine translation (Mallinson et al., 2017) .

Finally, the third approach, on which we focus in this paper, leverages multiple news documents discussing the same event. The underlying assumption is that such redundant texts may refer to the same entities or events using lexically-divergent mentions, and such co-referring mentions are considered paraphrases. Most work used this approach to extract sentential paraphrases. When long documents are used, the first step in this approach is to align each pair of documents by sentences. This is done by finding sentences with shared named entities (Shinyama et al., 2002) , lexical overlap (Barzilay and Lee, 2003; Shinyama and Sekine, 2006) and aligning pairs of predicates or arguments (Zhang and Weld, 2013; Recasens et al., 2013) .

More recent work uses news headlines from Twitter. Such texts are concise due to Twitter's character limit of 280 characters. Xu et al. (2014) extracted sentential paraphrases by finding pairs of tweets with a shared anchor word that discusses the same "trending topic". Lan et al. (2017) extracted pairs of tweets that link to the same URL.

Finally, Chirps (Shwartz et al., 2017) , which we used in this paper, focused on predicate paraphrases, and has collected more than 5 million distinct pairs over the last 3 years. During its collection, Chirps retrieves tweets linking to news websites, and aim to match between tweets that refer to the same event. It extracts predicate-argument tuples from them and considers as paraphrases pairs of predicates that appeared on tweets posted on the same day and whose arguments are heuristically matched.

s = count • (1 + d N ).

This score is proportional to the number of supporting pairs in which the two templates were paired (n), as well as the number of days in which such pairings were found (d). N is the number of days the resource is collected. The Chirps resource provides the scored predicate paraphrases as well as the supporting pairs for each paraphrase.

Human evaluation showed that this scoring is effective, and that the percentage of correct paraphrases is higher for highly scored paraphrases. At the same time, due to the heuristic collection and scoring of predicate paraphrases in Chirps, entries in the resource may suffer from two types of errors: (1) type 1 error, i.e., the heuristic recognized pairs of non-paraphrase predicates as paraphrases. This happens when the same arguments participate in multiple events, as in the following paraphrases: "[Police] 0 arrest [man] 1 " and "[Police] 0 shoot [man] 1 "; and (2) type 2 error, i.e., the scoring function assigned a low score to a rare but correct paraphrase pair, as in "[a 0 ] outgun [a 1 ]" and "[a 0 ] outperform [a 1 ]", with a single supporting pair.

To improve the scoring of Chirps paraphrasepairs, we train a new scorer using distant supervision. We first describe the features we extract to represent a paraphrase pair (Section 3.1). We then describe the distant supervision that we derived semi-automatically from the ECB+ training set (Section 3.2). Finally, we provide the implementation details (Section 3.3).

3.1 Features

Each paraphrase pair consists of two predicate templates p 1 and p 2 , accompanied by the n supporting pairs associated with this paraphrase pair support-pairs(p 1 , p 2 ) = {(t 1 1 , t 1 2 ), ..., (t n 1 , t n 2 )}. Each tweet included in Chirps links to a news article, whose content we retrieve. We extract the following features for a predicate paraphrase pair p 1 , p 2 (see the appendix for a full list of features in

f p 1 ,p 2 ∈ R 17 ):

Named Entity Coverage: While the original method did not utilize the information of this external article, we find it useful to retrieve more information about the event. Specifically, it might help mitigating errors in Chirps' argument matching mechanism, which relies on argument alignment considering only the text of the two tweets. We found that the original mechanism worked particularly well for named entities while being more error-prone for common nouns, which might require additional context.

Given (t i 1 , t i 2 ) ∈ support-pairs(p 1 , p 2 )

, we use SpaCy (Honnibal and Montani, 2017) to extract sets of named entities, N E 1 and N E 2 , respectively. N E j contains the named entities mentioned in the tweet t i j and in the first paragraph of its corresponding news article. We define a Named Entity Coverage score, NEC, as the maximum ratio of named entity coverage of one article by the other:

NEC(N E 1 , N E 2 ) = max |N E1 N E2| |N E1| , |N E1 N E2| |N E2|

We manually annotated a small balanced training set of 121 tweet pairs and used it to tune a score threshold T = 0.26, such that pairs of tweets whose NEC is at least T are considered coreferring. Finally, we include the following features: the number of coreferring tweet pairs (whose NEC score exceeds T ) and the average NEC score of these pairs.

Cross-Document Coreference Resolution:

We apply the state-of-the-art cross-document coreference model from Barhom et al. (2019) to the data constructed such that each tweet constitutes a document and each pair of tweets t j 1 and t j 2 in support-pairs(p 1 , p 2 ) forms a topic. As input for the models, in each tweet we mark the corresponding predicate span as an event mention and the two argument spans as entity mentions. The model outputs whether the two event mentions corefer (yielding a single event coreference cluster for the Barhom et al. (2019) : 1) false positive: wrong man / two men alignment (disregarding location modifiers). 2) (hypothetical) false negative: lexically-divergent yet semantically-similar arguments. two mentions) or not (yielding two singleton clusters). Similarly, it clusters the four arguments to entity coreference clusters.

Differently from Chirps, this model makes its event clustering decision based on the predicate, arguments, and context, as opposed to the arguments alone. Thus, we expect it not to cluster predicates whose arguments match lexically, if their contexts or predicates don't match (first example in Table 2 ). In addition, the model's mentions representation might help to identify lexically-divergent yet semantically-similar arguments (second example in Table 2 ).

Table 2: Examples of coreference errors made by Chirps and corrected by Barhom et al. (2019): 1) false positive: wrong man / two men alignment (disregarding location modifiers). 2) (hypothetical) false negative: lexically-divergent yet semantically-similar arguments.

For a given pair of tweets, we extract the following binary features with respect to the predicate mentions: perfect match when the predicates are assigned to the same cluster, and no match when each predicate forms a singleton cluster. For argument mentions, we extract the following features: perfect match if the two a 0 arguments belong to one cluster and the two a 1 arguments belong to another cluster; reversed match if at least one of the a 0 arguments is clustered as coreferring with the a 1 argument in the other tweet; and no match otherwise.

Connected Components: The original Chirps score of a predicate paraphrase pair is proportional to two parameters: (1) the number of supporting pairs; (2) the ratio of number of days in which supporting pairs were published relative to the entire collection period. The latter lowers the score of wrong paraphrase pairs which were mistakenly aligned on relatively few days (e.g. due to misleading argument alignments in particular events). The number of days in which the predicates were aligned is taken as a proxy for the number of global events in which the predicates co-refer. Here, we aim to get a more accurate split of tweets to global events by constructing a graph of tweets as nodes and looking for connected components. To that end, we define a bipartite graph G p 1 ,p 2 = (V, E) where V = tweets(p 1 , p 2 ) contains all the tweets in which p 1 or p 2 appeared, and E = support-pairs(p 1 , p 2 ).

We compute C, the number of connected components in G p 1 ,p 2 , and define the following feature: #connected(p 1 , p 2 ) = |{c ∈ C : |c| > 2}|. A larger number of connected components indicates that the two predicates were aligned across a large number of global events.

Clique:

We similarly build a global tweet graph for all the predicate pairs,

G all = (V , E ), where V = ∪ (p 1 ,p 2 ) tweets(p 1 , p 2 )

, and E = ∪ (p 1 ,p 2 ) support-pairs(p 1 , p 2 ). We compute Q, the set of cliques in G all . We assume that a pair of tweets are more likely to be coreferring if they are part of a bigger clique, whereas if they were extracted by mistake they wouldn't share many neighbors. We extract the following feature of clique coverage for a candidate paraphrase pair:

CLC(p 1 , p 2 ) = |{t j 1 , t j 2 ∈ support-pairs(p 1 , p 2 ) : ∃q ∈ Q such that t j 1 ∈ q ∧ t j 2 ∈ q}|.

3.2 Distantly Supervised Labels

In order to learn to score the paraphrases, we need gold standard labels, i.e., labels indicating whether a pair of predicate templates collected by Chirps are indeed paraphrases. Instead of collecting manual annotations, we chose a low-budget distant supervision approach. To that end, we leverage the similarity between the predicate paraphrase extraction and the event coreference resolution tasks, and use the annotations from the ECB+ dataset.

As positive examples we consider all pairs of predicates p 1 , p 2 from Chirps that appear in the same event cluster in ECB+, e.g., from {talk, say, tell, accord to, statement, confirm} we extract (talk, say), (talk, tell), ..., (statement, confirm).

Obtaining negative examples is a bit trickier. We consider negative examples to be pairs of predicates p 1 , p 2 from Chirps, which are under the same topic, but in different event clusters in ECB+, e.g., given the clusters {specify, reveal, say}, and {get}, we extract (specify, get), (reveal, get), and (say, get).

Note that the ECB+ annotations are contextdependent. Thus a pair of predicates that is in principle coreferable may be annotated as noncoreferring in a given context. To reduce the rate of such false-negative examples, we validated all the negative examples and a sample of the positive examples using Amazon Mechanical Turk. Following Shwartz et al. (2017) , we annotated the templates with 3 argument instantiations from their original tweets. Thus, we only included in the final data predicate pairs with at least 3 supporting pairs. We required that workers have 99% approval rate on at least 1,000 prior tasks and pass a qualification test.

Each example was annotated by 3 workers. We aggregated the per-instantiation annotations using majority vote and consider the pair as positive if at least one instantiation was judged as positive. The data statistics are given in Table 3 . The validation phase has balanced the positive-negative proportion of instances in the data, from approximately 1:7 to approximately 4:5.

Table 3: Statistics of the paraphrase scorer dataset. The difference in size before and after the annotation is due to omitting examples with less than 3 supporting pairs.

3.3 Model

We trained a random forest classifier (Breiman, 2001) implemented by the scikit-learn framework (Pedregosa et al., 2011) . To tune the hyperparameters, we ran a 3 fold cross-validation randomized search, yielding the following values: 157 estimators, max depth of 8, minimum samples leaf of 1, and min samples split of 10. 1

3.4 Evaluation

We used the model for two purposes: (1) classification: determining if a pair of predicate templates are paraphrases or not; and (2) ranking the pairs based on the predicted positive class score. Classifier Results. In Table 4 we depict the precision, recall, F 1 and accuracy scores on the distantly supervised test set from Section 3.2. We compared our classifier with two baselines; one based on the original Chirps scores and another based on cosine similarity scores between the two mentions using GloVe embeddings (Pennington et al., 2014 ). 2 For both baselines we used the train set to learn a threshold T for which predicate pairs whose score exceeds T are considered paraphrases. Our classifier substantially improved upon the baselines in both accuracy and F 1 , by decreasing the false-positive error rate.

Table 4: Test set results of the classifier and the scorer.

Ranking Results. Table 5 exemplifies the ranking of top predicate pairs by our scorer and the original Chirps scorer. We report the average precision scores on the entire test set "AP (all)" and on a random subset of 500 predicate pairs from the annotated data with at least 6 support pairs each: "AP (500)", on which our scorer outperforms the Chirps' scorer by 8 points. The results are statistically significant using bootstrap and permutation tests with p < 0.001 (Dror et al., 2018) .

Table 5: Average precision on 500 random pairs and on the entire set, along with top 10 ranked test set pairs by Chirps and our method. Pairs labeled as positive are highlighted in purple.

Table 6: Event mentions coreference-resolution results on ECB+ test set.

4 Leveraging A Paraphrasing Resource To Improve Coreference

In Section 3 we showed that CD event coreference annotations can be used to improve predicate paraphrase ranking. In this section, we show that this co-dependence can be used in both directions and that leveraging the improved predicate paraphrase resource can benefit CD coreference. Another way to look at it is as an extrinsic evaluation for the improved Chirps. As a preliminary analysis, we computed Chirps' coverage of pairs of co-referring events in ECB+, and found approximately 30% coverage and above 50% coverage for verbal mentions only, as detailed in Table 7 . 3

Table 7: Chirps coverage of co-referring mention pairs in ECB+.

4.1 Integration Method

Barhom et al. (2019) trained a pairwise mention scoring function, S(m i , m j ), which predicts the probability that two mentions m i , m j refer to the same event. The input to

S(m i , m j ) is v i,j = [ v(m i ); v(m j ); v(m i ) • v(m j ); f (i, j)],

where • denotes an element-wise multiplication and f (i, j) consists of various binary features. We extended the model by changing the input to the pairwise event mention scoring function S to

v i,j = [ v i,j ; c i,j ]

, where c i,j denotes the Chirps component. We compute c i,j in the following way:

c i,j = N N ( f m i ,m j ) if m i , m j ∈ Chirps N N ( 0) otherwise f m i ,m j ∈ R 17

is the feature vector representing pair of predicates (m i , m j ) for which there is an entry in Chirps and NN is an MLP with a single hidden layer of 50, as illustrated in Figure 1 . The rest of the model remains the same, including the model architecture, training, and inference.

Figure 1: New mention pair representation.

4.2 Evaluation

We evaluate the event coreference performance on ECB+ using the official CoNLL scorer (Pradhan et al., 2014) . The reported metrics are MUC (Vilain et al., 1995) , B 3 (Bagga and Baldwin, 1998), CEAF-e (Luo, 2005) and CoNLL F 1 (the average of MUC, B 3 and CEAF-e scores).

3 Non-verbal mentions in ECB+ include nominalizations (investigation), names (Oscars) acronyms (DUI), and more. The results in Table 6 show that the Chirpsenhanced model provides a small improvement upon the baseline in the F 1 scores, most of all in the link-based MUC. The performance difference (CoNLL F 1 ) of 0.5 point is statistically significant according to bootstrap and permutation tests with p < 0.001.

4.3 Errors Recovered By Chirps

We analyze the cases in which incorporating knowledge from Chirps helped the model overcome the two types of error:

1. False Positive: the original model clustered a non-coreferring mention pair together, and our model didn't. We found 314/1,322 pairs (25.75%), exemplified in the top part of Table 8. 2. False Negative: coreferring mention pairs that were assigned different clusters in the original model and the same cluster in ours. We found 299/2,823 pairs (10%), exemplified in the bottom part of Table 8 .

Table 8: Examples of false positive and false negative errors on ECB+ recovered by Chirps, together with the cosine similarity scores between the predicates, using GloVe (Pennington et al., 2014).

Although the gap between our model and the original model by Barhom et al. (2019) is statistically significant, it is rather small. We can attribute it partly to the coverage of Chirps over ECB+ (around 30%) which entails that most event mentions have the same representation as in the original model. We also note that ECB+ suffers from annotation errors, as observed by Barhom et al. (2019) and others.

5 Conclusion And Future Work

We studied the synergy between the tasks of identifying predicate paraphrases and event coreference resolution, both pertaining to consolidating Table 8 : Examples of false positive and false negative errors on ECB+ recovered by Chirps, together with the cosine similarity scores between the predicates, using GloVe (Pennington et al., 2014) .

lexically-divergent mentions, and showed that they can benefit each other. Using event coreference annotations as distant supervision, we learned to re-rank predicate paraphrases that were initially ranked heuristically, and managed to increase their average precision substantially. In the other direction, we incorporated knowledge from our reranked predicate paraphrases resource into a model for event coreference resolution, yielding a small improvement upon previous state-of-the-art results. We hope that our study will encourage future research to make progress on both tasks jointly.

Congle Zhang and Daniel S Weld. 2013. Harvesting parallel news streams to generate paraphrases of event relations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1776-1786.

A Features List

The full feature list of a given predicate paraphrase pair p 1 , p 2 (Section 3.1) includes features from several categories: 6. Days of available supporting pairs: the total number of days d in which the support pairs above occurred in the available tweets.

A.1

A.2 Named Entity Features

As described in Section 3.1: 7. Above Threshold: number of pairs with NEC score of at least T .

8. Average Above Threshold: average of NEC scores for pairs with a score of at least T .

9. Perfectly Clustered with NE Coverage: the number of pairs with NEC score of at least T and perfect clustering for event coreference resolution (Section A.4).

A.3 Graph Features

As described in Section 3.1:

10. Number of connected components: #connected(p 1 , p 2 ) = |{c ∈ C : |c| > 2}|

11. Average connected component: the average size of connected components in G p 1 ,p 2 .

12. In Clique: the number of pairs in support-pairs(p 1 , p 2 ) that are in a clique.

A.4 Cross-Document Coreference Features

As described in Section 3.1:

13. Event Perfect: number of event pairs with perfect match.

14. Event No Match: number of event pairs with no match.

15. Entity Perfect: number of entity pairs with perfect match.

16. Entity Reverse: number of entity pairs with reverse match.

17. Entity No Match: number of entity pairs with no match.

We chose random forest over a neural model because of the small size of the training set and it yielded the best performance over the validation set

Multi-word predicates were represented by the average of their word vectors.

Paraphrasing vs Coreferring: Two Sides of the Same Coin

Authors

Abstract

1 Introduction

2 Background 2.1 Event Coreference Resolution

2.2 Predicate Paraphrase Identification

3.1 Features

Cross-Document Coreference Resolution:

Clique:

3.2 Distantly Supervised Labels

3.3 Model

3.4 Evaluation

4 Leveraging A Paraphrasing Resource To Improve Coreference

4.1 Integration Method

4.2 Evaluation

4.3 Errors Recovered By Chirps

5 Conclusion And Future Work

A Features List

A.2 Named Entity Features

A.3 Graph Features

A.4 Cross-Document Coreference Features