DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion

Mor Geva
Eric Malmi
Idan Szpektor
Jonathan Berant
NAACL
2019
View in Semantic Scholar

Abstract

Sentence fusion is the task of joining several independent sentences into a single coherent text. Current datasets for sentence fusion are small and insufficient for training modern neural models. In this paper, we propose a method for automatically-generating fusion examples from raw text and present DiscoFuse, a large scale dataset for discourse-based sentence fusion. We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences. We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation. Finally, we conduct transfer learning experiments with WebSplit, a recent dataset for text simplification. We show that pretraining on DiscoFuse substantially improves performance on WebSplit when viewed as a sentence fusion task.

1 Introduction

Sentence fusion is the task of combining several independent sentences into a single coherent text (Barzilay and McKeown, 2005) . Sentence fusion is important in many NLP applications, including retrieval-based dialogue (Song et al., 2018; Yan and Zhao, 2018) , text summarization (Barzilay and McKeown, 2005; Bing et al., 2015) and question answering Marsi and Krahmer, 2005) . Such systems retrieve multiple sentences from different sources, documents or paragraphs, and use them to construct a coherent text. Sentence fusion is challenging because it requires understanding the discourse semantics between the input sentences. Consider the example in Figure 1 : a coherent fusion of the sentences requires understanding that the second sentence contrasts the first one, in order to insert the discourse connective "However". In addition, the gender and syntactic role of the entity "Zeitler" needs to be inferred to insert the pronoun "he".

Figure 1: Example for two independent sentences, and their fusion. The modifications applied are pronominalization (blue) and connective insertion (red).

Prior work on sentence fusion (Barzilay and McKeown, 2005; Turner and Charniak, 2005; Filippova, 2010; Elsner and Santhanam, 2011; Thadani and McKeown, 2013; Bing et al., 2015; Chali et al., 2017) utilized very small amounts of labeled data, which are insufficient to train modern neural models. In this work, we propose a method for automatically generating sentence fusion examples at scale from raw text corpora.

To this end, we go over sentences and contiguous pairs of sentences in a corpus, and apply a set of manually-constructed rules, which identify the occurrence of prevalent fusion operations. The rules specify how to modify the sentences such that they are "unfused" into two independent sentences. E.g., in Figure 1 one rule will delete the discourse connective "However", and another will replace the pronoun "he" with the named entity "Zeitler".

In the generated examples, the original fused text becomes the target, and the unfused sentences (generated by rules) are the input. Importantly, sentence fusion models trained on our data cannot simply learn to invert rule application, because information is lost and can be recovered only by understanding the text semantics . As mentioned, learning to insert "However" in Figure 1 requires inferring that the sentences contrast. We cover a wide range of fusion phenomena such as inserting discourse connectives in various positions of the sentences, anaphora and cataphora identification, and sentence merging through coordination, relative clauses and apposition.

We applied our method on two large document collections, Wikipedia and sports articles from the Web, resulting in two datasets of 16 million and 44 million examples respectively. We call the combined dataset DISCOFUSE. We extensively analyze the quality of our dataset with crowdsourcing, and find that workers understand the text after splitting in 85% of the cases, and the other 15% are due to either the original text being unclear or errors in rule application.

We trained a state-of-the-art sequence-tosequence model (Vaswani et al., 2017) and analyzed the fusion phenomena in which the model struggles. We found that the model succeeds in fusing sentences through structural constructions such as apposition or relative clauses, but performs badly when fusion involves inserting a particular discourse connective, or selecting pronominals.

Last, we performed transfer learning by training on DISCOFUSE and then fine-tuning on a smaller dataset from a different distribution. To this end, we utilize WEBSPLIT, a recent dataset for sentence splitting (Narayan et al., 2017; Aharoni and Goldberg, 2018) , viewing WEBSPLIT as a sentence fusion task. We found that pre-training on DISCOFUSE substantially improves the performance of a fusion model in this setup.

To conclude, our contributions are: 1. DISCOFUSE: a dataset of 60 million sentence fusion examples from two different corpora. 2. A method for automatically generating sentence fusion examples from raw text. 3. Automatic and human evaluation of the Transformer model on the fusion task. 4. A transfer learning setting in which model performance improves when pre-trained with DIS-

Cofuse.

The DISCOFUSE dataset is publicly available at: https://discofuse.page.link/data.

2 Background

Existing fusion datasets are small, which is perhaps why only few works have explored the application of supervised models to sentence fusion (Elsner and Santhanam, 2011; Thadani and McKeown, 2013) . McKeown et al. (2010) introduced a human-generated corpus of 3,000 examples. Elsner and Santhanam (2011) extracted around 300 fusion examples from pre-and postediting news articles. Thadani and McKeown (2013) constructed 1,858 examples from summarization tasks. Such datasets are too small to train modern data-hungry neural models. Related to sentence fusion is its "inverse" task of sentence splitting. Collados (2013) automatically constructed a Spanish simplification dataset by splitting single sentences into several simpler ones. Recently, two larger datasets for text splitting were released (Botha et al., 2018; Narayan et al., 2017; Aharoni and Goldberg, 2018) . However, using these datasets for the "mirror" task of sentence fusion is problematic. First, sentence splitting often involves removing content from the original sentence for simplification, and this content is impossible to recover in the fusion direction. Second, these datasets do not focus on discourse and thus prominent discourse phenomena may be missed. Last, our new dataset is more than an order of magnitude larger than the above sentence splitting datasets.

Another related line of recent work focused on predicting discourse connectives between sentences and automatically generating examples from raw text (Liu et al., 2016; Malmi et al., 2018) . We substantially expand over those works by handling more diverse linguistic phenomena, such as connectives in single sentences, generating anaphora and cataphora constructions, relative clauses, coordination and more, which are all represented in a single dataset. Moreover, our dataset is 20x larger compared to prior work, allowing us to examine in depth long-tail scenarios.

3 The Discofuse Dataset

We next describe our process for building DIS-COFUSE, which contains 60 million sentence fusion examples from two different document collections: Wikipedia and Web articles about sports. 2: Example generation rule for apposition. Given an input text and its dependency tree, we check for a match with the apposition pattern. We then use the dependency tree to split the sentence and create a new example.

3.1 Example Generation

DISCOFUSE contains union-fusion examples, i.e. fusing sentences without loss of content (Marsi and Krahmer, 2005) . To automatically extract examples, we manually crafted a list of text splitting rules. Our rule-set covers 9 fusion phenomena, including handling discourse connectives, coordination and relative clauses, and entity resolution for anaphora and cataphora constructions. For entity resolution, both anaphoric pronouns ("she", "they", "his") and anaphoric nominals ("the team", "the man") are considered, based on the output of a coreference system. The covered phenomena are summarized in Table 1 and a detailed description is given in Appendix A.1.

Table 1: Generated fusion examples for different phenomena. The input text is marked in uppercase blue, and the generated sentence pair is marked in lowercase red. We show in boldface parts that allow us to detect the target phenomenon.

Given a text t consisting of one or two consecutive sentences, each of our rules addresses a specific discourse phenomenon and has two parts: (a) conditions for matching the phenomenon in t, and (b) operations over a dependency tree annotated with coreference resolution. Applying the operations generates a fusion example (x = (s 1 , s 2 ), t), in which (s 1 , s 2 ) are two independent sentences originating from t, but stripped from the discourse phenomenon that tied them in t. Figure 2 gives an example of a rule for the apposition structure. The rule is applied to the sentence "The Jacksonville Jazz Piano Competition, a 30 year tradition, takes place at the Florida Theatre". First, the input is matched to the rule's condition. In this case, the condition is a single clause surrounded by two commas, which has a determiner as its first token and includes an apposition with incoming edge from a preceding token to the clause. Once matched, an example is generated. For this rule, the first sentence is created by removing the apposition clause, and the second sentence by removing the part after the clause and inserting the appropriate "be" verb ("is"). Generation examples for all 9 rule types are provided in Table 1 .

Figure 2: Example generation rule for apposition. Given an input text and its dependency tree, we check for a match with the apposition pattern. We then use the dependency tree to split the sentence and create a new example.

As explained in Section 1, solving sentence fusion involves more than just reverse-engineering the generation rules. The model needs to decide whether to insert a discourse connective with the right semantics, whether to merge the input sentences, and what syntactic construction (relative clause, coordination, apposition) is most appropriate in the given context.

Last, often several discourse phenomena occur in a single text t. Thus, we allow combining anaphora rules with one of the following rule types: discourse connective, inner connective and sentence coordination, which cover frequent combinations in our texts.

3.2 Building The Discofuse Dataset

To create DISCOFUSE we retrieved the latest Wikipedia release and crawled the Web for several million sports articles. Documents were annotated with dependency trees and coreference resolution using Google Cloud Natural Language. 1 We considered each sentence and pair of consecutive sentences in each document as candidates, applying the example generation process described in Section 3.1. Additionally, we added as examples sentence pairs from the original corpus that did not match any rule, that is (s 1 , s 2 ) = t, so that a trained model would also learn when not to change the input. We filtered out examples with sentence length ≤ 6 tokens, and examples with non-ASCII characters.

This process resulted in 44, 177, 443 sports examples and 16, 642, 323 Wikipedia examples. We randomly split these examples into 98% train, 1% dev, and 1% test sets, The input text is marked in uppercase blue, and the generated sentence pair is marked in lowercase red. We show in boldface parts that allow us to detect the target phenomenon.

making sure that each document contributes examples to only one of the split sets.

Like prior work (Malmi et al., 2018) , we observed a skewed distribution of discourse phenomena in the data. Specifically, examples with anaphora or the connectives "and" and "but" constitute 99.7% of Sports and 59% of Wikipedia examples. Such a skewed distribution is likely to bias models and will fail to elucidate the ability of models to capture a wide range of linguistic phenomena. Therefore, we constructed a version of DISCOFUSE by down-sampling examples containing "and" and "but" or anaphora. The downsampled dataset contains 12,080,513 Sports examples and 4,581,352 Wikipedia examples.

The resulting distributions of discourse types and most common connectives in the two parts of DISCOFUSE are provided in Appendix A.2. We will release both the original and the downsampled versions of DISCOFUSE.

Rater Selection SPORTS (%) WIKIPEDIA (%) Yes 83.4 86.0 No majority 10.9 8.9 No 5.7 5.1 Table 2 : Rater evaluation understandability of the text after splitting. For each example, the majority of 5 raters was taken as the final rater selection.

Table 2: Rater evaluation understandability of the text after splitting. For each example, the majority of 5 raters was taken as the final rater selection.

4 Discofuse Quality Evaluation

To assess the quality of the generated fusion examples in DISCOFUSE, we randomly selected 500 examples from each of the development sets of the Wikipedia and the Sports parts. We then conducted a crowdsourcing experiment in which each example was rated by 5 proficient English speakers, limiting each rater to at most 6 items. Each rater was presented with the two independent sentences in the example and was asked to indicate whether the text is understandable. If the rater answered "yes", she was then asked to characterize the relation between the sentences and how she would fuse them. We next detail the results.

4.1 Example Text Clarity

Raters were asked whether they can understand the text after the example is split. Table 2 summarizes this evaluation. Most examples were marked as understandable by the raters ("yes") -86% of Wikipedia examples and 83.4% of Sports examples. The rest either had no majority of rater votes or were marked as not understandable.

To shed light on the possible reasons for obscurity, we analyzed 70 random examples that were not marked as understandable by the majority of raters. In 29 examples (41%) the original text was unclear and for 17 examples a broader context was needed (24%). In the remaining 24 examples (34%), our rules generated sentences with grammatical or semantic errors. Examples for these cases are in Table 3 .

Table 3: Examples for three possible reasons for not understanding the text. In each example, (A) is the original text and (a) and (b) are the two sentences generated by our rules.

Additionally, we analyzed 100 random examples for grammatical errors, and found that our rules did not introduce any errors in 79 examples. For 15 examples, the errors did not modify the meaning of the text nor caused semantic errors. The detected grammatical errors include ex-tra commas, missing determiners and bad pronoun replacements, and are demonstrated in Table 4 .

Table 4: Examples of grammatical errors introduced by our rules. The red boldface text was incorrectly inserted and the blue italic text was incorrectly removed.

4.2 Fusion Evaluation

Next, we evaluated agreement on the fusion task for the 847 examples marked as understandable in Section 4.1. Because there are many ways in which sentences can be fused, one cannot expect raters to produce the original text t verbatim. Instead, we analyzed three central decisions and estimated whether people agree on those: (a) whether to merge the two sentences into a single one or keep them separate; (b) whether there are entities in the text that should be replaced with nominal or pronominal anaphors or cataphors; and (c) which discourse connective to add (if any).

For the last question, we presented raters with one connective from each of the four coarsegrained senses for discourse connectives defined by the PDTB (Prasad et al., 2008) : comparison, expansion, contingency and temporal, as well as a no-connective option. If the original text in the example includes a connective, we provided it as one of the options.

We observed a strong bias among raters towards refraining from performing any changes. E.g., while only 38% of the examples did not contain a connective in t, the raters chose not to add a connective in 69.2% of the cases. Similarly, only in 29.1% of the examples the two sentences were not merged into a single one, while the raters chose not to merge in 53.1% of the examples. Similar behavior was also observed by Malmi et al. (2018) and Rohde et al. (2016) .

We further looked at the agreement between the rater majority and the 'gold' fusion decision. This analysis is shown in Table 5 . Agreement on merging the input sentences into one is almost random (52%), since usually both options are valid. Consensus on whether to add an anaphor is higher, but not very high (63%), especially in sentences when the anaphor in t is a nominal rather than a pronoun. Finally, there is higher agreement on selecting the connective category (57%), for which the random baseline is 20%.

Table 5: Average agreement for each fusion decision between the gold annotation and rater majority on examples marked as understandable by the raters. The right column considers only examples in which both the ‘gold’ and rater majority agreed that a connective should be added.

As mentioned, raters tend to keep the sentences unchanged. But in cases where raters agree to add a connective, agreement figures increase substantially. Specifically, when it is clear that a connective is needed, there is also high agreement for picking the right one (76%), for deciding whether to add an anaphor (70%), and for deciding whether to merge the sentences or not (70%).

5 Supervised Neural Fusion Models

Using DISCOFUSE, we trained a Transformer seq2seq model (Vaswani et al., 2017) that reads the input sentence pair and generates a fused text. We report model performance on the test-set using automatic metrics as well as human evaluation. We also provide detailed analysis of the different phenomena captured by this model.

5.1 Experimental Settings

We tokenized all texts using byte-pair-encoding (Sennrich et al., 2015 ) and compared the following three Transformer models :

• DFSPORT -trained on the sports portion of DISCOFUSE after down-sampling. • DFWIKI -trained on the Wikipedia portion of DISCOFUSE after down-sampling. • DFS+W -trained on a 50%-50% mixture of the sports and Wikipedia portions of DISCO-FUSE after down-sampling.

All models share the same network architecture, based on the best practices discussed by Popel and Bojar (2018) . We tuned parameters to select the best learning and dropout rates for each model with respect to the Exact Match objective (described in Section 5.2). Network architecture and hyper-parameters are in Appendix A.3. As a baseline, we also tested a model called COPY, which simply concatenates the two input sentences.

5.2 Automatic Evaluation Results

We evaluated model performance using two automatic metrics. The first is Exact Match (Exact) to see how often the model generates the exact same text as the gold fusion. The second is SARI (Xu et al., 2016) , which computes the set of added, removed, and kept n-grams in the model output, comparing the output both with the gold text and the input text. Then it computes the F 1 scores for these three sets and averages the scores. We compute SARI on up to 4-grams, as in Xu et al. (2016) . We refrained from using metrics like BLEU because in fusion there is large overlap between the input sentences and their fused version, and such metrics do not capture well fine-grained differences of only a single word.

We note that our definition of SARI 2 slightly differs from the one given by Xu et al. (2016) in two aspects: (i) We define 0 0 = 1 when computing precision and recall, otherwise SARI could be less than 1 even if the output matches the gold text exactly. (ii) Instead of considering only the precision of deleted n-grams, we use F 1 for all three sets. Otherwise, SARI will give high scores to models that merely copy everything in the input, without even trying to infer what to delete. We next turn to cross-domain evaluation. When applying a model trained on one domain to the other domain performance drops. This shows that the discourse phenomena distribution differs between the domains, indicating that transfer learning is not trivial even with these large datasets. This is especially evident when applying DFWIKI to Sports, where Exact falls from 42% to 32% on the full dataset and from 50% to 40% on the downsampled one. Interestingly, when learning on the mixed training set, performance on both domains is close to in-domain performance, showing that the model has the capacity to handle both domains.

Table 6: Exact and SARI scores of DFSPORT, DFWIKI, DFS+W and COPY, on the test sets of DISCOFUSE before (Full) and after down-sampling (Sampled).

Finally, we take advantage of the provided annotation of the different discourse phenomena within each example in DISCOFUSE. We conducted a detailed analysis of in-domain model performance by discourse type, presented in Table 7 . Results show that structural discourse types, such as apposition and relative clause, are easier to learn with both high exact match and SARI scores. While differences with respect to SARI scores are not large between phenomena, exact match varies more. Anaphora and verb phrase coordination are more challenging, but still require matching of the same noun (the named entity or the subject). On the other hand, discourse types that involve connective prediction, such as sentence coordination and discourse connective, require semantic understanding, and performance is significantly lower. In addition, when two discourse types are required for fusion, performance drops dramatically. Table 8 : Human detection (Detection) percentage for DFSPORT and DFWIKI on 1000 samples from each of the Sports and Wikipedia development sets. We report Detection for cases when model output differed from the gold, and cases when they were identical.

Table 7: In-domain evaluation with breakdown by discourse phenomena. Performance of DFSPORT and DFWIKI on the sports and Wikipedia development sets.

Table 8: Human detection (Detection) percentage for DFSPORT and DFWIKI on 1000 samples from each of the Sports and Wikipedia development sets. We report Detection for cases when model output differed from the gold, and cases when they were identical.

5.3 Human Evaluation Results

As our second experiment we employed crowdsourcing to test how distinguishable the fusion model outputs are from the gold fused texts. Concretely, we present raters an independent sentence pair from DISCOFUSE and two fused versionsthe gold version and one generated by a model. Raters were asked to detect the gold version. For each example, we took the majority of 5 raters as the final choice. This experiment mitigates the difficulties of automatic text generation evaluation, where many outputs are valid for a single input. We sampled 1000 random examples from each development set of the two domains and applied the in-domain model to both. The raters were presented only with examples where the model output was different from the gold fusion, and assumed 50% detection accuracy otherwise. Table 8 depicts the results. Out of cases when model output differed from the gold, raters were able to identify the human version in 65% of Sports examples and 61% of Wikipedia examples. Looking at the entire set, humans were able to identify the human version in 57% (Sports) and 55% (Wikipedia) of the cases. This shows that our Transformer model, applied over a dataset of millions of examples, is able to learn good fusions in general. Nevertheless, models are still far from perfect -human accuracy is clearly better than random and this improvement is statistically significant at a level of p < 10 −5 for Sports and p < 10 −3 for Wikipedia.

5.4 Alignment-Based Analysis

We next present an analysis of the types of errors our models produce. To this end, we sampled 40K examples of DFSPORT and DFWIKI outputs on Sports and Wikipedia development sets. We then automatically aligned predicted sequences to gold sequences and looked at the differences between aligned words. The trained models successfully learned to copy most of the input text, and thus errors due to alignment problems are rare. We start by considering the semantic relation between the input sentences. Table 9 displays model accuracy in predicting the most common connectives in DISCOFUSE, as well as the top connectives predicted in this slot. We observe that when the model predicts a wrong connective, that connective is often reasonable, e.g., predicting "but" instead of "and" or "however". Moreover, a second source of error is not adding a connective at all. It is also clear that some connectives, like "however", "although" and "for example", are harder to learn.

Table 9: Alignment-based connective prediction accuracy for the most common connectives. When a model did not add a connective, the token 〈other〉 is used.

We also analyzed the models' ability to correctly infer pronoun anaphors including gender, possessive and plurality. Figure 3 shows the pronoun confusion matrix for DFWIKI, 3 where lines refer to gold pronouns and columns to the generated pronoun in the same position. The clear diagonal shows that in most cases, the model successfully outputs the correct pronoun. However, the other column indicates that occasionally the model tends not to replace the entity in the input with a pronoun anaphor. In addition, the model seems to struggle with possession and plural 3rd person ("it", "its", "they", "their", "theirs").

Figure 3: DFWIKI outputs versus the gold pronouns. Rows refer to gold pronouns and columns refer to aligned model outputs at the gold pronoun position. Values in each row are normalized to 1. Column 〈other〉 refers to model outputs that are not pronouns.

3 Results for DFSPORT are very similar.

6 Transfer Learning Experiment

With the DISCOFUSE approach we can collect a large amount of examples automatically. Still, these examples only reflect the manual rules that identify discourse phenomena. We wanted to see if DISCOFUSE covers enough cases such that a trained model would be helpful for testing on fusion datasets generated by different approaches.

6.1 Experimental Settings

In this experiment, we looked at the recently released WEBSPLIT dataset 1.0 (Narayan et al., 2017) . It consists of examples (t, {s i } n i=1 ), where t is a sentence that verbalizes the same set of RDF triples as {s i } n i=1 . We note that WEBSPLIT was originally developed for sentence splitting, from t to {s i } n i=1 , but here we view its examples for the reverse fusion task: from {s i } n i=1 to t. We only considered examples where {s i } n i=1 corresponds to exactly two simpler sentences (n = 2). This leaves us with 135K training, 8K validation, and 8K test samples.

We tokenized the data using byte-pair-encoding (Sennrich et al., 2015 ) and compared three models: (i) The COPY baseline that concatenates the two input sentences, (ii) a model trained on WEB-SPLIT alone, and (iii) a model pre-trained on DFWIKI and fine-tuned on WEBSPLIT.

For the last two models, we use the CopyNet architecture (Gu et al., 2016) , which is similar to state-of-the-art models for the splitting task on WEBSPLIT (Narayan et al., 2017; Botha et al., 2018) . While the Transformer outperformed this model on our main experiments, here it overfit on the small training set of WEBSPLIT. The training details are provided in Appendix A.3. Table 10 shows the results of the experiment. Similarly to Section 5, we measured the model performance using SARI. Pre-training with DFWIKI improves SARI score by 9% compared to using WEBSPLIT alone. In particular, the F1 of the 'kept' and 'added' n-grams is significantly higher, by 23% and 33% respectively. Specifically, 'added' tokens refer also to correctly choosing discourse connectives, to which the large-scale examples in DISCOFUSE were likely helpful. We note that even with pre-training, the SARI 'add' score is only 10.4. This is probably due to the large amount of paraphrasing done in WEB-SPLIT, which makes it problematic for fusion evaluation (see also Section 2). For example:

Table 10: Fusion results on WEBSPLIT, measured by SARI and the F1 scores that compose it.

6.2 Results

Sentence 1: Bolt , a comic character AKA Larry Bolatinsky , was created by Paris Cullins and Ernie Colon .

Sentence 2: Paris Cullins is a United States national .

Gold: Larry Bolatinsky is the alternative name for the comic book character Bolt , which was created by Ernie Colon and the American Paris Cullins .

Correctly inferring the added terms (shown in red) requires paraphrasing knowledge that is outside the scope of DISCOFUSE.

7 Conclusions

We presented DISCOFUSE, a large-scale dataset for sentence fusion that was generated by applying a rule-based method. It contains millions of examples from two domains, annotated with multiple discourse phenomena.

We used DISCOFUSE to build supervised neural models for sentence fusion and conducted finegrained analyses of the results. Currently, our models fuse only two sentences together. We would like to expand them to more input sentences in future work.

We also demonstrated DISCOFUSE's usefulness in a transfer learning setup on a different fusion test-set, hoping it would facilitate research on text fusion in data-scarce domains.

Table 11: Notation and definitions for Table 14 of generation rules. Z,R, SZ , SR are token lists and 1 ≤ i, j ≤ |Z| are indices. The full lists of connectives and POS tags are provided in Table 16.

A.1 Generation Rules

In this section we provide technical details of the generation rules used to create DISCOFUSE. For the sake of clarity, we provide a simplified version of the rules, that does not include edge cases and minor implementation details. The discourse connectives we considered in the rules were selected from the Penn Discourse Treebank (PDTB) (Prasad et al., 2008) and are listed in Table 12 .

Table 12: Connectives and POS tags used in our detection rules. A preceding comma is allowed for conjunctions in Cc. For the connectives “although” and “since” in Cf , we do not allow a following comma.

Given an input text, it is encoded with 3 lists: Z is the token list, Z t is a list of POS tags, Z l is a list of dependency labels (see Table 11 ). In addition, all entities mentioned in the text are extracted and stored such that for two token lists Z, R, the set m(Z, R) holds all the mention pairs of the same entity in the two lists. Each rule is designed for a specific discourse phenomenon and contains two parts. First, a set of conditions is applied to the input lists to detect whether the phenomenon occurs in the text. If a discourse pattern has been identified, a short sequence of simple operations is applied to the input, yielding a new sentence pair. Table 13 summarizes the operations in use, which allow insertion and deletion of tokens and splitting of the input text. Table 14 provides the technicalities of each rule, i.e. the detection conditions of the discourse structure, and the sequence of operations for generating a new sentence pair from it. A detailed example for two-rule execution process is given in Table 15 .

Table 13: Operations upon token lists, which are used for generation of sentence pairs (Table 14). The arguments X,Y, Z are token lists and the arguments i, n are integers.

Table 14: Generation rules for sentence pairs. The rules apply for token lists Z,A,B, where Z represents a single sentence and A,B either represent two consecutive sentences or two consecutive sentence parts. ∗For the rules of relative clause and apposition, r is the index of the leftmost child in the dependency sub-tree of a(i).

Table 15: Detailed two-rule execution example. We show in red parts of the input that are used for detection or modified during execution. The input token list Z is of a single sentence. First, the rule for inner connective is applied, splitting Z into two sentences A,B, without the connective “because”. Then, applying the anaphora rule, the pronoun “his” in B is replaced with the entity it refers to in A, to obtain a new sentence pair.

As mentioned, the rules are simplified for clarity. However, we note two special cases where morphological modifications are required to produce text without grammatical errors. First, in some cases of forward connective and cataphora, the tense change of a verb is required when splitting the input sentence. For instance, in the cataphora example in Table 1 , we change the verb "stating" to have a past tense -"stated". Likewise, occasionally a "be" verb needs to be inserted when splitting a single sentence, as demonstrated in Figure 2 . In our rules, we choose which "be" verb to insert based on the tense and perspective of the rest of the sentence. Figure 4 and Table 16 show the distributions of discourse types and most common connectives in the two parts of DISCOFUSE.

Figure 4: Discourse type distribution of the sports and Wikipedia portions of DISCOFUSE after downsampling.

Table 16: Most common connectives in DISCOFUSE after down-sampling. Percentages are with respect to the entire dataset, including examples without a connective.

A.2 Discofuse Data Distribution

Analyzing the dataset reveals significant differences in discourse phenomena between the two types of documents (Figure 4 ). E.g., coordination is very common in Wikipedia while anaphora is dominant in Sports. Likewise, the distribution of discourse connectives is quite different (Table 16) .

A.3 Neural Models Parameters

The models DFSPORT, DFWIKI, DFS+W share the same Transformer network architecture, that was originally proposed by Vaswani et al. (2017) . During training, we split the samples to buckets

Notation Definition

Token list Z A list of tokens {z (1) , ..., z (|Z|) } Zt

The list of POS tags of Z, where Z

(i) t

is the tag of z (i) for every i = 1, ..., |Z|.

Z L

The list of dependency labels of Z, where Z

(i) l

is the label of incoming edge of z (i) for every i = 1, ..., |Z|.

M(Z, R)

A set of mention pairs in Z, R : { S Z , S R | S R ≺ R and S Z ≺ Z are mentions of the same entity} S ≺ Z S is a span in Z, such that ∃i ∈ 1, ..., |Z| − |S| + 1 : ∀j = 0, ..., |S| − 1 : s1+j = zi+j S Z S is a prefix of Z, such that ∀i = 1, ..., |S| :

si = zi i U j

There is an edge from the ith token to the jth token in the dependency tree of Z.

C B

A set of backward connectives.

Cs

A set of intra-sentence connectives, which are either forward connectives or conjunctions. C f A set of forward connectives.

Cc

A set of coordinating conjunctions. Pr A set of relative pronouns. V A set of POS tags for verbal phrases. "accordingly", "additionally", "afterward", "alternatively", "although ,", "and", "as a result ,", "because of that", "because of this", "besides ,", "but", "by comparison ,", "by contrast ,", "by doing this ,", "by then", "consequently", "conversely", "else,", "finally ,", "for example", "for instance", "further ,", "furthermore", "hence ,", "however", "in contrast ,", "in fact ,", "in other words", "in particular ,", "in short ,", "in sum ,", "in the end ,", "in turn ,", "indeed ,", "instead ,", "lest", "likewise ,", "meantime ,", "in the meantime ,", "meanwhile ,", "moreover", "nevertheless", "next ,", "nonetheless", "on the contrary ,", "on the other hand", "or ,", "otherwise ,", "overall ,", "plus ,", "rather ,", "regardless ,", "similarly ,", "simultaneously", "specifically ,", "still ,", "then ,", "thereafter ,", "thereby ,", "therefore", "though ,", "thus ,", "ultimately ,", "whereas", "yet ,", "now ,", "second ,", "third ,", "basically ,", "this ,", "eventually ,", "obviously ,", "again ,", "fortunately ,", "luckily ,", "meaning ,", "interestingly ,", "anyway ,", "clearly ," Cs "because", ", because", "hence", ", while", "whereas", ", although", "although", "and although", "unless", "now that", ", now that", "so that", ", so that", "meaning", ", meaning" C f although, since, in addition to, aside from Cc and, but, or, nor, yet, so, for Pr who, which, whose, whom V VB, VBD, VBG, VBN, VBP, VBZ Table 12 : Connectives and POS tags used in our detection rules. A preceding comma is allowed for conjunctions in C c . For the connectives "although" and "since" in C f , we do not allow a following comma.

Operation Description DELETE(X, i, n)

Delete a sequence of n tokens from X, starting from index i.

Prepend(X, Y )

Attach the list Y at the beginning of X.

Replace(X, Y, Z)

Replace every occurrence of Y in X with Z, in a non-overlapping manner. SPLIT(X, i)

Split X into two token lists V = {x1, ..., xi−1}, W = {xi, ..., x |X| }.

Trim(X)

Delete all tokens in X after the first punctuation token, e.g. period, comma, etc. Table 13 : Operations upon token lists, which are used for generation of sentence pairs (Table 14) . The arguments X, Y, Z are token lists and the arguments i, n are integers. trained on WEBSPLIT alone, and a model pretrained on DFWIKI and finetuned on WEBSPLIT. The first model was trained for 200,000 steps on WEBSPLIT, whereas the second model was pretrained for 1 million steps on DFWIKI and then finetuned for 100,000 steps on WEBSPLIT. Again, the samples were split to buckets by their text length, with batch sizes between 25-125 for each bucket. The final test scores were computed with the parameters that maximize the validation SARI score during training. The network architecture and hyperparameters were shared between the models and not optimized during training. They are listed in Table 18 .

Table 18: Parameters and hyperparameters of the CopyNet models used for transfer learning.

https://cloud.google.com/natural-language/

Our SARI implementation is available at: https: //github.com/tensorflow/tensor2tensor/ blob/master/tensor2tensor/utils/sari_ hook.py

Table 17: Parameters and hyperparameters of the models DFSPORT, DFWIKI, DFS+W.