Cross-Document Language Modeling

Avi Caciularu
Arman Cohan
Iz Beltagy
Matthew E. Peters
Arie Cattan
Ido Dagan
ArXiv
2021
View in Semantic Scholar

Abstract

We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our crossdocument language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which encourages the model to learn cross-document and long-range relationships. Second, extending the recent Longformer model, we pretrain with long contexts of several thousand tokens and introduce a new attention pattern that uses sequence-level global attention to predict masked tokens, while retaining the familiar local attention elsewhere. We show that our CD-LM sets new state-of-the-art results for several multi-text tasks, including crossdocument event and entity coreference resolution, paper citation recommendation, and documents plagiarism detection, while using a significantly reduced number of training parameters relative to prior works1.

1 Introduction

The majority of NLP research addresses a single text, typically at the sentence or document level. This has been the case for both infrastructure language analysis tasks, such as syntactic, semantic and discourse analysis, as well as applied tasks, such as question answering (and its reading comprehension variant (Xu et al., 2019) ), information extraction, sentiment analysis etc., where the system output is typically extracted from a single document. Yet, there are important applications which are concerned with aggregated information spread across multiple texts, e.g., multidocument summarization (Fabbri et al., 2019a) , cross-document coreference resolution (Cybulska and Vossen, 2014a) , and multi-hop question answering (Yang et al., 2018) . While providing stateof-the-art results for cross-document tasks, current pretraining methods, developed for a single text, are not geared to fully address the needs of crossdocument tasks. As an alternate approach, we propose Cross-Document Language Model (CD-LM), a new language model (LM) that is trained in a cross-document manner. We show that this significantly outperforms previous approaches, resulting in state-of-the-art results for event and entity crossdocument coreference resolution, paper citation recommendation, and documents plagiarism detection.

Tasks that consider multiple documents typically require mapping or linking between pieces of information across documents. Such input documents usually contain overlapping information, e.g., Doc 1 and 2 in Fig. 1 . Desirably, LMs should be able to align between overlapping elements across these related documents. For example, one would expect a competent model to correctly align the events around "name" and "nominates" in Doc 1 and Doc 2, effectively recognizing their relation even when they are in separate documents. Yet, existing LM pretraining methods do not expose the model to learn such information. Here, we propose a scheme to integrate cross-document knowledge already in pretraining, thus allowing the LM to learn to encode relevant cross-document relationships implicitly.

Figure 1: Various document examples from the ECB+ dataset. In Doc 1 and Doc 2, underlined words represent coreferering events and the same color represents a coreference cluster: The entity clusters are (“Dr. Regina Benjamin”, “MacArthur “genius grant” fellow Regina Benjamin”) and (“President Obama”,“Obama”), and the single event cluster is (“name”,“nominates”). These examples are adopted from Cattan et al. (2020).

To allow our CD-LM to address large contexts across multiple documents, we leverage the recent appealing architecture of the Longformer model (Beltagy et al., 2020) , designed to address long inputs. Specifically, we leverage its global attention mechanism, originally utilized only during task-specific fine-tuning, and extend its use already in pretraing, enabling the model to consider cross-Doc 1: "... President Obama will name Dr. Regina Benjamin as U.S. Surgeon General in a Rose Garden announcement late this morning. ..." Doc 2: "... Obama nominates new surgeon general: MacArthur "genius grant" fellow Regina Benjamin. ..." Figure 1 : Various document examples from the ECB+ dataset. In Doc 1 and Doc 2, underlined words represent coreferering events and the same color represents a coreference cluster: The entity clusters are ("Dr. Regina Benjamin", "MacArthur "genius grant" fellow Regina Benjamin") and ("President Obama","Obama"), and the single event cluster is ("name","nominates"). These examples are adopted from Cattan et al. (2020) . document, as well as long-range within-document, information. While using this mechanism, we introduce a cross-document masking approach. This approach considers as input multiple-documents containing related, partly overlapping, information. The model is then challenged to unmask the masked tokens while attending to information in both the same as well as the related documents. This way, the model is encouraged to "peek" at other documents and map cross-document information, in order to yield better unmasking. Our pretraining procedure yields a generic cross-document language model, which may be leveraged for various cross-document downstream tasks that would need to map information across related texts. As mentioned above, our experiments assess utility of our CD-LM for a range of cross-document tasks, resulting with significant improvements, suggesting its appeal for future work in the cross-document setting.

2 Background

Transformer-based language models (LMs) (Devlin et al., 2019; Yang et al., 2019) have led to significant performance gains in various natural language understanding tasks, mainly for within-document-related tasks. They use multiple self-attention layers in order to learn to produce high-quality token representations. They were shown to incorporate contextual knowledge by assigning a representation that is an attentive function of the entire input context. Such models are trained using the Masked Language Modeling (MLM) objective (known as the pretraining phase) -given a piece of text, a model uses the context words surrounding a masked token to try to predict it, and by that, maximizing the likelihood of the input words.

These models have significantly advanced the state-of-the-art in various NLP tasks, mostly using post-pretraining, finetuning approaches, e.g., question answering (Yang et al., 2019) , coreference resolution , such as those of the GLUE benchmark . Importantly, pretrained LMs eliminate the need for many heavily-engineered and hand-crafted task-specific architectures for downstream tasks. Additionally, Clark et al. (2019) show that BERT's (Devlin et al., 2019) attention heads encode a substantial amount of linguistic knowledge, such as the ability to represent within-document coreference relations. This enables better performance over downstream tasks, with limited resources of labeled training data. Despite such models' success in within-document tasks, due to memory and time constraints, they limit the input size and are only able to support a rather small context. Thus, they cannot be readily applied in cross-document tasks where the input size is large.

Recently, several models were suggested to handle these issues and bypass the length constraint, by employing techniques for dealing with the computational and memory obstacles (Tay et al., 2020) . Examples to such architectures include the Longformer (Beltagy et al., 2020) BigBird (Zaheer et al., 2020) , and LinFormer , which were introduced to extend the range of context that can be used, both for the pretraining and fine-tuning stages. Specifically for the Longformer model, which we utilize in this work, a localized sliding window-based attention, termed local attention, was proposed for reducing computation and extending the previous LMs to support longer sequences. This enabled the handling of long context processing by removing the restrictions of long inputs. In addition, the authors introduced the global attention mode, which allows the LM to build representations based on the full input sequence for prediction, and is used during fine-tuning only. Both the local attention and the global attention modes rely on the known self-attentive score (Vaswani et al., 2017) which is given by:

Attention (Q, K, V ) = softmax QK T √ d k V,

where the learned linear projection matrices Q, K, V are partitioned into two distinct sets; θ l = {Q l , K l , V l } and θ g = {Q g , K g , V g }, for the local and the global attention modes, respectively. During pretraining, the Longformer assigns the local attention mode for all tokens to optimize the MLM objective. Before task-specific finetuning, the attention mode is predetermined for each input token, assigning global attention to few targeted tokens (e.g., special tokens) to avoid computational inefficiency. We hypothesize that the global attention mechanism is useful for learning meaningful representations for modeling cross-document relationships. We propose augmenting the pretraining phase to exploit the global attention mode, rather than using it only for finetuning, as further described in the next section.

3 Cross-Document Language Model Documents that describe the same event, e.g., different news articles that discuss the same story, usually contain overlapping information. Accordingly, many cross-document tasks may leverage from LM infrastructure that encodes information regarding alignment and mapping across texts. For example, for the cross-document coreference resolution task, consider the underlined predicate examples in Fig. 1 . One would expect a model to correctly align the events around "name" and "nominates", effectively recognizing their coreference relation even when they are in separate documents.

Our approach to cross-document language modeling is based on training a Transformer-based LM on sets (clusters) of documents, all describing the same event. Such document clusters are readily available in a variety of existing datasets for cross-document benchmarks, such as summarization (e.g., MultiNews (Fabbri et al., 2019b) ), crossdocument coreference resolution (e.g., ECB+ (Cybulska and Vossen, 2014b)) and cross-documents alignment benchmarks (Zhou et al., 2020) . Training the LM over a set of related documents provides the potential to learn cross-text mapping and alignment capabilities, as part of the contextualization process. Indeed, we show that this cross-document pretraining strategy directs the model to utilize information across documents for predicting masked tokens, and helps in multiple cross-document downstream tasks.

To support contextualizing information across multiple documents, we need to use efficient Transformer models that scale linearly with input length.

Thus, we base our model on the Longformer (Beltagy et al., 2020). As described in Sec. 2, this is an efficient Transformer model for long sequences that uses a combination of local attention (selfattention restricted to a local sliding window) and global attention (a small set of pre-specified input locations with direct global attention access).

Cross-Document Masking In pretraining, we concatenate the document set using new special document separator tokens, and <\doc-s>, for marking document boundaries. We apply a similar masking procedure as in BERT: For each training example, we chose randomly a sample of tokens (15%) to be masked 2 ; for each of those, our proposed model tries to predict it considering the full document set, by assigning them global attention. This allows the global attention parameters of the Longformer to contextualize encompassing both information across documents, and within-document long dependencies. The nonmasked tokens use local attention, as usual.

An illustration of the full cross-document masking procedure is depicted in Fig. 2 , where the masked token associated with "nominates" (colored in orange) globally attends to the whole sequence, and the non-masked token hiding the word "new" (colored in blue) attends to the local context. With regard to the example in Fig. 1 , this masking approach aims to implicitly compel the model to learn to correctly predict the word "nominates" by looking at the second document, optimally at the phrase "name", and thus enforce the alignment between the events.

Figure 2: CD-LM pretraining: The input consists of concatenated documents, separated by the new special document separator tokens. The token colored in yellow represents global attention, and tokens colored in blue represent local attention. The goal is to predict the masked token given the output representation xi, based on the global context, i.e, the entire set of documents in the sample.

The loss function induced by the above masking method requires a MLM objective which accounts for the entire sequence, namely, the concatenated documents. We mimic the LM bidirectional conditioning from BERT (Devlin et al., 2019) but instead of using constant model weights for all tokens, we assign the global attention weights θ g for the masked tokens, so the model can predict the target token in a multi-document context. The unmasked tokens use the local attention weights, θ l . We dub this method Cross-document masked language modeling (CD-MLM). The resulting model includes the following new components: new pretrained special document separator, and pretrained sets of both global and local attention weights that form the cross-document language model (CD- LM). The document separator tokens can be useful for downstream tasks for marking the document bounderies while the global attention weights provide better encoding of cross-document selfattentive information.

Finetuning During finetuning on downstream cross-document tasks, we utilize our model by concatenating the tokens of relevant input documents using the document separator tokens, along with the classification token (referred to as CLS) at the beginning of the input sequence. Moreover, for token-level tasks such as coreference resolution, we assign global attention to several explicit spans of text, as described in Section 5.2. Using global attention on at least one token ensures that the distribution of the data during finetuning is similar to the distribution during pretraining, which avoids pretraining-finetuning discrepancy. Note that this method is much simpler than existing task-specific cross-document models.

4 Cd-Lm Implementation

In this section, we provide experimental details used for pretraining our CD-LM model, and detail the ablations we used.

Corpus data We use the Multi-News dataset (Fabbri et al., 2019a) as the source of related documents for the pretraining. This large-scale dataset contains 44,972 training documents-summary clusters which are originally intended for multi-document summarization. The number of related source documents (that describe the same event) per summary varies from 2 to 10, as detailed in Appendix A.1. We discard the summaries and consider each cluster of related documents, of at least 3 documents, for our cross-document pretraining scheme. We compiled the training corpus by concatenating related documents that were sampled randomly from each cluster, until reaching the input sequence length limit of 4,096 tokens per sample. The average input contains 2.5k tokens and the 90th percentile of input lengths is 3.8K tokens.

Training and hyperparameters We pretrain the model according to our CD-MLM strategy described in Section 3. To that end, we employ the Longformer-base model (Beltagy et al., 2020) 3 and continue its pretraining for additional 25k steps. We use the same hyperparameters and follow the exact setting as in Beltagy et al. 2020: Input sequences are of the length of 4,096, effective batch size of 64 (using gradient accumulation and batch size of 8), a maximum learning rate of 3e-5, and a linear warmup of 500 steps, followed by a power 3 polynomial decay. For speeding up the training and reducing memory consumption, we used the mixedprecision (16-bits) training mode. 4 The rest of the hyperparameters are the same as for RoBERTa .

Baseline Language Models In Addition To Our

proposed CD-LM model and the state-of-the-art models detailed in the next sections, we considered the following LM variations in our evaluations, as ablations for our model:

• The plain LONGFORMER-base model, without further pretraining.

• The RAND CD-LM model based on the Longformer-base model, with the additional CD-MLM pretraining but using random, unrelated documents from various clusters. The amounts of data and pretraining hypyerparameters are the same as the ones of CD-LM. This baseline model can asses whether pretraining using related documents is beneficial.

When finetuning each one of the above models, we restrict each input segment (document/abstract/passage) to include a maximal length of 2,047 tokens, so that the entire input length, including the CLS token, will have no more than 4,096 input tokens.

5 Evaluations And Results

This section presents the intrinsic and extrinsic experiments conducted to evaluate our CD-LM. For the intrinsic evaluation we measure the perplexity of the models, while for extrinsic evaluations we considered the event and entity cross-document coreference resolution, paper citation recommendation, and the document plagiarism detection tasks.

5.1 Cross-Document Perplexity

First, we conduct a cross-document perplexity experiment, in a task-independent manner, to asses the contribution of the pretraining process. We used the MultiNews validation and test sets, each of them containing 5,622 documents-summary clusters, to construct the evaluation corpora. Then we followed the same protocol from the pretraining phase -random 15% of the input tokens are masked and assigned with global attention, and the challenge is to predict the masked token given all documents in the input sequence. The perplexity is then measured by computing exponentiation of the loss.

The results are depicted in Table 1 . The CD-LM model outperforms the baselines. In particular, the advantage over RAND CD-LM, which was pretrained equivalently over an equivalent amount of (unrelated) cross-document data, confirms that cross-document masking, in pretraining over related documents, indeed helps for cross-document masked token prediction across such document. The CD-LM is encouraged to look at the full sequence, when predicting a masked token. Therefore, it exploits related information in other documents as well, and not just local context. The RAND CD-LM is inferior since, in its pretraining phase, it was not exposed to such overlapping useful information. The plain LONGFORMER model, which is reported just as a reference point, is expected to have difficulty to predict cross-document tokens, in addition to the reason above, since the document separators we used are not part of its embedding set and are randomly initialized during this task. Moreover, recall that the CD-LM and the RAND CD-LM models have two pretrained sets of linear projection weights -one for local attention and one for global attention. The plain LONGFORMER model uses the same weights for the two modes, and therefore it is reasonable that it will fail at long-range mask prediction.

Table 1: Cross-document perplexity evaluation on the validation set and test set of MultiNews. The lower is better.

5.2 Cross-Document Coreference Resolution

Cross-document (CD) coreference resolution deals with identifying and clustering together textual mentions across multiple documents that refer to the same concept (see examples in Doc 1 and Doc 2 in Fig. 1 ). The considered mentions can be both entity mentions, usually noun phrases, e.g., "Obama" and "President Obama", and event mentions, which are mostly verbs or nominalizations that appear in the text, e.g., "name" and "nominates".

Benchmark. For assessing our CD-LM on CD coreference resolution, we utilized it for an evaluation over the ECB+ corpus (Cybulska and Vossen, 2014a) , which is the most commonly used dataset for the task. ECB+ consists of within-and crossdocument coreference annotations for entities and events. The ECB+ dataset statistics are described in Appendix A.2. Following previous work, for comparison, we conduct our experiments on gold event and entity mentions. For evaluating the performance of coreference clustering we follow the standard coreference resolution evaluation metrics: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAFe (Luo, 2005) , their average CoNLL F1, and the more recent LEA metric (Moosavi and Strube, 2016).

Algorithm. Recent approaches for CD coreference resolution train a pairwise scorer to learn the likelihood that two mentions are coreferring. At inference time, an agglomerative clustering based on the pairwise scores is applied, to form the coreference clusters. Next, we detail our proposed modifications for the pairwise scorer. The current state-ofthe-art models (Zeng et al., 2020; Yu et al., 2020) train the pairwise scorer by including only the local contexts (containing sentences) of the candidate mentions. They concatenate the two input sentences and feed them into a transformer-based LM. Then, part of the resulting tokens representations are aggregated into a single feature vector which is passed into an additional MLP-based scorer to produce the coreference probability estimate. To accommodate our proposed CD-LM model, we modify this modeling, as illustrated in Fig 3. We include the entire documents containing the two candidate mentions, instead of just their containing sentences. We concatenate the relevant documents using the special document separator tokens, then encode them using our CD-LM along with the token (corresponding to the CLS token) at the beginning of this sequence, as suggested in Section 3. For within-document coreference candidate examples, we use just the single containing document with the document separator. Inspired by Yu et al. 2020, we use candidate mention marking: we wrap the mentions with special tokens and <\m> in order to direct the model to specifically pay attention to the candidates representations. Additionally, we assign global-attention to , , <\m>, and the mention tokens themselves, according to the finetuning strategy proposed in Section 3. Our final pairwise-mention representation is formed like in Zeng et al. (2020) and Yu et al. (2020) : We concatenate the cross-document contextualized representation vectors for the t th sample:

Figure 3: CD-coreference resolution pairwise mention representation, using the CD-LM. mit, m j t and st are the cross-document contextualized representation vectors for mentions i and j, and of the CLS token, respectively. mit ◦m j t is the element-wise product between mit and m j t . mt(i, j) is the final produced pairwisemention representation. The tokens colored in yellow represent global attention, and tokens colored in blue represent local attention.

m t (i, j) = s t , m i t , m j t , m i t • m j t ,

where [•] denotes the concatenation operator, s t is the cross-document contextualized representation vector of the CLS token, and each of m i t and m j t is the sum of candidate tokens of the corresponding mentions (i and j). Then, we train the pairwise scorer according to the suggested finetuning scheme. At test time, similar to most recent works, we apply agglomerative clustering to merge the most similar cluster pairs. The hyperparameters and further details are elaborated in Appendix B.1.

Baselines. We consider recent, state-of-the-art baselines that reported results over the ECB+ benchmark. The following baselines were used for both event and entity coreference resolution:

Same Head-Lemma is a simple baseline that merges mentions sharing the same syntactic headlemma into the same coreference cluster. Barhom et al. (2019) is a model trained jointly for solving both event and entity coreference as a single task. Cattan et al. (2020) is a model trained in an endto-end manner (jointly learning mention detection and coreference), employing the RoBERTa-large model to encode separately each document and to train a pair-wise scorer on top of these representations.

The following baselines were used only for event coreference resolution. They all integrate external linguistic information as additional features to the model: Meged et al. (2020) is an extension of Barhom et al. (2019) , leveraging additional side-information acquired by a paraphrase resource (Shwartz et al., 2017) . Zeng et al. (2020) is an end-to-end model, encoding the concatenated two sentences containing the two mentions, by the BERT-large model. Similarly to our proposed algorithm, they feed a MLP-based pairwise scorer with the CLS contextualized token representation and an attentive function of the contextualized representation vectors of the candidate mentions. Yu et al. (2020) is an end-to-end model similar to Zeng et al. (2020) , but uses rather RoBERTa-large and does not consider the CLS contextualized token representation for the pairwise classification. This is a non-attentive version of Zeng et al's mechanism for paraphrase detection. Results. The results on event and entity CD coreference resolution are depicted in Tables 2 and 3 . All results are statistically significant using bootstrap and permutation tests with p < 0.001 (Dror et al., 2018) . Our CD-LM outperforms the sentence based models (Zeng et al., 2020; Yu et al., 2020) on event coreference (+1.2 CoNLL F1) and largely surpasses state-of-the-art results on entity coreference (+9.8 CoNLL F1), even though these models utilize external linguistic argument information, 5 and include many more parameters (large models vs our base model). Finally, the RAND CD-LM is inferior to the plain LONGFORMER model, despite the fact that it already has pretrained document separator embeddings. This emphasizes the requirement of pretraining on related documents rather than random ones, which allows better alignment and paraphrasing capabilities, required for coreference detection.

5.3 Paper Citation Recommendation & Plagiarism Detection

We evaluate our CD-LM over citation recommendation and plagiarism detection benchmarks Zhou et al. (2020) , a recently released benchmark for cross-document tasks. These tasks share the same objective -categorizing whether a particular relationship holds between two input documents, and therefore, correspond to binary classification problems. Citation recommendation deals with detecting whether one reference document should cite the other one, while the plagiarism detection task infers whether one document plagiarizes the other one. To compare with recent state-of-the-art models, we utilized the setup and data selection from Zhou et al. (2020) , which provides three datasets for citation recommendation and one for plagiarism detection.

Benchmarks. For citation recommendation, we used the ACL Anthology Network Corpus (AAN; Radev et al., 2013), the Semantic Scholar Open Corpus (OC; Bhagavatula et al., 2018) , and the Semantic Scholar Open Research Corpus (S2ORC; Lo et al., 2020) . For plagiarism detection, we used the Plagiarism Detection Challenge (PAN; Potthast et al., 2013) .

AAN is composed of computational linguistics papers which were published on the ACL Anthology from 2001 to 2014, OC is composed of computer science and neuroscience papers, S2ORC is composed of open access papers across broad domains of science, and PAN is composed of web documents that contain several kinds of plagiarism phenomena. For further dataset prepossessing details and statistics, see Appendix A.3.

Algorithm. For our models we added the CLS token at the beginning of the input sequence and concatenated the pair of texts together, according to the finetuning setup discussed in Section 3. The hyperparameters are further detailed in Appendix B.2.

Baselines. We consider the reported results of the following recent baselines: SMASH (Jiang et al., 2019) is an attentive hierarchical RNN model, used for tasks related to long-document.

SMITH ) is a BERTbased hierarchical model, similar to the previously suggested hierarchical attentive networks (HANs (Yang et al., 2016) ).

BERT-HAN+CDA (Zhou et al., 2020) is a cross-document attentive mechanism (CDA) built on top of Hierarchical Attention Networks (HANs), based on BERT. For more details, see Section 6. We report the results of their finetuned model over the datasets (Zhou et al., 2020, Section 5.2) .

Note that both SMASH and SMITH reported results only over the ANN benchmark. In addition, they used a slightly different version of the AAN dataset, and included the full documents, unlike the dataset that BERT-HAN+CDA used, which we utilized as well, that considers only the documents' abstracts.

Table 2: Results on event cross-document coreference on ECB+ test set.

Table 3: Results on entity cross-document coreference on ECB+ test set.

Results. The results on the citation recommendation over the AAN dataset are depicted in Table 4 . We observe that even though several baselines reported results using the full documents, our model outperforms them, using the partial version of the dataset, as in (Zhou et al., 2020) . Moreover, unlike our model, the CDA is task-specific since it trains new cross-document weights for each task, yet it is still inferior to our model. The results on the rest of the benchmarks are reported in Table 5 , and as can be seen, our CD-LM consistently outperforms both the prior baseline as well as the LONGFORMER and RAND CD-LM models.

Table 4: Accuracy and F1 scores of various baselines on the AAN test set. * indicates using a different version of the dataset6.

Table 5: Accuracy and F1 scores of various models on OC, S2ORC and PAN test sets.

6 Related Work

Recently, several works proposed equipping LMs with cross-document processing capabilities, mostly by harnessing sequence-to-sequence architectures. Lewis et al. (2020) suggested to pretrain a LM by means of reconstructing a document, given, and conditioned on, related documents. They showed that this technique forced the model to learn how to paraphrase the original reconstructed document, leading to significant performance gains on multi-lingual document summarization and retrieval. This work considers a basic retrieval model, that does not consider cross-document interactions at all. proposed an end-to-end architecture for improving abstractive summarization. Unlike standard LMs, in their pretraining, several sentences (and not just tokens) are removed from documents, and the model's task is to recover them. A similar approach was also suggested for single document summarization . The advantage of such self-supervision approaches is that they were proved to produce high-quality summaries without any human annotation, often the bottleneck in purely supervised summarization systems. While these approaches advanced the stateof-the-art sequence-to-sequence tasks, the encoders they employed support the encoding of a single document at a time. In our work, we allow inputs comprised of multiple documents in each sample, to support cross-document contextualization. Nevertheless, the main drawback of such sequence-tosequence architectures is that they require a massive amount of data and training time in order to obtain a plausibly trained model, while we used a relatively small corpus.

The closest work to our proposed model is the recent Cross-Document Attention model (CDA) (Zhou et al., 2020) . They introduced a crossdocument component, that enables document-todocument and sentence-to-document alignments. This model is set on top of existing hierarchical document encoding models (Sun et al., 2018; Liu and Lapata, 2019; Guo et al., 2019) , that do not consider information across documents by themselves. CDA suggests influencing the document and sentence representations, by those of other documents, without considering word-to-word information across documents (which might require an additional quadratic number of parameters). This makes such modeling unsuitable for token-level alignment tasks, such as cross-document coreference resolution. Moreover, unlike our proposed model, which employs a generic cross-document pretraining, the CDA mechanism requires learning from scratch the cross-document parameters for each downstream task. Further, they support crossdocument attention between two documents, while our method does not restrict the number of input documents, as long as they fit the input length of the Longformer.

7 Conclusion

We presented a novel pretraining strategy and technique for cross-document language modeling, providing better encoding for cross-document downstream tasks. Our primary contributions include cross-document masking over clusters of related documents, driving the model to encode crossdocument relationships. This was achieved by extending the use of the global attention mechanism of the Longformer model (Beltagy et al., 2020) in pretraining, attending to long-range information across and within documents. Our experiments assess that leveraging our cross-document language model yields new state-of-the-art results over several cross-document benchmarks, including the fundamental task of cross-document entity and event coreference, while, in fact, employing substantially smaller models. We suggest the attractiveness of our CD-LM for neural encoding in cross-document tasks, and propose future research to extend this framework to support sequence-to-sequence crossdocument tasks, such as multi-document abstractive summarization.

A Dataset Statistics And Details

In this section, we provide more details about the datasets of the corpus and benchmarks we used during our experiments.

A.1 Multinews Corpus

In Table 6 we list the number of related documents articles per cluster. This follows the original dataset construction. Note that the datasets and the statistics are taken from Fabbri et al. (2019a

Table 6: MultiNews training set statistics.

A.2 Ecb+ Dataset

In Table 7 we list the statistics about training, development and test splits regarding the topics, documents, mentions and coreference clusters. We follow the data split by previous works (Cybulska and Vossen, 2015; Kenyon-Dean et al., 2018; Barhom et al., 2019): Training topics: 1, 3, 4, 6-11, 13-17, 19-20, 22, 24-33; Validation topics: 2, 5, 12, 18, 21, 23, 34, 35

Table 7: ECB+ dataset statistics. The slash numbers for Mentions and Clusters represent event/entity statistics.

A.3 Paper Citation Recommendation & Plagiarism Detection Datasets

In Table 8 we list the statistics about training, development and test splits for each benchmark, and in Table 9 we list the document and examples counts for each benchmark. The statistics are taken from Zhou et al. (2020) . The preprocessing of the datasets performed by Zhou et al. (2020) includes the following steps: Dataset Train Validation Test AAN 106,592 13,324 13,324 OC 240,000 30,000 30,000 S2ORC 152,000 19000 19000 PAN 17,968 2,908 2,906 Table 8 : Document-to-Document benchmarks statistics: Details regrading the training, validation, and test splits. Table 9 : Document-to-Document benchmarks statistics: The reported numbers are the count of document pairs and the count of unique documents.

Table 8: Document-to-Document benchmarks statistics: Details regrading the training, validation, and test splits.

Table 9: Document-to-Document benchmarks statistics: The reported numbers are the count of document pairs and the count of unique documents.

For AAN, only pairs of documents that include abstracts are considered, and only their abstracts are used. For OC, only one citation per paper is considered and the dataset was downsampled significantly. For S2ORC, formed pairs of citing sections and the corresponding abstract in the cited paper are included, and the dataset was downsampled significantly. For PAN, pairs of relevant segments out of the test were extracted. For all the datasets, negative pairs are sampled randomly. Then, a standard preprocessing that includes filtering out characters that are not digits, letters, punctuation, or white space in the texts was performed. The complete dataset statistics are described in Appendix A.3.

B Hyperparameters Setting And Training Details

In this section, we elaborate the hyperparameter choices for our experiments.

B.1 Cross-Document Coreference Resolution

We adopt the same protocol as suggested in Cattan et al. (2020) 7 : Our training set is composed of positive instances which consist of all the pairs of mentions that belong to the same coreference cluster, while the negative examples are randomly sampled. We fine-tune our models for 10 epochs, with an effective batch size of 128. The feature vector is passed through a MLP pairwise scorer

For more details, see the masking procedure of BERT(Devlin et al., 2019)

HuggingFace implmentation, https://github. com/huggingface/transformers.4 The pretraining took 8 days, using eight 48GB RTX8000 GPUs.

They utilized semantic role labeling to add features related to the arguments of each event mention.

Following the most recent work ofZhou et al. (2020), we evaluate our model on their version of the dataset. We also quote the results of SMASH and SMITH methods, even though they used a somewhat different version of this dataset, hence their results are not fully comparable to the results of our model and those of BERT-HAN+CDA.

we used the implementation taken from https:// github.com/ariecattan/cross_encoder

we used the script https://github.com/ XuhuiZhou/CDA/blob/master/BERT-HAN/run_ ex_sent.sh

References

Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In The first interna- tional conference on language resources and evalua- tion workshop on linguistics coreference, volume 1, pages 563-566. Citeseer.

Shany Barhom, Vered Shwartz, Alon Eirew, Michael Bugert, Nils Reimers, and Ido Dagan. 2019. Re- visiting joint modeling of cross-document entity and event coreference resolution. In Proceedings of the 57th Annual Meeting of the Association for Com- putational Linguistics, pages 4179-4189, Florence, Italy. Association for Computational Linguistics.
Return to section: 5.2 Cross-Document Coreference Resolution

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- ichart. 2018. The hitchhiker's guide to testing statis- tical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 1383-1392, Melbourne, Aus- tralia. Association for Computational Linguistics.
Return to section: 5.2 Cross-Document Coreference Resolution

Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019a. Multi-news: A large-scale multi-document summarization dataset and abstrac- tive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1074-1084, Florence, Italy. Association for Computational Linguistics.
Return to section: 1 Introduction, 4 Cd-Lm Implementation, A.1 Multinews Corpus

Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. 2019b. Multi-news: A large-scale multi-document summarization dataset and abstrac- tive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1074-1084, Florence, Italy. Association for Computational Linguistics.
Return to section: 2 Background

Mandy Guo, Yinfei Yang, Keith Stevens, Daniel Cer, Heming Ge, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil. 2019. Hierarchical document encoder for parallel corpus mining. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Re- search Papers), pages 64-72, Florence, Italy. Asso- ciation for Computational Linguistics.
Return to section: 6 Related Work

Jyun-Yu Jiang, Mingyang Zhang, Cheng Li, Michael Bendersky, Nadav Golbandi, and Marc Najork. 2019. Semantic text matching for long-form docu- ments. In The World Wide Web Conference (WWW).
Return to section: 5.3 Paper Citation Recommendation & Plagiarism Detection

Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel Weld. 2019. BERT for coreference reso- lution: Baselines and analysis. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5803-5808, Hong Kong, China. Association for Computational Linguistics.

Kian Kenyon-Dean, Jackie Chi Kit Cheung, and Doina Precup. 2018. Resolving event coreference with supervised representation learning and clustering- oriented regularization. In Proceedings of the Seventh Joint Conference on Lexical and Com- putational Semantics, pages 1-10, New Orleans, Louisiana. Association for Computational Linguis- tics.

Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Ar- men Aghajanyan, Sida Wang, and Luke Zettlemoyer. 2020. Pre-training via paraphrasing. Advances in Neural Information Processing Systems, 33.
Return to section: 6 Related Work

Yang Liu and Mirella Lapata. 2019. Hierarchical trans- formers for multi-document summarization. In Pro- ceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5070- 5081, Florence, Italy. Association for Computa- tional Linguistics.
Return to section: 2 Background, 6 Related Work

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
Return to section: 1 Introduction, 2 Background, 4 Cd-Lm Implementation, 7 Conclusion

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kin- ney, and Daniel Weld. 2020. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics, pages 4969-4983, Online. As- sociation for Computational Linguistics.
Return to section: 5.3 Paper Citation Recommendation & Plagiarism Detection

Xiaoqiang Luo. 2005. On coreference resolution per- formance metrics. In Proceedings of the confer- ence on Human Language Technology and Empiri- cal Methods in Natural Language Processing, pages 25-32. Association for Computational Linguistics.
Return to section: 5.2 Cross-Document Coreference Resolution

Yehudit Meged, Avi Caciularu, Vered Shwartz, and Ido Dagan. 2020. Paraphrasing vs coreferring: Two sides of the same coin. In Findings of the Associ- ation for Computational Linguistics: EMNLP 2020, pages 4897-4907, Online. Association for Computa- tional Linguistics.
Return to section: 5.2 Cross-Document Coreference Resolution

Nafise Sadat Moosavi and Michael Strube. 2016. Which coreference evaluation metric do you trust? a proposal for a link-based entity aware metric. In Proceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 632-642, Berlin, Germany. As- sociation for Computational Linguistics.

Martin Potthast, Matthias Hagen, Tim Gollub, Martin Tippmann, Johannes Kiesel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. 2013. Overview of the 5th international competition on plagiarism de- tection. In Conference on Multilingual and Multi- modal Information Access Evaluation (CLEF).
Return to section: 5.3 Paper Citation Recommendation & Plagiarism Detection

Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. 2013. The acl an- thology network corpus. Language Resources and Evaluation, 47(4):919-944.

Vered Shwartz, Gabriel Stanovsky, and Ido Dagan. 2017. Acquiring predicate paraphrases from news tweets. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pages 155-160, Vancouver, Canada. Associa- tion for Computational Linguistics.
Return to section: 5.2 Cross-Document Coreference Resolution

Qingying Sun, Zhongqing Wang, Qiaoming Zhu, and Guodong Zhou. 2018. Stance detection with hierar- chical attention network. In Proceedings of the 27th International Conference on Computational Linguis- tics, pages 2399-2409, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Return to section: 6 Related Work

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2020. Long range arena: A benchmark for efficient trans- formers. arXiv preprint arXiv:2011.04006.
Return to section: 2 Background

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems (NIPS).
Return to section: 2 Background

Chandra Bhagavatula, Sergey Feldman, Russell Power, and Waleed Ammar. 2018. Content-based citation recommendation. In Proceedings of the 2018 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), pages 238-251, New Orleans, Louisiana. Association for Computational Linguistics.
Return to section: 5.3 Paper Citation Recommendation & Plagiarism Detection

Marc Vilain, John Burger, John Aberdeen, Dennis Con- nolly, and Lynette Hirschman. 1995. A model- theoretic coreference scoring scheme. In Proceed- ings of the 6th conference on Message understand- ing, pages 45-52. Association for Computational Linguistics.

Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. In Pro- ceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353-355, Brussels, Belgium. Association for Computational Linguistics.

Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self- attention with linear complexity. arXiv preprint arXiv:2006.04768.

Hu Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2324-2335, Minneapolis, Minnesota. Association for Computational Linguis- tics.
Return to section: 1 Introduction

Liu Yang, Mingyang Zhang, Cheng Li, Michael Ben- dersky, and Marc Najork. 2020. Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching. In Pro- ceedings of the ACM International Conference on Information & Knowledge Management (CIKM).

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in neural in- formation processing systems (NeurIPS).

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing. In Proceedings of the 2018 Conference on Em- pirical Methods in Natural Language Processing, pages 2369-2380, Brussels, Belgium. Association for Computational Linguistics.
Return to section: 1 Introduction

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 1480-1489, San Diego, California. Associa- tion for Computational Linguistics.
Return to section: 5.3 Paper Citation Recommendation & Plagiarism Detection

Xiaodong Yu, Wenpeng Yin, and Dan Roth. 2020. Paired representation learning for event and entity coreference. arXiv preprint arXiv:2010.12808.
Return to section: 5.2 Cross-Document Coreference Resolution

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago On- tanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33.
Return to section: 2 Background

Arie Cattan, Alon Eirew, Gabriel Stanovsky, Mandar Joshi, and Ido Dagan. 2020. Streamlining cross- document coreference resolution: Evaluation and modeling. ArXiv, abs/2009.11032.
Return to section: 1 Introduction, 5.2 Cross-Document Coreference Resolution

Yutao Zeng, Xiaolong Jin, Saiping Guan, Jiafeng Guo, and Xueqi Cheng. 2020. Event coreference reso- lution with their paraphrases and argument-aware embeddings. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, pages 3084-3094, Barcelona, Spain (Online). Inter- national Committee on Computational Linguistics.
Return to section: 5.2 Cross-Document Coreference Resolution

Jingqing Zhang, Y. Zhao, Mohammad Saleh, and Pe- ter J. Liu. 2020. PEGASUS: Pre-training with ex- tracted gap-sentences for abstractive summarization. In International Conference on Machine Learning (ICML).

Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HI- BERT: Document level pre-training of hierarchical bidirectional transformers for document summariza- tion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5059-5069, Florence, Italy. Association for Computational Linguistics.

Xuhui Zhou, Nikolaos Pappas, and Noah A. Smith. 2020. Multilevel text alignment with cross- document attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 5012-5025, On- line. Association for Computational Linguistics. that is composed of one hidden layer of the size of 1024, followed by the Tanh activation.
Return to section: 2 Background, 5.3 Paper Citation Recommendation & Plagiarism Detection, 6 Related Work, A.3 Paper Citation Recommendation & Plagiarism Detection Datasets

B.2 Paper Citation Recommendation & Plagiarism Detection We fine-tune our models and BERT-HAN + CDA for 8 epochs, using a batch size of 32, and used the same hyperparameter setting from Zhou et al. (2020, Section 5.2) 8 . We used the mixed-precision training mode, to reduce time and memory consup- tion.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of BERT's attention. In Pro- ceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276-286, Florence, Italy. Association for Computational Linguistics.
Return to section: 2 Background

Agata Cybulska and Piek Vossen. 2014a. Using a sledgehammer to crack a nut? lexical diversity and event coreference resolution. In LREC, pages 4545- 4552.
Return to section: 1 Introduction, 5.2 Cross-Document Coreference Resolution

Agata Cybulska and Piek Vossen. 2014b. Using a sledgehammer to crack a nut? lexical diversity and event coreference resolution. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4545- 4552, Reykjavik, Iceland. European Language Re- sources Association (ELRA).

Agata Cybulska and Piek Vossen. 2015. "bag of events" approach to event coreference resolution. supervised classification of event templates. Int. J. Comput. Lin- guistics Appl.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics.
Return to section: 2 Background

Cross-Document Language Modeling

Authors

Abstract

1 Introduction

2 Background

4 Cd-Lm Implementation

Baseline Language Models In Addition To Our

5 Evaluations And Results

5.1 Cross-Document Perplexity

5.2 Cross-Document Coreference Resolution

5.3 Paper Citation Recommendation & Plagiarism Detection

6 Related Work

7 Conclusion

A Dataset Statistics And Details

A.1 Multinews Corpus

A.2 Ecb+ Dataset

A.3 Paper Citation Recommendation & Plagiarism Detection Datasets

B Hyperparameters Setting And Training Details

B.1 Cross-Document Coreference Resolution