multidocument summarization problem deals with extracting main information and ideas from a set of related documents. Solution to this problem is to find an extraction strategy that aims at finding a small subset of s...
详细信息
multidocument summarization problem deals with extracting main information and ideas from a set of related documents. Solution to this problem is to find an extraction strategy that aims at finding a small subset of sentences that is able to cover the most important information about the whole document set. Although a large number of machine-learning-based methods have shown great promise, the lack of high-quality training data poses an inherent obstacle to them. Furthermore, because of the proliferation of low-quality documents on the Internet, the existing summarization strategies, which are merely based on statistical features, get poor performance. In this article, we propose a new two-phase multidocument summarization strategy using content attention-based subtopic detection. First, inspired by distance dynamics-based community detection mechanism, we extract subtopics from the set of documents by having insight into their own content attention and also underlying semantic relations. Instead of complicated neural attention mechanisms, we propose a simple iteration-based content attention method to complete the subtopic detection task. Second, we formulate summarization from different subtopics as a combinatorial optimization problem of minimizing sentence distance and maximizing topic diversity. We prove the submodularity of the above optimization problem, which allows us to propose a new multidocument summarization algorithm based on the greedy mechanism. Finally, we experimentally validate our new algorithms on BBC news summary and wikiHow data. The results show our new algorithms outperform the state-of-the-art methods.
Nowadays, it is necessary that users have access to information in a concise form without losing any critical information. Document summarization is an automatic process of generating a short form from a document. In ...
详细信息
Nowadays, it is necessary that users have access to information in a concise form without losing any critical information. Document summarization is an automatic process of generating a short form from a document. In itemset-based document summarization, the weights of all terms are considered the same. In this paper, a new approach is proposed for multidocument summarization based on weighted patterns and term association measures. In the present study, the weights of the terms are not equal in the context and are computed based on weighted frequent itemset mining. Indeed, the proposed method enriches frequent itemset mining by weighting the terms in the corpus. In addition, the relationships among the terms in the corpus have been considered using term association measures. Also, the statistical features such as sentence length and sentence position have been modified and matched to generate a summary based on the greedy method. Based on the results of the DUC 2002 and DUC 2004 datasets obtained by the ROUGE toolkit, the proposed approach can outperform the state-of-the-art approaches significantly.
multidocument aspect-based summarization (AspSumm) aims to generate focused summaries based on the target aspects from a cluster of relevant documents. Generating such summaries can better satisfy readers' specifi...
详细信息
multidocument aspect-based summarization (AspSumm) aims to generate focused summaries based on the target aspects from a cluster of relevant documents. Generating such summaries can better satisfy readers' specific points of interest, as readers may have different concerns about the same articles. However, previous methods usually generate aspect-based summaries based on the given aspects without using the relationship among aspects to assist in the summarization. In this work, we propose a two-stage general framework for multidocument AspSumm. The model first discovers the latent relationship among aspects and then uses relevant sentences selected by aspect discovery to generate abstractive summaries. We exploit latent dependencies among aspects using a tag mask training (TMT) strategy, which increases the interpretability of the model. In addition to improvements in summarization over aspect-based strong baselines, experimental results show that our proposed model can accurately discover multidomain aspects on the WikiAsp dataset.
Ontology based information extraction and summarization process in news content retrieved the news based on the user query. The user query should be of any context about the news content. So that, users need not be aw...
详细信息
Contextual text feature extraction and classification play a vital role in the multi-document summarization process. Natural language processing (NLP) is one of the essential text mining tools which is used to preproc...
详细信息
Contextual text feature extraction and classification play a vital role in the multi-document summarization process. Natural language processing (NLP) is one of the essential text mining tools which is used to preprocess and analyze the large document sets. Most of the conventional single document feature extraction measures are independent of contextual relationships among the different contextual feature sets for the document categorization process. Also, these conventional word embedding models such as TF-ID, ITF-ID and Glove are difficult to integrate into the multi-domain feature extraction and classification process due to a high misclassification rate and large candidate sets. To address these concerns, an advanced multi-document summarization framework was developed and tested on number of large training datasets. In this work, a hybrid multi-domain glove word embedding model, multi-document clustering and classification model were implemented to improve the multi-document summarization process for multi-domain document sets. Experimental results prove that the proposed multi-document summarization approach has improved efficiency in terms of accuracy, precision, recall, F-score and run time (ms) than the existing models.
This paper focuses on automatic summarization of multiple engineering papers. A summarization approach based on documents' macro-and microstructure has been proposed. The macrostructure consists of a list of ranke...
详细信息
This paper focuses on automatic summarization of multiple engineering papers. A summarization approach based on documents' macro-and microstructure has been proposed. The macrostructure consists of a list of ranked topics from engineering papers. Topics are discovered by extracting and grouping frequently appearing word sequences into equivalence classes. Hence, the macrostructure symbolically presents the topical links in different papers. Meanwhile, the microstructure is defined as the rhetorical structure within a single paper. The identification of microstructure is approached as a classification problem. Each sentence in a paper is automatically labeled with one of the predefined rhetorical categories. Unlike existing summarization methods that first separate documents into nonoverlapping clusters and then summarize each cluster individually, our approach aims to summarize multiple documents according to the characteristics suggested at macro-and microstructure levels. The experimental study showed that our proposed approach outperformed peer systems in terms of recall-oriented understudy for gisting evaluation scores and readers' responsiveness. In an independent manual categorization task using the summaries generated by our approach and peer systems, we also performed better in terms of precision and recall. [DOI: 10.1115/1.3563048]
Query focused multi-document summarization is a process of automatic query biased text compression of a document set. Lately, the graph-based and ranking methods have been intensively attracted the researchers from ex...
详细信息
ISBN:
(纸本)9783319572611
Query focused multi-document summarization is a process of automatic query biased text compression of a document set. Lately, the graph-based and ranking methods have been intensively attracted the researchers from extractive document summarization domain. The uniform sentence connecteness or non-uniform document-sentence connecteness, such as sentence similarity weighted by document importance, were the main features used by work to date. Contrary, in this paper we present a novel five-layered heterogeneous graph model. It emphasizes not only sentence and document level relations but also the influence of lower level relations (e.g. a part of sentence similarity) and higher level relations (i.e. query to sentences similarity). Based on this model, we developed an iterative sentence ranking algorithm, based on the existing well known PageRank algorithm. Moreover, for text similarity calculations we used universal paraphrase embeddings that outperform various strong baselines on many text similarity tasks and many domains. Experiments are conducted on the DUC 2005 data sets and the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) evaluation results demonstrate the advantages of the proposed approach.
We present a novel unsupervised query-focused multi-document summarization approach. To this end, we generate a summary by extracting a subset of sentences using the Cross-Entropy (CE) Method. The proposed approach is...
详细信息
ISBN:
(纸本)9781450350228
We present a novel unsupervised query-focused multi-document summarization approach. To this end, we generate a summary by extracting a subset of sentences using the Cross-Entropy (CE) Method. The proposed approach is generic and requires no domain knowledge. Using an evaluation over DUC 2005-2007 datasets with several other state-of-the-art baseline methods, we demonstrate that, our approach is both effective and efficient.
Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of ...
详细信息
Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix and then conduct the clustering process. Although many of these clustering methods can group the documents effectively, it is still hard for people to capture the meaning of the documents since there is no satisfactory interpretation for each document cluster. A straightforward solution is to first cluster the documents and then summarize each document cluster using summarization methods. However, most of the current summarization methods are solely based on the sentence-term matrix and ignore the context dependence of the sentences. As a result, the generated summaries lack guidance from the document clusters. In this article, we propose a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentence-term matrices. By utilizing the mutual influence of document clustering and summarization, our method makes;(1) a better document clustering method with more meaningful interpretation;and (2) an effective document summarization method with guidance from document clustering. Experimental results on various document datasets show the effectiveness of our proposed method and the high interpretability of the generated summaries.
This paper suggests an approach for creating a summary for a set of documents with revealing the topics and extracting informative sentences. The topics are determined through clustering of sentences, and the informat...
详细信息
This paper suggests an approach for creating a summary for a set of documents with revealing the topics and extracting informative sentences. The topics are determined through clustering of sentences, and the informative sentences are extracted using the ranking algorithm. The result of the summarization has been shown depends on the clustering method, the ranking algorithm, and the similarity measure. The experiments on an open benchmark datasets DUC2001 and DUC2002 have showed that the suggested clustering methods and the ranking algorithm show better results than the known k-means method and the ranking algorithms PageRank and HITS.
暂无评论