作者:
Biemann, Chris
475 Brannan St Ste. 330 San Francisco CA 94107 United States
This paper examines the influence of features based on clusters of co-occurrences for supervised Word Sense Disambiguation and Lexical Substitution. Cooccurrence cluster features are derived from clustering the local ...
详细信息
This paper introduces multi-level association graphs (MLAGs), a new graph-based framework for information retrieval (IR). The goal of that framework is twofold: first, it is meant to be a meta model of IR, i.e. it sub...
详细信息
This paper introduces multi-level association graphs (MLAGs), a new graph-based framework for information retrieval (IR). The goal of that framework is twofold: first, it is meant to be a meta model of IR, i.e. it subsumes various IR models under one common representation. Second, it allows to model different forms of search, such as feedback, associative retrieval and browsing at the same time. It is shown how the new integrated model gives insights and stimulates new ideas for IR algorithms. One of these new ideas is presented and evaluated, yielding promising experimental results.
In this paper, we introduce DegExt, a graph-basedlanguage-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (graph-based keyword extraction for single-document ...
详细信息
In this paper, we introduce DegExt, a graph-basedlanguage-independent keyphrase extractor, which extends the keyword extraction method described in Litvak and Last (graph-based keyword extraction for single-document summarization. In: proceedings of the workshop on multi-source multilingual information extraction and summarization, pp 17-24, 2008). We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx (Turney in Inf Retr 2: 303-336, 2000) and TextRank (Mihalcea and Tarau in Textrank-bringing order into texts. In: proceedings of the conference on empirical methods in naturallanguageprocessing. Barcelona, Spain, 2004). We evaluated DegExt on collections of benchmark summaries in two different languages: English and Hebrew. Our experiments on the English corpus show that DegExt significantly outperforms TextRank and GenEx in terms of precision and area under curve for summaries of 15 keyphrases or more at the expense of a mostly non-significant decrease in recall and F-measure, when the extracted phrases are matched against gold standard collection. Due to DegExt's tendency to extract bigger phrases than GenEx and TextRank, when the single extracted words are considered, DegExt outperforms them both in terms of recall and F-measure. In the Hebrew corpus, DegExt performs the same as TextRank disregarding the number of keyphrases. An additional experiment shows that DegExt applied to the TextRank representation graphs outperforms the other systems in the text classification task. For documents in both languages, DegExt surpasses both GenEx and TextRank in terms of implementation simplicity and computational complexity.
Contrary to the traditional Bag-of-Words approach, we consider the graph-of-Words (GoW) model in which each document is represented by a graph that encodes relationships between the different terms. based on this form...
详细信息
In many information retrieval and selection tasks it is valuable to score how much a text is about a certain entity and to compute how much the text discusses the entity with respect to a certain viewpoint. In this pa...
详细信息
Recent works show that the graph structure of sentences, generated from dependency parsers, has potential for improving event detection. However, they often only leverage the edges (dependencies) between words, and di...
详细信息
Text classification is an important topic in naturallanguageprocessing. In recent years, both graph kernel methods and deep learning methods have been widely employed in text classification tasks. However, previous ...
详细信息
Text classification is an important topic in naturallanguageprocessing. In recent years, both graph kernel methods and deep learning methods have been widely employed in text classification tasks. However, previous graph kernel algorithms focused too much on the graph structure itself, such as the shortest path subgraph,while focusing limited attention to the information of the text itself. Previous deep learning methods have often resulted in substantial utilization of computational resources. Therefore,we propose a new graph kernel algorithm to address the disadvantages. first,we extract the textual information of the document using the term weighting scheme. Second,we collect the structural information on the document graph. Third, graph kernel is used for similarity measurement for text classification. We compared eight baseline methods on three experimental datasets, including traditional deep learning methods and graph-based classification methods, and tested our algorithm on multiple indicators. The experimental results demonstrate that our algorithm outperforms other baseline methods in terms of accuracy. Furthermore, it achieves a minimum reduction of 69% in memory consumption and a minimum decrease of 23% in runtime. Furthermore, as we decrease the percentage of training data, our algorithm continues to achieve superior results compared to other deep learning methods. The excellent experimental results show that our algorithm can improve the efficiency of text classification tasks and reduce the occupation of computer resources under the premise of ensuring high accuracy.
Current graph-based approaches to automatic text summarization, such as LexRank and TextRank, assume a static graph which does not model how the input texts emerge. A suitable evolutionary text graph model may impart ...
详细信息
Current graph-based approaches to automatic text summarization, such as LexRank and TextRank, assume a static graph which does not model how the input texts emerge. A suitable evolutionary text graph model may impart a better understanding of the texts and improve the summarization process. We propose a timestamped graph (TSG) model that is motivated by human writing and reading processes, and show how text units in this model emerge over time. In our model, the graphs used by LexRank and TextRank are specific instances of our timestamped graph with particular parameter settings. We apply timestamped graphs on the standard DUC multi-document text summarization task and achieve comparable results to the state of the art.
We propose to use graph-based diffusion techniques with data-dependent kernels to build unigram language models. Our approach entails building graphs, where each vertex corresponds uniquely to a word from a closed voc...
详细信息
We propose to use graph-based diffusion techniques with data-dependent kernels to build unigram language models. Our approach entails building graphs, where each vertex corresponds uniquely to a word from a closed vocabulary, and the existence of an edge (with an appropriate weight) between two words indicates some form of similarity between them. In one of our constructions, we place an edge between two words if the number of times these words were seen in a training set differs by at most one count. This graph construction results in a similarity matrix with small intrinsic dimension, since words with the same counts have the same neighbors. Experimental results from a benchmark task from language modeling show that our method is competitive with the Good-Turing estimator.
This paper introduces a graph-based algorithm for sequence data labeling, using random walks on graphs encoding label dependencies. The algorithm is illustrated and tested in the context of an unsupervised word sense ...
详细信息
暂无评论