We introduce Chinese Whispers, a randomized graph-clustering algorithm, which is time-linear in the number of edges. After a detailed definition of the algorithm and a discussion of its strengths and weaknesses, the p...
We introduce Chinese Whispers, a randomized graph-clustering algorithm, which is time-linear in the number of edges. After a detailed definition of the algorithm and a discussion of its strengths and weaknesses, the performance of Chinese Whispers is measured on naturallanguageprocessing (NLP) problems as diverse as language separation, acquisition of syntactic word classes and word sense disambiguation. At this, the fact is employed that the small-world property holds for many graphs in NLP.
To determine how close two language models (e.g., n-grams models) are, we can use several distance measures. If we can represent the models as distributions, then the similarity is basically the similarity of distribu...
详细信息
The ability to accurately model the content structure of text is important for many naturallanguageprocessing applications. This paper describes experiments with generative models for analyzing the discourse structu...
详细信息
Document indexing and representation of term-document relations are very important for document clustering and retrieval. In this paper, we combine a graph-based dimensionality reduction method with a corpus-based ass...
Document indexing and representation of term-document relations are very important for document clustering and retrieval. In this paper, we combine a graph-based dimensionality reduction method with a corpus-based association measure within the Generalized Latent Semantic Analysis framework. We evaluate the graph-based GLSA on the document clustering task.
In this paper we present a graph-based approach to question answering. The method assumes a graph representation of question sentences and text sentences. Question answering rules are automatically learnt from a train...
In this paper we present a graph-based approach to question answering. The method assumes a graph representation of question sentences and text sentences. Question answering rules are automatically learnt from a training corpus of questions and answer sentences with the answer annotated. The method is independent from the graph representation formalism chosen. A particular example is presented that uses a specific graph representation of the logical contents of sentences.
We discuss several feature sets for novelty detection at the sentence level, using the data and procedure established in task 2 of the TREC 2004 novelty track. In particular, we investigate feature sets derived from g...
We discuss several feature sets for novelty detection at the sentence level, using the data and procedure established in task 2 of the TREC 2004 novelty track. In particular, we investigate feature sets derived from graph representations of sentences and sets of sentences. We show that a highly connected graph produced by using sentence-level term distances and pointwise mutual information can serve as a source to extract features for novelty detection. We compare several feature sets based on such a graph representation. These feature sets allow us to increase the accuracy of an initial novelty classifier which is based on a bag-of-word representation and KL divergence. The final result ties with the best system at TREC 2004.
Summary form only given. A number of problems in information retrieval and naturallanguageprocessing can be approached using graph theory. Some representative examples in IR include Brin and Page's Pagerank and ...
详细信息
Summary form only given. A number of problems in information retrieval and naturallanguageprocessing can be approached using graph theory. Some representative examples in IR include Brin and Page's Pagerank and Kleinberg's HITS for document ranking using graph-based random walk models. In NLP, one could mention Pang and Lee's work on sentiment analysis using graph min- cuts, Mihalcea's work on word sense disambiguation, Zhu et al.'s label propagation algorithms, Toutanova et al.'s prepositional attachment algorithm, and McDonald et al.'s dependency parsing algorithm using minimum spanning trees. In this talk I will quickly summarize three graph-based algorithms developed recently at the University of Michigan: (a) lexrank, a method for multidocument summarization based on random walks on lexical centrality graphs, (b) TUMBL, a generic method using bipartite graphs for semi-supervised learning, and (c) biased lexrank, a semi-supervised technique for passage ranking for information retrieval and discuss the applicability of such techniques to other problems in naturallanguageprocessing and Information Retrieval.
We present a graph-based semi-supervised learning algorithm to address the sentiment analysis task of rating inference. Given a set of documents (e.g., movie reviews) and accompanying ratings (e.g., "4 stars"...
We present a graph-based semi-supervised learning algorithm to address the sentiment analysis task of rating inference. Given a set of documents (e.g., movie reviews) and accompanying ratings (e.g., "4 stars"), the task calls for inferring numerical ratings for unlabeled documents based on the perceived sentiment expressed by their text. In particular, we are interested in the situation where labeled data is scarce. We place this task in the semi-supervised setting and demonstrate that considering unlabeled reviews in the learning process can improve rating-inference performance. We do so by creating a graph on both labeled and unlabeled data to encode certain assumptions for this task. We then solve an optimization problem to obtain a smooth rating function over the whole graph. When only limited labeled data is available, this method achieves significantly better predictive accuracy over other methods that ignore the unlabeled examples during training.
We study how two graph algorithms apply to topic-driven summarization in the scope of Document Understanding Conferences. The DUC 2005 and 2006 tasks were to summarize into 250 words a collection of documents on a top...
We study how two graph algorithms apply to topic-driven summarization in the scope of Document Understanding Conferences. The DUC 2005 and 2006 tasks were to summarize into 250 words a collection of documents on a topic consisting of a few statements or questions. Our algorithms select sentences for extraction. We measure their performance on the DUC 2005 test data, using the Summary Content Units made available after the challenge. One algorithm matches a graph representing the entire topic against each sentence in the collection. The other algorithm checks, for pairs of open-class words in the topic, whether they can be connected in the syntactic graph of each sentence. Matching performs better than connecting words, but a combination of both methods works best. They also both favour longer sentences, which makes summaries more fluent.
Summary form only given. A typical text-basednaturallanguage application (eg. machine translation, summarization, information extraction) consists of a pipeline of preprocessing steps such as tokenization, stemming,...
详细信息
Summary form only given. A typical text-basednaturallanguage application (eg. machine translation, summarization, information extraction) consists of a pipeline of preprocessing steps such as tokenization, stemming, part-of-speech tagging, named entity detection, chunking, parsing. Information flows downstream through the preprocessing steps along a narrow pipe: each step transforms a single input string into a single best solution string. However, this narrow pipe is limiting for two reasons: first, since each of the preprocessing steps are erroneous, producing a single best solution could magnify the error propogation down the pipeline. Second, the preprocessing steps are forced to resolve genuine ambiguity prematurely. While the widening of the pipeline can potentially benefit text-basedlanguage applications, it is imperative for spoken languageprocessing where the output from the speech recognizer is typically a word lattice/graph. In this talk, we review how such a goal has been accomplished in tasks such as spoken language understanding, speech translation and multimodal languageprocessing. We will also sketch methods that encode the preprocessing steps as finite-state transductions in order to exploit composition of finite-state transducers as a general constraint propogation method.
暂无评论