The aim of this paper is to apply and develop methodsbased on naturallanguageprocessing for automatically testing the validity, reliability and coverage of various Swedish SNOMED-CT subsets, the Systematized NOmenc...
详细信息
Sets of lexical items sharing a significant aspect of their meaning (concepts) are fundamental for linguistics and NLP. Unsupervised concept acquisition algorithms have been shown to produce good results, and are pref...
详细信息
The first task of statistical computational linguistics, or any other type of data-driven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of...
ISBN:
(纸本)9781932432541
The first task of statistical computational linguistics, or any other type of data-driven processing of language, is the extraction of counts and distributions of phenomena. This is much more difficult for the type of complex structured data found in treebanks and in corpora with sophisticated annotation than for tokenized texts. Recent developments in data mining, particularly in the extraction of frequent subtrees from treebanks, offer some solutions. We have applied a modified version of the TreeMiner algorithm to a small treebank and present some promising results.
In this paper, we address the problem of event coreference resolution as specified in the Automatic Content Extraction (ACE) program. In contrast to entity coreference resolution, event coreference resolution has not ...
ISBN:
(纸本)9781932432541
In this paper, we address the problem of event coreference resolution as specified in the Automatic Content Extraction (ACE) program. In contrast to entity coreference resolution, event coreference resolution has not received great attention from researchers. In this paper, we first demonstrate the diverse scenarios of event coreference by an example. We then model event coreference resolution as a spectral graph clustering problem and evaluate the clustering algorithm on ground truth event mentions using ECM F-Measure. We obtain the ECM-F scores of 0.8363 and 0.8312 respectively by using two methods for computing coreference matrices.
We present a graph-based model for representing the lexical cohesion of a discourse. In the graph structure, vertices correspond to the content words of a text and edges connecting pairs of words encode how closely th...
ISBN:
(纸本)9781932432541
We present a graph-based model for representing the lexical cohesion of a discourse. In the graph structure, vertices correspond to the content words of a text and edges connecting pairs of words encode how closely the words are related semantically. We show that such a structure can be used to distinguish literal and non-literal usages of multi-word expressions.
We present ongoing work in a scalable, distributed implementation of over 200 million individual language models, each capturing a single user's dialect in a given language (multilingual users have several models)...
ISBN:
(纸本)9781932432541
We present ongoing work in a scalable, distributed implementation of over 200 million individual language models, each capturing a single user's dialect in a given language (multilingual users have several models). These have a variety of practical applications, ranging from spam detection to speech recognition, and dialectometrical methods on the social graph. Users should be able to view any content in their language (even if it is spoken by a small population), and to browse our site with appropriately translated interface (automatically generated, for locales with little crowd-sourced community effort).
In this study we used bipartite spectral graph partitioning to simultaneously cluster varieties and sound correspondences in Dutch dialect data. While clustering geographical varieties with respect to their pronunciat...
ISBN:
(纸本)9781932432541
In this study we used bipartite spectral graph partitioning to simultaneously cluster varieties and sound correspondences in Dutch dialect data. While clustering geographical varieties with respect to their pronunciation is not new, the simultaneous identification of the sound correspondences giving rise to the geographical clustering presents a novel opportunity in dialectometry. Earlier methods aggregated sound differences and clustered on the basis of aggregate differences. The determination of the significant sound correspondences which co-varied with cluster membership was carried out on a post hoc basis. Bipartite spectral graph clustering simultaneously seeks groups of individual sound correspondences which are associated, even while seeking groups of sites which share sound correspondences. We show that the application of this method results in clear and sensible geographical groupings and discuss the concomitant sound correspondences.
Text segmentation is important for many naturallanguageprocessing tasks, such as passage retrieval and summarization. This paper uses suffix tree model for the text representation and introduces a new measure, subse...
详细信息
Text segmentation is important for many naturallanguageprocessing tasks, such as passage retrieval and summarization. This paper uses suffix tree model for the text representation and introduces a new measure, subsequence-based coherence, to represent the coherence between sentences and utilize the word order information. This paper also introduces a text segmentation algorithm, subsequence-based maximum cut, and a passage labeling approach based on subsequences. The educational text segmentation results show that our method outperforms some of the existing methods, and the passage labeling result is approving.
Computing the semantic similarity between terms relies on existence and usage of semantic resources. However, these resources, often composed of equivalent units, or synonyms, must be first analyzed and weighted in or...
ISBN:
(纸本)9781932432305
Computing the semantic similarity between terms relies on existence and usage of semantic resources. However, these resources, often composed of equivalent units, or synonyms, must be first analyzed and weighted in order to define within them the reliability zones where the semantic cohesiveness is stronger. We propose an original method for acquisition of elementary synonyms based on exploitation of structured terminologies, analysis of syntactic structure of complex (multi-unit) terms and their compositionality. The acquired synonyms are then profiled thanks to endogenous lexical and linguistic indicators (other types of relations, lexical inclusions, productivity), which are automatically inferred within the same terminologies. Additionally, synonymy relations are observed within graph, and its structure is analyzed. Particularly, we explore the usefulness of the graph theory notions such as connected component, clique, density, bridge, articulation vertex, and centrality of vertices.
Coreference resolution has been shown to be beneficial in many naturallanguageprocessing (NLP) applications. In the past decades, various strategies were proposed to address this problem. In order to employ these di...
详细信息
Coreference resolution has been shown to be beneficial in many naturallanguageprocessing (NLP) applications. In the past decades, various strategies were proposed to address this problem. In order to employ these different strategies in a single model, we glue part of them together with an ensemble learning method based on maximum entropy (ME). The different coreference resolution strategies combined in this paper include mention-ranking model, mention-entity model and graph-cut-based model, which are high-quality methods today. The experiments on ACE 2004 Chinese data show that the performance of the proposed method is better than those of three basic models, and improve the coreference resolution effectively.
暂无评论