With billions of triples in the Linked Open Data cloud, which continues to grow exponentially, very challenging tasks begin to emerge related to the exploitation of large-scale reasoning. A considerable amount of work...
详细信息
With billions of triples in the Linked Open Data cloud, which continues to grow exponentially, very challenging tasks begin to emerge related to the exploitation of large-scale reasoning. A considerable amount of work has been done in the area of using Information Retrieval methods to address these problems. However, although applied models work on Web scale, they downgrade the semantics contained in an RDF graph by observing each physical resource as a 'bag of words (URIs/literals)'. Distributional statistic methods can address this problem by capturing the structure of the graph more efficiently. However, these methods are continually confronting with efficiency and scalability problems on serial computing architectures due to their computational complexity. In this paper, we describe a parallelization algorithm of one such method (Random Indexing) based on the Message-Passing Interface (MPI), that enables efficient utilization of high performance parallel computers. Our evaluation results show significant performance improvement.
We present the Potsdam naturallanguage generation systems P1 and P2 of the GIVE-2.5 Challenge. The systems implement two different referring expression generation models from Garoufi and Koller (2011) while behaving ...
详细信息
Topic models are an established technique for generating information about the subjects discussed in collections of documents. Latent Dirichlet Allocation (LDA) is a widely applied topic model. The topic models genera...
详细信息
Objective Information extraction and classification of clinical data are current challenges in naturallanguageprocessing. This paper presents a cascaded method to deal with three different extractions and classifica...
详细信息
Objective Information extraction and classification of clinical data are current challenges in naturallanguageprocessing. This paper presents a cascaded method to deal with three different extractions and classifications in clinical data: concept annotation, assertion classification and relation classification. Materials and methods A pipeline system was developed for clinical naturallanguageprocessing that includes a proofreading process, with gold-standard reflexive validation and correction. The information extraction system is a combination of a machine learning approach and a rule-based approach. The outputs of this system are used for evaluation in all three tiers of the fourth i2b2/VA shared-task and workshop challenge. Results Overall concept classification attained an F-score of 83.3% against a baseline of 77.0%, the optimal F-score for assertions about the concepts was 92.4% and relation classifier attained 72.6% for relationships between clinical concepts against a baseline of 71.0%. Micro-average results for the challenge test set were 81.79%, 91.90% and 70.18%, respectively. Discussion The challenge in the multi-task test requires a distribution of time and work load for each individual task so that the overall performance evaluation on all three tasks would be more informative rather than treating each task assessment as independent. The simplicity of the model developed in this work should be contrasted with the very large feature space of other participants in the challenge who only achieved slightly better performance. There is a need to charge a penalty against the complexity of a model as defined in message minimalisation theory when comparing results. Conclusion A complete pipeline system for constructing languageprocessing models that can be used to process multiple practical detection tasks of language structures of clinical records is presented.
The 2010 i2b2/VA workshop on naturallanguageprocessing Challenges for Clinical Records presented three tasks: a concept extraction task focused on the extraction of medical concepts from patient reports;an assertion...
详细信息
The 2010 i2b2/VA workshop on naturallanguageprocessing Challenges for Clinical Records presented three tasks: a concept extraction task focused on the extraction of medical concepts from patient reports;an assertion classification task focused on assigning assertion types for medical problem concepts;and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. i2b2 and the VA provided an annotated reference standard corpus for the three tasks. Using this reference standard, 22 systems were developed for concept extraction, 21 for assertion classification, and 16 for relation classification. These systems showed that machine learning approaches could be augmented with rule-based systems to determine concepts, assertions, and relations. Depending on the task, the rule-based systems can either provide input for machine learning or post-process the output of machine learning. Ensembles of classifiers, information from unlabeled data, and external knowledge sources can help when the training data are inadequate.
In this paper we apply lightly-supervised training to a hierarchical phrase-based statistical machine translation system. We employ bitexts that have been built by automatically translating large amounts of monolingua...
详细信息
The proceedings contain 12 papers. The topics discussed include: network analysis reveals structure indicative of syntax in the corpus of undeciphered Indus civilization inscriptions;bipartite spectral graph partition...
ISBN:
(纸本)193243254X
The proceedings contain 12 papers. The topics discussed include: network analysis reveals structure indicative of syntax in the corpus of undeciphered Indus civilization inscriptions;bipartite spectral graph partitioning to co-cluster varieties and sound correspondences in dialectology;WikiWalk: random walks on Wikipedia for semantic relatedness;classifying Japanese polysemous verbs based on Fuzzy C-means clustering;measuring semantic relatedness with vector space models and random walks;ranking and semi-supervised classification on large scale graphs using Map-Reduce;opinion graphs for polarity and discourse classification;a cohesion graphbased approach for unsupervised recognition of literal and non-literal use of multiword expressions;social (distributed) language modeling, clustering and dialectometry;and quantitative analysis of treebanks using frequent subtree mining methods.
The proceedings contain 14 papers. The topics discussed include: learning finite state machines;developing computational morphology for low- and middle-density languages;selected operations and applications of n-tape ...
ISBN:
(纸本)364214683X
The proceedings contain 14 papers. The topics discussed include: learning finite state machines;developing computational morphology for low- and middle-density languages;selected operations and applications of n-tape weighted finite-state machines;OpenFst;morphological analysis of tone marked Kinya-rwanda text;minimizing weighted tree grammars using simulation;reducing nondeterministic finite automata with sat solvers;joining composition and trimming of finite-state transducers;porting Basque morphological grammars to foma, an open-source tool;describing Georgian morphology with a finite-state system;finite state morphology of the nguni language cluster: modeling and implementation issues;a finite state approach to setswana verb morphology;and Zulu: an interactive learning competition.
Kernel methods are considered the most effective techniques for various relation extraction (RE) tasks as they provide higher accuracy than other approaches. In this paper, we introduce new dependency tree (DT) kernel...
详细信息
The coexistence of five languages with offcial status in the Iberian Peninsula (Basque, Catalan, Galician, Portuguese, and Spanish), has prompted collaborative efforts to share and cross-develop resources and material...
详细信息
The coexistence of five languages with offcial status in the Iberian Peninsula (Basque, Catalan, Galician, Portuguese, and Spanish), has prompted collaborative efforts to share and cross-develop resources and materials for these languages of the region. However, it is not the case that comprehension boundaries only exist between each of these five languages;dialectal variation is also present, and in the case of Basque, for example, many written resources are only available in dialectal (or pre-standardization) form. At the same time, all the computational tools developed for Basque are based on the standard language ("Batua"), and will not work correctly with other dialects, of which there are many. In this work we attempt to semiautomatically deduce relationships between the standard Basque and dialectal variants. Such an effort provides an opportunity to apply existing tools to texts issued before a unified standard Basque was developed, and so take advantage of a rich source of linguistic information.
暂无评论