In this paper, we introduce a novel graphbased technique for topic based multidocument summarization. We transform documents into a bipartite graph where one set of nodes represents entities and the other set of node...
详细信息
One research goal in Second language Acquisition (SLA) is to formulate and test hypotheses about errors and the environments in which they are made, a process which often involves substantial effort;large amounts of d...
详细信息
We describe the neural-network training framework used in the Kaldi speech recognition toolkit, which is geared towards training DNNs with large amounts of training data using multiple GPU-equipped or multi-core machi...
详细信息
Most of the recent work on machine learning-based temporal relation classification has been done by considering only a given pair of temporal entities (events or temporal expressions) at a time. Entities that have tem...
详细信息
This paper describes a CRF based token level language identification system entry to language Identification in CodeSwitched (CS) Data task of CodeSwitch 2014. Our system hinges on using conditional posterior probabil...
详细信息
We describe our entry in the EMNLP 2014 code-switching shared task. Our system is based on a sequential classifier, trained on the shared training set using various character- and word-level features, some calculated ...
详细信息
Background: Parsing, which generates a syntactic structure of a sentence (a parse tree), is a critical component of naturallanguageprocessing (NLP) research in any domain including medicine. Although parsers develop...
详细信息
Background: Parsing, which generates a syntactic structure of a sentence (a parse tree), is a critical component of naturallanguageprocessing (NLP) research in any domain including medicine. Although parsers developed in the general English domain, such as the Stanford parser, have been applied to clinical text, there are no formal evaluations and comparisons of their performance in the medical domain. methods: In this study, we investigated the performance of three state-of-the-art parsers: the Stanford parser, the Bikel parser, and the Charniak parser, using following two datasets: (1) A Treebank containing 1,100 sentences that were randomly selected from progress notes used in the 2010 i2b2 NLP challenge and manually annotated according to a Penn Treebank based guideline;and (2) the MiPACQ Treebank, which is developed based on pathology notes and clinical notes, containing 13,091 sentences. We conducted three experiments on both datasets. First, we measured the performance of the three state-of-the-art parsers on the clinical Treebanks with their default settings. Then we re-trained the parsers using the clinical Treebanks and evaluated their performance using the 10-fold cross validation method. Finally we re-trained the parsers by combining the clinical Treebanks with the Penn Treebank. Results: Our results showed that the original parsers achieved lower performance in clinical text (Bracketing F-measure in the range of 66.6%-70.3%) compared to general English text. After retraining on the clinical Treebank, all parsers achieved better performance, with the best performance from the Stanford parser that reached the highest Bracketing F-measure of 73.68% on progress notes and 83.72% on the MiPACQ corpus using 10-fold cross validation. When the combined clinical Treebanks and Penn Treebank was used, of the three parsers, the Charniak parser achieved the highest Bracketing F-measure of 73.53% on progress notes and the Stanford parser reached the highest F-measur
We describe a CRF based system for word-level language identification of code-mixed text. Our method uses lexical, contextual, character n-gram, and special character features, and therefore, can easily be replicated ...
详细信息
This paper describes the DCU-UVT team's participation in the language Identification in Code-Switched Data shared task in the workshop on Computational Approaches to Code Switching. Word-level classification exper...
详细信息
We describe the CMU submission for the 2014 shared task on language identification in code-switched data. We participated in all four language pairs: Spanish-English, Mandarin-English, Nepali-English, and Modern Stand...
详细信息
暂无评论