In this paper, to support more precise Chinese Out-of-Vocabulary (OOV) term detection and Part-of-Speech (POS) guessing, a unified mechanism is proposed and formulated based on the fusion of multiple features and supe...
详细信息
ISBN:
(纸本)9781577355120
In this paper, to support more precise Chinese Out-of-Vocabulary (OOV) term detection and Part-of-Speech (POS) guessing, a unified mechanism is proposed and formulated based on the fusion of multiple features and supervised learning. Besides all the traditional features, the new features for statistical information and global contexts are introduced, as well as some constraints and heuristic rules, which reveal the relationships among OOV term candidates. Our experiments on the Chinese corpora from both People's Daily and SIGHAN 2005 have achieved the consistent results, which are better than those acquired by pure rule-based or statistics-based models. From the experimental results for combining our model with Chinese monolingual retrieval on the data sets of TREC-9, it is found that the obvious improvement for the retrieval performance can also be obtained.
Many unsupervised learning methods for recognising patterns in data streams are based on fixed length data sequences, which makes them unsuitable for applications where the data sequences are of variable length such a...
详细信息
ISBN:
(纸本)9781577355120
Many unsupervised learning methods for recognising patterns in data streams are based on fixed length data sequences, which makes them unsuitable for applications where the data sequences are of variable length such as in speech recognition, behaviour recognition and text classification. In order to use these methods on variable length data sequences, a pre-processing step is required to manually segment the data and select the appropriate features, which is often not practical in real-world applications. In this paper we suggest an unsupervised learning method that handles variable length data sequences by identifying structure in the data stream using text compression and the edit distance between 'words'. We demonstrate that using this method we can automatically cluster unlabelled data in a data stream and perform segmentation. We evaluate the effectiveness of our proposed method using both fixed length and variable length benchmark datasets, comparing it to the Self-Organising Map in the first case. The results show a promising improvement over baseline recognition systems.
Activity recognition aims to identify and predict human activities based on a series of sensor readings. In recent years, machine learning methods have become popular in solving activity recognition problems. A specia...
详细信息
ISBN:
(纸本)9781577355120
Activity recognition aims to identify and predict human activities based on a series of sensor readings. In recent years, machine learning methods have become popular in solving activity recognition problems. A special difficulty for adopting machine learning methods is the workload to annotate a large number of sensor readings as training data. Labeling sensor readings for their corresponding activities is a time-consuming task. In practice, we often have a set of labeled training instances ready for an activity recognition task. If we can transfer such knowledge to a new activity recognition scenario that is different from, but related to, the source domain, it will ease our effort to perform manual labeling of training data for the new scenario. In this paper, we propose a transfer learning framework based on automatically learning a correspondence between different sets of sensors to solve this transfer-learning in activity recognition problem. We validate our framework on two different datasets and compare it against previous approaches of activity recognition, and demonstrate its effectiveness.
Haseman-Elston (H-E) regression is a commonly used conventional approach for detecting quantitative trait loci (QTLs), which regulate the quantitative phenotype based on the Identical-By-Descent (IBD) information betw...
详细信息
ISBN:
(纸本)9783642238772
Haseman-Elston (H-E) regression is a commonly used conventional approach for detecting quantitative trait loci (QTLs), which regulate the quantitative phenotype based on the Identical-By-Descent (IBD) information between twins in Genome-wide scan. However, this approach only considers genetic effect at individual loci, but not any interaction between genes. A Pair-Wise H-E regression (PWH-E) and a Feature Screening Approach (FSA) are proposed in this paper to take gene-gene interaction into account when detecting QTLs. After testing these approaches with several series of simulation studies, they are applied to a real-world bone mineral density (BMD) dataset, and find three site specific sets of potential QTLs. Further comparison analyses show that our results not only corroborate the 14 findings from previous published studies, but also suggest 22 new QTLs of BMD.
Positive and unlabelled learning (PU learning) has been investigated to deal with the situation where only the positive examples and the unlabelled examples are available. Most of the previous works focus on identifyi...
详细信息
ISBN:
(纸本)9781577355120
Positive and unlabelled learning (PU learning) has been investigated to deal with the situation where only the positive examples and the unlabelled examples are available. Most of the previous works focus on identifying some negative examples from the unlabelled data, so that the supervised learning methods can be applied to build a classifier. However, for the remaining unlabelled data, which can not be explicitly identified as positive or negative (we call them ambiguous examples), they either exclude them from the training phase or simply enforce them to either class. Consequently, their performance may be constrained. This paper proposes a novel approach, called similarity-based PU learning (SPUL) method, by associating the ambiguous examples with two similarity weights, which indicate the similarity of an ambiguous example towards the positive class and the negative class, respectively. The local similarity-based and global similarity-based mechanisms are proposed to generate the similarity weights. The ambiguous examples and their similarity-weights are thereafter incorporated into an SVM-based learning phase to build a more accurate classifier. Extensive experiments on real-world datasets have shown that SPUL outperforms state-of-the-art PU learning methods.
Network alarm triage refers to grouping and prioritizing a stream of low-level device health information to help operators find and fix problems. Today, this process tends to be largely manual because existing rule-ba...
详细信息
ISBN:
(纸本)9781577355120
Network alarm triage refers to grouping and prioritizing a stream of low-level device health information to help operators find and fix problems. Today, this process tends to be largely manual because existing rule-based tools cannot easily evolve with the network. We present CueT, a system that uses interactive machine learning to constantly learn from the triaging decisions of operators. It then uses that learning in novel visualizations to help them quickly and accurately triage alarms. Unlike prior interactive machine learning systems, CueT handles a highly dynamic environment where the groups of interest are not known a priori and evolve constantly. Our evaluations with real operators anddata from a large network show that CueT significantly improves the speed and accuracy of alarm triage.
Patterns provide a simple, yet powerful means of describing formal languages. However, for many applications, neither patterns nor their generalized versions of typed patterns are expressive enough. This paper extends...
详细信息
ISBN:
(纸本)9783642244117;9783642244124
Patterns provide a simple, yet powerful means of describing formal languages. However, for many applications, neither patterns nor their generalized versions of typed patterns are expressive enough. This paper extends the model of (typed) patterns by allowing relations between the variables in a pattern. The resulting formal languages are called Relational Pattern Languages (RPLs). We study the problem of learning RPLs from positive data (text) as well as the membership problem for RPLs. These problems are not solvable or not efficiently solvable in general, but we prove positive results for interesting subproblems. We further introduce a new model of learning from a restricted pool of potential texts. Probabilistic assumptions on the process that generates words from patterns make the appearance of some words in the text more likely than that of other words. We prove that, in our new model, a large subclass of RPLs can be learned with high confidence, by effectively restricting the set of likely candidate patterns to a finite set after processing a single positive example.
This paper addresses one of the key components of the mining process: the geological prediction of natural resources from spatially distributed measurements. We present a novel approach combining undirected graphical ...
详细信息
ISBN:
(纸本)9781577355120
This paper addresses one of the key components of the mining process: the geological prediction of natural resources from spatially distributed measurements. We present a novel approach combining undirected graphical models with ensemble classifiers to provide 3D geological models from multiple sensors installed in an autonomous drill rig. Drill sensor measurements used for drilling automation, known as measurement-while-drilling (MWD) data, have the potential to provide an estimate of the geological properties of the rocks being drilled. The proposed method maps MWD parameters to rock types while considering spatial relationships, i.e., associating measurements obtained from neighboring regions. We use a conditional random field with local information provided by boosted decision trees to jointly reason about the rock categories of neighboring measurements. To validate the approach, MWD data was collected from a drill rig operating at an iron ore mine. Graphical models of the 3D structure present in real data sets possess a high number of nodes, edges and cycles, making them intractable for exact inference. We provide a comparison of three approximate inference methods to calculate the most probable distribution of class labels. The empirical results demonstrate the benefits of spatial modeling through graphical models to improve classification performance.
We propose an online classification approach for co-occurrence data which is based on a simple information theoretic principle. We further show how to properly estimate the uncertainty associated with each prediction ...
详细信息
ISBN:
(纸本)9781577355120
We propose an online classification approach for co-occurrence data which is based on a simple information theoretic principle. We further show how to properly estimate the uncertainty associated with each prediction of our scheme and demonstrate how to exploit these uncertainty estimates. First, in order to abstain highly uncertain predictions. And second, within an active learning framework, in orderto preserve classification accuracy while substantially reducing training set size. Our method is highly efficient in terms of run-time and memory footprint requirements. Experimental results in the domain of text classification demonstrate that the classification accuracy of our method is superior or comparable to other state-of-the-art online classification algorithms.
To support more precise query translation for English-Chinese Bi-Directional Cross-Language Information Retrieval (CLIR), we have developed a novel framework by integrating a semantic network to characterize the corre...
详细信息
ISBN:
(纸本)9781577355120
To support more precise query translation for English-Chinese Bi-Directional Cross-Language Information Retrieval (CLIR), we have developed a novel framework by integrating a semantic network to characterize the correlations between multiple inter-related text terms of interest and learn their inter-related statistical query translation models. First, a semantic network is automatically generated from large-scale English-Chinese bilingual parallel corpora to characterize the correlations between a large number of text terms of interest. Second, the semantic network is exploited to learn the statistical query translation models for such text terms of interest. Finally, these inter-related query translation models are used to translate the queries more precisely and achieve more effective CLIR. Our experiments on a large number of official public data have obtained very positive results.
暂无评论