Most of state-of-the-art machine learning algorithms cannot provide a reliable measure of their classifications and predictions. This paper addresses the importance of reliability and confidence for classification, an...
详细信息
ISBN:
(纸本)9781424420957
Most of state-of-the-art machine learning algorithms cannot provide a reliable measure of their classifications and predictions. This paper addresses the importance of reliability and confidence for classification, and presents a novel method based on a combination of the unexcelled ensemble method, random forest (RF), and transductive confidence machine (TCM) which we call TCM-RF. The new algorithm hedges the predictions of RF and gives a well-calibrated region prediction by using the proximity matrix generated with RF as a nonconformity measure of examples. The new method takes advantage of RF and possesses a more precise and robust nonconformity measure. It can deal with redundant and noisy data with mixed types of variables, and is less sensitive to parameter settings. Experiments on benchmark datasets show it is more effective and robust than other TCMs. Further stud), on a real-world lymphoma microarray dataset shows its superiority over SVM with the ability of controlling the risk of error.
This paper proposes a new type of regularization in the context of multi-class support vector machine for simultaneous classification and gene *** combining the huberized hinge loss function and the elastic net penalt...
详细信息
This paper proposes a new type of regularization in the context of multi-class support vector machine for simultaneous classification and gene *** combining the huberized hinge loss function and the elastic net penalty,the proposed support vector machine can do automatic gene selection and further encourage a grouping effect in the process of building classifiers,thus leading a sparse multi-classifiers with enhanced ***,a reasonable correlation between the two regularization parameters is proposed and an efficient solution path algorithm is *** of microarray classification are performed on the leukaemia data set to verify the obtained results.
Most of state-of-the-art machine learning algorithms cannot provide a reliable measure of their classifications and predictions. This paper addresses the importance of reliability and confidence for classification, an...
详细信息
Most of state-of-the-art machine learning algorithms cannot provide a reliable measure of their classifications and predictions. This paper addresses the importance of reliability and confidence for classification, and presents a novel method based on a combination of the unexcelled ensemble method, random forest (RF), and transductive confidence machine (TCM) which we call TCM-RF. The new algorithm hedges the predictions of RF and gives a well-calibrated region prediction by using the proximity matrix generated with RF as a nonconformity measure of examples. The new method takes advantage of RF and possesses a more precise and robust nonconformity measure. It can deal with redundant and noisy data with mixed types of variables, and is less sensitive to parameter settings. Experiments on benchmark datasets show it is more effective and robust than other TCMs. Further study on a real-world lymphoma microarray dataset shows its superiority over SVM with the ability of controlling the risk of error.
Classifying a patient based on disease type, treatment prognosis, survivability, or other such criteria has become a major focus of genomics and proteomics. From the perspective of the general population of a particul...
详细信息
Classifying a patient based on disease type, treatment prognosis, survivability, or other such criteria has become a major focus of genomics and proteomics. From the perspective of the general population of a particular kind of cell, one would like a classifier that applies to the whole population;however, it is often the case that the population is sufficiently structurally diverse that a satisfactory classifier cannot be designed from available sample data. In such a circumstance, it can be useful to identify cellular contexts within which a disease can be reliably diagnosed, which in effect means that one would like to find classifiers that apply to different sub-populations within the overall population. Using a model-based approach, this paper quantifies the effect of contexts on classification performance as a function of the classifier used and the sample size. The advantage of a model-based approach is that we can vary the contextual confusion as a function of the model parameters, thereby allowing us to compare the classification performance in terms of the degree of discriminatory confusion caused by the contexts. We consider five popular classifiers: linear discriminant analysis, three nearest neighbor, linear support vector machine, polynomial support vector machine, and Boosting. We contrast the case where classification is done with a single classifier without discriminating between the contexts to the case where there are context markers that facilitate context separation before classifier design. We observe that little can be done if there is high contextual confusion, but when the contextual confusion is low, context separation can be beneficial, the benefit depending on the classifier.
Since microarray data acquire tens of thousands of gene expression values simultaneously, they could be very useful in identifying the phenotypes of diseases. However, the results of analyzing several microarray datas...
详细信息
ISBN:
(纸本)0769527272
Since microarray data acquire tens of thousands of gene expression values simultaneously, they could be very useful in identifying the phenotypes of diseases. However, the results of analyzing several microarray datasets which were independently carried out with the same biological objectives, could turn out to be different. One of the main reasons is attributable to the limited number of samples involved in one microarray experiment. In order to increase the classification accuracy, it is desirable to augment the sample size by integrating and maximizing the use of independently-conducted microarray datasets. In this paper, we propose a two-stage approach which firstly integrates individual microarray datasets to overcome the problem caused by limited number of samples, and identifies informative genes, secondly builds a classifier using only the informative genes. The classifier from large samples by integrating independent microarray datasets achieves high accuracy, sensitivity, and specificity on independent test sample dataset.
暂无评论