In this paper the observability of information in modem databases is investigated. Observable information is the one that is explicitly stored in a database. Unobservable information is information hidden in various c...
详细信息
ISBN:
(纸本)9789898111531
In this paper the observability of information in modem databases is investigated. Observable information is the one that is explicitly stored in a database. Unobservable information is information hidden in various coding schemes, transaction streams or free-text-description fields. Nowadays credit risk management tends to employ cutting edge technologies and approaches from the fields of statistics and machinelearning for achieving their goals. Often it is forgotten that machinelearning schemes can only use completely observable information. The issue is eventually addressed - but instead when using the data it should be addressed during the data warehouse design phase.
Making use of block information in Web IR and datamining tasks calls for a good understanding of the function of each block. Existing works on classifying block functions and judging block importance have not made fu...
详细信息
ISBN:
(纸本)9781424420209
Making use of block information in Web IR and datamining tasks calls for a good understanding of the function of each block. Existing works on classifying block functions and judging block importance have not made full use of the image factor, and only simple image features were considered. We regard image as a strong indicator of Web page blocks with various functions and propose to learn block functions using roles of images as part of block features. Blocks are generated from Web page segmentation and roles of images are automatically decided by image classification. We experiment on 140 Web pages and demonstrate that utilizing roles of images can significantly improve the classification quality of learning Web page block functions. We also measure the usefulness of different roles of images and evaluate the effect of two page segmentation methods.
Sentiment classification aims at mining reviews of people for a certain event39;s topic or product by automatic classifying the reviews into positive or negative opinions. With the fast developing of World Wide Web ...
详细信息
ISBN:
(纸本)9780769534077
Sentiment classification aims at mining reviews of people for a certain event's topic or product by automatic classifying the reviews into positive or negative opinions. With the fast developing of World Wide Web applications, sentiment classification would have huge opportunity to help people automatic analysis of customers' opinions from the web information. Automatic opinion mining will benefit to both decision maker and ordinary people. Up to now, it is still a complicated task with great challenge. There are mainly two types of approaches for sentiment classification, machinelearning methods and semantic orientation methods. Though some pioneer researches explored the approaches for English reviews classification, few jobs have been done on sentiment classification for Chinese reviews. The machinelearning approach Based on string kernel for sentiment classification on reviews written in Chinese was proposed in this paper. data experiment shows the capability of this approach.
Many enterprises incorporate information gathered from a variety of data sources into an integrated input for some learning task. For example, aiming towards the design of an automated diagnostic tool for some disease...
详细信息
ISBN:
(纸本)9780769534077
Many enterprises incorporate information gathered from a variety of data sources into an integrated input for some learning task. For example, aiming towards the design of an automated diagnostic tool for some diseases, one may wish to integrate data gathered from many different hospitals. Analyzing and mining these distributed heterogeneous data sources require distributed machinelearning and datamining technique In this paper, a Modified Distributed Combining Algorithm is proposed to cluster disparate data sources having diverse, possibly overlapping set of features and also need not share objects. First, all objects located at local sites are grouped using K-Means/Fuzzy C-Means clustering algorithm and resulting centroid is considered as local models. Then, the set of centroids are transformed into unified structure and optimum values are assigned to missing attributes. Finally, global cluster centroid is computed to identify global cluster model based on cluster ensemble and centroid mapping. The experiments are carried out for various datasets of UCI machinelearningdata repository in order to achieve the efficiency of the proposed algorithm.
In the biometric verification system of a smart gun, the rightful user of a gun is recognized based on grip-patternrecognition. It was found that the verification performance of this system degrades strongly when the...
详细信息
ISBN:
(纸本)9783540699040
In the biometric verification system of a smart gun, the rightful user of a gun is recognized based on grip-patternrecognition. It was found that the verification performance of this system degrades strongly when the data for training and testing have been recorded in different sessions with a time lapse. This is due to the variations between the probe image and the gallery image of a subject. In this work the grip-pattern verification has been implemented based on both classifiers of the likelihood-ratio classifier and the support vector machine. It has been shown that the support vector machine gives much better results than the likelihood-ratio classifier if there are considerable variations between data for training and testing. However, once the variations are reduced by certain techniques and thus the data are better modelled during the training process, the support vector machine tends to lose its superiority.
Artificial neural networks have shown good performance in classification tasks. However, models used for learning in pattern classification are challenged when the differences between the patterns of the training set ...
详细信息
ISBN:
(纸本)9789898111210
Artificial neural networks have shown good performance in classification tasks. However, models used for learning in pattern classification are challenged when the differences between the patterns of the training set are small. Therefore, the choice of effective features is mandatory for obtaining good performance. Statistical and geometrical features alone are not suitable for recognition of hand printed characters due to variations in writing styles that may result in deformations of character shapes. We address this problem by using a relational context feature combined with a local descriptor for training a neural network-based recognition system in a user-independent online character recognition application. Our feature extraction approach provides a rich representation of the global shape characteristics, in a considerably compact form. This new relational feature provides a higher distinctiveness and increases robustness with respect to character deformations. While enhancing the recognition accuracy, the feature extraction is computationally simple. We show that the ability to discriminate in Arabic handwriting characters is increased by adopting this mechanism in feed forward neural network architecture. Our experiments on Arabic character recognition show comparable results with the state-of-the-art methods for online recognition of these characters.
In this paper, we present an automatic terminology extraction approach for Chinese multi-word terms. In this term extraction system, besides five linguistic rules acquired from an available term list by some machine l...
详细信息
ISBN:
(纸本)9781424421961
In this paper, we present an automatic terminology extraction approach for Chinese multi-word terms. In this term extraction system, besides five linguistic rules acquired from an available term list by some machinelearning methods, two statistical strategies are involved: a termhood measure based on the term distribution variation, and a unithood measure adopting the left and right entropy method to estimate the collocation variation degree. The candidates are ranked according to the values of the former. The latter is used to filter the preposition phrases and some verb-object phrases that rarely appear as terms. By validating on a small scale corpus in the computer domain, the precision reaches 91.5% of the top 2000 outputs.
Identifying cancer molecular patterns robustly from large dimensional protein expression data not only has significant impacts on clinical ontology, but also presents a challenge for statistical learning. Principal co...
详细信息
ISBN:
(纸本)9783540884347
Identifying cancer molecular patterns robustly from large dimensional protein expression data not only has significant impacts on clinical ontology, but also presents a challenge for statistical learning. Principal component analysis (PCA) is a widely used feature selection algorithm and generally integrated with classic classification algorithms to conduct cancer molecular pattern discovery. However, its holistic mechanism prevents local data characteristics capture in feature selection. This may lead to the increase of misclassification rates and affect robustness of cancer molecular diagnostics. In this study, we develop a nonnegative principal component analysis (NPCA) algorithm and propose a NPCA-based SVM algorithm with sparse coding in the cancer molecular pattern analysis of proteomics data. We report leading classification results from this novel algorithm in predicting cancer molecular patterns of three benchmark proteomics datasets, under 100 trials of 50% hold-out and leave one out cross validations, by directly comparing its performances with those of the PCA-SVM, NMF-SVM, SVM, k-NN and PCA-LDA classification algorithms with respect to classification rates, sensitivities and specificities. Our algorithm also overcomes the overfitting problem in the SVM and PCA-SVM classifications and provides exceptional sensitivities and specificities.
Random Forests, Support Vector machines and k-Nearest Neighbors are successful and proven classification techniques that are widely used for different kinds of classification problems. One of them is classification of...
详细信息
ISBN:
(纸本)9783540884347
Random Forests, Support Vector machines and k-Nearest Neighbors are successful and proven classification techniques that are widely used for different kinds of classification problems. One of them is classification of genomic and proteomic data that is known as a problem with extremely high dimensionality and therefore demands suited classification techniques. In this domain they are usually combined with gene selection techniques to provide optimal classification accuracy rates. Another reason for reducing the dimensionality of such datasets is their interpretability. It is much easier to interpret a small set of ranked genes than 20 or 30 thousands of unordered genes. In this paper we present a classification ensemble of decision trees called Rotation Forest and evaluate its classification performance on small subsets of ranked genes for 14 genomic and proteomic classification problems. An important feature of Rotation Forest is demonstrated - i.e. robustness and high classification accuracy using small sets of genes.
Prediction of protein stability upon amino acid substitution and discrimination of thermophilic proteins from mesophilic ones are important problems in designing stable proteins. We have developed a classification rul...
详细信息
ISBN:
(纸本)9783540884347
Prediction of protein stability upon amino acid substitution and discrimination of thermophilic proteins from mesophilic ones are important problems in designing stable proteins. We have developed a classification rule generator using the information about wild-type, mutant, three neighboring residues and experimentally observed stability data. Utilizing the rules, we have developed a method based on decision tree for discriminating the stabilizing and destabilizing mutants and predicting protein stability changes upon single point mutations, which showed an accuracy of 82% and a correlation of 0.70, respectively. In addition, we have systematically analyzed the characteristic features of amino acid residues in 3075 mesophilic and 1609 thermophilic proteins belonging to 9 and 15 families, respectively, and developed methods for discriminating them. The method based on neural network could discriminate them at the 5-fold cross-validation accuracy of 89% in a dataset of 4684 proteins and 91% in a test set of 707 proteins.
暂无评论