Determining the relevant features is a combinatorial task in various fields of machinelearning such as text mining, bioinformatics, patternrecognition, etc. Several scholars have developed various methods to extract...
详细信息
ISBN:
(纸本)3540464840
Determining the relevant features is a combinatorial task in various fields of machinelearning such as text mining, bioinformatics, patternrecognition, etc. Several scholars have developed various methods to extract the relevant features but no method is really superior. Breiman proposed Random Forest to classify a pattern based on CART tree algorithm and his method turns out good results compared to other classifiers. Taking advantages of Random Forest and using wrapper approach which was first introduced by Kohavi et. al, we propose an algorithm named Dynamic Recursive Feature Elimination (DRFE) to find the optimal subset of features for reducing noise of the data and increasing the performance of classifiers. In our method, we use Random Forest as induced classifier and develop our own defined feature elimination function by adding extra terms to the feature scoring. We conducted experiments with two public datasets: Colon cancer and Leukemia cancer. The experimental results of the real world data showed that the proposed method has higher prediction rate compared to the baseline algorithm. The obtained results are comparable and sometimes have better performance than the widely used classification methods in the same literature of feature selection.
We describe the development of the ICSI-SRI speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2006 Meeting Rich Transcription (RT-06S) evaluation, highlighting improvements...
详细信息
This paper presents a novel clustering model for miningpatterns from imprecise electric load time series. The model consists of three components. First, it contains a process that deals with representation and prepro...
详细信息
ISBN:
(纸本)3540459162
This paper presents a novel clustering model for miningpatterns from imprecise electric load time series. The model consists of three components. First, it contains a process that deals with representation and preprocessing of imprecise load time series. Second, it adopts a similarity metric that uses interval semantic separation (Interval SS)-based measurement. Third, it applies the similarity metric together with the k-means clustering method to construct clusters. The model gives a unified way to solve imprecise time series clustering problem and it is applied in a real world application, to find similar consumption patterns in the electricity industry. Experimental results have demonstrated the applicability and correctness of the proposed model.
There are several approaches in trying to solve the quantitative structure-activity (QSAR) problem. These approaches are based either on statistical methods or on predictive datamining. Among the statistical methods,...
详细信息
There are several approaches in trying to solve the quantitative structure-activity (QSAR) problem. These approaches are based either on statistical methods or on predictive datamining. Among the statistical methods, one should consider regression analysis, patternrecognition (such as cluster analysis, factor analysis and principal components analysis) or partial least squares. Predictive datamining techniques use either neural networks, or genetic programming, or neuro-fuzzy knowledge. These approaches have a low explanatory capability or non at all. This paper attempts to establish a new approach in solving QSAR problems using descriptive datamining. This way, the relationship between the chemical properties and the activity of a substance would be comprehensibly modeled
An electronic contract can encompass a large number of collateral contract documents in PDF format. These contract documents are of different contract document types and converted from different original formats. data...
详细信息
An electronic contract can encompass a large number of collateral contract documents in PDF format. These contract documents are of different contract document types and converted from different original formats. data extraction and thus datamining for this kind of electronic contracts is very difficult. In this paper, we present a novel method to automatically extract contract data from this kind of electronic contracts. Our automatic electronic contract data extraction system comprises an administrator module, a PDF parser, a patternrecognition engine and a contract data extraction engine. The administrator module provides templates for inputting document patterns and a list of contract data tags for each contract document type. It also constructs the pattern matrices and stores them in a database. The PDF parser converts the contract PDF document into the contract text document with the insertion of formatting bookmarks, such as a new page, paragraph or line. The patternrecognition engine determines a list of contract document types in the electronic contract by comparing and matching the patterns of all known contract document types with the pattern of the contract text document. The contract data extraction engine retrieves the corresponding list of contract data tags and then extracts contract data accordingly for each contract document type on the list. Our automatic electronic contract data extraction system has found to be very accurate, efficient and useful in extracting contract data for datamining
A number of events such as hurricanes, earthquakes, power outages can cause large-scale failures in the Internet. These in turn cause anomalies in the interdomain routing process. The policy-based nature of border gat...
详细信息
A number of events such as hurricanes, earthquakes, power outages can cause large-scale failures in the Internet. These in turn cause anomalies in the interdomain routing process. The policy-based nature of border gateway protocol (BGP) further aggravates the effect of these anomalies causing severe, long lasting route fluctuations. In this work we propose an architecture for anomaly detection that can be implemented on individual routers. We use statistical patternrecognition techniques for extracting meaningful features from the BGP update message data. A time-series segmentation algorithm is then carried out on the feature traces to detect the onset of an instability event The performance of the proposed algorithm is evaluated using real Internet trace data. We show that instabilities triggered by events like router mis-configurations, infrastructure failures and worm attacks can be detected with a false alarm rate as low as 0.0083 alarms per hour. We also show that our learning based mechanism is highly robust as compared to methods like exponentially weighted moving average (EWMA) based detection.
Finding a small set of representative instances for large datasets can bring various benefits to datamining practitioners so they can (1) build a learner superior to the one constructed from the whole massive data; a...
详细信息
ISBN:
(纸本)0769525210
Finding a small set of representative instances for large datasets can bring various benefits to datamining practitioners so they can (1) build a learner superior to the one constructed from the whole massive data; and (2) avoid working on the whole original dataset all the time. We propose in this paper a scalable representative instance selection and ranking (SRISTAR pronounced 3STAR) mechanism, which carries two unique features: (1) it provides a representative instance ranking list, so that users can always select instances from the top to the bottom, based on the number of examples they prefer; and (2) it investigates the behaviors of the underlying examples for instance selection, and the selection procedure tries to optimize the expected future error. Given a dataset, we first cluster instances into small data cells, each of which consists of instances with similar behaviors. Then we progressively evaluate data cells and their combinations, and order them into a list such that the learners built from the top cells are more accurate
The low error rate of naive Bayes (NB) classifier has been described as surprising. It is known that class conditional independence of the features is sufficient but not a necessary condition for optimality of NB. Thi...
详细信息
The low error rate of naive Bayes (NB) classifier has been described as surprising. It is known that class conditional independence of the features is sufficient but not a necessary condition for optimality of NB. This study is about the difference between the estimated error and the true error of NB taking into account feature dependencies. Analytical results are derived for two binary features. Illustration examples are also provided
datamining refers to extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases. Noisy and inconsistent data are commonplace propert...
详细信息
datamining refers to extracting interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases. Noisy and inconsistent data are commonplace properties of large database and data warehouses. It is difficult when noisy and inconsistent data are mined by using classical rough set theory. In this paper, the concept of information granule is introduced. Then the knowledge possessing given confidence is described by using concept of information granule and the roughness and simpleness of knowledge is discussed by using extension of rough set theory. At last, the algorithm for attribute reduction based on information granule is presented. Experimental results show that the presented algorithm is good at enormous data production and effective to extract simplicity knowledge from noisy and inconsistent data with minimum confidence threshold
We have developed an e-learning platform which records all activity done by students. We try to create an analysis framework that may be used for proving whether or not the platform can classify the students according...
详细信息
We have developed an e-learning platform which records all activity done by students. We try to create an analysis framework that may be used for proving whether or not the platform can classify the students according to their accumulated knowledge. In this process, attribute selection is one of the major initial steps that must be accomplished. In this paper we show what implications may have attribute selection for classification performance and which are the best techniques that may lead to concluding results
暂无评论