the recently introduced transductive confidence machines (TCMs) framework allows to extend classifiers such that they satisfy the calibration property. this means that the error rate can be set by the user prior to cl...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
the recently introduced transductive confidence machines (TCMs) framework allows to extend classifiers such that they satisfy the calibration property. this means that the error rate can be set by the user prior to classification. An analytical proof of the calibration property was given for TCMs applied in the on-line learning setting. However, the nature of this learning setting restricts the applicability of TCMs. In this paper we provide strong empirical evidence that the calibration property also holds in the off-line learning setting. Our results extend the range of applications in which TCMs can be applied. We may conclude that TCMs are appropriate in virtually any application domain.
mining maximal frequent itemsets in data streams is more difficult than miningthem in static databases for the huge, high-speed and continuous characteristics of data streams. In this paper, we propose a novel one-pa...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
mining maximal frequent itemsets in data streams is more difficult than miningthem in static databases for the huge, high-speed and continuous characteristics of data streams. In this paper, we propose a novel one-pass algorithm called FpMFI-DS, which mines all maximal frequent itemsets in Landmark windows or Sliding windows in data streams based on FP-Tree. A new structure of FP-Tree is designed for storing all transactions in Landmark windows or Sliding windows in data streams. To improve the efficiency of the algorithm, a new pruning technique, extension support equivalency pruning (ESEquivPS), is imported to it. the experiments show that our algorithm is efficient and scalable. It is suitable for mining MFIs both in static database and in data streams.
We develop a metric Psi, based upon the RAND index, for the comparison and evaluation of dimensionality reduction techniques. this metric is designed to test the preservation of neighborhood structure in derived lower...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
We develop a metric Psi, based upon the RAND index, for the comparison and evaluation of dimensionality reduction techniques. this metric is designed to test the preservation of neighborhood structure in derived lower dimensional configurations. We use a customer information data set to show how Psi can be used to compare dimensionality reduction methods, tune method parameters, and choose solutions when methods have a local optimum problem. We show that Psi is highly negatively correlated with an alienation coefficient K that is designed to test the recovery of relative distances. In general a method with a good value of Psi also has a good value of K. However the monotonic regression used by Nonmetric MDS produces solutions with good values of Psi, but poor values of K.
Association rule mining often results in an overwhelming number of rules. In practice, it is difficult for the final user to select the most relevant rules. In order to tackle this problem, various interestingness mea...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Association rule mining often results in an overwhelming number of rules. In practice, it is difficult for the final user to select the most relevant rules. In order to tackle this problem, various interestingness measures were proposed. Nevertheless, the choice of an appropriate measure remains a hard task and the use of several measures may lead to conflicting information. In this paper, we give a unified view of objective interestingness measures. We define a new framework embedding a large set of measures called SBMs and we prove that the SBMs have a similar behavior. Furthermore, we identify the whole collection of the rules simultaneously optimizing all the SBMs. We provide an algorithm to efficiently mine a reduced set of rules among the rules optimizing all the SBMs. Experiments on real datasets highlight the characteristics of such rules.
In recent years there has been a tremendous increase in the number of users maintaining online blogs on the Internet. Companies, in particular, have become aware of this medium of communication and have taken a keen i...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
In recent years there has been a tremendous increase in the number of users maintaining online blogs on the Internet. Companies, in particular, have become aware of this medium of communication and have taken a keen interest in what is being said about them through such personal blogs. this has given rise to a new field of research directed towards mining useful information from a large amount of unformatted data present in online blogs and online forums. We discuss an implementation of such a blog mining application. the application is broadly divided into two parts, the indexing process and the search module. Blogs pertaining to different organizations are fetched from a particular blog domain on the Internet. After analyzing the textual content of these blogs they are assigned a sentiment rating. Specific data from such blogs along withtheir sentiment ratings are then indexed on the physical hard drive. the search module searches through these indexes at run time for the input organization name and produces a list of blogs conveying both positive and negative sentiments about the organization.
Fractal theory has been used for computer graphics, image compression and different fields of patternrecognition. In this paper, a fractal based method for recognition of both on-line and off-line Farsi/Arabic handwr...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Fractal theory has been used for computer graphics, image compression and different fields of patternrecognition. In this paper, a fractal based method for recognition of both on-line and off-line Farsi/Arabic handwritten digits is proposed. Our main goal is to verify whether fractal theory is able to capture discriminatory information from digits for patternrecognition task. Digit classification problem (on-line and off-line) deals withpatterns which do not have complex structure. So, a general purpose fractal coder, introduced for image compression, is simplified to be utilized for this application. In order to do that, during the coding process, contrast and luminosity information of each point in the input pattern are ignored. therefore, this approach can deal with on-line data and binary images of handwritten Farsi digits. In fact, our system represents the shape of the input pattern by searching for a set, of geometrical relationship between parts of it. Some fractal-based features are directly extracted by the fractal coder. We show that the resulting features have invariant properties which can be used for object recognition.
Advances in wireless and mobile technology flood us with amounts of moving object datathat preclude all means of manual data processing. the volume of data gathered from position sensors of mobile phones, PDAs, or ve...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Advances in wireless and mobile technology flood us with amounts of moving object datathat preclude all means of manual data processing. the volume of data gathered from position sensors of mobile phones, PDAs, or vehicles, defies human ability to analyze the stream of input data. On the other hand, vast amounts of gathered data hide interesting and valuable knowledge patterns describing the behavior of moving objects. thus, new algorithms for mining moving object data are required to unearththis knowledge. An important function of the mobile objects management system is the prediction of the unknown location of an object. In this paper we introduce a datamining approach to the problem of predicting the location of a moving object. We mine the database of moving object locations to discover frequent trajectories and movement rules. then, we match the trajectory of a moving object withthe database of movement rules to build a probabilistic model of object location. Experimental evaluation of the proposal reveals prediction accuracy close to 80%. Our original contribution includes the elaboration on the location prediction model, the design of an efficient mining algorithm, introduction of movement rule matching strategies, and a thorough experimental evaluation of the proposed model.
data perturbation with random noise signals has been shown to be useful for data hiding in privacy-preserving datamining. Perturbation methods based on additive randomization allows accurate estimation of the Probabi...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
data perturbation with random noise signals has been shown to be useful for data hiding in privacy-preserving datamining. Perturbation methods based on additive randomization allows accurate estimation of the Probability Density Function (PDF) via the Expectation-Maximization (EM) algorithm but it has been shown that noise-filtering techniques can be used to reconstruct the original data in many cases, leading to security breaches. In this paper, we propose a generic PDF reconstruction algorithm that can be used on non-additive (and additive) randomization techiques for the purpose of privacy-preserving datamining. this two-step reconstruction algorithm is based on Parzen-Window reconstruction and Quadratic Programming over a convex set - the probability simplex. Our algorithm eliminates the usual need for the iterative EM algorithm and it is generic for most randomization models. the simplicity of our two-step reconstruction algorithm, without iteration, also makes it attractive for use when dealing with large datasets.
Support vector machine (SVM) has received much attention in feature selection recently because of its ability to incorporate kernels to discover nonlinear dependencies between features. However it is known that the nu...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Support vector machine (SVM) has received much attention in feature selection recently because of its ability to incorporate kernels to discover nonlinear dependencies between features. However it is known that the number of support vectors required in SVM typically grows linearly withthe size of the training data set. Such a limitation of SVM becomes more critical when we need to select a small subset of relevant features from a very large number of candidates. To solve this issue, this paper proposes a novel algorithm, called the 'relevance feature vector machine'(RFVM), for nonlinear feature selection. the RFVM algorithm utilizes a highly sparse learning algorithm, the relevance vector machine (RVM), and incorporates kernels to extract important features with both linear and nonlinear relationships. As a result, our proposed approach can reduce many false alarms, e.g. including irrelevant features, while still maintain good selection performance. We compare the performances between RFVM and other state of the art nonlinear feature selection algorithms in our experiments. the results confirm our conclusions.
We present a method, called equivalence learning, which applies a two-class classification approach to object-pairs defined within a multi-class scenario. the underlying idea is that instead of classifying objects int...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
We present a method, called equivalence learning, which applies a two-class classification approach to object-pairs defined within a multi-class scenario. the underlying idea is that instead of classifying objects into their respective classes, we classify object pairs either as equivalent (belonging to the same class) or non-equivalent (belonging to different classes). the method is based on a vectorisation of the similarity between the objects and the application of a machinelearning algorithm (SVM, ANN, LogReg, Random Forests) to learn the differences between equivalent and non-equivalent object pairs, and define a, unique kernel function that can be obtained via equivalence learning. Using a small dataset of archaeal, bacterial and eukaryotic 3-phosphoglycerate-kinase sequences we found that the classification performance of equivalence learning slightly exceeds those of several simple machinelearning algorithms at the price of a minimal increase in time and space requirements.
暂无评论