We develop a metric Psi, based upon the RAND index, for the comparison and evaluation of dimensionality reduction techniques. this metric is designed to test the preservation of neighborhood structure in derived lower...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
We develop a metric Psi, based upon the RAND index, for the comparison and evaluation of dimensionality reduction techniques. this metric is designed to test the preservation of neighborhood structure in derived lower dimensional configurations. We use a customer information data set to show how Psi can be used to compare dimensionality reduction methods, tune method parameters, and choose solutions when methods have a local optimum problem. We show that Psi is highly negatively correlated with an alienation coefficient K that is designed to test the recovery of relative distances. In general a method with a good value of Psi also has a good value of K. However the monotonic regression used by Nonmetric MDS produces solutions with good values of Psi, but poor values of K.
mining maximal frequent itemsets in data streams is more difficult than miningthem in static databases for the huge, high-speed and continuous characteristics of data streams. In this paper, we propose a novel one-pa...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
mining maximal frequent itemsets in data streams is more difficult than miningthem in static databases for the huge, high-speed and continuous characteristics of data streams. In this paper, we propose a novel one-pass algorithm called FpMFI-DS, which mines all maximal frequent itemsets in Landmark windows or Sliding windows in data streams based on FP-Tree. A new structure of FP-Tree is designed for storing all transactions in Landmark windows or Sliding windows in data streams. To improve the efficiency of the algorithm, a new pruning technique, extension support equivalency pruning (ESEquivPS), is imported to it. the experiments show that our algorithm is efficient and scalable. It is suitable for mining MFIs both in static database and in data streams.
Association rule mining often results in an overwhelming number of rules. In practice, it is difficult for the final user to select the most relevant rules. In order to tackle this problem, various interestingness mea...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Association rule mining often results in an overwhelming number of rules. In practice, it is difficult for the final user to select the most relevant rules. In order to tackle this problem, various interestingness measures were proposed. Nevertheless, the choice of an appropriate measure remains a hard task and the use of several measures may lead to conflicting information. In this paper, we give a unified view of objective interestingness measures. We define a new framework embedding a large set of measures called SBMs and we prove that the SBMs have a similar behavior. Furthermore, we identify the whole collection of the rules simultaneously optimizing all the SBMs. We provide an algorithm to efficiently mine a reduced set of rules among the rules optimizing all the SBMs. Experiments on real datasets highlight the characteristics of such rules.
In recent years there has been a tremendous increase in the number of users maintaining online blogs on the Internet. Companies, in particular, have become aware of this medium of communication and have taken a keen i...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
In recent years there has been a tremendous increase in the number of users maintaining online blogs on the Internet. Companies, in particular, have become aware of this medium of communication and have taken a keen interest in what is being said about them through such personal blogs. this has given rise to a new field of research directed towards mining useful information from a large amount of unformatted data present in online blogs and online forums. We discuss an implementation of such a blog mining application. the application is broadly divided into two parts, the indexing process and the search module. Blogs pertaining to different organizations are fetched from a particular blog domain on the Internet. After analyzing the textual content of these blogs they are assigned a sentiment rating. Specific data from such blogs along withtheir sentiment ratings are then indexed on the physical hard drive. the search module searches through these indexes at run time for the input organization name and produces a list of blogs conveying both positive and negative sentiments about the organization.
Fractal theory has been used for computer graphics, image compression and different fields of patternrecognition. In this paper, a fractal based method for recognition of both on-line and off-line Farsi/Arabic handwr...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Fractal theory has been used for computer graphics, image compression and different fields of patternrecognition. In this paper, a fractal based method for recognition of both on-line and off-line Farsi/Arabic handwritten digits is proposed. Our main goal is to verify whether fractal theory is able to capture discriminatory information from digits for patternrecognition task. Digit classification problem (on-line and off-line) deals withpatterns which do not have complex structure. So, a general purpose fractal coder, introduced for image compression, is simplified to be utilized for this application. In order to do that, during the coding process, contrast and luminosity information of each point in the input pattern are ignored. therefore, this approach can deal with on-line data and binary images of handwritten Farsi digits. In fact, our system represents the shape of the input pattern by searching for a set, of geometrical relationship between parts of it. Some fractal-based features are directly extracted by the fractal coder. We show that the resulting features have invariant properties which can be used for object recognition.
Advances in wireless and mobile technology flood us with amounts of moving object datathat preclude all means of manual data processing. the volume of data gathered from position sensors of mobile phones, PDAs, or ve...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Advances in wireless and mobile technology flood us with amounts of moving object datathat preclude all means of manual data processing. the volume of data gathered from position sensors of mobile phones, PDAs, or vehicles, defies human ability to analyze the stream of input data. On the other hand, vast amounts of gathered data hide interesting and valuable knowledge patterns describing the behavior of moving objects. thus, new algorithms for mining moving object data are required to unearththis knowledge. An important function of the mobile objects management system is the prediction of the unknown location of an object. In this paper we introduce a datamining approach to the problem of predicting the location of a moving object. We mine the database of moving object locations to discover frequent trajectories and movement rules. then, we match the trajectory of a moving object withthe database of movement rules to build a probabilistic model of object location. Experimental evaluation of the proposal reveals prediction accuracy close to 80%. Our original contribution includes the elaboration on the location prediction model, the design of an efficient mining algorithm, introduction of movement rule matching strategies, and a thorough experimental evaluation of the proposed model.
Many applications require the discovery of items which have occur frequently within multiple distributed data streams. Past solutions for this problem either require a high degree of error tolerance or can only provid...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Many applications require the discovery of items which have occur frequently within multiple distributed data streams. Past solutions for this problem either require a high degree of error tolerance or can only provide results periodically. In this paper we introduce a new algorithm designed for continuously tracking frequent items over distributed data streams providing either exact or approximate answers. We tested the efficiency of our method using two real-world data sets. the results indicated significant reduction in communication cost when compared to naive approaches and an existing efficient algorithm called Top-K Monitoring. Since our method does not rely upon approximations to reduce communication overhead and is explicitly designed for tracking frequent items, our method also shows increased quality in its tracking results.
data perturbation with random noise signals has been shown to be useful for data hiding in privacy-preserving datamining. Perturbation methods based on additive randomization allows accurate estimation of the Probabi...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
data perturbation with random noise signals has been shown to be useful for data hiding in privacy-preserving datamining. Perturbation methods based on additive randomization allows accurate estimation of the Probability Density Function (PDF) via the Expectation-Maximization (EM) algorithm but it has been shown that noise-filtering techniques can be used to reconstruct the original data in many cases, leading to security breaches. In this paper, we propose a generic PDF reconstruction algorithm that can be used on non-additive (and additive) randomization techiques for the purpose of privacy-preserving datamining. this two-step reconstruction algorithm is based on Parzen-Window reconstruction and Quadratic Programming over a convex set - the probability simplex. Our algorithm eliminates the usual need for the iterative EM algorithm and it is generic for most randomization models. the simplicity of our two-step reconstruction algorithm, without iteration, also makes it attractive for use when dealing with large datasets.
this paper presents a data preprocessing procedure to select support vector (SV) candidates. We select decision boundary region vectors (BRVs) as SV candidates. Without the need to use the decision boundary, BRVs can ...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
this paper presents a data preprocessing procedure to select support vector (SV) candidates. We select decision boundary region vectors (BRVs) as SV candidates. Without the need to use the decision boundary, BRVs can be selected based on a vector's nearest neighbor of opposite class (NNO). To speed up the process, two spatial approximation sample hierarchical (SASH) trees are used for estimating the BRVs. Empirical results show that our data selection procedure can reduce a full dataset to the number of SVs or only slightly higher. Training withthe selected subset gives performance comparable to that of the full dataset. For large datasets, overall time spent in selecting and training on the smaller dataset is significantly lower than the time used in training on the full dataset.
We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machinelearning approach where the mapping between input and target docume...
详细信息
ISBN:
(纸本)9783540734987
We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machinelearning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. this model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. this instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.
暂无评论