Exhaustive methods of pattern extraction in a database face real obstacles to speed and output control of patterns: a large number of patterns are extracted, many of which are redundant. Pattern extraction methods thr...
详细信息
Exhaustive methods of pattern extraction in a database face real obstacles to speed and output control of patterns: a large number of patterns are extracted, many of which are redundant. Pattern extraction methods through sampling, which allow for controlling the size of the outputs while ensuring fast response times, provide a solution to these two problems. However, these methods do not provide high-quality patterns: they return patterns that are very infrequent in the database. Furthermore, they do not scale. To ensure more frequent and diversified patterns in the output, we propose integrating compression methods into sampling to select the most representative patterns from the sampled transactions. We demonstrate that our approach improves the state of the art in terms of diversity of produced patterns.
The paper reports an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods. This was done after systematically evaluating a total of 48 com...
详细信息
The paper reports an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods. This was done after systematically evaluating a total of 48 combinations of methods, involving eight state-of-the-art classification algorithms and six commonly used feature selection methods, on 129 data sets from the UCI repository (some data sets with known high classification accuracy were excluded). In this paper, a data set for classification is called hard if none of the 48 combinations can achieve an AUC over 0.8 and none of them can achieve an F-Measure value over 0.8;it is called easy otherwise. A total of 15 out of the 129 data sets were found to be hard in that sense. This paper also compares the performance of different methods, and it produces rankings of classification methods, separately on the hard data sets and on the easy data sets. This paper is the first to rank methods separately for hard data sets and for easy data sets. It turns out that the classifier rankings resulting from our experiments are somehow different from those in the literature and hence they offer new insights on method selection. It should be noted that the Random Forest method remains to be the best in all groups of experiments.
In many real-world applications, information such as web click data, stock ticker data, sensor network data, phone call records, and traffic monitoring data appear in the form of data streams. Online monitoring of dat...
详细信息
In many real-world applications, information such as web click data, stock ticker data, sensor network data, phone call records, and traffic monitoring data appear in the form of data streams. Online monitoring of data streams has emerged as an important research undertaking. Estimating the frequency of the items on these streams is an important aggregation and summary technique for both stream mining and data management systems with a broad range of applications. This paper reviews the state-of-the-art progress on methods of identifying frequent items from data streams. It describes different kinds of models for frequent items mining task. For general models such as cash register and Turnstile, we classify existing algorithms into sampling-based, counting-based, and hashing-based categories. The processing techniques and data synopsis structure of each algorithm are described and compared by evaluation measures. Accordingly, as an extension of the general data stream model, four more specific models including time-sensitive model, distributed model, hierarchical and multi-dimensional model, and skewed data model are introduced. The characteristics and limitations of the algorithms of each model are presented, and open issues waiting for study and improvement are discussed.
Neighborhood discovery is a precursor to knowledge discovery in complex and large datasets such as Temporal data, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the e...
详细信息
Neighborhood discovery is a precursor to knowledge discovery in complex and large datasets such as Temporal data, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the entire data, we are interested in dividing the huge data into several smaller intervals of interest which we call as temporal neighborhoods. In this paper we propose a class of algorithms to generate temporal neighborhoods through unequal depth discretization. We describe four novel algorithms (a) Similarity based Merging (SMerg), (b) Stationary distribution based Merging (StMerg), (c) Greedy Merge (GMerg) and, (d) Optimal Merging (OptMerg). The SMerg and STMerg algorithms are based on the robust framework of Markov models and the Markov Stationary distribution respectively. GMerg is a greedy approach and OptMerg algorithm is geared towards discovering optimal binning strategies for the most effective partitioning of the data into temporal neighborhoods. Both these algorithms do not use Markov models. We identify temporal neighborhoods with distinct demarcations based on unequal depth discretization of the data. We discuss detailed experimental results in both synthetic and real world data. Specifically, we show (i) the efficacy of our algorithms through precision and recall of labeled bins, (ii) the ground truth validation in real world traffic monitoring datasets and, (iii) Knowledge discovery in the temporal neighborhoods such as global anomalies. Our results indicate that we are able to identify valuable knowledge based on our ground truth validation from real world traffic data.
The rough set (RS) theory can be seen as a new mathematical approach to vagueness and is capable of discovering important facts hidden in that data. However, traditional rough set approach ignores that the desired red...
详细信息
The rough set (RS) theory can be seen as a new mathematical approach to vagueness and is capable of discovering important facts hidden in that data. However, traditional rough set approach ignores that the desired reducts are not necessarily unique since several reducts could include the same value of the strength index. In addition, the current RS algorithms have the ability to generate a set of classification rules efficiently, but they cannot generate rules incrementally when new objects are given. Numerous studies of incremental approaches are not capable to deal with the problems of large database. Therefore, an incremental rule-extraction algorithm is proposed to solve these issues in this study. Using this algorithm, when a new object is added up to an information system, it is unnecessary to re-compute rule sets from the very beginning, which can quickly generate the complete but not repetitive rules. In the case study, the results show that the incremental issues of new data add-in are resolved and a huge computation time is saved. (C) 2012 Elsevier B. V. All rights reserved.
Many existing state-of-the-art top-N recommendation methods model users and items in the same latent space and the recommendation scores are computed via the dot product between those vectors. These methods assume tha...
详细信息
ISBN:
(纸本)9781479942749
Many existing state-of-the-art top-N recommendation methods model users and items in the same latent space and the recommendation scores are computed via the dot product between those vectors. These methods assume that the user preference is consistent across all the items that he/she has rated. This assumption is not necessarily true, since many users can have multiple personas/interests and their preferences can vary with each such interest. To address this, a recently proposed method modeled the users with multiple interests. In this paper, we build on this approach and model users using a much richer representation. We propose a method which models the user preference as a combination of having global preference and interest-specific preference. The proposed method uses a nonlinear model for predicting the recommendation score, which is used to perform top-N recommendation task. The recommendation score is computed as a sum of the scores from the components representing global preference and interest-specific preference. A comprehensive set of experiments on multiple datasets show that the proposed model outperforms other state-of-the-art methods for top-N recommendation task.
Activation of the autonomic nervous system is a primary characteristic of human hedonic responses to sensory stimuli. For smells, general tendencies of physiological reactions have been described using classical stati...
详细信息
Activation of the autonomic nervous system is a primary characteristic of human hedonic responses to sensory stimuli. For smells, general tendencies of physiological reactions have been described using classical statistics. However, these physiological variations are generally not quantified precisely;each psychophysiological parameter has very often been studied separately and individual variability was not systematically considered. The current study presents an innovative approach based on data mining, whose goal is to extract knowledge from a dataset. This approach uses a subgroup discovery algorithm which allows extraction of rules that apply to as many olfactory stimuli and individuals as possible. These rules are described by intervals on a set of physiological attributes. Results allowed both quantifying how each physiological parameter relates to odor pleasantness and perceived intensity but also describing the participation of each individual to these rules. This approach can be applied to other fields of affective sciences characterized by complex and heterogeneous datasets.
We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these...
详细信息
We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and, hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large data sets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between gene sets in a bioinformatics data set, and relating publications in the PubMed index of abstracts.
The interactor normalization task (INT) is to identify genes that play the interactor role in protein-protein interactions (PPIs), to map these genes to unique IDs, and to rank them according to their normalized confi...
详细信息
The interactor normalization task (INT) is to identify genes that play the interactor role in protein-protein interactions (PPIs), to map these genes to unique IDs, and to rank them according to their normalized confidence. INT has two subtasks: gene normalization (GN) and interactor ranking. The main difficulties of INT GN are identifying genes across species and using full papers instead of abstracts. To tackle these problems, we developed a multistage GN algorithm and a ranking method, which exploit information in different parts of a paper. Our system achieved a promising AUC of 0.43471. Using the multistage GN algorithm, we have been able to improve system performance ( AUC) by 1.719 percent compared to a one-stage GN algorithm. Our experimental results also show that with full text, versus abstract only, INT AUC performance was 22.6 percent higher.
Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. One important problem in mining databases of trees is to find frequently...
详细信息
Tree structures are used extensively in domains such as computational biology, pattern recognition, XML databases, computer networks, and so on. One important problem in mining databases of trees is to find frequently occurring subtrees. Because of the combinatorial explosion, the number of frequent subtrees usually grows exponentially with the size of frequent subtrees and, therefore, mining all frequent subtrees becomes infeasible for large tree sizes. In this paper, we present CMTreeMiner, a computationally efficient algorithm that discovers only closed and maximal frequent subtrees in a database of labeled rooted trees, where the rooted trees can be either ordered or unordered. The algorithm mines both closed and maximal frequent subtrees by traversing an enumeration tree that systematically enumerates all frequent subtrees. Several techniques are proposed to prune the branches of the enumeration tree that do not correspond to closed or maximal frequent subtrees. Heuristic techniques are used to arrange the order of computation so that relatively expensive computation is avoided as much as possible. We study the performance of our algorithm through extensive experiments, using both synthetic data and data sets from real applications. The experimental results show that our algorithm is very efficient in reducing the search space and quickly discovers all closed and maximal frequent subtrees.
暂无评论