We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these...
详细信息
We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and, hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large data sets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between gene sets in a bioinformatics data set, and relating publications in the PubMed index of abstracts.
Matrix decomposition methods represent a data matrix as a product of two factor matrices: one containing basis vectors that represent meaningful concepts in the data and another describing how the observed data can be...
详细信息
Matrix decomposition methods represent a data matrix as a product of two factor matrices: one containing basis vectors that represent meaningful concepts in the data and another describing how the observed data can be expressed as combinations of the basis vectors. Decomposition methods have been studied extensively, but many methods return real-valued matrices. Interpreting real-valued factor matrices is hard if the original data is Boolean. In this paper, we describe a matrix decomposition formulation for Boolean data, the Discrete Basis Problem. The problem seeks for a Boolean decomposition of a binary matrix, thus allowing the user to easily interpret the basis vectors. We also describe a variation of the problem, the Discrete Basis Partitioning Problem. We show that both problems are NP-hard. For the Discrete Basis Problem, we give a simple greedy algorithm for solving it;for the Discrete Basis Partitioning Problem, we show how it can be solved using existing methods. We present experimental results for the greedy algorithm and compare it against other well-known methods. Our algorithm gives intuitive basis vectors, but its reconstruction error is usually larger than with the real-valued methods. We discuss the reasons for this behavior.
The discovery of information encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually encoded in patterns frequ...
详细信息
The discovery of information encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually encoded in patterns frequently occurring in the sequences, which are also called motifs. In fact, motif discovery has received much attention in the literature, and several algorithms have already been proposed, which are specifically tailored to deal with motifs exhibiting some kinds of "regular structure." Motivated by biological observations, this paper focuses on the mining of loosely structured motifs, i.e., of more general kinds of motif where several "exceptions" may be tolerated in pattern repetitions. To this end, an algorithm exploiting data structures conceived to efficiently handle pattern variabilities is presented and analyzed. Furthermore, a randomized variant with linear time and space complexity is introduced, and a theoretical guarantee on its performances is proven. Both algorithms have been implemented and tested on real data sets. Despite the ability of mining very complex kinds of pattern, performance results evidence a genome-wide applicability of the proposed techniques.
In this paper, we present a new tree mining algorithm, DRYADEPARENT, based on the hooking principle first introduced in DRYADE. In the experiments, we demonstrate that the branching factor and depth of the frequent pa...
详细信息
In this paper, we present a new tree mining algorithm, DRYADEPARENT, based on the hooking principle first introduced in DRYADE. In the experiments, we demonstrate that the branching factor and depth of the frequent patterns to find are key factors of complexity for tree miningalgorithms, even if often overlooked in previous work. We show that DRYADEPARENT outperforms the current fastest algorithm, CMTreeMiner, by orders of magnitude on data sets where the frequent tree patterns have a high branching factor.
Data mining techniques have been widely used in various applications. However, the misuse of these techniques may lead to the disclosure of sensitive information. Researchers have recently made efforts at hiding sensi...
详细信息
Data mining techniques have been widely used in various applications. However, the misuse of these techniques may lead to the disclosure of sensitive information. Researchers have recently made efforts at hiding sensitive association rules. Nevertheless, undesired side effects, e. g., nonsensitive rules falsely hidden and spurious rules falsely generated, may be produced in the rule hiding process. In this paper, we present a novel approach that strategically modifies a few transactions in the transaction database to decrease the supports or confidences of sensitive rules without producing the side effects. Since the correlation among rules can make it impossible to achieve this goal, in this paper, we propose heuristic methods for increasing the number of hidden sensitive rules and reducing the number of modified entries. The experimental results show the effectiveness of our approach, i.e., undesired side effects are avoided in the rule hiding process. The results also report that in most cases, all the sensitive rules are hidden without spurious rules falsely generated. Moreover, the good scalability of our approach in terms of database size and the influence of the correlation among rules on rule hiding are observed.
When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are rep...
详细信息
When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If too many unremarkable data points are returned to the user labeled as candidate anomalies, the software will soon fall into disuse. One way to ensure that returned anomalies are useful is to make use of domain knowledge provided by the user. Often, the data in question includes a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. This paper describes a general purpose method called conditional anomaly detection for taking such differences among attributes into account, and proposes three different expectation-maximization algorithms for learning the model that is used in conditional anomaly detection. Experiments with more than 13 different data sets compare our algorithms with several other more standard methods for outlier or anomaly detection.
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in Web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challengi...
详细信息
Maximal biclique (also known as complete bipartite) subgraphs can model many applications in Web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect to the vertex number when the graph grows. In this paper, we efficiently enumerate them through the use of closed patterns of the adjacency matrix of the graph. For an undirected graph G without self-loops, we prove that 1) the number of closed patterns in the adjacency matrix of G is even, 2) the number of the closed patterns is precisely double the number of maximal biclique subgraphs of G, and 3) for every maximal biclique subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, the problem of enumerating maximal bicliques can be solved by using efficient algorithms for mining closed patterns, which are algorithms extensively studied in the data mining field. However, this direct use of existing algorithms causes a duplicated enumeration. To achieve high efficiency, we propose an O(mn) time delay algorithm for a nonduplicated enumeration, in particular, for enumerating those maximal bicliques with a large size, where m and n are the number of edges and vertices of the graph, respectively. We evaluate the high efficiency of our algorithm by comparing it to state-of-the- art algorithms on three categories of graphs: randomly generated graphs, benchmarks, and a real-life protein interaction network. In this paper, we also prove that if self-loops are allowed in a graph, then the number of closed patterns in the adjacency matrix is not necessarily even, but the maximal bicliques are exactly the same as those of the graph after removing all the self-loops.
In this paper, instead of proposing the fastest mining algorithm in the world, we present a new approach in mining association rules. We propose a new algorithm - GRA (Gradational Reduction Approach). It adopts three ...
详细信息
ISBN:
(纸本)0769529941
In this paper, instead of proposing the fastest mining algorithm in the world, we present a new approach in mining association rules. We propose a new algorithm - GRA (Gradational Reduction Approach). It adopts three mechanisms to increase the performance of mining. First, GRA algorithm uses a hash based technique, Hash MAP, which is similar to Hash Table to increase the access efficiency. Second, GRA algorithm uses an infrequent itemsets filtering mechanism to avoid generating a great deal of infrequent sub-itemsets of transaction records. Third, in order to reduce the size of database, GRA algorithm uses gradational reduction mechanism which uses the frequent itemsets as the information of filtration mechanisms to erase the infrequent items from database at every phase. GRA algorithm can decrease a large number of non-frequent itemsets and increase the utility rate of memory.
We propose a novel nonnegative matrix factorization model that aims at finding localized, part-based, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF)...
详细信息
We propose a novel nonnegative matrix factorization model that aims at finding localized, part-based, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted '' nonsmooth nonnegative matrix factorization '' (nsNMF), corresponds to the optimization of an unambiguous cost function designed to explicitly represent sparseness, in the form of nonsmoothness, which is controlled by a single parameter. In general, this method produces a set of basis and encoding vectors that are not only capable of representing the original data, but they also extract highly localized patterns, which generally lend themselves to improved interpretability. The properties of this new method are illustrated with several data sets. Comparisons to previously published methods show that the new nsNMF method has some advantages in keeping faithfulness to the data in the achieving a high degree of sparseness for both the estimated basis and the encoding vectors and in better interpretability of the factors.
Classification of large data sets is an important data mining problem that has wide applications. Jumping Emerging Patterns ( JEPs) are those itemsets whose supports increase abruptly from zero in one data set to nonz...
详细信息
Classification of large data sets is an important data mining problem that has wide applications. Jumping Emerging Patterns ( JEPs) are those itemsets whose supports increase abruptly from zero in one data set to nonzero in another data set. In this paper, we propose a fast, accurate, and less complex classifier based on a subset of JEPs, called Strong Jumping Emerging Patterns ( SJEPs). The support constraint of SJEP removes potentially less useful JEPs while retaining those with high discriminating power. Previous algorithms based on the manipulation of border [ 1] as well as consEPMiner [ 2] cannot directly mine SJEPs. Here, we present a new tree-based algorithm for their efficient discovery. Experimental results show that: 1) the training of our classifier is typically 10 times faster than earlier approaches, 2) our classifier uses much fewer patterns than the JEP-Classifier [ 3] to achieve a similar ( and, often, improved) accuracy, and 3) in many cases, it is superior to other state-of-the-art classification systems such as Naive Bayes, CBA, C4.5, and bagged and boosted versions of C4.5. We argue that SJEPs are high-quality patterns which possess the most differentiating power. As a consequence, they represent sufficient information for the construction of accurate classifiers. In addition, we generalize these patterns by introducing Noise-tolerant Emerging Patterns (NEPs) and Generalized Noise-tolerant Emerging Patterns ( GNEPs). Our tree-based algorithms can be adopted to easily discover these variations. We experimentally demonstrate that SJEPs, NEPs, and GNEPs are extremely useful for building effective classifiers that can deal well with noise.
暂无评论