检索结果-内蒙古大学图书馆

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2008年第6期20卷 736-751页

作者： Kumar, Deept Ramakrishnan, Naren Helm, Richard F. Potts, Malcolm Feeva Technol San Francisco CA 94105 USA Virginia Tech Dept Comp Sci Blacksburg VA 24061 USA Virginia Tech Dept Biochem Blacksburg VA 24061 USA

We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and, hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large data sets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between gene sets in a bioinformatics data set, and relating publications in the PubMed index of abstracts.

关键词： data mining mining methods and algorithms retrieval models graph and tree search strategies

来源：评论

学校读者我要写书评

暂无评论

The discrete basis problem

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2008年第10期20卷 1348-1362页

作者： Miettinen, Pauli Mielikainen, Taneli Gionis, Aristides Das, Gautam Mannila, Heikki Univ Helsinki Helsinki Inst Informat Technol FI-00014 Helsinki Finland Aalto Univ FI-00014 Helsinki Finland Nokia Res Ctr Palo Alto Syst Res Ctr Palo Alto CA 94304 USA Yahoo Res Barcelona 08003 Spain Univ Texas Arlington Dept Comp Sci & Engn Arlington TX 76019 USA

Matrix decomposition methods represent a data matrix as a product of two factor matrices: one containing basis vectors that represent meaningful concepts in the data and another describing how the observed data can be expressed as combinations of the basis vectors. Decomposition methods have been studied extensively, but many methods return real-valued matrices. Interpreting real-valued factor matrices is hard if the original data is Boolean. In this paper, we describe a matrix decomposition formulation for Boolean data, the Discrete Basis Problem. The problem seeks for a Boolean decomposition of a binary matrix, thus allowing the user to easily interpret the basis vectors. We also describe a variation of the problem, the Discrete Basis Partitioning Problem. We show that both problems are NP-hard. For the Discrete Basis Problem, we give a simple greedy algorithm for solving it;for the Discrete Basis Partitioning Problem, we show how it can be solved using existing methods. We present experimental results for the greedy algorithm and compare it against other well-known methods. Our algorithm gives intuitive basis vectors, but its reconstruction error is usually larger than with the real-valued methods. We discuss the reasons for this behavior.

关键词： mining methods and algorithms clustering classification and association rules text mining

来源：评论

学校读者我要写书评

暂无评论

mining loosely structured motifs from biological data

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2008年第11期20卷 1472-1489页

作者： Fassetti, Fabio Greco, Gianluigi Terracina, Giorgio Univ Calabria Dipartimento Elettron Informat & Sistemist I-87036 Arcavacata Di Rende CS Italy Univ Calabria Dipartimento Matemat I-87036 Arcavacata Di Rende CS Italy

The discovery of information encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually encoded in patterns frequently occurring in the sequences, which are also called motifs. In fact, motif discovery has received much attention in the literature, and several algorithms have already been proposed, which are specifically tailored to deal with motifs exhibiting some kinds of "regular structure." Motivated by biological observations, this paper focuses on the mining of loosely structured motifs, i.e., of more general kinds of motif where several "exceptions" may be tolerated in pattern repetitions. To this end, an algorithm exploiting data structures conceived to efficiently handle pattern variabilities is presented and analyzed. Furthermore, a randomized variant with linear time and space complexity is introduced, and a theoretical guarantee on its performances is proven. Both algorithms have been implemented and tested on real data sets. Despite the ability of mining very complex kinds of pattern, performance results evidence a genome-wide applicability of the proposed techniques.

关键词： data mining bioinformatics (genome or protein) databases mining methods and algorithms

来源：评论

学校读者我要写书评

暂无评论

DRYADEPARENT, an efficient and robust closed attribute tree mining algorithm

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2008年第3期20卷 300-320页

作者： Termier, Alexandre Rousset, Marie-Christine Sebag, Michele Ohara, Kouzou Washio, Takashi Motoda, Hiroshi Univ Grenoble 1 Lab Informat Grenoble F-38402 St Martin Dheres France Univ Paris 11 LRI F-91405 Orsay France Osaka Univ Inst Sci & Ind Res Osaka 5670047 Japan Air Force Res Lab Air Force Off Sci Res Asian Off Aerospace Res & Dev Tokyo 1600032 Japan

In this paper, we present a new tree mining algorithm, DRYADEPARENT, based on the hooking principle first introduced in DRYADE. In the experiments, we demonstrate that the branching factor and depth of the frequent patterns to find are key factors of complexity for tree mining algorithms, even if often overlooked in previous work. We show that DRYADEPARENT outperforms the current fastest algorithm, CMTreeMiner, by orders of magnitude on data sets where the frequent tree patterns have a high branching factor.

关键词： data mining mining methods and algorithms mining tree structured data

来源：评论

学校读者我要写书评

暂无评论

Hiding sensitive association rules with limited side effects

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2007年第1期19卷 29-42页

作者： Wu, Yi-Hung Chiang, Chia-Ming Chen, Arbee L. P. Chung Yuan Christian Univ Dept Informat & Comp Engn Chungli 32023 Taiwan VIA Technol Inc Taipei 231 Taiwan Natl Chengchi Univ Dept Comp Sci Taipei 11605 Taiwan

Data mining techniques have been widely used in various applications. However, the misuse of these techniques may lead to the disclosure of sensitive information. Researchers have recently made efforts at hiding sensitive association rules. Nevertheless, undesired side effects, e. g., nonsensitive rules falsely hidden and spurious rules falsely generated, may be produced in the rule hiding process. In this paper, we present a novel approach that strategically modifies a few transactions in the transaction database to decrease the supports or confidences of sensitive rules without producing the side effects. Since the correlation among rules can make it impossible to achieve this goal, in this paper, we propose heuristic methods for increasing the number of hidden sensitive rules and reducing the number of modified entries. The experimental results show the effectiveness of our approach, i.e., undesired side effects are avoided in the rule hiding process. The results also report that in most cases, all the sensitive rules are hidden without spurious rules falsely generated. Moreover, the good scalability of our approach in terms of database size and the influence of the correlation among rules on rule hiding are observed.

关键词： association rules data mining mining methods and algorithms rule hiding

来源：评论

学校读者我要写书评

暂无评论

Conditional anomaly detection

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2007年第5期19卷 631-645页

作者： Song, Xiuyao Wu, Mingxi Jermaine, Christopher Ranka, Sanjay Univ Florida Comp & Informat Sci & Engn Dept Gainesville FL 32611 USA

When anomaly detection software is used as a data analysis tool, finding the hardest-to-detect anomalies is not the most critical task. Rather, it is often more important to make sure that those anomalies that are reported to the user are in fact interesting. If too many unremarkable data points are returned to the user labeled as candidate anomalies, the software will soon fall into disuse. One way to ensure that returned anomalies are useful is to make use of domain knowledge provided by the user. Often, the data in question includes a set of environmental attributes whose values a user would never consider to be directly indicative of an anomaly. However, such attributes cannot be ignored because they have a direct effect on the expected distribution of the result attributes whose values can indicate an anomalous observation. This paper describes a general purpose method called conditional anomaly detection for taking such differences among attributes into account, and proposes three different expectation-maximization algorithms for learning the model that is used in conditional anomaly detection. Experiments with more than 13 different data sets compare our algorithms with several other more standard methods for outlier or anomaly detection.

关键词： data mining mining methods and algorithms

来源：评论

学校读者我要写书评

暂无评论

Maximal biclique subgraphs and closed pattern pairs of the adjacency matrix: A one-to-one correspondence and mining algorithms

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2007年第12期19卷 1625-1637页

作者： Li, Jinyan Liu, Guimei Li, Haiquan Wong, Limsoon Nanyang Technol Univ Sch Comp Engn Singapore 639798 Singapore Natl Univ Singapore Sch Comp Singapore 117590 Singapore Inst Infocomm Res Knowledge Discovery Singapore 119613 Singapore

Maximal biclique (also known as complete bipartite) subgraphs can model many applications in Web mining, business, and bioinformatics. Enumerating maximal biclique subgraphs from a graph is a computationally challenging problem, as the size of the output can become exponentially large with respect to the vertex number when the graph grows. In this paper, we efficiently enumerate them through the use of closed patterns of the adjacency matrix of the graph. For an undirected graph G without self-loops, we prove that 1) the number of closed patterns in the adjacency matrix of G is even, 2) the number of the closed patterns is precisely double the number of maximal biclique subgraphs of G, and 3) for every maximal biclique subgraph, there always exists a unique pair of closed patterns that matches the two vertex sets of the subgraph. Therefore, the problem of enumerating maximal bicliques can be solved by using efficient algorithms for mining closed patterns, which are algorithms extensively studied in the data mining field. However, this direct use of existing algorithms causes a duplicated enumeration. To achieve high efficiency, we propose an O(mn) time delay algorithm for a nonduplicated enumeration, in particular, for enumerating those maximal bicliques with a large size, where m and n are the number of edges and vertices of the graph, respectively. We evaluate the high efficiency of our algorithm by comparing it to state-of-the- art algorithms on three categories of graphs: randomly generated graphs, benchmarks, and a real-life protein interaction network. In this paper, we also prove that if self-loops are allowed in a graph, then the number of closed patterns in the adjacency matrix is not necessarily even, but the maximal bicliques are exactly the same as those of the graph after removing all the self-loops.

关键词： maximal biclique subgraphs closed patterns mining methods and algorithms bioinformatics (genome or protein) database

来源：评论

学校读者我要写书评

暂无评论

A decomposition approach for mining frequent itemsets

A decomposition approach for mining frequent itemsets

引用

3rd International Conference on Intelligent Information Hiding and Multimedia Signal Processing

作者： Huang, Jen-Peng Lan, Guo-Cheng Ku, Huang-Cheng Hong, Tzung-Pei Southern Taiwan Univ Technol Dept Informat Management Tainan 710 Taiwan Natl Chiayi Univ Dept Comp Sci & Informat Engn Chiayi 600 Taiwan Natl Univ Kaohsiung Dept Elect Engn Kaohsiung 811 Taiwan

ISBN: (纸本)0769529941

In this paper, instead of proposing the fastest mining algorithm in the world, we present a new approach in mining association rules. We propose a new algorithm - GRA (Gradational Reduction Approach). It adopts three mechanisms to increase the performance of mining. First, GRA algorithm uses a hash based technique, Hash MAP, which is similar to Hash Table to increase the access efficiency. Second, GRA algorithm uses an infrequent itemsets filtering mechanism to avoid generating a great deal of infrequent sub-itemsets of transaction records. Third, in order to reduce the size of database, GRA algorithm uses gradational reduction mechanism which uses the frequent itemsets as the information of filtration mechanisms to erase the infrequent items from database at every phase. GRA algorithm can decrease a large number of non-frequent itemsets and increase the utility rate of memory.

关键词： data mining mining methods and algorithms association rules

来源：评论

学校读者我要写书评

暂无评论

Nonsmooth nonnegative matrix factorization (nsNMF)

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2006年第3期28卷 403-415页

作者： Pascual-Montano, A Carazo, JM Kochi, K Lehmann, D Pascual-Marqui, RD Univ Complutense Madrid Fac Ciencias Fis Comp Architecture & Syst Engn Dept E-28040 Madrid Spain Univ Autonoma Madrid Natl Biotechnol Ctr CSIC E-28049 Madrid Spain Univ Hosp Psychiat Key Inst Brain Mind Res CH-8029 Zurich Switzerland

We propose a novel nonnegative matrix factorization model that aims at finding localized, part-based, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted '' nonsmooth nonnegative matrix factorization '' (nsNMF), corresponds to the optimization of an unambiguous cost function designed to explicitly represent sparseness, in the form of nonsmoothness, which is controlled by a single parameter. In general, this method produces a set of basis and encoding vectors that are not only capable of representing the original data, but they also extract highly localized patterns, which generally lend themselves to improved interpretability. The properties of this new method are illustrated with several data sets. Comparisons to previously published methods show that the new nsNMF method has some advantages in keeping faithfulness to the data in the achieving a high degree of sparseness for both the estimated basis and the encoding vectors and in better interpretability of the factors.

关键词： nonnegative matrix factorization constrained optimization datamining mining methods and algorithms pattern analysis feature extraction or construction sparse structured and very large systems

来源：评论

学校读者我要写书评

暂无评论

Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2006年第6期18卷 721-737页

作者： Fan, H Ramamohanarao, K Univ Melbourne Dept Comp Sci & Software Engn Melbourne Vic 3010 Australia

Classification of large data sets is an important data mining problem that has wide applications. Jumping Emerging Patterns ( JEPs) are those itemsets whose supports increase abruptly from zero in one data set to nonzero in another data set. In this paper, we propose a fast, accurate, and less complex classifier based on a subset of JEPs, called Strong Jumping Emerging Patterns ( SJEPs). The support constraint of SJEP removes potentially less useful JEPs while retaining those with high discriminating power. Previous algorithms based on the manipulation of border [ 1] as well as consEPMiner [ 2] cannot directly mine SJEPs. Here, we present a new tree-based algorithm for their efficient discovery. Experimental results show that: 1) the training of our classifier is typically 10 times faster than earlier approaches, 2) our classifier uses much fewer patterns than the JEP-Classifier [ 3] to achieve a similar ( and, often, improved) accuracy, and 3) in many cases, it is superior to other state-of-the-art classification systems such as Naive Bayes, CBA, C4.5, and bagged and boosted versions of C4.5. We argue that SJEPs are high-quality patterns which possess the most differentiating power. As a consequence, they represent sufficient information for the construction of accurate classifiers. In addition, we generalize these patterns by introducing Noise-tolerant Emerging Patterns (NEPs) and Generalized Noise-tolerant Emerging Patterns ( GNEPs). Our tree-based algorithms can be adopted to easily discover these variations. We experimentally demonstrate that SJEPs, NEPs, and GNEPs are extremely useful for building effective classifiers that can deal well with noise.

关键词： data mining machine learning emerging patterns classification frequent patterns mining methods and algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：