检索结果-内蒙古大学图书馆

Coupling MDL and Markov chain Monte Carlo to sample diverse pattern sets

DATA & KNOWLEDGE ENGINEERING 2025年 156卷

作者： Camelin, Francois Loudni, Samir Pesant, Gilles Truchet, Charlotte IMT Atlantique 4 Rue Alfred Kastler F-44300 Nantes France Polytech Montreal 2500 Chem Polytech Montreal PQ H3T 0A3 Canada Sorbone Univ Paris 75005 France

Exhaustive methods of pattern extraction in a database face real obstacles to speed and output control of patterns: a large number of patterns are extracted, many of which are redundant. Pattern extraction methods through sampling, which allow for controlling the size of the outputs while ensuring fast response times, provide a solution to these two problems. However, these methods do not provide high-quality patterns: they return patterns that are very infrequent in the database. Furthermore, they do not scale. To ensure more frequent and diversified patterns in the output, we propose integrating compression methods into sampling to select the most representative patterns from the sampled transactions. We demonstrate that our approach improves the state of the art in terms of diversity of produced patterns.

关键词： Data mining mining methods and algorithms Pattern Sampling Diversity Compression CFTP

来源：评论

学校读者我要写书评

暂无评论

Experimental identification of hard data sets for classification and feature selection methods with insights on method selection

引用

DATA & KNOWLEDGE ENGINEERING 2018年 118卷 41-51页

作者： Luan, Cuiju Dong, Guozhu Shanghai Maritime Univ Coll Informat Engn Shanghai 201306 Peoples R China Wright State Univ Dept Comp Sci & Engn Dayton OH 45435 USA

The paper reports an experimentally identified list of benchmark data sets that are hard for representative classification and feature selection methods. This was done after systematically evaluating a total of 48 combinations of methods, involving eight state-of-the-art classification algorithms and six commonly used feature selection methods, on 129 data sets from the UCI repository (some data sets with known high classification accuracy were excluded). In this paper, a data set for classification is called hard if none of the 48 combinations can achieve an AUC over 0.8 and none of them can achieve an F-Measure value over 0.8;it is called easy otherwise. A total of 15 out of the 129 data sets were found to be hard in that sense. This paper also compares the performance of different methods, and it produces rankings of classification methods, separately on the hard data sets and on the easy data sets. This paper is the first to rank methods separately for hard data sets and for easy data sets. It turns out that the classifier rankings resulting from our experiments are somehow different from those in the literature and hence they offer new insights on method selection. It should be noted that the Random Forest method remains to be the best in all groups of experiments.

关键词： Classification methods Feature selection methods Hard data sets Method ranking Performance comparison Classification mining methods and algorithms

来源：评论

学校读者我要写书评

暂无评论

methods for mining frequent items in data streams: an overview

引用

KNOWLEDGE AND INFORMATION SYSTEMS 2011年第1期26卷 1-30页

作者： Liu, Hongyan Lin, Yuan Han, Jiawei Tsinghua Univ Sch Econ & Management Beijing 100084 Peoples R China Univ Washington Informat Sch Seattle WA 98195 USA Univ Illinois Dept Comp Sci Urbana IL 61801 USA

In many real-world applications, information such as web click data, stock ticker data, sensor network data, phone call records, and traffic monitoring data appear in the form of data streams. Online monitoring of data streams has emerged as an important research undertaking. Estimating the frequency of the items on these streams is an important aggregation and summary technique for both stream mining and data management systems with a broad range of applications. This paper reviews the state-of-the-art progress on methods of identifying frequent items from data streams. It describes different kinds of models for frequent items mining task. For general models such as cash register and Turnstile, we classify existing algorithms into sampling-based, counting-based, and hashing-based categories. The processing techniques and data synopsis structure of each algorithm are described and compared by evaluation measures. Accordingly, as an extension of the general data stream model, four more specific models including time-sensitive model, distributed model, hierarchical and multi-dimensional model, and skewed data model are introduced. The characteristics and limitations of the algorithms of each model are presented, and open issues waiting for study and improvement are discussed.

关键词： Data mining Data stream mining methods and algorithms Frequent items

来源：评论

学校读者我要写书评

暂无评论

Discovery of temporal neighborhoods through discretization methods

引用

INTELLIGENT DATA ANALYSIS 2014年第4期18卷 609-636页

作者： Dey, Sandipan Janeja, Vandana P. Gangopadhyay, Aryya Univ Maryland Baltimore Cty Dept Comp Sci & Elect Engn Baltimore MD 21228 USA Univ Maryland Baltimore Cty Dept Informat Syst Baltimore MD 21228 USA

Neighborhood discovery is a precursor to knowledge discovery in complex and large datasets such as Temporal data, which is a sequence of data tuples measured at successive time instances. Hence instead of mining the entire data, we are interested in dividing the huge data into several smaller intervals of interest which we call as temporal neighborhoods. In this paper we propose a class of algorithms to generate temporal neighborhoods through unequal depth discretization. We describe four novel algorithms (a) Similarity based Merging (SMerg), (b) Stationary distribution based Merging (StMerg), (c) Greedy Merge (GMerg) and, (d) Optimal Merging (OptMerg). The SMerg and STMerg algorithms are based on the robust framework of Markov models and the Markov Stationary distribution respectively. GMerg is a greedy approach and OptMerg algorithm is geared towards discovering optimal binning strategies for the most effective partitioning of the data into temporal neighborhoods. Both these algorithms do not use Markov models. We identify temporal neighborhoods with distinct demarcations based on unequal depth discretization of the data. We discuss detailed experimental results in both synthetic and real world data. Specifically, we show (i) the efficacy of our algorithms through precision and recall of labeled bins, (ii) the ground truth validation in real world traffic monitoring datasets and, (iii) Knowledge discovery in the temporal neighborhoods such as global anomalies. Our results indicate that we are able to identify valuable knowledge based on our ground truth validation from real world traffic data.

关键词： Spatial/temporal databases data mining temporal neighborhoods mining methods and algorithms

来源：评论

学校读者我要写书评

暂无评论

Alternative rule induction methods based on incremental object using rough set theory

引用

APPLIED SOFT COMPUTING 2013年第1期13卷 372-389页

作者： Huang, Chun-Che Tseng, Tzu-Liang (Bill) Fan, Yu-Neng Hsu, Chih-Hua Natl Taiwan Univ Dept Informat Management Taipei 106 Taiwan Natl Chi Nan Univ Dept Informat Management Puli 545 Nantou Taiwan Univ Texas El Paso Dept Ind Mfg & Syst Engn El Paso TX 79968 USA

The rough set (RS) theory can be seen as a new mathematical approach to vagueness and is capable of discovering important facts hidden in that data. However, traditional rough set approach ignores that the desired reducts are not necessarily unique since several reducts could include the same value of the strength index. In addition, the current RS algorithms have the ability to generate a set of classification rules efficiently, but they cannot generate rules incrementally when new objects are given. Numerous studies of incremental approaches are not capable to deal with the problems of large database. Therefore, an incremental rule-extraction algorithm is proposed to solve these issues in this study. Using this algorithm, when a new object is added up to an information system, it is unnecessary to re-compute rule sets from the very beginning, which can quickly generate the complete but not repetitive rules. In the case study, the results show that the incremental issues of new data add-in are resolved and a huge computation time is saved. (C) 2012 Elsevier B. V. All rights reserved.

关键词： Rough set theory Incremental algorithm Incremental object Rule induction mining methods and algorithms

来源：评论

学校读者我要写书评

暂无评论

NLMF: NonLinear Matrix Factorization methods for Top-N Recommender Systems 14

NLMF: NonLinear Matrix Factorization Methods for Top-N Recom...

引用

14th IEEE International Conference on Data mining (IEEE ICDM)

作者： Kabbur, Santosh Karypis, George Univ Minnesota Dept Comp Sci Minneapolis MN 55455 USA

ISBN: (纸本)9781479942749

Many existing state-of-the-art top-N recommendation methods model users and items in the same latent space and the recommendation scores are computed via the dot product between those vectors. These methods assume that the user preference is consistent across all the items that he/she has rated. This assumption is not necessarily true, since many users can have multiple personas/interests and their preferences can vary with each such interest. To address this, a recently proposed method modeled the users with multiple interests. In this paper, we build on this approach and model users using a much richer representation. We propose a method which models the user preference as a combination of having global preference and interest-specific preference. The proposed method uses a nonlinear model for predicting the recommendation score, which is used to perform top-N recommendation task. The recommendation score is computed as a sum of the scores from the components representing global preference and interest-specific preference. A comprehensive set of experiments on multiple datasets show that the proposed model outperforms other state-of-the-art methods for top-N recommendation task.

关键词： Database Applications Data mining Personalization mining methods and algorithms

来源：评论

学校读者我要写书评

暂无评论

Using Subgroup Discovery to Relate Odor Pleasantness and Intensity to Peripheral Nervous System Reactions

引用

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 2023年第3期14卷 2005-2019页

作者： Moranges, Maelle Plantevit, Marc Bensafi, Moustafa Lyon Neurosci Res Ctr F-69500 Bron France Univ Lyon LIRIS Lab Villeurbanne Lyon France Ecole Ingn Informat Paris Sch Engn & Comp Sci F-94270 Le Kremlin Bicetre France

Activation of the autonomic nervous system is a primary characteristic of human hedonic responses to sensory stimuli. For smells, general tendencies of physiological reactions have been described using classical statistics. However, these physiological variations are generally not quantified precisely;each psychophysiological parameter has very often been studied separately and individual variability was not systematically considered. The current study presents an innovative approach based on data mining, whose goal is to extract knowledge from a dataset. This approach uses a subgroup discovery algorithm which allows extraction of rules that apply to as many olfactory stimuli and individuals as possible. These rules are described by intervals on a set of physiological attributes. Results allowed both quantifying how each physiological parameter relates to odor pleasantness and perceived intensity but also describing the participation of each individual to these rules. This approach can be applied to other fields of affective sciences characterized by complex and heterogeneous datasets.

关键词： Physiology Olfactory Data mining Task analysis Skin Sociology Classification algorithms mining methods and algorithms pattern analysis physiological measures

来源：评论

学校读者我要写书评

暂无评论

algorithms for storytelling

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2008年第6期20卷 736-751页

作者： Kumar, Deept Ramakrishnan, Naren Helm, Richard F. Potts, Malcolm Feeva Technol San Francisco CA 94105 USA Virginia Tech Dept Comp Sci Blacksburg VA 24061 USA Virginia Tech Dept Biochem Blacksburg VA 24061 USA

We formulate a new data mining problem called storytelling as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and, hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large data sets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between gene sets in a bioinformatics data set, and relating publications in the PubMed index of abstracts.

关键词： data mining mining methods and algorithms retrieval models graph and tree search strategies

来源：评论

学校读者我要写书评

暂无评论

Multistage Gene Normalization and SVM-Based Ranking for Protein Interactor Extraction in Full-Text Articles

引用

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010年第3期7卷 412-420页

作者： Dai, Hong-Jie Lai, Po-Ting Tsai, Richard Tzong-Han Natl Tsing Hua Univ Dept Comp Sci Hsinchu 30043 Taiwan Acad Sinica Inst Informat Sci Intelligent Agent Syst Lab Taipei Taiwan Yuan Ze Univ Dept Comp Sci & Engn Zhongli City 320 Taoyuan County Taiwan

The interactor normalization task (INT) is to identify genes that play the interactor role in protein-protein interactions (PPIs), to map these genes to unique IDs, and to rank them according to their normalized confidence. INT has two subtasks: gene normalization (GN) and interactor ranking. The main difficulties of INT GN are identifying genes across species and using full papers instead of abstracts. To tackle these problems, we developed a multistage GN algorithm and a ranking method, which exploit information in different parts of a paper. Our system achieved a promising AUC of 0.43471. Using the multistage GN algorithm, we have been able to improve system performance ( AUC) by 1.719 percent compared to a one-stage GN algorithm. Our experimental results also show that with full text, versus abstract only, INT AUC performance was 22.6 percent higher.

关键词： Data mining feature evaluation and selection mining methods and algorithms text mining scientific databases

来源：评论

学校读者我要写书评

暂无评论

Design of computationally efficient density-based clustering algorithms

引用

DATA & KNOWLEDGE ENGINEERING 2015年 95卷 23-38页

作者： Nanda, Satyasai Jagannath Panda, Ganapati Malaviya Natl Inst Technol Dept Elect & Commun Engn Jaipur 302017 Rajasthan India Indian Inst Technol Sch Elect Sci Bhubaneswar 751013 Orissa India

The basic DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm uses minimum number of input parameters, very effective to cluster large spatial databases but involves more computational complexity. The present paper proposes a new strategy to reduce the computational complexity associated with the DBSCAN by efficiently implementing new merging criteria at the initial stage of evolution of clusters. Further new density based clustering (DBC) algorithms are proposed considering correlation coefficient as similarity measure. These algorithms though computationally not efficient, found to be effective when there is high similarity between patterns of dataseL The computations associated with DBC based on correlation algorithms are reduced with new cluster merging criteria. Test on several synthetic and real datasets demonstrates that these computationally efficient algorithms are comparable in accuracy to the traditional one. An interesting application of the proposed algorithm has been demonstrated to identify the regional hazard regions present in the seismic catalog of Japan. (C) 2014 Elsevier B.V. All rights reserved.

关键词： Clustering, classification, and association rules mining methods and algorithms DBSCAN Fast DBC Physical action datasets Seismic catalog of japan

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：