This paper addresses relation information extraction problem and proposes a method of discovering relations among entities which is buried in different nest structures of XML documents. The method first identifies and...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
This paper addresses relation information extraction problem and proposes a method of discovering relations among entities which is buried in different nest structures of XML documents. The method first identifies and collects XML fragments that contain all types of entities given by users, then computes similarity between fragments based on semantics of their tags and their structures, and clusters fragments by similarity so that the fragments containing the same relation are clustered together, finally extracts relation instances and patterns of their occurrences from each cluster. The results of experiments show that the method can identify and extract relation information among given types of entities correctly from all kinds of XML documents with meaningful tags.
Many processes experience abrupt changes in their dynamics. This causes problems for some prediction algorithms which assume that the dynamics of the sequence to be predicted are constant, or at least only change slow...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Many processes experience abrupt changes in their dynamics. This causes problems for some prediction algorithms which assume that the dynamics of the sequence to be predicted are constant, or at least only change slowly over time. In this paper the problem of predicting sequences with sudden changes in dynamics is considered. For a model of multivariate Gaussian data we derive expected generalization error of standard linear Fisher classifier in situation where after unexpected task change, the classification algorithm learns on a mixture of old and new data. We show both analytically and by an experiment that optimal length of learning sequence depends on complexity of the task, input dimensionality, on the power and periodicity of. the changes. The proposed solution is to consider a collection of agents, in this case non-linear single layer perceptrons (agents), trained by a memetic like learning algorithm. T e most successful agents are voting for predictions. A grouped structure of the agent population assists in obtaining favorable diversity in the agent population. Efficiency of socially organized evolving multi-agent system is demonstrated on an artificial problem.
A Classification Association Rule (CAR), a common type of mined knowledge in datamining, describes an implicative co-occurring relationship between a set of binary-valued data-attributes (items) and a pre-defined cla...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
A Classification Association Rule (CAR), a common type of mined knowledge in datamining, describes an implicative co-occurring relationship between a set of binary-valued data-attributes (items) and a pre-defined class, expressed in the form of an "antecedent double right arrow consequent-class" rule. Classification Association Rule mining (CARM) is a recent Classification Rule mining (CRM) approach that builds an Association Rule mining (ARM) based classifier using CARs. Regardless of which particular methodology is used to build it, a classifier is usually presented as an ordered CAR list, based on an applied rule ordering strategy. Five existing rule ordering mechanisms can be identified: (1) Confidence-Support-size -of-Antecedent (CSA), (2) size-of-Antecedent-Confidence-Support (ACS), (3) Weighted Relative Accuracy (WRA), (4) Laplace Accuracy, and (5) chi(2) Testing. In this paper, we divide the above mechanisms into two groups: (i) pure "support-confidence" framework like, and (ii) additive score assigning like. We consequently propose a hybrid rule ordering approach by combining one approach taken from (i) and another approach taken from (ii). The experimental results show that the proposed rule ordering approach performs well with respect to the accuracy of classification.
Programs for gene prediction in computational biology are examples of systems for which the acquisition of authentic test data is difficult as these require years of extensive research. This has lead to test methods b...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Programs for gene prediction in computational biology are examples of systems for which the acquisition of authentic test data is difficult as these require years of extensive research. This has lead to test methods based on semiartificially produced test data, often produced by ad hoc techniques complemented by statistical models such as Hidden Markov Models (HMM). The quality of such a test method depends on how well the test data reflect the regularities in known data and how well they generalize these regularities. So far only very simplified and generalized, artificial data sets have been tested, and a more thorough statistical foundation is required. We propose to use logic-statistical modelling methods for machine-learning for analyzing existing and manually marked up data, integrated with the generation of new, artificial data. More specifically, we suggest to use the PRISM system developed by Sato and Kameya. Based on logic programming extended with random variables and parameter learning, PRISM appears as a powerful modelling environment, which subsumes HMMs and a wide range of other methods, all embedded in a declarative language. We illustrated these principles here, showing parts of a model under development for genetic sequences and indicate first initial experiments producing test data for evaluation of existing gene finders, exemplified by GENSCAN, HMMGene and ***.
Transduction is an inference mechanism "from particular to particular". Its application to classification tasks implies the use of both labeled (training) data and unlabeled (working) data to build a classif...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Transduction is an inference mechanism "from particular to particular". Its application to classification tasks implies the use of both labeled (training) data and unlabeled (working) data to build a classifier whose main goal is that of classifying (only) unlabeled data as accurately as possible. Unlike the classical inductive setting, no general rule valid for all possible instances is generated. Transductive learning is most suited for those applications where the examples for which a prediction is needed are already known when training the classifier. Several approaches have been proposed in the literature on building transductive classifiers from data stored in a single table of a relational database. Nonetheless, no attention has been paid to the application of the transduction principle in a (multi-) relational setting, where data are stored in multiple tables of a relational database. In this paper we propose a new transductive classifier, named TRANSC, which is based on a probabilistic approach to making transductive inferences from relational data. This new method works in a transductive setting and employs a principled probabilistic classification in multi-relational datamining to face the challenges posed by some spatial datamining problems. Probabilistic inference allows us to compute the class probability and return, in addition to result of transductive classification, the confidence in the classification. The predictive accuracy of TRANSC has been compared to that of its inductive counterpart in an empirical study involving both a benchmark relational dataset and two spatial datasets. The results obtained are generally in favor of TRANSC, although improvements are small by a narrow margin.
Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel unsupervised algorithm for outlier detection with a solid statistical foundation is prop...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel unsupervised algorithm for outlier detection with a solid statistical foundation is proposed. First we modify a nonparametric density estimate with a variable kernel to yield a robust local density estimation. Outliers are then detected by comparing the local density of each point to the local density of its neighbors. Our experiments performed on several simulated data sets have demonstrated that the proposed approach can outperform two widely used outlier detection algorithms (LOF and LOCI).
In this paper, we propose a filter-refinement scheme based on a new approach called Sorted Extended Gaussian Image histogram approach (SEGI) to address the problems of traditional EGI. Specifically, SEGI first constru...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
In this paper, we propose a filter-refinement scheme based on a new approach called Sorted Extended Gaussian Image histogram approach (SEGI) to address the problems of traditional EGI. Specifically, SEGI first constructs a 2D histogram based on the EGI histogram and the shell histogram. Then, SEGI extracts two kinds of descriptors from each 3D model: (i) the descriptor from the sorted histogram bins is used to perform approximate 3D model retrieval in the filter step, and (ii) the descriptor which records the relations between the histogram bins is used to refine the approximate results and obtain the final query results. The experiments show that SEGI outperforms most of state-of-art approaches (e.g., EGI, shell histogram) on the public Princeton Shape Benchmark.
One of most important algorithms for miningdata streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions...
详细信息
ISBN:
(数字)9783540734994
ISBN:
(纸本)9783540734987
One of most important algorithms for miningdata streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. In this paper, we revisit this problem and implemented a system fVFDT on top of VFDT and VFDTc. We make the following four contributions: 1) we present a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O(nlogn), while VFDT's processing time is O(n(2)). When a new example arrives, VFDTc need update O(logn) attribute tree nodes, but fVFDT just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves from O(nlogn) to O (n) in processing time. 3) Comparing to VFDTc, fVFDT's candidate split-test number decrease from O(n) to O(logn).4)lmprove the soft discretization method to be used in data streams mining, it overcomes the problem of noise data and improve the classification accuracy.
datamining is the important approach to realize knowledge discovery. It is the process of extracting patterns or predicting previously unknown and useful trends from large quantities of data by using the knowledge of...
详细信息
ISBN:
(纸本)9781424413119
datamining is the important approach to realize knowledge discovery. It is the process of extracting patterns or predicting previously unknown and useful trends from large quantities of data by using the knowledge of multidisciplinary fields such as statistics, mode identify, artificial intelligence., machinelearning, database, management information system and so on. The artificial neural network (ANN) is one of the techniques of datamining. It is the nonlinear auto-fit dynamic system made of many cells with simulating the construction of biology neural systems. ANN has the ability to mapping high nonlinear system, associable memory and abstractly generalization. It can make model from analyzing the mode in the data and discover the unknown knowledge. The present paper gives an engineering application of datamining based on neural networks. The back propagation (BP) neural network is used as the algorithm of datamining. Then the effects of structural technologic parameters on stress in the weld region of the shield engine rotor in a submarine are analyzed. The mined data come from the numerical simulations of the finite element method. The effects of different parameters on the stress in the weld region are achieved from the results of the datamining. The discovered knowledge is beneficial to the security improvement of structural strength design for the engine rotor.
This paper presents a steganographic method in which a payload is embedded into the values of an attribute associated with files. A secret payload is embedded piece by piece, by traversing directories in a given direc...
详细信息
ISBN:
(纸本)0769529941
This paper presents a steganographic method in which a payload is embedded into the values of an attribute associated with files. A secret payload is embedded piece by piece, by traversing directories in a given directory tree. In each directory, a piece of the payload is embedded through an embedding algorithm specified by the user Two embedding algorithms are currently available. The regular files in the directory tree are kept intact even after embedding is performed, while most steganographic methods ever proposed modify contents of files for embedding data. A prototype system has been developed for Linux. It has been shown that the proposed method does work in principle.
暂无评论