Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be com...
详细信息
Because of the widespread diffusion of semistructured data in XML format, much research effort is currently devoted to support the storage and retrieval of large collections of such documents. XML documents can be compared as to their structural similarity, in order to group them into clusters so that different storage, retrieval, and processing techniques can be effectively exploited. In this scenario, an efficient and effective similarity function is the key of a successful data management process. We present an approach for detecting structural similarity between XML documents which significantly differs from standard methods based on graph-matching algorithms, and allows a significant reduction of the required computation costs. Our proposal roughly consists of linearizing the structure of each XML document, by representing it as a numerical sequence and, then, comparing such sequences through the analysis of their frequencies. First, some basic strategies for encoding a document are proposed, which can focus on diverse structural facets. Moreover, the theory of Discrete Fourier Transform is exploited to effectively and efficiently compare the encoded documents (i.e., signals) in the domain of frequencies. Experimental results reveal the effectiveness of the approach, also in comparison with standard methods.
Queries on semistructured data are hard to process due to the complex nature of the data and call for specialized techniques. Existing path- based indexes and query processing algorithms are not efficient for searchin...
详细信息
Queries on semistructured data are hard to process due to the complex nature of the data and call for specialized techniques. Existing path- based indexes and query processing algorithms are not efficient for searching complex structures beyond simple paths, even when the queries are high- selective. We introduce the definition of minimal infrequent structures ( MIS), which are structures that 1) exist in the data, 2) are not frequent with respect to a support threshold, and 3) all substructures of them are frequent. By indexing the occurrences of MIS, we can efficiently locate the high- selective substructures of a query, improving search performance significantly. An efficient data mining algorithm is proposed, which finds the minimal infrequent structures. Their occurrences in the XML data are then indexed by a lightweight data structure and used as a fast filter step in query evaluation. We validate the efficiency and applicability of our methods through experimentation on both synthetic and real data.
mining frequent patterns in databases is a fundamental and essential problem in data mining research. A continuity is a kind of causal relationship which describes a definite temporal factor with exact position betwee...
详细信息
ISBN:
(纸本)9780898715934
mining frequent patterns in databases is a fundamental and essential problem in data mining research. A continuity is a kind of causal relationship which describes a definite temporal factor with exact position between the records. Since continuities break the boundaries of records, the number of potential patterns will increase drastically. An alternative approach is to mine closed frequent continuities. mining closed frequent patterns has the same power as mining the complete set of frequent patterns, while substantially reducing redundant rules to be generated and increasing the effectiveness of mining. In this paper, we propose a method called projected window list technology for the mining of frequent continuities. We present a closed frequent continuity mining algorithm, ClosedPROWL. Experimental result shows that our algorithm is more efficient than previously proposed algorithms.
Block correlations are common semantic patterns in storage systems. They can be exploited for improving the effectiveness of storage caching, prefetching, data layout, and disk scheduling. Unfortunately, information a...
详细信息
In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely ...
详细信息
In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded if incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications.
In this paper, we describe a new method combining the polynomial neural network and decision tree techniques in order to derive comprehensible classification rules from clinical electroencephalograms (EEGs) recorded f...
详细信息
In this paper, we describe a new method combining the polynomial neural network and decision tree techniques in order to derive comprehensible classification rules from clinical electroencephalograms (EEGs) recorded from sleeping newborns. These EEGs are heavily corrupted by cardiac, eye movement, muscle, and noise artifacts and, as a consequence, some EEG features are irrelevant to classification problems. Combining the polynomial network and decision tree techniques, we discover comprehensible classification rules while also attempting to keep their classification error down. This technique is shown to outperform a number of commonly used machine learning technique applied to automatically recognize artifacts in the sleep EEGs.
暂无评论