An efficient low-level word image representation plays a crucial role in general cursive word recognition. this paper proposes a novel representation scheme, where a word image can be represented as two sequences of f...
详细信息
In the typical nonparametric approach to classification in instance-based learning and datamining, random data (the training set of patterns) are collected and used to design a decision rule (classifier). One of the ...
详细信息
ISBN:
(纸本)3540305068
In the typical nonparametric approach to classification in instance-based learning and datamining, random data (the training set of patterns) are collected and used to design a decision rule (classifier). One of the most well known such rules is the k-nearest neighbor decision rule (also known as lazy learning) in which an unknown pattern is classified into the majority class among the k-nearest neighbors in the training set. this rule gives low error rates when the training set is large. However, in practice it is desired to store as little of the training data as possible, without sacrificing the performance. It is well known that thinning (condensing) the training set withthe Gabriel proximity graph is a viable partial solution to the problem. However, this brings up the problem of efficiently computing the Gabriel graph of large training data sets in high dimensional spaces. In this paper we report on a new approach to the instance-based learning problem. the new approach combines five tools: first, editing the data using Wilson-Gabriel-editing to smooththe decision boundary, second, applying Gabriel-thinning to the edited set, third, filtering this output withthe ICF algorithm of Brighton and Mellish, fourth, using the Gabriel-neighbor decision rule to classify new incoming queries, and fifth, using a new data structure that allows the efficient computation of approximate Gabriel graphs in high dimensional spaces. Extensive experiments suggest that our approach is the best on the market.
the task of extracting knowledge from text is an important research problem for information processing and document understanding. Approaches to capture the semantics of picture objects in documents constitute subject...
详细信息
Steering an autonomous vehicle requires the permanent adaptation of behavior in relation to the various situations the vehicle is in. this paper describes a research which implements such adaptation and optimization b...
详细信息
the management of Virtual Organization (VO) brings some challenges. One of them is the appropriate and effective coordination and monitoring of distributed businesses processes. It is not easy or trivial to handle the...
详细信息
the work presented in this paper is part of the cooperative research project AUTO-OPT carried out by twelve partners from the automotive industries. One major work package concerns the application of datamining metho...
详细信息
Several cost-sensitive boosting algorithms have been reported as effective methods in dealing with class imbalance problem. Misclassification costs, which reflect the different level of class identification importance...
详细信息
Monotonicity is a simple yet significant qualitative characteristic. We consider the problem of segmenting an array in up to K segments. We want segments to be as monotonie as possible and to alternate signs. We propo...
详细信息
In many datamining projects the data to be analysed contains personal information, like names and addresses. Cleaning and preprocessing of such data likely involves deduplication or linkage with other data, which is ...
详细信息
ISBN:
(纸本)354026972X
In many datamining projects the data to be analysed contains personal information, like names and addresses. Cleaning and preprocessing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machinelearning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.
In this paper we present a method to cluster large datasets that change over time using incremental learning techniques. the approach is based on the dynamic representation of clusters that involves the use of two set...
详细信息
暂无评论