Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is air important technique...
详细信息
ISBN:
(纸本)9783642029615
Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is air important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machinelearning approach to tackle this problem. Our approach is independent of templates and thus will riot suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.
the relevance vector machine(RVM) is a state-of-the-art constructing sparse regression kernel model [1,2,3,4]. It not only generates a much sparser model but provides better generalization performance than the standar...
详细信息
Coronary artery disease has been described as one of the curses of the western world, as it is one of the most important causes of mortality. therefore, clinicians seek to improve diagnostic procedures, especially tho...
详细信息
ISBN:
(纸本)9783642018107
Coronary artery disease has been described as one of the curses of the western world, as it is one of the most important causes of mortality. therefore, clinicians seek to improve diagnostic procedures, especially those that allow them to reach reliable early diagnoses. In the clinical setting, coronary artery disease diagnostics is typically performed in a stepwise manner. the four diagnostic levels consist of evaluation of (1) signs and symptoms of the disease and ECG (electrocardiogram) at rest, (2) sequential ECG testing during the controlled exercise, (3) myocardial perfusion scintigraphy, and finally (4) coronary angiography, that is considered as the "gold standard" reference method. Our study focuses on improving diagnostic performance of the third diagnostic level. Myocardial scintigraphy is non invasive;it results in a series of medical images that are relatively inexpensively obtained. In clinical practice, these images are manually described (parameterized) by expert physicians. In the paper we present an innovative alternative to manual image evaluation - an automatic image parameterization in multiple resolutions, based on texture description with specialized association rules. Extracted image parameters are combined into more informative composite parameters by means of principle component analysis, and finally used to build automatic classifiers withmachinelearning methods. Our experiments with synthetic datasets show that association-rule-based multi-resolution image parameterization equals or surpasses other state-of-the-art methods for finding multiple informative resolutions. Experimental results in coronary artery disease diagnostics confirm these results as our approach significantly improves the clinical results in terms of quality of image parameters as well as diagnostic performance.
mining bilingual data (including bilingual sentences and terms) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observa...
ISBN:
(纸本)9781932432466
mining bilingual data (including bilingual sentences and terms) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual datamining method is proposed. Specifically, given a web page, the method contains four steps: 1) preprocessing: parse the web page into a DOM tree and segment the inner text of each node into snippets; 2) seed mining: identify potential translation pairs (seeds) using a word based alignment model which takes both translation and transliteration into consideration; 3) patternlearning: learn generalized patterns withthe identified seeds; 4) pattern based mining: extract all bilingual data in the page using the learned patterns. Our experiments on Chinese web pages produced more than 7.5 million pairs of bilingual sentences and more than 5 million pairs of bilingual terms, both with over 80% accuracy.
Contamination grades assessment is the important content for the online monitoring system of insulator leakage current (LC). the difficult of assessment is the nonlinear relationship between the electric character var...
详细信息
ISBN:
(纸本)9781424427994
Contamination grades assessment is the important content for the online monitoring system of insulator leakage current (LC). the difficult of assessment is the nonlinear relationship between the electric character variables of the LC, the environment factors and the contamination condition of insulator surface. In this paper, based on laboratory simulation experiments and field data, the parameters of support vector machine (SVM) is optimized by using particle swarm optimization (PSO) arithmetic, then the SVM patternrecognition model of assessment of the contamination grades is constructed. the method takes advantages of the minimum structure risk of SVM and the quickly globally optimizing ability of particle swarm, and the mapping relation between the root mean square (R.M.S.) of LC, the peak value of the LC, the amplitude and times of the pulses of the LC, temperature and humidity of environment and contamination grades may be setup quickly by learning from sample data. Experiment results show that the contamination condition assessment method is effective. then the insulator contamination condition online detection system is developed based on the assessment model.
the proceedings contain 67 papers. the topics discussed include: on concentration of discrete distributions with applications to supervised learning of classifiers;multi-source data modelling: integrating related data...
详细信息
ISBN:
(纸本)9783540734987
the proceedings contain 67 papers. the topics discussed include: on concentration of discrete distributions with applications to supervised learning of classifiers;multi-source data modelling: integrating related data to improve model performance;an empirical comparison of ideal and empirical ROC-based reject rules;outlier detection with kernel density functions;generic probability density function reconstruction for randomization in privacy-preserving datamining;an incremental fuzzy decision tree classification method for miningdata streams;on the combination of locally optimal pairwise classifiers;an agent-based approach to the multiple-objective selection of reference vectors;on applying dimension reduction for multi-labeled problems;nonlinear feature selection by relevance feature vector machine;a bounded index for cluster validity;and varying density spatial clustering based on a hierarchical tree.
recognition of specific functionally-important DNA sequence fragments is considered one of the most important problems in bioinformatics. One type of such fragments are promoters, Le., short regulatory DNA sequences l...
详细信息
ISBN:
(纸本)9781424417391
recognition of specific functionally-important DNA sequence fragments is considered one of the most important problems in bioinformatics. One type of such fragments are promoters, Le., short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. In this paper, a machinelearning method, called Support Vector machine (SVM), is used for classification of DNA sequences and promoter recognition. For optimal classification, 11 rules for mapping of DNA sequences into binary SVM feature space are analyzed. Classification is performed using a power series kernel function. Kernel parameters are optimized using a modification of the Nelder-Mead (downhill simplex) optimization method. the results of classification for drosophila and human sequence datasets are presented.
Two methods of patternrecognition are introduced in this paper: Unsupervised learning algorithm-fuzzy clustering method and supervised learning algorithm -neural network. the patternrecognition becomes failure patte...
详细信息
ISBN:
(纸本)9780769532783
Two methods of patternrecognition are introduced in this paper: Unsupervised learning algorithm-fuzzy clustering method and supervised learning algorithm -neural network. the patternrecognition becomes failure patternrecognition if it is used in the fault diagnosis of the machine. Both merits and shortages of these two methods are discussed through a specific example in the mechanical faults diagnosis.
Network intrusion detection aims at distinguishing the attacks on the Internet from normal use of the Internet. this is a typical problem of the classfication,so intrusion detection(ID) can be seen as a pattern recogn...
详细信息
ISBN:
(纸本)9780769533223
Network intrusion detection aims at distinguishing the attacks on the Internet from normal use of the Internet. this is a typical problem of the classfication,so intrusion detection(ID) can be seen as a patternrecognition problem. In this paper, In this paper, we build the intrusion detection system using Adaboost, a prevailing machinelearning algorithm, construction detection classification. In the algorithm, decision RBF neural network are used as weak classifiers. For the training sets is multi-attribute non-linear and massive, we use patternrecognition method of non-linear datadimension reduction algorithm-Isomap algorithm to feature extraction and to improve the speed and training for the handling of classified speed In the feature extraction after the feature of the dimension and Adaboost algorithm training rounds, were studied and experimented. Finally, the experiment proved that Isomap and Adaboost combination of testing the effectiveness of the mothod.
Committee machines approach has shown to be useful in different applications. Protein primary structure data contain valuable information to extract. In this paper we mine these data and predict protein contact map ba...
详细信息
ISBN:
(纸本)9781424422104
Committee machines approach has shown to be useful in different applications. Protein primary structure data contain valuable information to extract. In this paper we mine these data and predict protein contact map based on committee machines. Contact map is the simplified, two dimensional representation of protein spatial structure. Contact map prediction is of great interest due to its application in fold recognition and predicting protein tertiary structure. the results show that the performance of the committee is considerably better than a single model.
暂无评论