作者:
Gong, FWang, BTChau, FTLiang, YZHong Kong Polytech Univ
Dept Appl Biol & Chem Technol Hong Kong Hong Kong Peoples R China Cent S Univ
Coll Chem & Chem Engn Inst Chemometr & Intelligent Analyt Instruments Res Ctr Modernizat Chinese Herbal Med Changsha 410083 Peoples R China
Recently, the fingerprinting approach using chromatography has become one of the most potent tools for quality assessment of herbal medicine. Due to the complexity of the chromatographic fingerprint and the irreproduc...
详细信息
Recently, the fingerprinting approach using chromatography has become one of the most potent tools for quality assessment of herbal medicine. Due to the complexity of the chromatographic fingerprint and the irreproducibility of chromatographic instruments and experimental conditions, several chemometric approaches such as variance analysis, peak alignment, correlation analysis, and pattern recognition were employed to deal with the chromatographic fingerprint in this work. To facilitate the data preprocessing, a software named Computer Aided Similarity Evaluation (CASE) was also developed. All programs of chemometric algorithms for CASE were coded in MATLAB5.3 based on Windows. data loading, removing, cutting, smoothing, compressing, background and retention time shift correction, normalization, peak identification and matching, variation determination of common peaks/regions, similarity comparison, sample classification, and other data processes associated with the chromatographic fingerprint were investigated in this software. The case study of high pressure liquid chromatographic HPLC fingerprints of 50 Rhizoma chuanxiong samples from different sources demonstrated that the chemometric approaches investigated in this work were reliable and user friendly for data preprocessing of chromatographic fingerprints of herbal medicines for quality assessment.
In the field of data science, we consider usually data independently from a problem to be solved. The originality of this paper consists in handling huge instances of combinatorial problems with datamining technologie...
详细信息
ISBN:
(纸本)9781509044702
In the field of data science, we consider usually data independently from a problem to be solved. The originality of this paper consists in handling huge instances of combinatorial problems with datamining technologies in order to reduce the complexity of their treatment. Such task can be performed on Web combinatorial optimization such as internet data packet routing and web clustering. We focus in particular on the satisfiability of Boolean formulae but the proposed idea could be adopted for any other complex problem. The aim is to explore the satisfiability instance using datamining techniques in order to reduce its size, prior to solve it. An estimated solution for the obtained instance is then computed using a hybrid algorithm based on DPLL technique and a genetic algorithm. It is then compared to the solution of the initial instance in order to validate the method effectiveness. We performed experiments on the well-known BMC datasets and show the benefits of using datamining techniques as a pretreatment, prior to solving the problem.
data preprocessing includes data cleaning, data integration, data transformation and data reduction. data cleaning is aimed to remove unrelated or redundant items through two processes. data integration includes three...
详细信息
ISBN:
(纸本)9783642286575;9783642286582
data preprocessing includes data cleaning, data integration, data transformation and data reduction. data cleaning is aimed to remove unrelated or redundant items through two processes. data integration includes three main problems and each of them can be solved by kinds of methods. data transformation includes data generalization and property construction and standardization. Three algorithms can be used to normalize the data. The last step data reduction is used to compress the data in order to improve the quality of mining models. All these four steps are interrelated to each other and shouldn't be separated. They work together to improve the final result of data mining.
Large-scale agricultural internet of things will generate a large amount of data every moment. After a certain period of time, the amount of data can reach hundreds of millions. It is very meaningful to analyze and mi...
详细信息
ISBN:
(纸本)9783319959290;9783319959306
Large-scale agricultural internet of things will generate a large amount of data every moment. After a certain period of time, the amount of data can reach hundreds of millions. It is very meaningful to analyze and mine agricultural big data and replace artificial experience with analysis results. However, the agricultural production environment is complex, and the raw data collected include a variety of anomalies, which can not be directly followed by analysis and mining. In this paper, a data preprocessing method based on time series analysis is proposed, which can quickly and efficiently obtain the prediction model, and can be used to fill and replace the abnormal data. On this basis, we add data preprocessing layer to the traditional three-layer Internet of things system (IoT), which is located between the application layer and the transmission layer, and designs a four layer of Agricultural IoT system. The system not only realizes the basic functions of data acquisition, transmission and storage, but also provides better data sources for subsequent analysis.
Since Intrusion Detection Systems (IDSs) operate in real-time, they should be light-weighted to detect intrusions as fast as possible. Distance-based Outlier Detection (DBOD) is one of the most widely-used techniques ...
详细信息
ISBN:
(纸本)9781457705847
Since Intrusion Detection Systems (IDSs) operate in real-time, they should be light-weighted to detect intrusions as fast as possible. Distance-based Outlier Detection (DBOD) is one of the most widely-used techniques for detecting outliers due to its simplicity and efficiency. Additionally, DBOD is an unsupervised approach which overcomes the problem of the lack of training datasets with known intrusions. However, since IDSs usually have high-dimensional datasets, using DBOD becomes subject to the curse of the dimensionality problem. Furthermore, intrusion datasets should be normalized before calculating pair-wise distance between observations. The purpose of this research is conduct a comparative study among different normalization methods in conjunction with a well-known feature extraction technique;Principle Component Analysis (PCA). Therefore, the efficiency of these methods as data preprocessing techniques can be investigated when applying DBOD to detect intrusions. Experiments were performed using two kinds of distance metrics;Euclidean distance and Mahalanobis distance. We further examined the PCA using 7 threshold values to indicate the number of Principle components to consider according to their total contribution in the variability of features. These approaches have been evaluated using the KDD Cup 1999 intrusion detection (KDD-Cup) dataset. The main purpose of this study is to find the best attribute normalization method along with the correct threshold value for PCA so that a fast unsupervised IDS can discover intrusions effectively. The results recommended using the Log normalization method combined the Euclidean distance while performing PCA.
Presumptions of each data analysis are data themselves, regardless of the analysis focus ( visit rate analysis, optimization of portal, personalization of portal, etc.). Results of selected analysis highly depend on t...
详细信息
Presumptions of each data analysis are data themselves, regardless of the analysis focus ( visit rate analysis, optimization of portal, personalization of portal, etc.). Results of selected analysis highly depend on the quality of analyzed data. In case of portal usage analysis, these data can be obtained by monitoring web server log file. We are able to create data matrices and web map based on these data which will serve for searching for behaviour patterns of users. data preparation from the log file represents the most time-consuming phase of whole analysis. We realized an experiment so that we can find out to which criteria are necessary to realize this time-consuming data preparation. We aimed at specifying the inevitable steps that are required for obtaining valid data from the log file. Specially, we focused on the reconstruction of activities of the web visitor. This advanced technique of data preprocessing belongs to time consuming one. In the article we tried to assess the impact of reconstruction of activities of a web visitor on the quantity and quality of the extracted rules which represent the web users' behaviour patterns. (C) 2010 Published by Elsevier Ltd.
The evolution of Industry towards the 4.0 paradigm has motivated the adoption of Artificial Neural Networks (ANNs) to deal with applications where predictive and maintenance tasks are performed. These tasks become dif...
详细信息
ISBN:
(纸本)9789082797039
The evolution of Industry towards the 4.0 paradigm has motivated the adoption of Artificial Neural Networks (ANNs) to deal with applications where predictive and maintenance tasks are performed. These tasks become difficult to carry out when rare events are present due to the imbalance of data. This is because training of ANN can be biased. Conventional techniques addressing this problem are mainly based on resampling-based approaches. However, these are not always feasible when dealing with time-series forecasting tasks in industrial scenarios. For that reason, this work proposes the application of data preprocessing techniques especially designed to face this scenario, a problem which has not been covered enough in the state-of-the-art. Considered techniques are applied over time-series data coming from Wastewater Treatment Plants (WWTPs). Our proposal significantly outperforms current strategies showing a 68% of improvement in terms of RMSE when rare events are addressed.
For Magnetic Suspension Gyro-total-station has vulnerable to outside interference factors, there are some random drifting containing in measurements which are unable to establish its mathematical model. Vondrak filter...
详细信息
ISBN:
(纸本)9783037856611
For Magnetic Suspension Gyro-total-station has vulnerable to outside interference factors, there are some random drifting containing in measurements which are unable to establish its mathematical model. Vondrak filter which does not require the model is used to pre-process measurements of Magnetic Suspension Gyro-total-station. In this paper a high-precision astronomical baseline is established in Xi'an, and the gyro azimuth is tested eight times in baseline. 40,000 north-seeking torque of the first and second place is filtered by the Vondrak filter for each test. The results show that the burr of data is reduced after filtered, and the filtered values reflect the trends of gyro north-seeking. Compared with the root mean square (RMS) of the measurements, RMS of Vondrak filter is decreased, the data is denser. Vondrak filter can effectively eliminate the random drifting containing in measurements, retain useful information in the maximum extent, and improve the accuracy of true north azimuth.
Many classifiers and methods are proposed to deal with letter recognition problem. Among them, clustering is a widely used method. But only one time for clustering is not adequately. Here, we adopt data preprocessing ...
详细信息
Many classifiers and methods are proposed to deal with letter recognition problem. Among them, clustering is a widely used method. But only one time for clustering is not adequately. Here, we adopt data preprocessing and a re kernel clustering method to tackle the letter recognition problem. In order to validate effectiveness and efficiency of proposed method, we introduce re kernel clustering into Kernel Nearest Neighbor classification(KNN), Radial Basis Function Neural Network(RBFNN), and Support Vector Machine(SVM). Furthermore, we compare the difference between re kernel clustering and one time kernel clustering which is denoted as kernel clustering for short. Experimental results validate that re kernel clustering forms fewer and more feasible kernels and attain higher classification accuracy.
Goal-oriented process enhancement and discovery (GoPED) was recently proposed to take advantage of goal modeling capabilities in process mining activities. Conventional process mining aims to discover underlying proce...
详细信息
ISBN:
(纸本)9781728151656
Goal-oriented process enhancement and discovery (GoPED) was recently proposed to take advantage of goal modeling capabilities in process mining activities. Conventional process mining aims to discover underlying process models from historical, crowdsourced event logs in an activity-oriented fashion. GoPED, however, infers goal-aligned process models from the event logs enhanced with some goal-related attributes. GoPED selects the historical behaviors that have yielded sufficient levels of satisfaction for (often conflicting) goals of different stakeholders. There are three algorithms available to select the subset of event logs from three different perspectives. The main input of all three algorithms is a version of the event log (EnhancedLog) that is (1) structured as a table showing each case and its trace in one row, (2) with rows enhanced with satisfaction levels of different goals. Therefore, typical event logs are not ready to be fed as-is to GoPED algorithms. This paper proposes a scheme for manipulating original event logs and turn them into EnhancedLog. Two tools were also developed and tested for this scheme: TraceMaker, to structure the log as explained above, and EnhancedLogMaker, to compute satisfaction levels of goals for all cases in the structured log.
暂无评论