Domestic mass data processing system in aerospace field uses bigdata simple sampling algorithm for data specification in the data preprocessing stage. This paper analyzes the data curve distortion caused by this algo...
详细信息
ISBN:
(纸本)9781665440899
Domestic mass data processing system in aerospace field uses bigdata simple sampling algorithm for data specification in the data preprocessing stage. This paper analyzes the data curve distortion caused by this algorithm, and proposes an optimization method for that. Finally, a big data sampling algorithm based on peak detection is adopted to achieve the purpose of quickly viewing the fidelity and complete picture of massive historical data, while ensuring the correctness of the data interpretation after data preprocessing at the same time. Through the using of real test data for verification, in the data preprocessing stage of the domestic mass data processing system, the large datasampling algorithm based on peak detection is adopted to achieve the high fidelity of the data curve after sampling.
Domestic mass data processing system in aerospace field uses bigdata simple sampling algorithm for data specification in the data preprocessing *** paper analyzes the data curve distortion caused by this algorithm,an...
详细信息
Domestic mass data processing system in aerospace field uses bigdata simple sampling algorithm for data specification in the data preprocessing *** paper analyzes the data curve distortion caused by this algorithm,and proposes an optimization method for ***,a big data sampling algorithm based on peak detection is adopted to achieve the purpose of quickly viewing the fidelity and complete picture of massive historical data,while ensuring the correctness of the data interpretation after data preprocessing at the same *** the using of real test data for verification,in the data preprocessing stage of the domestic mass data processing system,the large datasampling algorithm based on peak detection is adopted to achieve the high fidelity of the data curve after sampling.
bigdata mining is related to large-scale data analysis and faces computational cost-related challenges due to the exponential growth of digital technologies. Classical data mining algorithms suffer from computational...
详细信息
bigdata mining is related to large-scale data analysis and faces computational cost-related challenges due to the exponential growth of digital technologies. Classical data mining algorithms suffer from computational deficiency, memory utilization, resource optimization, scale-up, and speed-up related challenges in bigdata mining. sampling is one of the most effective data reduction techniques that reduces the computational cost, improves scalability and computational speed with high efficiency for any data mining algorithm in single and multiple machine execution environments. This study suggested a Euclidean distance-based stratum method for stratum creation and a stratified random sampling-based bigdata mining model using the K-Means clustering (SSK-Means) algorithm in a single machine execution environment. The performance of the SSK-Means algorithm has achieved better cluster quality, speed-up, scale-up, and memory utilization against the random sampling-based K-Means and classical K-Means algorithms using silhouette coefficient, Davies Bouldin index, Calinski Harabasz index, execution time, and speedup ratio internal measures.
The bigdata samples are important source for analytics. However, its relevant/irrelevant information, unspecified variables/scales, noise/null, and so forth impose huge challenges on the analysis of relevance, featur...
详细信息
The bigdata samples are important source for analytics. However, its relevant/irrelevant information, unspecified variables/scales, noise/null, and so forth impose huge challenges on the analysis of relevance, feature, cause, and evaluation. This paper proposes an evidential analytics to disclose buried information in bigdata samples. Technically, it models memberships composed of relevant preference and replaces data with these priors. Its operations include generating analytics baselines, reducing variables, identifying sparse features, and inducing rules by taking advantage of evidence. In illustration, a case study of semiconductor manufacturing in UCI secom is presented. It discloses relevant signals, key factors, variables' thresholds, sparse characteristics, and causal effect of damages buried in normal samples. The contribution of this paper not only contains these achievements but provides priori data for inference. (C) 2019 Elsevier Inc. All rights reserved.
data Accuracy is one of the main dimensions of data Quality;it measures the degree to which data are correct. Knowing the accuracy of an organization's data reflects the level of reliability it can assign to them ...
详细信息
data Accuracy is one of the main dimensions of data Quality;it measures the degree to which data are correct. Knowing the accuracy of an organization's data reflects the level of reliability it can assign to them in decision-making processes. Measuring data accuracy in bigdata environment is a process that involves comparing data to assess with some "reference data" considered by the system to be correct. However, such a process can be complex or even impossible in the absence of appropriate reference data. In this paper, we focus on this problem and propose an approach to obtain the reference data thanks to the emergence of bigdata technologies. Our approach is based on the upstream selection of a set of criteria that we define as "Accuracy Criteria". We use furthermore a set of techniques such as big data sampling, Schema Matching, Record Linkage, and Similarity Measurement. The proposed model and experiment results allow us to be more confident in the importance of data quality assessment solution and the configuration of the accuracy criteria to automate the selection of reference data in a data Lake.
Particle filtering is a numerical Bayesian technique that has great potential for solving sequential estimation problems involving non-linear and non-Gaussian models. Since the estimation accuracy achieved by particle...
详细信息
Particle filtering is a numerical Bayesian technique that has great potential for solving sequential estimation problems involving non-linear and non-Gaussian models. Since the estimation accuracy achieved by particle filters improves as the number of particles increases, it is natural to consider as many particles as possible. MapReduce is a generic programming model that makes it possible to scale a wide variety of algorithms to bigdata. However, despite the application of particle filters across many domains, little attention has been devoted to implementing particle filters using MapReduce. In this paper, we describe an implementation of a particle filter using MapReduce. We focus on a component that what would otherwise be a bottleneck to parallel execution, the resampling component. We devise a new implementation of this component, which requires no approximations, has O(N) spatial complexity and deterministic O((log N)(2)) time complexity. Results demonstrate the utility of this new component and culminate in consideration of a particle filter with 2(24) particles being distributed across 512 processor cores.
Since bigdata contain more comprehensive probability distributions and richer causal relationships than conventional small datasets, discovering Bayesian network (BN) structure from bigdatasets is becoming more and ...
详细信息
ISBN:
(纸本)9783319557052;9783319557045
Since bigdata contain more comprehensive probability distributions and richer causal relationships than conventional small datasets, discovering Bayesian network (BN) structure from bigdatasets is becoming more and more valuable for modeling and reasoning under uncertainties in many areas. Facing bigdata, most of the current BN structure learning algorithms have limitations. First, learning BNs structure from bigdatasets is an expensive process that requires high computational cost, often ending in failure. Second, given any dataset as input, it is very difficult to choose one algorithm from numerous candidates for consistently achieving good learning accuracy. To address these issues, we introduce a novel approach called Adaptive Bayesian network Learning (ABNL). ABNL begins with an adaptive sampling process that extracts a sufficiently large data partition from any bigdataset for fast structure learning. Then, ABNL feeds the data partition to different learning algorithms to obtain a collection of BN Structures. Lastly, ABNL adaptively chooses the structures and merge them into a final network structure using an ensemble method. Experimental results on four bigdatasets show that ABNL leads to a significantly improved performance than whole dataset learning and more accurate results than baseline algorithms.
暂无评论