We describe our experience using PARED, an object oriented system for the adaptive solution of PDEs in a distributed computing environment. PARED handles selective mesh refinement and coarsening, mesh repartitioning f...
详细信息
We describe our experience using PARED, an object oriented system for the adaptive solution of PDEs in a distributed computing environment. PARED handles selective mesh refinement and coarsening, mesh repartitioning for load balancing and interprocessor mesh migration. PARED is an object-oriented system that runs on distributed memory parallel computers such as the IBM SP and network of workstations. In this paper, we report on the use of PARED to solve two- and three-dimensional PDEs. We show that our object-oriented technology provides great flexibility with a small overhead to support the highly desirable adaptive features of PARED.
this paper presents a measurement and simulation based study of parallel I/O in a high-performance cluster system: the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. the measurements were used to chara...
详细信息
this paper presents a measurement and simulation based study of parallel I/O in a high-performance cluster system: the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. the measurements were used to characterize the performance bottlenecks and the throughput limits at the compute and I/O nodes, and to provide realistic input parameters to PioSim, a simulation environment we have developed to investigate parallel I/O performance issues in cluster systems. PioSim was used to obtain a detailed characterization of parallel I/O performance, in the high-performance cluster system, for different regular access patterns and different system configurations. this paper also explores the use of local disks at the compute nodes for parallel I/O, and finds that the local disk architecture outperforms the traditional parallel I/O over remote I/O node disks architecture, even when as much as 68-75% of the requests from each compute node goes to remote disks.
A delivery problem which reduces to the NP-complete set-partitioning problem is investigated. the sequential and parallel simulated annealing algorithms to solve the delivery problem are discussed. the objective is to...
详细信息
Convolutional Neural Networks (CNNs), one of the most representative algorithms of deep learning, are widely used in various artificial intelligence applications. Convolution operations often take most of the computat...
详细信息
ISBN:
(纸本)9781665435741
Convolutional Neural Networks (CNNs), one of the most representative algorithms of deep learning, are widely used in various artificial intelligence applications. Convolution operations often take most of the computational overhead of CNNs. the FFT-based algorithm can improve the efficiency of convolution by reducing its algorithm complexity, there are a lot of works about the high-performance implementation of FFT-based convolution on many-core CPUs. However, there is no optimization for the non-uniform memory access (NUMA) characteristics in many-core CPUs. In this paper, we present a NUMA-aware FFT-based convolution implementation on ARMv8 many-core CPUs with NUMA architectures. the implementation can reduce a number of remote memory access through the data reordering of FFT transformations and the three-level parallelization of the complex matrix multiplication. the experiment results on a ARMv8 many-core CPU with NUMA architectures demonstrate that our NUMA-aware implementation has much better performance than the state-of-the-art work in most cases.
the dataflow model of computation, in general, and its recent direction to combine dataflow processing with control-flow processing, in particular, provide attractive alternatives to satisfy the computational demands ...
详细信息
the dataflow model of computation, in general, and its recent direction to combine dataflow processing with control-flow processing, in particular, provide attractive alternatives to satisfy the computational demands of new applications, without experiencing the shortcomings of the traditional concurrent systems. this should motivate researchers to analyze the applicability of familiar concepts, such as scheduling and load balancing, within this new architectural framework. Effective execution of loop iterations as a means to improve performance and hardware utilization has received a great deal of attention in the past. In this paper we address the problem of scheduling/allocation of DOACROSS loops in a multithreaded dataflow environment. An extension to the staggered scheme - Cyclic staggered scheme - which produces a more balanced distribution of iterations among processors is introduced and its performance improvement in a dataflow and control-flow environment is simulated and analyzed.
there is an ongoing effort to develop tools that apply distributed computational resources to tackle large problems or reduce the time to solve them. In this context, the Alternating Direction Method of Multipliers (A...
详细信息
ISBN:
(纸本)9781509036820
there is an ongoing effort to develop tools that apply distributed computational resources to tackle large problems or reduce the time to solve them. In this context, the Alternating Direction Method of Multipliers (ADMM) arises as a method that can exploit distributed resources like the dual ascent method and has the robustness and improved convergence of the augmented Lagrangian method. Traditional approaches to accelerate the ADMM using multiple cores are problem-specific and often require multi-core programming. By contrast, we propose a problem-independent scheme of accelerating the ADMM that does not require the user to write any parallel code. We show that this scheme, an interpretation of the ADMM as a message-passing algorithm on a factor-graph, can automatically exploit fine-grained parallelism both in GPUs and shared-memory multi-core computers and achieves significant speedup in such diverse application domains as combinatorial optimization, machine learning, and optimal control. Specifically, we obtain 10-18x speedup using a GPU, and 5-9x using multiple CPU cores, over a serial, optimized C-version of the ADMM, which is similar to the typical speedup reported for existing GPU-accelerated libraries, including cuFFT (19x), cuBLAS (17x), and cuRAND (8x).
As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points...
详细信息
ISBN:
(纸本)9781509021406
As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.
this article discusses the results of practical research on the transfer of the architecture of the integration middleware layer to the distributed stream processing platform. the selection of the framework for stream...
详细信息
ISBN:
(纸本)9781728117393
this article discusses the results of practical research on the transfer of the architecture of the integration middleware layer to the distributed stream processing platform. the selection of the framework for streaming processing is based on a comparative analysis of the available open source solutions.
the application of Support Vector Machine (SVM) over data stream is growing withthe increasing real-time processing requirements in classification field, like anomaly detection and real-time image processing. However...
详细信息
ISBN:
(纸本)9781538637906
the application of Support Vector Machine (SVM) over data stream is growing withthe increasing real-time processing requirements in classification field, like anomaly detection and real-time image processing. However, the dynamic live data with high volume and fast arrival rate in data streams make it challenging to apply SVM in data stream processing. Existing SVM implementations are mostly designed for batch processing and hardly satisfy the efficiency requirement of stream processing for its inherent complexity. To address the challenges, we propose a high efficiency distributed SVM framework over data stream (HDSVM), which consists of two main algorithms, incremental learning algorithm and distributed algorithm. Firstly, we propose a partial support vectors reserving incremental learning algorithm (PSVIL). By selecting a subset of support vectors based on their distances to classification hyperplane instead of the universal set to update SVM, the algorithm achieves lower time overhead while ensuring accuracy. Secondly, we propose a distribution remaining partition and fast aggregation distributed algorithm (DRPFA) for SVM. the real-time data is partitioned based on the original distribution with clustering instead of random partition, and historical support vectors are partitioned based on their distances to the classification hyperplane. the global hyperplane can be obtained by averaging the parameters of local hyperplanes due to the above partition strategy. Extensive experiments on Apache Storm show that the proposed HDSVM achieve lower time overhead and similar accuracy compared withthe state-of-art. Speed-up ratio is increased by 2-8 times within 1% accuracy deviation.
Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3-D flow velocity measu...
详细信息
Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3-D flow velocity measurement. Due to the 3-D nature of this application, very large computation and communication requirements are imposed. An innovative algorithm, the Concise Cross Correlation (CCC), is employed in the system to extract velocity field form the hologram of the test flows. With CCC we achieved a compression ratio of 104 and a processing speed 1000 times faster than with traditional 3-D FFT-based correlation. To further accelerate the processing speed for fully time- and space-resolved measurement, parallelprocessing is necessary. We present our design for a distributed system supporting this previously unparallelized application, and comment on our experiences implementing a master-slave distributed version of CCC utilizing MPI. Brief experimental results on Gigabit Ethernet and multi-processor Pentium Xeon systems are given.
暂无评论