In this paper we investigate the performance-energy balance of a variety of concurrent architectures, from general-purpose and digital signal multicore systems to graphics processors (GPUs), representative of current ...
详细信息
ISBN:
(纸本)9783642552243
In this paper we investigate the performance-energy balance of a variety of concurrent architectures, from general-purpose and digital signal multicore systems to graphics processors (GPUs), representative of current technology. this analysis employs the conjugate gradient method, an important algorithm for the iterative solution of linear systems that is basically composed of the sparse matrix-vector product and other (minor) vector kernels. To allow a fair comparison, we leverage simple implementations of the numerical methods and underlying kernels, and rely only on those optimizations applied by the target compiler.
Due to the physical processor frequency scaling constraint, current computer systems are equipped with more and more processing units. therefore, parallel computing has become an important paradigm in the recent years...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
Due to the physical processor frequency scaling constraint, current computer systems are equipped with more and more processing units. therefore, parallel computing has become an important paradigm in the recent years. AMPL is a comprehensive algebraic modeling language for formulating optimization problems. However, AMPL itself does not support defining tasks to be executed in parallel. Although in last years the parallelism is often provided by solvers, which take advantage of multiple processing units, in many cases it is more efficient to formulate the problem in a decomposed way and apply various problem specific enhancements. Moreover, when the number of cores is permanently growing, it is possible to use both types of parallelism. this paper presents the design of Parampl - a simple tool for parallel execution of AMPL programs. Parampl introduces explicit asynchronous execution of AMPL subproblems from within the program code. Such an extension implies a new view on AMPL programs, where a programmer is able to define complex, parallelized optimization tasks and formulate algorithms solving optimization subproblems in parallel.
there is no dedicated thread mapping method for Many Integrated Core (MIC) heterogeneous system in the traditional multithread programming model. the unreasonable thread mapping will lead the promising computing power...
详细信息
ISBN:
(纸本)9783319111940;9783319111933
there is no dedicated thread mapping method for Many Integrated Core (MIC) heterogeneous system in the traditional multithread programming model. the unreasonable thread mapping will lead the promising computing power of MIC coprocessor not to be fully exploited. In order to fully exploit the computing potential of MIC coprocessor, this paper discussed effective multi threads mapping strategies through comparing the computing performance and analyzing the performance differences between various mapping methods. Meanwhile, for the further exploiting the high computing power of MIC heterogeneous system, the specific program porting and performance optimization strategies were explored by using the k-means application program. Experimental results show that the proposed mapping and parallel optimization strategies are effective, which can be guide the programmer to port and optimize applications effectively to MIC heterogeneous parallel system.
the amount of data generated by traditional business activities, has resulted data warehouses with a size up to petabytes. the ability to analyze this torrent of data will become the basis of competition and growth fo...
详细信息
ISBN:
(纸本)9781479979783
the amount of data generated by traditional business activities, has resulted data warehouses with a size up to petabytes. the ability to analyze this torrent of data will become the basis of competition and growth for individual firms by ever-narrower segmentation of customers, improvement of decision-making and unearth valuable insights that would otherwise remain hidden. For this purpose, the large size of data to be processed requires the use of high-performance analytical systems running on distributed environments. Because the data is so big it affects the types of algorithms we are willing to consider. then standard analytics algorithms need to be adapted to take advantage of cloud computing models which provide scalability and flexibility. this work illustrates an implementation of a parallel version of the multiple linear regression. It can extract coefficients from large amounts of data, based on MapReduce Framework with large scale. parallelprocessing of multiple linear regression will be based on the QR decomposition and the ordinary least squares method adapted to Map Reduce. Our platform in deployed on Cloud Amazon EMR. Experimental results demonstrate that the our parallel version of the multiple linear regression can efficiently handle very large datasets on commodity hardware with a good performance on different evaluation criterions, including number, size and structure of machines in the cluster.
the biomedical imagery, the numeric communications, the acoustic signal processing and many others digital signal processing (DSP) applications are present more and more in the numeric world. they process growing data...
详细信息
ISBN:
(纸本)9781479961238
the biomedical imagery, the numeric communications, the acoustic signal processing and many others digital signal processing (DSP) applications are present more and more in the numeric world. they process growing data volume which is represented with more and more accuracy, and use complex algorithms with time constraints to satisfying. Consequently, a high requirement of computing power characterize them. To satisfy this need, it's inevitable today to use parallel and heterogeneous architectures in order to speedup the processing, where the best examples are today's supercomputers like "Tianhe-2" and "Titan" of Top500 ranking. these architectures withtheir multi-core nodes supported by many-core accelerators offer a good response to this problem. However, they are still hard to program to make performance because of many reasons: parallelism expression, task synchronization, memory management, hardware specifications handling, load balancing ... In the present work, we are characterizing DSP applications and propose a programming model based on their distinctiveness in order to implement them easily and efficiently on heterogeneous clusters.
this paper explores the possibilities of using a GPU for complex 3D finite difference computation. We propose a new approach to this topic using surface memory and compare it with 3D stencil computations carried out v...
详细信息
ISBN:
(纸本)9783642552243
this paper explores the possibilities of using a GPU for complex 3D finite difference computation. We propose a new approach to this topic using surface memory and compare it with 3D stencil computations carried out via shared memory, which is currently considered to be the best approach. the case study was performed for the extensive computation of collisions between heavy nuclei in terms of relativistic hydrodynamics.
While GPU is becoming a compelling acceleration solution for a series of scientific applications, most existing work on climate models only achieved limited speedup. It is due to partial porting of the huge code and t...
详细信息
ISBN:
(数字)9783319111971
ISBN:
(纸本)9783319111971;9783319111964
While GPU is becoming a compelling acceleration solution for a series of scientific applications, most existing work on climate models only achieved limited speedup. It is due to partial porting of the huge code and the memory bound inherence of these models. In this work, we design and implement a customized GPU-based acceleration of the Princeton Ocean Model (gpuPOM). Based on Nvidia's state-of-the-art GPU architectures (K20X and K40m), we rewrite the original model from the Fortran into the CUDA-C completely. Several accelerating methods, including optimizing memory access in a single GPU, overlapping communication and boundary operations among multiple GPUs, are presented. the experimental results show that the gpuPOM on one K40m GPU achieves 6.9-fold to 17.8-fold speedup and 5.8-fold to 15.5-fold speedup on one K20X GPU comparing with different Intel CPUs. Further experiments on multiple GPUs indicate that the performance of the gpuPOM on a super-workstation containing 4 GPUs is equivalent to a powerful cluster consisting of 34 pure CPU nodes with over 400 CPU cores.
We present FooPar, an extension for highly efficient parallel Computing in the multi-paradigm programming language Scala. Scala offers concise and clean syntax and integrates functional programming features. Our frame...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
We present FooPar, an extension for highly efficient parallel Computing in the multi-paradigm programming language Scala. Scala offers concise and clean syntax and integrates functional programming features. Our framework FooPar combines these features withparallel computing techniques. FooPar is designed to be modular and supports easy access to different communication backends for distributed memory architectures as well as high performance math libraries. In this article we use it to parallelize matrix-matrix multiplication and show its scalability by a isoefficiency analysis. In addition, results based on a empirical analysis on two supercomputers are given. We achieve close-to-optimal performance wrt. theoretical peak performance. Based on this result we conclude that FooPar allows programmers to fully access Scalas design features without suffering from performance drops when compared to implementations purely based on C and MPI.
this paper formulates the program runtime prediction problem subject to algorithm parameters and characteristics of a computational system to be used to run the algorithm. It is suggested to build a model representing...
详细信息
ISBN:
(纸本)9781450328890
this paper formulates the program runtime prediction problem subject to algorithm parameters and characteristics of a computational system to be used to run the algorithm. It is suggested to build a model representing runtime as a function of algorithm parameters and computational system characteristics. this is followed by determination of features to be used for functional dependence recovery. A two-step method of problem solution using linear and non-linear machine learning algorithms is proposed. the paper examines peculiarities of software algorithms and suggests a method for processing experimental data provided by computational systems. It also features a comparative analysis of runtime prediction results for solution of several linear algebra problems on 84 personal computers and servers using a number of machine learning algorithms. Use of a random forest combined withthe linear least square method shows an error of less than 15% for most computational systems of similar architecture. Copyright 2014 ACM.
In this paper, embeddings of a family of 3D meshes in locally twisted cubes are studied. Let LTQ(n)(V, E) denotes the n-dimensional locally twisted cube. We find two major results in this paper:(1) For any integer n &...
详细信息
ISBN:
(数字)9783319111940
ISBN:
(纸本)9783319111940;9783319111933
In this paper, embeddings of a family of 3D meshes in locally twisted cubes are studied. Let LTQ(n)(V, E) denotes the n-dimensional locally twisted cube. We find two major results in this paper:(1) For any integer n >= 4, two node-disjoint 3D meshes of size 2 x 2 x 2(n-3) can be embedded into LTQ(n) with dilation 1 and expansion 2. (2) For any integer n = 6, four node-disjoint 4x2x2(n-5) meshes can be embedded into LTQ(n) with dilation 1 and expansion 4. Further, an embedding algorithm can be constructed based on our embedding method. the obtained results are optimal in the sense that the dilations of the embeddings are 1.
暂无评论