The Matrix Chain Ordering Problem is a well studied optimization problem, aiming at finding optimal parentheses assignment for minimizing the number of arithmetic operations required when computing a chain of matrix m...
详细信息
ISBN:
(纸本)9781728112466
The Matrix Chain Ordering Problem is a well studied optimization problem, aiming at finding optimal parentheses assignment for minimizing the number of arithmetic operations required when computing a chain of matrix multiplications. Existing algorithms include the O(N-3) dynamic programming of Godbole (1973) and the faster O(N log N) algorithm of Hu and Shing (1982). We show that both may result in suboptimal parentheses assignment on modern machines as they do not take into account inter-processor communication costs that often dominate the running time. Further, the optimal solution may change when using fast matrix multiplication algorithms. We show that the O(N-3) dynamic-programing algorithm easily adapts to provide optimal solutions for modern matrix multiplication algorithms, and obtain an adaption of the O(N log N) algorithm that guarantees a constant approximation.
Researches on technologies about testing IOPS and data transfer speed of disk arrays in mass storage systems. We propose a parallel testing technology towards the high performance disk array and realize the testing wo...
详细信息
parallel efficiency is always a fundamental research field in high performance computing. This paper focuses on parallel computing at high performance computing cluster with CASTEP program, discusses multi-core parall...
详细信息
We present the results of investigation of the dynamics of (99942) Apophis asteroid, which will undergo a very close encounter with the Earth on April 13, 2029. The region of possible motions of the asteroid is consid...
详细信息
We present the results of investigation of the dynamics of (99942) Apophis asteroid, which will undergo a very close encounter with the Earth on April 13, 2029. The region of possible motions of the asteroid is considered on the time interval (2004, 2040). In addition, it is shown that an increase of the observational interval (2004, 2006) until 2008 allowed us to reduce significantly the area of possible motions. All investigations were performed by numerical methods with the help of algorithms and software developed by us in the environment of parallelprogramming using the SKIF Cyberia multiprocessor computer of the Tomsk State University.
Parallaxis is a machine-independent language for data-parallelprogramming, based on sequential Modula-2. programming in Parallaxis is done on a level of abstraction with virtual processors and virtual connections, wh...
详细信息
ISBN:
(纸本)0780320182
Parallaxis is a machine-independent language for data-parallelprogramming, based on sequential Modula-2. programming in Parallaxis is done on a level of abstraction with virtual processors and virtual connections, which may be defined by the application programmer. This paper describes Parallaxis-III, the current version of the language definition, together with a number of parallel sample algorithms.
parallelism suffers from a lack of programming languages both simple to handle and able to take advantage of the power of present parallel computers. If parallelism expression is too high level, compilers have to perf...
详细信息
ISBN:
(纸本)0818678836
parallelism suffers from a lack of programming languages both simple to handle and able to take advantage of the power of present parallel computers. If parallelism expression is too high level, compilers have to perform complex optimizations leading often to poor performances. One the other hand, too low level parallelism transfers difficulties reward the programmer. In this paper, we propose a new programming language that integrates both a synchronous data-parallel progamming model and an asynchronous execution model. The synchronous data-parallelprogramming model allows a safe program designing. The asynchronous execution model yields an efficient execution on present MIMD architectures without any program transformation. Our language relies on a logical instruction ordering exploited by specific send/receive communications. It allows to express only the effective data dependences between processors. This ability is enforced by a possible send/receive unmatching useful for irregular algorithms. A sparse vector computation exemplifies our language potentialities.
parallel computing is notoriously challenging due to the difficulty in developing correct and efficient programs. With the arrival of multi-core processors for desktop systems, desktop applications must now be paralle...
详细信息
Electromagnetic researchers are often faced with long execution time and therefore algorithmic and implementation-level optimization can dramatically increase the overall performance of electromagnetism simulation usi...
详细信息
irregular particle-based applications that use trees, far example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a cohere...
详细信息
ISBN:
(纸本)0818684038
irregular particle-based applications that use trees, far example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coherent shared address space programming model. As more and more supercomputing platforms that can support different programming models become available to users, from tightly-coupled hardware-coherent machines to clusters of workstations or SMPs, to truly deliver on its ease of programing advantages to application users it is important that the shared address space model nor only perform and scale well in the rightly-coupled case but also port well in performance across the range of platforms (as the message passing model can). For tree-based N-body applications, this is currently not true: While the actual computation of interactions ports well, the parallel tree building phase can become a severe bottleneck on coherent shared address space platforms, in particular an platforms with less aggressive, commodity-oriented communication architectures (even though it rakes less than 3 percent of the time in most sequential executions). We therefore investigate the performance of five parallel tree building methods in the context of a complete galaxy simulation on four very different platforms that support this programming model: an SGI Origin2000 (an aggressive hardware cache-coherent machine with physically distributed memory), an SGI Challenge bits-based shared memory multiprocessor art Intel Paragon running a shared virtual memory protocol in software at page granularity, and a Wisconsin Typhoon-zero in which the granularity of coherence can be varied using hardware support but the protocol runs in software (in the last case using both a page-based and a fine-grained protocol). We find that the algorithms used successfully and widely distributed so far for the first two platforms cause overall application performance to be very poor on the latter two commodit
In this paper, we share our experiences in using two important yet different High Performance Computing (HPC) architectures for evaluating two HPC algorithms. The first architecture is an Intel x64 ISA based homogenou...
详细信息
ISBN:
(纸本)9781479961238
In this paper, we share our experiences in using two important yet different High Performance Computing (HPC) architectures for evaluating two HPC algorithms. The first architecture is an Intel x64 ISA based homogenous multicore with Uniform Memory Access (UMA) type shared-memory based Symmetric Multi-Processing system. The second architecture is an IBM Power ISA based heterogenous multicore with Non-Uniform Memory Access (NUMA) based distributed-memory Asymmetric Multi-Processing system. The two HPC algorithms are for predicting biological molecular structures, specifically the RNA secondary structures. The first algorithm that we created is a parallelized version of a popular serial RNA secondary structure prediction algorithm called PKNOTS. The second algorithm is a new parallel-by-design algorithm that we have developed called MARSs. Using real Ribo-Nucleic Acid (RNA) sequences, we conducted large-scale experiments involving hundreds of sequences using the above two algorithms. Based on thousands of data points that we collected as an outcome of our experiments, we report on the observed performance metrics for both the algorithms on the two architectures. Through our experiments, we infer that architectures with specialized co-processors for number-crunching along with high-speed memory bus and dedicated bus controllers generally perform better than general-purpose multi-processor architectures. In addition, we observed that algorithms that are intrinsically parallelized by design are able to scale & perform better by taking advantage of the underlying parallel architecture. We further share best practices on handling scalability aspects with regards to workload size. We believe our results are applicable to other HPC applications on similar HPC architectures.
暂无评论