The particle-particle method for N-Body problems is one of the most commonly used methods in computer driven physics simulation. These algorithms are, in general, very simple to design and code, and highly paralleliza...
详细信息
The particle-particle method for N-Body problems is one of the most commonly used methods in computer driven physics simulation. These algorithms are, in general, very simple to design and code, and highly parallelizable. In this article, we present the most important approaches for the application of the three performance improvement areas on these algorithms when executed on high performance computing (HPC) clusters: 1) sequential optimization (a single core in a node of the cluster), 2) shared memory parallelism (in a single node with multiple CPUs available, just like a multiprocessor), and 3) distributed memory parallelism (in the whole cluster). For each one of the improvement areas we present the employed techniques and the obtained performance gain. Also, we will show how some (sequential/classical) code optimizations are almost essential for obtaining at least acceptable parallel performance and scalability.
The progress of semiconductor technology enables to implement a large system to only one chip, and physical problems such as IR-drop, electro migration, etc. become serious problems for VLSI circuit. To alleviate thes...
详细信息
ISBN:
(纸本)9780889868113
The progress of semiconductor technology enables to implement a large system to only one chip, and physical problems such as IR-drop, electro migration, etc. become serious problems for VLSI circuit. To alleviate these problems, VLSI circuit optimization which includes fast and accurate circuit simulator is necessary. This paper describes fast and accurate parallel transient simulator for RLC power grid circuit by GPU (Graphics Processing Unit) using CUDA. The GPU is a processor specified for graphic processing, and its architecture is quite unique. This paper proposes transient simulation method considering the feature of GPU architecture. Experimental results show that proposed transient simulator can achieve 173 times faster simulation than CPU, and the simulation error between proposed simulator and simulator executed on CPU is under 0.01%.
An efficient range query processing support is required for distributed Hash Table (DHT)-based P2P networks as consistent hashing destroys the inherent order of numeric keys. In this paper, we present a lightweight an...
详细信息
ISBN:
(纸本)9780889866386
An efficient range query processing support is required for distributed Hash Table (DHT)-based P2P networks as consistent hashing destroys the inherent order of numeric keys. In this paper, we present a lightweight and efficient mechanism, called Range Hash Tree (RHT), to support range queries on DHTs. In RHT, key space is partitioned into small ranges such that data items within one range can be stored on one node. To quickly resolve a range query, we introduce a scalable encoding algorithm that can represent the key space partitioning status with a small RHT Signature. Compared to other approaches, RHT provides a bounded query delay, which is independent of the size of range query and the number of matching data items. Our experiments show that RHT works efficiently on both uniform and much skewed data distributions.
Active and passive replication are powerful techniques to improve the quality of multimedia streaming. Most systems follow either the active or the passive approach. A well known example for active replication are Con...
详细信息
ISBN:
(纸本)9780889866379
Active and passive replication are powerful techniques to improve the quality of multimedia streaming. Most systems follow either the active or the passive approach. A well known example for active replication are Content Distribution Networks [8] that replicate data to predefined static locations. In contrast to that, P2P file sharing networks [2, 1] use passive replication where identical content is usually provided by different peers. We suggest a system that combines both techniques using Proxy Affinity, Request Affinity and Replication Affinity considering user preferences, user behaviour, hardware resources and networks capabilities.
A future is a parallel programming language construct that enables programmers to specify potentially asynchronous computations. We present and empirically evaluate a novel implementation of futures for Java. Our futu...
详细信息
ISBN:
(纸本)9780889866386
A future is a parallel programming language construct that enables programmers to specify potentially asynchronous computations. We present and empirically evaluate a novel implementation of futures for Java. Our futures implementation is a JVM extension that couples estimates of future computational granularity with underlying resource availability to enable automatic and adaptive decisions of when to spawn futures in parallel or to execute them sequentially. Our system builds from, combines. and extends (i) lazy task creation and (ii) a JVM sampling infrastructure previously used solely for dynamic and adaptive compilation. We empirically evaluate our system using different benchmarks, triggers for automatic spawning of futures, processor availability, and JVM configurations. We show that our future implementation for Java is efficient and scalable for fine-grained Java futures without requiring programmer intervention.
We are eloping a task parallel script language MegaScript for executing large-scale workflows on widely distributed heterogeneous environments. For efficient execution of this language, we have proposed a multi-layere...
详细信息
ISBN:
(纸本)9780889868113
We are eloping a task parallel script language MegaScript for executing large-scale workflows on widely distributed heterogeneous environments. For efficient execution of this language, we have proposed a multi-layered task scheduling scheme: the upper layer making rough global scheduling, and the lower layer making precise local scheduling. However, the cost for local scheduling is still a serious issue. Therefore, we propose an adaptive scheduling scheme appropriate to this kind of workflow. The scheme adaptively switches DAG scheduling and independent task scheduling, reducing the scheduling cost for independent task sets in the workflow. The results of our evaluation show our scheme achieved a 540 times speedup of total scheduling time when each host executes 100 tasks on average without serious extension of the makespan less than 7%.
In this paper, the design and implementation of a recently developed clustering algorithm NNCA [1], Nearest Neighhour Clustering Algorithm, is proposed in conjunction with a Fast K Nearest Neighbour (FKNN) strategy fo...
详细信息
ISBN:
(纸本)9780889866379
In this paper, the design and implementation of a recently developed clustering algorithm NNCA [1], Nearest Neighhour Clustering Algorithm, is proposed in conjunction with a Fast K Nearest Neighbour (FKNN) strategy for further reduction in processing time. The parallel algorithm (PNNCA) has the ability to cluster pixels of retinal images into those belonging to blood vessels and others not belonging to blood vessels in a reasonable time.
Multi-dimensional fixed-point fast Fourier transform(FFT) methods were developed and tested on two generalpurpose many-core architecture platforms. One is the highly-parallel fine-grained eXplicit Multi-Threaded (XMT)...
详细信息
ISBN:
(纸本)9780889868113
Multi-dimensional fixed-point fast Fourier transform(FFT) methods were developed and tested on two generalpurpose many-core architecture platforms. One is the highly-parallel fine-grained eXplicit Multi-Threaded (XMT) single-chip parallel architecture that targets reducing single task completion time. The second is 8xDual Core AMD Opteron 8220. The results show that the former outperforms the latter not only in speedup and ease of programming, but also with small data sets (small-scale parallelism). One of our results on XMT was a super-linear speedup (by a factor larger than the number of processors) observed under some rather unique circumstances.
We introduce a parallelized molecular dynamics (MD) simulation adapted for the IBM Blue Gene/L supercomputer. We begin by describing the parallel MD code. Next we discuss how parallel MD was tuned for Blue Gene/L. We ...
详细信息
ISBN:
(纸本)9780889866379
We introduce a parallelized molecular dynamics (MD) simulation adapted for the IBM Blue Gene/L supercomputer. We begin by describing the parallel MD code. Next we discuss how parallel MD was tuned for Blue Gene/L. We then show the results for some test targets, related to disease associated proteins, that we have run on Blue Gene/L and the efficiency we have achieved. Finally, we mention some future directions that we envisage undertaking as a continuation of this project.
The Multi-Level computing Architecture (MLCA) is a novel parallel System-on-a-Chip architecture targeted for multimedia applications. It features a top level controller that automatically extracts task level paralleli...
详细信息
ISBN:
(纸本)9780889866386
The Multi-Level computing Architecture (MLCA) is a novel parallel System-on-a-Chip architecture targeted for multimedia applications. It features a top level controller that automatically extracts task level parallelism using techniques similar to how instruction level parallelism is extracted by superscalar processors. This allows the MLCA to support a simple programming model that is similar to sequential programming. In order to assist programmers to easily and efficiently port multimedia applications to the MLCA programming model, a compilation environment is designed. This compilation environment enhances parallelism in MLCA programs by applying three simple code transformations that are based on known compiler optimizations. In this paper, we describe the MLCA architecture, its programming model, its compilation environment and an evaluation of its performance. Our experimental evaluation with three real multimedia applications and an MLCA simulator shows that the MLCA is a viable architecture and scaling speedups can be obtained using the compilation environment with little programmer effort.
暂无评论