this work is devoted to the numerical resolution of the 4D Vlasov equation using an adaptive mesh of phase space. We previously proposed a parallel algorithm designed for distributed memory architectures. the underlyi...
详细信息
ISBN:
(纸本)9783540854500
this work is devoted to the numerical resolution of the 4D Vlasov equation using an adaptive mesh of phase space. We previously proposed a parallel algorithm designed for distributed memory architectures. the underlying numerical scheme makes possible a parallelization using a block-based mesh partitioning. Efficiency of this algorithm relies on maintaining a good load balance at a low cost during the whole simulation. In this paper, we propose a dynamic load balancing mechanism based on a geometric partitioning algorithm. this mechanism is deeply integrated into the parallel algorithm in order to minimize overhead. Performance measurements on a PC cluster show the good quality of our load balancing and confirm the pertinence of our approach.
the technology of parallel 10 is one of the key technologies for high performance computer. Firstly, the 10 system of the newest Top500 typical machines will be introduced in this paper. Secondly, a new distributed sh...
详细信息
ISBN:
(纸本)9780769530994
the technology of parallel 10 is one of the key technologies for high performance computer. Firstly, the 10 system of the newest Top500 typical machines will be introduced in this paper. Secondly, a new distributed shared parallel 10 system for high performance computer (DSPIO) will be put forward, and some key technologies implemented in the system 14411 be discussed Finally, a prototype system is built. the experiment results show that this architecture can offer high 10 bandwidth, good scalability, and suit for high performance computing very much.
Traditionally, the block-based medial axis transform (BB-MAT) and the chessboard distance transform (CDT) were usually viewed as two completely different image computation problems, especially for three dimensional (3...
详细信息
ISBN:
(纸本)9783540695004
Traditionally, the block-based medial axis transform (BB-MAT) and the chessboard distance transform (CDT) were usually viewed as two completely different image computation problems, especially for three dimensional (3D) space. We achieve the computation of the 3D CDT problem by implementing the 3D BB-MAT algorithm first. For a 3D binary image of size N-3, our parallel algorithm can be run in O(logN) time using N-3 processors on the concurrent read exclusive write (CREW) parallel random access machine (PRAM) model to solve both 3D BB-MAT and 3D CDT problems, respectively. In addition, we have implemented a message passing interface (MPI) program on an AMD Opteron Model 270 cluster system to verify the proposed parallel algorithm, since the PRAM model is not available in the real world. the experimental results show that the speedup is saturated when the number of processors used is more than four, regardless of the problem size.
We consider power dissipation during simple switching in information processing. By considering general two-level system we show that the energy dissipation during errorless switching has a minimum of kTln2 and increa...
详细信息
ISBN:
(纸本)9780769530994
We consider power dissipation during simple switching in information processing. By considering general two-level system we show that the energy dissipation during errorless switching has a minimum of kTln2 and increases linearly with a switching speed. Also, we find optimal switching function, which minimizes heat dissipation for the given error rate. We present some estimates and compare them with results for the CMOS technology.
the proceedings contain 52 papers. the topics discussed include: fast custom instruction identification by convex subgraph enumeration;bit matrix multiplication in commodity processors;security processor with quantum ...
详细信息
ISBN:
(纸本)9781424418985
the proceedings contain 52 papers. the topics discussed include: fast custom instruction identification by convex subgraph enumeration;bit matrix multiplication in commodity processors;security processor with quantum key distribution;dynamically reconfigurable regular expression matching architecture;an efficient implementation of a phase unwrapping kernel on reconfigurable hardware;a parallel hardware architecture for connected component labeling based on fast label merging;design space exploration of a cooperative MIMO receiver for reconfigurable architectures;dynamic holographic reconfiguration on a four-context ODRGA;FPGA-based hardware accelerator of the heat equation with applications on infrared;FPGA based singular value decomposition for image processing applications;accelerating Nussinov RNA secondary structure prediction with systolic arrays on FPGAs;and reconfigurable acceleration of microphone array algorithms for speech enhancement.
A ubiquitous processor, HCgorilla followed Java CPU for multimedia processing and was built in RNG (random number generators) for cipher processing. then, HCgorilla had an execution stage composed of several units for...
详细信息
ISBN:
(纸本)9781424421015
A ubiquitous processor, HCgorilla followed Java CPU for multimedia processing and was built in RNG (random number generators) for cipher processing. then, HCgorilla had an execution stage composed of several units for those sophisticated processing. Since the execution stage kept physical separation. each function took different latency. this required instruction scheduling similarly to regular super scalar processors. We describe, in this paper, the improvement of HCgorilla to solve this issue. Specifically. the execution stage composed of arithmetic units is wave-pipelined in whole. this completely merges the parallel structure without physical separation. the waved multifunctional execution unit is effective to realize wide-range dynamic ILP (instruction level parallelism) at a rate higher than regular superscalar processors.
H.264/AVC is the latest video coding standard adopting variable block size motion estimation (VBS-ME), quarter-pixel accuracy, motion vector prediction and multi-reference frames for motion estimation. these new featu...
详细信息
H.264/AVC is the latest video coding standard adopting variable block size motion estimation (VBS-ME), quarter-pixel accuracy, motion vector prediction and multi-reference frames for motion estimation. these new features result in much higher computation requirements than previous coding standards. In this paper we propose a novel most significant bit (MSB) first bit-serial architecture for full-search block matching VBS-ME, and compare it with systolic implementations. Since the nature of MSB-first processing enables early termination of the sum of absolute difference (SAD) calculation, the average hardware performance can be enhanced. Five different designs, one and two dimensional systolic and tree implementations along with bit-serial, are compared in terms of performance, pixel memory bandwidth, occupied area and power consumption.
this paper presents a case for exploiting the synergy of dedicated and opportunistic network resources in a distributed hosting platform for data stream processing applications. Our previous studies have demonstrated ...
详细信息
ISBN:
(纸本)9780769534343
this paper presents a case for exploiting the synergy of dedicated and opportunistic network resources in a distributed hosting platform for data stream processing applications. Our previous studies have demonstrated the benefits of combining dedicated reliable resources with opportunistic resources in case of high-throughput computing applications, where timely allocation of the processing units is the primary concern. Since distributed stream processing applications demand large volume of data transmission between the processing sites at a consistent rate, adequate control over the network resources is important here to assure a steady flow of processing. In this paper, we propose a system model for the hybrid hosting platform where stream processing servers installed at distributed sites are interconnected with a combination of dedicated links and public Internet. Decentralized algorithms have been developed for allocation of the two classes of network resources among the competing tasks with an objective towards higher task throughput and better utilization of expensive dedicated resources. Results from extensive simulation study show that with proper management, systems exploiting the synergy of dedicated and opportunistic resources yield considerably higher task throughput and thus, higher return on investment over the systems solely using expensive dedicated resources.
We consider sorting problems based on compare-and-exchange operations on partially connected mesh networks, where n node are organized in sequence and each connects to its k nearest neighbors on both sides. Each node ...
详细信息
ISBN:
(纸本)9780769530994
We consider sorting problems based on compare-and-exchange operations on partially connected mesh networks, where n node are organized in sequence and each connects to its k nearest neighbors on both sides. Each node holds a distinct key and these keys need to be sorted in certain order We present a sequential algorithm with 3/8kn(2) + O (nlogn) time complexity and a parallel algorithm with 3/2kn + O(logn) time complexity.
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architec...
详细信息
ISBN:
(纸本)9783540681052
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. this paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. these tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. Compared to the standard approach, say with LAPACK, may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented withthe LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
暂无评论