Current processor allocation techniques for highly parallel systems are based on centralized front-end based algorithms. As a result, the applied strategies are restricted to static allocation, low parallelism and wea...
详细信息
ISBN:
(纸本)0769507840
Current processor allocation techniques for highly parallel systems are based on centralized front-end based algorithms. As a result, the applied strategies are restricted to static allocation, low parallelism and weak fault tolerance. To lift these restrictions we are investigating a distributed approach to the processor allocation problem in large distributed memory machines. A contiguous and a noncontiguous version of a distributed dynamic processor allocation strategy are proposed and studied in this paper Simulations compare the performance of the proposed strategies with that of well-known centralized algorithms. We also present the results of experiments on a Simens hpcLine Primergy Sewer with 96 nodes that show distributed allocation is feasible with current technologies.
An extended particle swarm optimizer (EPSO) is proposed in this paper. In this new algorithm, not only the local but also the global best positions found so far will involve in the particle's velocity updating pro...
详细信息
ISBN:
(纸本)0769523129
An extended particle swarm optimizer (EPSO) is proposed in this paper. In this new algorithm, not only the local but also the global best positions found so far will involve in the particle's velocity updating process. EPSO is an integration of two paradigms of canonical PSO and is presumed to yield a more steady progress of optimization. The experiment results have proved that EPSO is an applicable means to utilize the population information and it deserves to be investigated in future.
This paper proposes a new parallel search procedure for dynamic multi-objective traveling salesman problem. We design a multi-objective TSP in a stochastic dynamic environment. The proposed procedure first uses parall...
详细信息
Nonuniform grid refinement plays a fundamental role in simulating realistic flows with a multitude of length scales. We introduce the first GPU-optimized implementation of this technique in the context of the lattice ...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Nonuniform grid refinement plays a fundamental role in simulating realistic flows with a multitude of length scales. We introduce the first GPU-optimized implementation of this technique in the context of the lattice Boltzmann method. Our approach focuses on enhancing GPU performance while minimizing memory access bottlenecks. We employ kernel fusion techniques to optimize memory access patterns, reduce synchronization overhead, and minimize kernel launch latencies. Additionally, our implementation ensures efficient memory management, resulting in lower memory requirements compared to the baseline LBM implementations that were designed for distributed systems. Our implementation allows simulations of unprecedented domain size (e.g., 1596 x 840 x 840) using a single A100-40 GB GPU thanks to enabling grid refinement capabilities on a single GPU. We validate our code against published experimental data. Our optimization improves the performance of the baseline algorithm by 1.3-2X. We also compare against state-of-the-art current solutions for grid refinement LBM and show an order of magnitude speedup.
Sparse matrix problems require a communication paradigm different from those used in conventional distributed-memory multiprocessors. We present in this gaper how fine-grain communication can help obtain high performa...
详细信息
ISBN:
(纸本)0818677937
Sparse matrix problems require a communication paradigm different from those used in conventional distributed-memory multiprocessors. We present in this gaper how fine-grain communication can help obtain high performance in the experimental distributed-memory multiprocessor, EM-X, developed at ETL, which can handle fine-grain communication very efficiently. The sparse matrix: kernel, Conjugate Gradient, is selected for the experiments. Among the steps in CG is the sparse matrix vector multiplications we focus on in the study. Some communication methods are developed for performance comparison, including coarse-grain and fine-grain implementations, Fine-grain communication allows exact data access in an unstructured problem to reduce the amount of communication. While CG presents bottlenecks in terms of a large number of fine-grain remote reads, the multi-thraded principles of execution is so designed to tolerate such latency. Experimental results indicate that the performance of fine-grain read implementation is comparable to that of coarse-grain implementation on 64 processors. The results demonstrate that fine-grain communication can be a viable and efficient approach to unstructured sparse matrix problems on large-scale distributed-memory multiprocessors.
In this paper, an efficient, run-time, statistical scheme for estimating the execution time of a task is presented, in order to facilitate run-time matching and scheduling in a distributed heterogeneous computing envi...
详细信息
ISBN:
(纸本)0818675829
In this paper, an efficient, run-time, statistical scheme for estimating the execution time of a task is presented, in order to facilitate run-time matching and scheduling in a distributed heterogeneous computing environment. This scheme is based upon a nonparametric regression technique, where the execution time estimate for a task is computed from past observations. Furthermore, this technique is able to compensate for different parameters upon which the execution time depends, and does not require any knowledge of the architecture of the target machine. It is also able to make accurate predictions when erroneous data is present in the set of observations, and has been experimentally shown to produce estimates with very low error, even with few past values from which to calculate a new estimate.
In this paper, we consider the following problem. Given a point z in the interior of a simple polygon P with n vertices, find all the boundary points of P that are `visible' from z. We present two parallel algorit...
详细信息
ISBN:
(纸本)081864222X
In this paper, we consider the following problem. Given a point z in the interior of a simple polygon P with n vertices, find all the boundary points of P that are `visible' from z. We present two parallel algorithms for this problem on a two dimensional reconfigurable mesh. The first algorithm runs in O(log2 n) time using n 1/2 ×n 1/2 processors, and the second one runs in O(1) time using n×n processors.
Service Level Agreements (SLAs) are currently one of the major research topics in Grid Computing. Among many system components for supporting of SLA-aware Grid-based workflows, the SLA mapping module receives an impor...
详细信息
ISBN:
(纸本)9780769534718
Service Level Agreements (SLAs) are currently one of the major research topics in Grid Computing. Among many system components for supporting of SLA-aware Grid-based workflows, the SLA mapping module receives an important position. Mapping light communication workflows is one main part of the mapping module. With the previously proposed mapping algorithm, the mapping module may become the bottleneck of the system when many requests come in a short period of time. This paper presents a parallel mapping algorithm for light communication SLA-based workflows, which can cope with the problem. Performance measurements deliver evaluation results on the quality of the method.
Despite the large I/O capabilities in modern cluster architectures with. local disks on each node, applications mostly are not enabled to fully exploit them. This is especially problematic for data intensive applicati...
详细信息
ISBN:
(纸本)0769516866
Despite the large I/O capabilities in modern cluster architectures with. local disks on each node, applications mostly are not enabled to fully exploit them. This is especially problematic for data intensive applications which often suffer from low I/O performance. As one solution for this problem, a Distribution I/O Management (DIOM) system has been developed to manage a transparent distribution of data across cluster nodes and to then allow applications to access, this. data purely from local disks. In order to be effective,. however this distribution process requires semantic information about both the application and the input data. This work therefore extends DIOM to include independent specifications for both dataformats and application I/O patterns and thereby decouples them. This work is driven by an application from nuclear medical imaging, the reconstruction of PET images, for which DIOM has proven to be an adequate solution enabling truly scalable I/O and thereby improving the overall application performance.
In multicomputers, an appropriate data distribution is crucial for reducing communication overhead and therefore the overall performance. For this reason, data parallel languages provide programmers with primitives, s...
详细信息
In multicomputers, an appropriate data distribution is crucial for reducing communication overhead and therefore the overall performance. For this reason, data parallel languages provide programmers with primitives, such as BLOCK and CYCLIC that can be used to distribute data across the distributed memory. However, the languages do not aid the programmer as to how the distribution should be performed to maximize the performance. Therefore, this paper presents an analysis of data distribution methods for overlapping computation and communication in the Gaussian elimination algorithm. The analysis indicates that both BLOCK and CYCLIC distributions have their own merit;however, BLOCK_CYCLIC with its hybrid characteristic consistently out performs its counterparts.
暂无评论