In this paper a technique to deal withthe problem of poor locality and false sharing in irregular codes on shared memory multiprocessors (SMPs) is proposed. this technique is based on the locality model for irregular...
详细信息
ISBN:
(纸本)0769525091
In this paper a technique to deal withthe problem of poor locality and false sharing in irregular codes on shared memory multiprocessors (SMPs) is proposed. this technique is based on the locality model for irregular codes previously developed and extensively proven by the authors on mono-processors and multiprocessors. In the model, locality is established in run-time considering parameters that describe the structure of the sparse matrix which characterizes the irregular accesses. As an example of irregular code with false sharing a particular implementation of the sparse matrix-vector product (SpM x V) was selected. the problem of increasing locality and decreasing false sharing for a irregular problem is formulated as a graph. An adequate distribution of the graph among processors followed by a reordering of the nodes inside each processor produces the solution. the results show important improvements in the behavior of the irregular accesses: reductions in execution time and an improved program scalability.
In this paper we investigate the problem of finding a delay- and degree-bounded maximum sum of nodes application level multicast tree. We then proved the problem is NP-hard, and its relationship withthe well-studied ...
详细信息
this is an overview of the material to be discussed in the invited keynote presentation by H. J Siegel;it summarizes our research in [2, 16, and 17]. the resources in parallel computer systems (including heterogeneous...
详细信息
ISBN:
(纸本)0769525091
this is an overview of the material to be discussed in the invited keynote presentation by H. J Siegel;it summarizes our research in [2, 16, and 17]. the resources in parallel computer systems (including heterogeneous clusters) should be allocated to the computational applications in a way that maximizes some system performance measure. However, allocation decisions and associated performance prediction are often based on estimated values of application and system parameters. the actual values of these parameters may differ from the estimates;for example, the estimates may represent only average values, the models used to generate the estimates may have limited accuracy, and there may be changes in the environment. thus, an important research problem is the development of resource management strategies that can guarantee a particular system performance given such uncertainties. To address this problem, we have designed a model for deriving the degree of robustness of a resource allocation-the maximum amount of collective uncertainty in system parameters within which a user-specified level of system performance (QoS) can be guaranteed. the model will be presented and we will demonstrate its ability to select the most robust resource allocation from among those that otherwise perform similarly (based on the primary performance criterion). the model's use in allocation heuristics also will be demonstrated. this model is applicable to different types of computing and communication environments, including parallel, distributed cluster, grid, Internet, embedded, and wireless.
the three dimensional discrete cosine transform (3D DCT) has been widely used in many applications such as video compression. On the other hand, the k-ary n-cube is one of the most popular interconnection networks use...
详细信息
ISBN:
(纸本)0769524869
the three dimensional discrete cosine transform (3D DCT) has been widely used in many applications such as video compression. On the other hand, the k-ary n-cube is one of the most popular interconnection networks used in many recent multicomputers. As direct calculation of 3D DCT is very time consuming, many researchers have been working on developing algorithms and special-purpose architectures for fast computation of 3D DCT this paper proposes a parallel algorithm for efficient calculation of 3D DCT on the k-ary n-cube multicomputers. the time complexity of the proposed algorithm is of O(N) for an N x N x N input data cube while direct calculation of 3D DCT has a complexity of O(N-6).
In order to utilize the tremendous computing power of graphics hardware and to automatically adapt to the fast and frequent changes in its architecture and performance characteristics, this paper implements an automat...
详细信息
ISBN:
(纸本)076952429X
In order to utilize the tremendous computing power of graphics hardware and to automatically adapt to the fast and frequent changes in its architecture and performance characteristics, this paper implements an automatic tuning system to generate high-performance matrix-multiplication implementation on graphics hardware. the automatic tuning system uses a parameterized code generator to generate multiple versions of matrix multiplication, whose performances are empirically evaluated by actual execution on the target platform. An ad-hoc search engine is employed to search over the implementation space for the version that yields the best performance. In contrast to similar systems on CPUs, which utilize cache blocking, register tiling, instruction scheduling tuning strategies, this paper identifies and exploits several tuning strategies that are unique for graphics hardware. these tuning strategies include optimizing for multiple-render-targets, SIMD instructions with data packing, overcoming limitations on instruction count and dynamic branch instruction. the generated implementations have comparable performance with expert manually tuned version in spite of the significant overhead incurred due to the use of the high-level BrookGPU language.
the proceedings contain 41 papers from the parallel Computing Tecnologies: 8thinternationalconference, PaCT 2005. the topics discussed include: on evaluating the performance of security protocols;timed equivalence f...
详细信息
the proceedings contain 41 papers from the parallel Computing Tecnologies: 8thinternationalconference, PaCT 2005. the topics discussed include: on evaluating the performance of security protocols;timed equivalence for timed event structures;similarity of generalized resources in petri nets;real-time event structures and Scott domains;early-stopping k-set agreement in synchronous systems prone to any number of process crashes;allowing atomic objects to coexist with sequentially consistent objects;an approach to the implementation of the dynamical priorities method;information flow analysis for VHDL;and composing fine-grained parallelalgorithms for spatial dynamics simulation.
Challenging problems such as single or multi-objective optimizations often require running thousands of simulations in order to achieve the task. In this context, the use of properly trained Artificial Neural Networks...
详细信息
ISBN:
(纸本)095391402X
Challenging problems such as single or multi-objective optimizations often require running thousands of simulations in order to achieve the task. In this context, the use of properly trained Artificial Neural Networks (ANN), as a substitute for the hydraulic simulator, can drastically reduce the computing time required by the new algorithms. In this paper we shall show different ANN architectures used to reproduce the behaviour of complex water distribution system (WDS). the different input/output variables to be considered in each stage are discussed, as well as the training of the ANNs. A parallel version of the training algorithm is presented which reduces significantly the training time. All these techniques have been applied successfully to train an ANN to reproduce precisely the behaviour of a complex WDS as the Valencia network model.
作者:
Sun, JCChinese Acad Sci
Inst Software R&D Ctr Parallel Software Beijing 100080 Peoples R China
In this paper, the problem of partitioning parallel dodecahedrons in 3D is examined. Two schemes are introduced and their convergence rate discussed. A parallel fast solver was implemented and tested experimentally, w...
详细信息
ISBN:
(纸本)3540292357
In this paper, the problem of partitioning parallel dodecahedrons in 3D is examined. Two schemes are introduced and their convergence rate discussed. A parallel fast solver was implemented and tested experimentally, withthe performance results presented.
A few algorithms of distributed mutual exclusion are discussed, their unified model in terms of a finite-population queuing system is proposed, and their simulation performance study is presented withthe assumption t...
详细信息
ISBN:
(数字)9783540320715
ISBN:
(纸本)3540292357
A few algorithms of distributed mutual exclusion are discussed, their unified model in terms of a finite-population queuing system is proposed, and their simulation performance study is presented withthe assumption that they use multicast communication if possible. To formally represent the algorithms for simulation, a class of extended Petri nets is used. the simulation was done in the simulation system Winsim based on this class of Petri nets.
In this study, we have successfully developed a grid-enabled software distributed shared memory called Teamster-G. this system provides users with not only a shared memory programming interface but also a transparent ...
详细信息
暂无评论