The message passing interface (MPI) is designed as an architecture independent interface for parallel programming in the shared-nothing, message passing paradigm. We briefly summarize basic requirements to a high-qual...
详细信息
The message passing interface (MPI) is designed as an architecture independent interface for parallel programming in the shared-nothing, message passing paradigm. We briefly summarize basic requirements to a high-quality implementation of MPI for efficient programming of SMP clusters and related architectures, and discuss possible, mild extensions of the topology functionality of MPI, which, while retaining a high degree of architecture independence, can make MPI more useful and efficient for message-passing programming of SMP clusters. We show that the discussed extensions can all be implemented on top of MPI with very little environmental support.
The mobile agents model has the potential to provide a flexible framework to face the challenges of high performance computing, especially when targeted towards heterogeneous distributed architectures. We developed a ...
详细信息
The mobile agents model has the potential to provide a flexible framework to face the challenges of high performance computing, especially when targeted towards heterogeneous distributed architectures. We developed a framework for supporting programming and execution of mobile agent based distributed applications, the MAGDA (Mobile Agents Distributed Applications) toolset. It supplements mobile agent technology with a set of features for supporting parallel programming on a dynamic heterogeneous distributed environment.
UPC is an explicit parallel extension of ANSI C, which has been gaining rising attention from vendors and users. In this paper, we consider the low-level monitoring and experimental performance evaluation of a new imp...
详细信息
UPC is an explicit parallel extension of ANSI C, which has been gaining rising attention from vendors and users. In this paper, we consider the low-level monitoring and experimental performance evaluation of a new implementation of the UPC compiler on the SGI Origin family of NUMA architectures. These systems offer many opportunities for the high-performance implantation of UPC They also offer, due to their many hardware monitoring counters, the opportunity for low-level performance measurements to guide compiler implementations. Early, UPC compilers have the challenge of meeting the syntax and semantics requirements of the language. As a result, such compilers tend to focus on correctness rather than on performance. In this paper, we report on the performance of selected applications and kernels under this new compiler. The measurements were designed to help shed some light on the next steps that should be taken by UPC compiler developers to harness the full performance and usability potential of UPC under these architectures.
In this paper, we analyze the performance of the floating point digital signal processor (DSP) TMS320C6711 for an implementation of video coding motion. Two relevant motion estimation techniques were implemented: BMA ...
详细信息
In this paper, we analyze the performance of the floating point digital signal processor (DSP) TMS320C6711 for an implementation of video coding motion. Two relevant motion estimation techniques were implemented: BMA (block matching algorithm) and BMGT (block matching using geometric transforms). These have been combined with fast block matching algorithms to speed up the process. In order to increase the DSP performance, we have optimized some programming mechanisms like: the level of code parallelism, hand designed assembly code and an efficient usage of internal memory as cache. This implementation has shown that real-time motion estimation of BMA type, can be implemented in this DSP. However. BMGT type motion estimation cannot be done by one DSP alone in-real time applications, due to its high computational complexity.
Typical grid computing scenarios involve many distributed hardware and software components. The more components that are involved, the more likely it is that one of them may fail. In order for grid computing to succee...
详细信息
ISBN:
(纸本)9780769520261
Typical grid computing scenarios involve many distributed hardware and software components. The more components that are involved, the more likely it is that one of them may fail. In order for grid computing to succeed, there must be a simple mechanism to determine which component failed and why. Instrumentation of all grid applications and middleware is an important part of the solution to this problem. However, it must be possible to control and adapt the amount of instrumentation data produced in order to not be flooded by this data. We describe a scalable, high-performance instrumentation activation mechanism that addresses this problem.
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is ric...
详细信息
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine /spl ***/GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32 K/spl times/32 K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been in...
详细信息
Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been increased for higher frequencies and the number of parallel executed instructions, to increase the computational bandwidth. To program the parallel units, the VLIW (very long instruction word) has been introduced. programming the parallel units at the same time leads to an expanded program memory port or to the limitation that only a few units can be used in parallel. To overcome this limitation, this paper proposes a scaleable long instruction word (xLIW). The xLIW concept allows the full usage of the available units in parallel with optimal code density. An instruction buffer included reduces the power dissipation at the program memory ports during loop handling. The xLIW concept is part of a development project of a configurable DSP.
In this paper we present first experiences concerning the integration of MPI-based numerical software into an advanced programming environment for building parallel and distributed high-performance applications, which...
详细信息
In this paper we present first experiences concerning the integration of MPI-based numerical software into an advanced programming environment for building parallel and distributed high-performance applications, which is under development in the context of Italian national research projects. Such a programming environment, named ASSIST, is based on a combination of the concepts of structured parallel programming and component-based programming. Some activities within the projects are devoted to the definition, implementation and testing of a methodology for the integration of a parallel numerical library into ASSIST. The goal is providing a set of efficient, accurate and reliable tools that can be easily used as building blocks for high-performance scientific applications. We focus on the integration of existing and widely used MPI-based numerical library modules. To this aim, we propose a general approach to embed MPI computations into the ASSIST basic programming unit. This approach has been tested using the MPICH implementation of MPI for networks of workstations. Some modifications have been applied to the MPICH process startup procedure, in order to make it compliant with the ASSIST environment. Results of experiments concerning the integration of routines from a well-known FFT package are discussed.
Recent efforts in adapting computer networks into system-on-chip (SOC), or network-on-chip, present a setback to the traditional computer systems for the lack of effective programming model, while not taking full adva...
详细信息
ISBN:
(纸本)9781581137620
Recent efforts in adapting computer networks into system-on-chip (SOC), or network-on-chip, present a setback to the traditional computer systems for the lack of effective programming model, while not taking full advantage of the almost unlimited on-chip bandwidth. In this paper, we propose a new programming model, called context-flow, that is simple, safe, highly parallelizable yet transparent to the underlying architectural details. An SOC platform architecture is then designed to support this programming model, while fully exploiting the physical proximity between the processing elements. We demonstrate the performance efficiency of this architecture over bus based and packet-switch based networks by two case studies using a multi-processor architecture simulator.
Exploiting thread-level parallelism (TLP) is a promising way to improve the performance of applications with the advent of general-purpose cost effective uni-processor and shared-memory multiprocessor systems. In this...
详细信息
Exploiting thread-level parallelism (TLP) is a promising way to improve the performance of applications with the advent of general-purpose cost effective uni-processor and shared-memory multiprocessor systems. In this paper, we describe the OpenMP implementation in the Intel/spl reg/ C++ and Fortran compilers for Intel platforms. We present our major design consideration and decisions in the Intel compiler for generating efficient multithreaded codes guided by OpenMP directives and pragmas. We describe several transformation phases in the compiler for the OpenMP parallelization. In addition to compiler support, the OpenMP runtime library is a critical part of the Intel compiler. We present runtime techniques developed in the Intel OpenMP runtime library for exploiting thread-level parallelism as well as integrating the OpenMP support with other forms of threading termed as sibling parallelism. The performance results of a set of benchmarks show good speedups over the well-optimized serial code performance on Intel/spl reg/ Pentium- and Itanium-processor based systems.
暂无评论