With the shrinking of transistors continuing to follow Moore's Law and the non-scalability of conventional out-of-order processors, multi-core systems are becoming the design choice for industry. Performance extra...
详细信息
With the shrinking of transistors continuing to follow Moore's Law and the non-scalability of conventional out-of-order processors, multi-core systems are becoming the design choice for industry. Performance extraction is thus largely alleviated from the hardware and placed on the pro-gr ammer/compiler camp, who now have to expose Thread Level parallelism (TLP) to the underlying system in the form of explicitly parallel applications. Unfortunately, parallel programming is hard and error-prone. The programmer has to parallelize the work, perform the data placement, and deal with thread synchronization. Systems that support speculative multithreaded execution like Thread Level Speculation (TLS), offer an interesting alternative since they relieve the programmer from the burden of parallelizing applications and correctly synchronizing them. Since systems that support speculative multithreading usually treat all threads equally, they are energy-inefficient. This inefficiency stems from the fact that speculation occasionally fails and, thus, power is spent on threads that will have to be discarded. In this paper we propose a power allocation scheme for TLS systems, based on Dynamic Voltage and Frequency Scaling (DVFS), that tries to remedy this inefficiency. More specifically, we propose a profitability-based power allocation scheme, where we ¿steal¿ power fro m non-profitable threads and use it to speed up more useful ones. We evaluate our techniques for a state-of-the-art TLS system and show that, with minimal hardware support, they lead to improvements in ED of up to 39.6% with an average of 21.2%, for a subset of the SPEC 2000 Integer benchmark suite.
Heterogeneous parallel systems are becoming mainstream computing platforms nowadays. One of the main challenges the development community is currently facing is how to fully exploit the available computational power w...
详细信息
ISBN:
(纸本)9781479905874
Heterogeneous parallel systems are becoming mainstream computing platforms nowadays. One of the main challenges the development community is currently facing is how to fully exploit the available computational power when porting existing programs or developing new ones with available techniques. In this direction, several design space exploration methods have been presented and extensively adopted. However, defining the feasible design space of a dynamic dataflow program still remains an open issue. This paper proposes a novel methodology for defining such a space through a serial execution. Homotopy theoretic methods are used to demonstrate how the design space of a program can be reconstructed from its serial execution trajectory. Moreover, the concept of dependencies graph of a dataflow program defined in the literature is extended with the definition of two new kinds of dependencies - the Guard Enable and Disable - and the 3-tuple notion needed to represent them.
The mpC (message-passing C) language was developed to write efficient and portable programs for wide range of distributed memory machines. It supports both task and data parallelism, allows both static and dynamic pro...
详细信息
The mpC (message-passing C) language was developed to write efficient and portable programs for wide range of distributed memory machines. It supports both task and data parallelism, allows both static and dynamic process and communication structures, enables optimizations aimed at both communication and computation, and supports modular parallel programming and the development of a library of parallel programs. The language is an ANSI C superset based on the notion of a network comprising processor nodes of different types and performances, connected with links of different bandwidths. The user can describe a network topology, create and discard networks, and distribute data and computations over networks. The mpC programming environment uses the topological information at run-time to ensure the efficient execution of the application. This paper describes the implementation of network management in the mpC programming environment.
parallel programming is difficult. The need for correct and efficient parallel programs is important and one way to meet this requirement is to work on the refinement chain. Beginning with a specification written in T...
详细信息
parallel programming is difficult. The need for correct and efficient parallel programs is important and one way to meet this requirement is to work on the refinement chain. Beginning with a specification written in TLA/sup +/ (for instance), we can transform it-or refine it-into finer grained specifications. At some step, enough structure will have appeared so that we can bridge a gap to fill this structure. We introduce a more concrete version of TLA/sup +/, CTLA, where structuring concerns are to be expressed, but where distributing, mapping or implementation problems are avoided. Indeed, we firmly believe that it is a mistake to go immediately from TLA/sup +/ to a real language like CC++, since the ditch is still too wide. A numerical example supports our claim.
Although Java was not specifically designed for the computationally intensive numeric applications that are the typical fodder of highly parallel machines, its widespread popularity and portability make it an interest...
详细信息
Although Java was not specifically designed for the computationally intensive numeric applications that are the typical fodder of highly parallel machines, its widespread popularity and portability make it an interesting candidate vehicle for massively parallel programming. With the advent of high-performance optimizing Java compilers, the open question is: How can Java programs best exploit massive parallelism? The authors have been contemplating this question via libraries of Java-routines for specifying and coordinating parallel codes. It would be most desirable to have these routines written in 100%-Pure Java; however, a more expedient solution is to provide Java wrappers (stubs) to existing parallel coordination libraries, such as MPI. MPI is an attractive alternative, as like Java, it is portable. We discuss both approaches here. In undertaking this study, we have also identified some minor modifications of the current language specification that would make 100%-Pure Java parallel programming more natural.
The Pilot library offers a new method for programmingparallel clusters in C. Formal elements from Communicating Sequential Processes (CSP) were used to realize a process/channel model of parallel computation that red...
详细信息
ISBN:
(纸本)9781424465330
The Pilot library offers a new method for programmingparallel clusters in C. Formal elements from Communicating Sequential Processes (CSP) were used to realize a process/channel model of parallel computation that reduces opportunities for deadlock and other communication errors. This simple model, plus an application programming interface (API) fashioned on C's formatted I/O, are designed to make the library easy for novice scientific C programmers to learn. Optional runtime services including deadlock detection help the programmer to debug communication issues. Pilot forms a thin layer on top of standard Message Passing Interface (MPI), preserving the letter's portability and efficiency, with little performance impact. MPI's powerful collective operations can still be accessed within the conceptual model.
To specify dataflow applications efficiently is one of the greatest challenges facing Network-on-Chip (NoC) simulation and exploration. BTS (Behavior-level Traffic Simulation) was proposed to specify behavior-level ap...
详细信息
To specify dataflow applications efficiently is one of the greatest challenges facing Network-on-Chip (NoC) simulation and exploration. BTS (Behavior-level Traffic Simulation) was proposed to specify behavior-level applications more efficiently than conventional message-passing programming model does. To alleviate the complexity in parallel programming, BTS has the computation tasks implemented as sequential modules with data shared among them. Also parameterization was proposed in BTS to produce pseudo messages pointing to the shared data, and to fulfill data-driven scheduling. As substitute for the conventional parallel applications, BTS-based ones inherit their computation-models and the underlying scheduling schemes. The pseudo messages are consistent with those in the ancestors in function and size. Then BTS-based applications and conventional ones will produce identical traffic and identical results for NoC simulation. Case studies showed that BTS may boost the application specification by reusing the existing sequential codes, especially domain-specific languages implemented as libraries of sequential sub-routines.
This paper addresses the issues of programming a multi-level parallel *** computer has an architecture that combines multi-level parallelism for efficient *** exploit the full potential of this architecture,special fe...
详细信息
This paper addresses the issues of programming a multi-level parallel *** computer has an architecture that combines multi-level parallelism for efficient *** exploit the full potential of this architecture,special features are added to its programming language along with special functions in its *** base language is similar to *** keep the original openCL hierarchical (global,local and private) memory organization while extending openCL with features and library functions for message passing and remote function *** also add short vectors types and operations that frequently used in graphics and image *** features and library functions facilitate effective parallel programming using a combination of multi-level parallelism.
This paper is devoted to the research of bitmap image processing based on wavelet functions. The Daubechies wavelet function was used as a mathematical model for filtering, compression and smoothing of two-dimensional...
详细信息
This paper is devoted to the research of bitmap image processing based on wavelet functions. The Daubechies wavelet function was used as a mathematical model for filtering, compression and smoothing of two-dimensional signals, because the analysis of existing wavelet functions showed that the Daubechies wavelet family is most effective for image processing. OpenMP parallel programming in C/C++ was used for the parallelization of computing processes in image processing problems.
parallel programming has to date remained inaccessible to the average scientific programmer. parallel programming languages are generally foreign to most scientific applications programmers who only speak Fortran. Aut...
详细信息
parallel programming has to date remained inaccessible to the average scientific programmer. parallel programming languages are generally foreign to most scientific applications programmers who only speak Fortran. Automatic parallelization techniques have so far proved unsuccessful in extracting large amounts of parallelism from sequential codes and do not encourage development of new, inherently parallel algorithms. In addition, there is a lack of consistency of programmer interface across architectures which requires programmers to invest a lot of effort in porting code from one parallel machine to another. This paper discusses the object oriented Fortran language and support routines developed at Mississippi State in support of parallelizing complex field simulations. This interface is based on Fortran to ease its acceptance by scientific programmers and is implemented on top of the Unix operating system for portability.< >
暂无评论