Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architec...
详细信息
Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, the authors focus on some fundamental features of aCe C.
Typically, only technical arguments like performance, cost or scalability are discussed if programming models and languages on high performance computing facilities are under consideration. In this paper, we investiga...
详细信息
Typically, only technical arguments like performance, cost or scalability are discussed if programming models and languages on high performance computing facilities are under consideration. In this paper, we investigate the impact of human factors such as personal preferences and perceptions, and personal experience on making technical decisions. We have queried a large HPC community of the Sharcnet project in Ontario in regards to general preferences and in regards to detailed usage of language features and programming style. The main result of our study is that - as often claimed in the past but never proven - shared-memory programming models and architectures appear to be the ideal for the majority of users, even if the main architecture of the project is a distributed-memory cluster. However, experience appears to be able to quickly overcome initial difficulties in using message passing.
The data-flow graph (DFG) of a parallel application is frequently used to take scheduling decisions, based on the information that it models (dependencies among the tasks and volume of exchanged data). In the case of ...
详细信息
The data-flow graph (DFG) of a parallel application is frequently used to take scheduling decisions, based on the information that it models (dependencies among the tasks and volume of exchanged data). In the case of MPI-based programs, the DFG may be built at run-time by overloading the data exchange primitives. This article presents a library that enables the generation of the DFG of a MPI program, and its use to analyze the network contention on a test-application: the Linpack benchmark. It is the first step towards automatic mapping of a MPI program on a distributed architecture.
The availability of low-cost commodity multiprocessor machines change the nature of mainstream programming. This discipline is required to include small-scale, dual and quadruple processor machines, to remain competit...
详细信息
The availability of low-cost commodity multiprocessor machines change the nature of mainstream programming. This discipline is required to include small-scale, dual and quadruple processor machines, to remain competitive. These small-scale parallel systems require software engineering principles capable of encapsulating the complex parallel programming issues. This paper discusses a technique that provides a simple model for incorporating parallel programming in a scheduler. This model can dynamically adjust to single and small-scale multiple processor environments.
programming and design skills in parallel computing related to systems on chip (SOC) will become increasingly important since future SOCs will have multiple processors interconnected via on-chip networks (NOC). Unfort...
详细信息
programming and design skills in parallel computing related to systems on chip (SOC) will become increasingly important since future SOCs will have multiple processors interconnected via on-chip networks (NOC). Unfortunately there exist no easy-to-use tools for learning and experimenting with multiprocessor (MP)SOCs/NOCs, but one must use ad-hoc combinations of tools, methodologies and sample applications from very different sources. In this paper we introduce a parallel computing learning set (Parle) for configurable shared memory MPSOCs/NOCs and corresponding theoretical parallel random access machines (PRAM). The learning set consists of an experimental optimizing compiler for high-level parallel programming language e and assembler, linker, loader, simulator with a graphical user interface and statistical tools, and sample e/assembler code. Using the set, a student/designer can easily ivrite simple parallel programs, compile and load them into a configurable MPSOC/NOC platform, execute/debug them, gather statistics and explore the performance, utilization, and gate count estimations with different architectural parameters. The learning set runs on Mac OS X systems and is available for non-profit educational purposes.
Although deadlock is not completely avoidable in distributed and parallel programming, we here describe theory and practice of a system that allows us to limit deadlock to situations in which there are true circular d...
详细信息
Although deadlock is not completely avoidable in distributed and parallel programming, we here describe theory and practice of a system that allows us to limit deadlock to situations in which there are true circular data dependences or failure of processes that compute data needed at other processes. This allows us to guarantee absence of deadlock in SPMD computations absent process failure. Our system guarantees optimal ordering of communication statements. We gratefully acknowledge the support of the US National Science Foundation under Award CISE EIA 9810708 without which this work would not have been possible.
Summary form only given, as follows. This talk briefly reviews some of the most popular high-level and low-level parallel programming languages used for scientific computing. We will report our experiences of using th...
详细信息
ISBN:
(纸本)0769515738
Summary form only given, as follows. This talk briefly reviews some of the most popular high-level and low-level parallel programming languages used for scientific computing. We will report our experiences of using these languages in our research and compare the performance of several parallel scientific equation solvers implemented in different parallel languages. Major features and comparisons of these languages will be discussed. Some insights into when and where these languages should be used will be provided.
While the past research discussed several advantages of multiprocessor-system-on-a-chip (MPSOC) architectures from both area utilization and design verification perspectives over complex single core based systems, com...
详细信息
ISBN:
(纸本)1595930582
While the past research discussed several advantages of multiprocessor-system-on-a-chip (MPSOC) architectures from both area utilization and design verification perspectives over complex single core based systems, compilation issues for these architectures have relatively received less attention. programming MPSOCs can be challenging as several potentially conflicting issues such as data locality, parallelism and load balance across processors should be considered simultaneously. Most of the compilation techniques discussed in the literature for parallel architectures (not necessarily for MPSOCs) are loop based, i.e., they consider each loop nest in isolation. However, one key problem associated with such loop based techniques is that they fail to capture the interactions between the different loop nests in the application. This paper takes a more global approach to the problem and proposes a compiler-driven data locality optimization strategy in the context of embedded MPSOCs. An important characteristic of the proposed approach is that, in deciding the workloads of the processors (i.e., in parallelizing the application) it considers all the loop nests in the application simultaneously. The authors' experimental evaluation with eight embedded applications showed that the global scheme brings significant power/performance benefits over the conventional loop based scheme.
暂无评论