parallel programming has to date remained inaccessible to the average scientific programmer. parallel programming languages are generally foreign to most scientific applications programmers who only speak Fortran. Aut...
详细信息
parallel programming has to date remained inaccessible to the average scientific programmer. parallel programming languages are generally foreign to most scientific applications programmers who only speak Fortran. Automatic parallelization techniques have so far proved unsuccessful in extracting large amounts of parallelism from sequential codes and do not encourage development of new, inherently parallel algorithms. In addition, there is a lack of consistency of programmer interface across architectures which requires programmers to invest a lot of effort in porting code from one parallel machine to another. This paper discusses the object oriented Fortran language and support routines developed at Mississippi State in support of parallelizing complex field simulations. This interface is based on Fortran to ease its acceptance by scientific programmers and is implemented on top of the Unix operating system for portability.< >
To specify dataflow applications efficiently is one of the greatest challenges facing Network-on-Chip (NoC) simulation and exploration. BTS (Behavior-level Traffic Simulation) was proposed to specify behavior-level ap...
详细信息
To specify dataflow applications efficiently is one of the greatest challenges facing Network-on-Chip (NoC) simulation and exploration. BTS (Behavior-level Traffic Simulation) was proposed to specify behavior-level applications more efficiently than conventional message-passing programming model does. To alleviate the complexity in parallel programming, BTS has the computation tasks implemented as sequential modules with data shared among them. Also parameterization was proposed in BTS to produce pseudo messages pointing to the shared data, and to fulfill data-driven scheduling. As substitute for the conventional parallel applications, BTS-based ones inherit their computation-models and the underlying scheduling schemes. The pseudo messages are consistent with those in the ancestors in function and size. Then BTS-based applications and conventional ones will produce identical traffic and identical results for NoC simulation. Case studies showed that BTS may boost the application specification by reusing the existing sequential codes, especially domain-specific languages implemented as libraries of sequential sub-routines.
High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction ...
详细信息
ISBN:
(纸本)9781728151267
High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to take care of the computation-heavy parts of an application. There is today a plethora of accelerator architectures, including GPUs, many-cores, FPGAs, and domain-specific architectures such as AI accelerators. They all have their own programming models, which are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at risk, unacceptable for safety-critical embedded applications. In this position paper we argue that high-level executable modelling languages tailored for parallel computing can help in the software design for high performance embedded applications. In particular, we consider the data-parallel model to be a suitable candidate, since it allows very abstract parallel algorithm specifications free from race conditions. Moreover, we promote the Action Language for fUML (and thereby fUML) as suitable host language.
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidl...
详细信息
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidly growing data volume and task complexity, it is increasingly hard for individual workstations to meet the demands of interactive scientific data processing. The increasing cost of such interactive processing is hindering the productivity of end-to-end scientific computing workflows. While existing distributed computing systems allow people to aggregate desktop workstation resources for parallel computing, the burden of explicit parallel programming and parallel job execution often prohibits scientists to take advantage of such platforms. In this paper, we discuss the need for transparent desktop parallel computing in scientific data processing. As an initial step toward this goal, we present our on-going work on the automatic parallelization of the scripting language R, a popular tool for statistical computing. Our preliminary results suggest that a reasonable speedup can be achieved on real-world sequential R programs without requiring any code modification.
Presents a collection of slides covering the following: NVIDIA CUDA; CUDA toolkit; CUDA libraries; closely coupled CPU-GPU; CUDA many-core and multi-core support; nvcc CUDA compiler; CUBLAS; and CUFFT.
Presents a collection of slides covering the following: NVIDIA CUDA; CUDA toolkit; CUDA libraries; closely coupled CPU-GPU; CUDA many-core and multi-core support; nvcc CUDA compiler; CUBLAS; and CUFFT.
The authors report their experiences with the Gauss elimination algorithm on several parallel machines. Several different software designs are demonstrated, ranging from a simple shared memory implementation to the us...
详细信息
The authors report their experiences with the Gauss elimination algorithm on several parallel machines. Several different software designs are demonstrated, ranging from a simple shared memory implementation to the use of a message passing programming model. It is found that the efficient use of local memory is critical to obtaining good performance on scalable machines. Machines with large coherent caches appear to require the least software effort in order to obtain effective performance.< >
This paper analyzes the consequences of existing network structure for the design of a protocol for a radical COMA (Cache Only Memory Architecture). parallel computing today faces two significant challenges: the diffi...
详细信息
This paper analyzes the consequences of existing network structure for the design of a protocol for a radical COMA (Cache Only Memory Architecture). parallel computing today faces two significant challenges: the difficulty of programming and the need to leverage existing "off-the-shelf" hardware. The difficulty of programmingparallel computers can be split into two problems: distributing the data, and distributing the computation. parallelizing compilers address both problems, but have limited application outside the domain of loop intensive "scientific" code. Conventional COMAs provide an adaptive, self-distributing solution to data distribution, but do not address computation distribution. Our proposal leverages parallelizing compilers, and then extends COMA to provide adaptive self-distribution of both data and computation. The radical COMA protocols can be implemented in hardware, software, or a combination of both. When, however, the implementation is constrained to operate in a cluster computing environment (that is, to use only existing, already installed hardware), the protocols have to be reengineered to accommodate the deficiencies of the hardware. This paper identifies the critical quantities of various existing network structures, and discusses their repercussions for protocol design. A new protocol is presented in detail.
The multicore revolution is now happening both on the desktop and the server systems and is expected to soon enter the embedded space. For the last decades hardware manufacturers have been able to deliver more powerfu...
详细信息
The multicore revolution is now happening both on the desktop and the server systems and is expected to soon enter the embedded space. For the last decades hardware manufacturers have been able to deliver more powerful CPUs by higher clock speed and advanced memory systems. However, the frequency is no longer increasing, and instead the number of cores on each CPU is. Software development for embedded uniprocessor systems is completely dominated by imperative style programming and deeply rooted in C and scheduling of threads and processes. We believe that the multicore challenge requires new methodologies and new tools to make efficient use the hardware. Data flow programming, which has received considerable attention over the years, is a promising candidate for design and implementation of certain classes of applications, such as complex media coding, network processing, imaging and digital signal processing, and embedded control, on parallel hardware. This talk discusses current problems areas within the embedded domain and presents the Open Dataflow framework. Traditionally, very little work has been done on real-time analysis and design of dataflow systems. The difficulties involved, which relates to the high level of dynamicity are discussed and some research ideas are presented.
We have developed two new approaches to teaching parallel computing to undergraduates using higher level tools that lead to ease of programming, good software design, and scalable programs. The first approach uses a n...
详细信息
ISBN:
(纸本)9781479913725
We have developed two new approaches to teaching parallel computing to undergraduates using higher level tools that lead to ease of programming, good software design, and scalable programs. The first approach uses a new software environment that creates a higher level of abstraction for parallel and distributed programming based upon a pattern programming approach. The second approach uses compiler directives to describe how a program should be parallelized. We have studied whether using the above tools better helps the students grasp the concepts of parallel computing across the two campuses of the University of North Carolina Wilmington and the University of North Carolina Charlotte using a televideo network. We also taught MPI and OpenMP in the traditional fashion with which we could ask the students to compare and contrast the approaches. An external evaluator conducted three surveys during the semester and analyzed the data. In this paper, we discuss the techniques we used, the assignments we gave the students, and the results of what we learned.
Due to the huge computing resources the grid can provide, researchers have utilized the grid to run very large scale applications over a large number of computing and I/O nodes. However, since the computing nodes in g...
详细信息
Due to the huge computing resources the grid can provide, researchers have utilized the grid to run very large scale applications over a large number of computing and I/O nodes. However, since the computing nodes in grid are spread geographically over a wide area, communication latency varies significantly between nodes. Thus, running existing parallel applications over the whole grid can result in a worse performance even with larger number of computing nodes. Hence, in the grid environment, usually parallel applications still run on a cluster. It is expected that the emerging lambda network technology can be used for the backbone networks of grids and improve the communication performance between computing nodes. In this paper, we show the potential benefit of the lambda network for the parallel applications in grid environment. Our measurement results reveal that the NAS parallel benchmark over lambda grid can achieve more than 50% higher performance than a single cluster case. In addition, the results show that the parallel programming library such as MPI still needs to be improved with respect to the tolerance on the network delay and the topology awareness.
暂无评论