Recent efforts in adapting computer networks into system-on-chip (SOC), or network-on-chip, present a setback to the traditional computer systems for the lack of effective programming model, while not taking full adva...
详细信息
ISBN:
(纸本)9781581137620
Recent efforts in adapting computer networks into system-on-chip (SOC), or network-on-chip, present a setback to the traditional computer systems for the lack of effective programming model, while not taking full advantage of the almost unlimited on-chip bandwidth. In this paper, we propose a new programming model, called context-flow, that is simple, safe, highly parallelizable yet transparent to the underlying architectural details. An SOC platform architecture is then designed to support this programming model, while fully exploiting the physical proximity between the processing elements. We demonstrate the performance efficiency of this architecture over bus based and packet-switch based networks by two case studies using a multi-processor architecture simulator.
The authors report their experiences with the Gauss elimination algorithm on several parallel machines. Several different software designs are demonstrated, ranging from a simple shared memory implementation to the us...
详细信息
The authors report their experiences with the Gauss elimination algorithm on several parallel machines. Several different software designs are demonstrated, ranging from a simple shared memory implementation to the use of a message passing programming model. It is found that the efficient use of local memory is critical to obtaining good performance on scalable machines. Machines with large coherent caches appear to require the least software effort in order to obtain effective performance.< >
This paper analyzes the consequences of existing network structure for the design of a protocol for a radical COMA (Cache Only Memory Architecture). parallel computing today faces two significant challenges: the diffi...
详细信息
This paper analyzes the consequences of existing network structure for the design of a protocol for a radical COMA (Cache Only Memory Architecture). parallel computing today faces two significant challenges: the difficulty of programming and the need to leverage existing "off-the-shelf" hardware. The difficulty of programmingparallel computers can be split into two problems: distributing the data, and distributing the computation. parallelizing compilers address both problems, but have limited application outside the domain of loop intensive "scientific" code. Conventional COMAs provide an adaptive, self-distributing solution to data distribution, but do not address computation distribution. Our proposal leverages parallelizing compilers, and then extends COMA to provide adaptive self-distribution of both data and computation. The radical COMA protocols can be implemented in hardware, software, or a combination of both. When, however, the implementation is constrained to operate in a cluster computing environment (that is, to use only existing, already installed hardware), the protocols have to be reengineered to accommodate the deficiencies of the hardware. This paper identifies the critical quantities of various existing network structures, and discusses their repercussions for protocol design. A new protocol is presented in detail.
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidl...
详细信息
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidly growing data volume and task complexity, it is increasingly hard for individual workstations to meet the demands of interactive scientific data processing. The increasing cost of such interactive processing is hindering the productivity of end-to-end scientific computing workflows. While existing distributed computing systems allow people to aggregate desktop workstation resources for parallel computing, the burden of explicit parallel programming and parallel job execution often prohibits scientists to take advantage of such platforms. In this paper, we discuss the need for transparent desktop parallel computing in scientific data processing. As an initial step toward this goal, we present our on-going work on the automatic parallelization of the scripting language R, a popular tool for statistical computing. Our preliminary results suggest that a reasonable speedup can be achieved on real-world sequential R programs without requiring any code modification.
High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction ...
详细信息
ISBN:
(纸本)9781728151267
High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to take care of the computation-heavy parts of an application. There is today a plethora of accelerator architectures, including GPUs, many-cores, FPGAs, and domain-specific architectures such as AI accelerators. They all have their own programming models, which are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at risk, unacceptable for safety-critical embedded applications. In this position paper we argue that high-level executable modelling languages tailored for parallel computing can help in the software design for high performance embedded applications. In particular, we consider the data-parallel model to be a suitable candidate, since it allows very abstract parallel algorithm specifications free from race conditions. Moreover, we promote the Action Language for fUML (and thereby fUML) as suitable host language.
Presents a collection of slides covering the following: NVIDIA CUDA; CUDA toolkit; CUDA libraries; closely coupled CPU-GPU; CUDA many-core and multi-core support; nvcc CUDA compiler; CUBLAS; and CUFFT.
Presents a collection of slides covering the following: NVIDIA CUDA; CUDA toolkit; CUDA libraries; closely coupled CPU-GPU; CUDA many-core and multi-core support; nvcc CUDA compiler; CUBLAS; and CUFFT.
The multicore revolution is now happening both on the desktop and the server systems and is expected to soon enter the embedded space. For the last decades hardware manufacturers have been able to deliver more powerfu...
详细信息
The multicore revolution is now happening both on the desktop and the server systems and is expected to soon enter the embedded space. For the last decades hardware manufacturers have been able to deliver more powerful CPUs by higher clock speed and advanced memory systems. However, the frequency is no longer increasing, and instead the number of cores on each CPU is. Software development for embedded uniprocessor systems is completely dominated by imperative style programming and deeply rooted in C and scheduling of threads and processes. We believe that the multicore challenge requires new methodologies and new tools to make efficient use the hardware. Data flow programming, which has received considerable attention over the years, is a promising candidate for design and implementation of certain classes of applications, such as complex media coding, network processing, imaging and digital signal processing, and embedded control, on parallel hardware. This talk discusses current problems areas within the embedded domain and presents the Open Dataflow framework. Traditionally, very little work has been done on real-time analysis and design of dataflow systems. The difficulties involved, which relates to the high level of dynamicity are discussed and some research ideas are presented.
We have developed two new approaches to teaching parallel computing to undergraduates using higher level tools that lead to ease of programming, good software design, and scalable programs. The first approach uses a n...
详细信息
ISBN:
(纸本)9781479913725
We have developed two new approaches to teaching parallel computing to undergraduates using higher level tools that lead to ease of programming, good software design, and scalable programs. The first approach uses a new software environment that creates a higher level of abstraction for parallel and distributed programming based upon a pattern programming approach. The second approach uses compiler directives to describe how a program should be parallelized. We have studied whether using the above tools better helps the students grasp the concepts of parallel computing across the two campuses of the University of North Carolina Wilmington and the University of North Carolina Charlotte using a televideo network. We also taught MPI and OpenMP in the traditional fashion with which we could ask the students to compare and contrast the approaches. An external evaluator conducted three surveys during the semester and analyzed the data. In this paper, we discuss the techniques we used, the assignments we gave the students, and the results of what we learned.
Due to the huge computing resources the grid can provide, researchers have utilized the grid to run very large scale applications over a large number of computing and I/O nodes. However, since the computing nodes in g...
详细信息
Due to the huge computing resources the grid can provide, researchers have utilized the grid to run very large scale applications over a large number of computing and I/O nodes. However, since the computing nodes in grid are spread geographically over a wide area, communication latency varies significantly between nodes. Thus, running existing parallel applications over the whole grid can result in a worse performance even with larger number of computing nodes. Hence, in the grid environment, usually parallel applications still run on a cluster. It is expected that the emerging lambda network technology can be used for the backbone networks of grids and improve the communication performance between computing nodes. In this paper, we show the potential benefit of the lambda network for the parallel applications in grid environment. Our measurement results reveal that the NAS parallel benchmark over lambda grid can achieve more than 50% higher performance than a single cluster case. In addition, the results show that the parallel programming library such as MPI still needs to be improved with respect to the tolerance on the network delay and the topology awareness.
Multiprocessor architectures are converging. For the present, we urgently need to adopt common standards for message passing programming. For the future, one can expect scalable virtual shared memory machines to domin...
详细信息
Multiprocessor architectures are converging. For the present, we urgently need to adopt common standards for message passing programming. For the future, one can expect scalable virtual shared memory machines to dominate. The author discusses: communication strategies; dedicated components; programming environments; and programming. An example listing of a ranking program is given that would require such a generation of machine to execute efficiently.< >
暂无评论