Data races are a common problem on shared-memory parallel computers, including multicores. Analysis programs called race detectors help find and eliminate them. However, current race detectors are geared for specific ...
详细信息
Data races are a common problem on shared-memory parallel computers, including multicores. Analysis programs called race detectors help find and eliminate them. However, current race detectors are geared for specific concurrency libraries. When programmers use libraries unknown to a given detector, the detector becomes useless or requires extensive reprogramming. We introduce a new synchronization detection mechanism that is independent of concurrency libraries. It dynamically detects synchronization constructs based on a characteristic code pattern. The approach is non-intrusive and applicable to various concurrency libraries. Experimental results confirm that the approach identifies synchronizations and detects data races regardless of the concurrency libraries involved. With this mechanism, race detectors can be written once and need not be adapted to particular libraries.
The statistical language R is favoured by many biostatisticians for processing microarray data. In recent times, the quantity of data that can be obtained in experiments has risen significantly, making previously fast...
详细信息
The statistical language R is favoured by many biostatisticians for processing microarray data. In recent times, the quantity of data that can be obtained in experiments has risen significantly, making previously fast analyses time consuming or even not possible at all with the existing software infrastructure. High performance computing (HPC) systems offer a solution to these problems but at the expense of increased complexity for the end user. The Simple parallel R Interface is a library for R that aims to reduce the complexity of using HPC systems by providing biostatisticians with drop-in parallelised replacements of existing R functions. In this paper we describe parallel implementations of two popular techniques: exploratory clustering analyses using the random forest classifier and feature selection through identification of differentially expressed genes using the rank product method.
Directives based incremental parallelism is an uncomplicated and expressive parallelisation practice and has led to wide adoption of OpenMP. However, the OpenMP specification does not present a binding for the Java la...
详细信息
Directives based incremental parallelism is an uncomplicated and expressive parallelisation practice and has led to wide adoption of OpenMP. However, the OpenMP specification does not present a binding for the Java language and the OpenMP threading model finds limited use for GUI (Graphical User Interface) application development. This paper focuses on the study of a semantic interpretation of OpenMP in the context of an object orientated environment. It proposes novel concepts to extend OpenMP for applications with a Graphical User Interface (GUI), based on the distinction between parallelism and concurrency. We present a compiler-runtime system for OpenMP-like directives in Java, enhanced with GUI related constructs. Acknowledging the productivity gains of the incremental parallelism approach of OpenMP, the GUI related constructs enable the developer to incrementally introduce concurrency. We present and discuss the performance of programs written using our system by comparing them with previous attempts and traditional ways of parallelisation-concurrency, using the parallel Java Grande Forum (JGF) benchmarks and a set of GUI applications. (C) 2013 Elsevier B.V. All rights reserved.
The computing landscape has shifted towards multicore architectures. To learn about software development, it is increasingly important for students to gain hands-on parallel programming experience in multicore environ...
详细信息
Multilevel flash memory contains blocks of cells that represent data by the amount of charge stored in them. The cell writing-or programming-process applies specified voltages in a sequential manner, injecting charge ...
详细信息
Multilevel flash memory contains blocks of cells that represent data by the amount of charge stored in them. The cell writing-or programming-process applies specified voltages in a sequential manner, injecting charge to achieve a desired level. Reducing a cell level requires a costly block erasure, so programming only increases cell levels. parallel programming, whereby a common voltage is applied to a group of cells to inject charge simultaneously, simplifies circuitry and increases programming speed. However, cell-to-cell variations and limited programming round can adversely affect its precision. In this paper, we consider algorithms for efficient cell programming. Since cell levels are quantized to a discrete set of values, our objective is to minimize the number of cells that are not quantized to their target levels. For a specified number of programming rounds, we derive an optimal parallel programming algorithm with complexity that is polynomial in the number of cells. We extend the algorithm to account for intercell interference, where the voltage applied to a cell can affect the level of adjacent cells. We then consider noisy programming of a single cell, with and without feedback about the cell level. In both scenarios, we present an algorithm that, for a given number of programming rounds, minimizes the probability of an incorrect cell level.
Transactional Memory (TM) promises to simplify parallel programming by replacing locks with atomic transactions. Despite much recent progress in TM research, there is very little experience using TM to develop realist...
详细信息
Transactional Memory (TM) promises to simplify parallel programming by replacing locks with atomic transactions. Despite much recent progress in TM research, there is very little experience using TM to develop realistic parallel programs from scratch. In this article, we present the results of a detailed case study comparing teams of programmers developing a parallel program from scratch using transactional memory and locks. We analyze and quantify in a realistic environment the development time, programming progress, code metrics, programming patterns, and ease of code understanding for six teams who each wrote a parallel desktop search engine over a fifteen week period. Three randomly chosen teams used Intel's Software Transactional Memory compiler and Pthreads, while the other teams used just Pthreads. Our analysis is exploratory: Given the same requirements, how far did each team get? The TM teams were among the first to have a prototype parallel search engine. Compared to the locks teams, the TM teams spent less than half the time debugging segmentation faults, but had more problems tuning performance and implementing queries. Code inspections with industry experts revealed that TM code was easier to understand than locks code, because the locks teams used many locks (up to thousands) to improve performance. Learning from each team's individual success and failure story, this article provides valuable lessons for improving TM.
This paper presents an algorithm for the indirect solution of optimal control problems that contain mixed state and control variable inequality constraints. The necessary conditions for optimality lead to an inequalit...
详细信息
This paper presents an algorithm for the indirect solution of optimal control problems that contain mixed state and control variable inequality constraints. The necessary conditions for optimality lead to an inequality constrained two-point BVP with index-1 differential-algebraic equations (BVP-DAEs). These BVP-DAEs are solved using a multiple shooting method where the DAEs are approximated using a single-step linearly implicit Runge-Kutta (Rosenbrock-Wanner) method. An interior-point Newton method is used to solve the residual equations associated with the multiple shooting discretization. The elements of the residual equations, and the Jacobian of the residual equations, are constructed in parallel. The search direction for the interior-point method is computed by solving a sparse bordered almost block diagonal (BABD) linear system. Here, a parallel-structured orthogonal factorization algorithm is used to solve the BABD system. Examples are presented to illustrate the efficiency of the parallel algorithm. It is shown that an American National Standards Institute C implementation of the parallel algorithm achieves significant speedup with the increase in the number of processors used. Copyright (c) 2013 John Wiley & Sons, Ltd.
parallel computational frameworks for high-performance computing are central to the advancement of simulation-based studies in science and engineering. Finding and fixing bugs in these frameworks can be time consuming...
详细信息
parallel computational frameworks for high-performance computing are central to the advancement of simulation-based studies in science and engineering. Finding and fixing bugs in these frameworks can be time consuming. If left unchecked, these bugs diminish the amount of new science performed. A systematic study of the Uintah Computational Framework investigates debugging approaches, leveraging the framework's modular structure.
The spiking neural network architecture (SpiNNaker) project aims to deliver a massively parallel million-core computer whose interconnect architecture is inspired by the connectivity characteristics of the mammalian b...
详细信息
The spiking neural network architecture (SpiNNaker) project aims to deliver a massively parallel million-core computer whose interconnect architecture is inspired by the connectivity characteristics of the mammalian brain, and which is suited to the modeling of large-scale spiking neural networks in biological real time. Specifically, the interconnect allows the transmission of a very large number of very small data packets, each conveying explicitly the source, and implicitly the time, of a single neural action potential or "spike.'' In this paper, we review the current state of the project, which has already delivered systems with up to 2500 processors, and present the real-time event-driven programming model that supports flexible access to the resources of the machine and has enabled its use by a wide range of collaborators around the world.
We introduce a new faster molecular dynamics (MD) engine into the CHARMM software package. The new MD engine is faster both in serial (i.e., single CPU core) and parallel execution. Serial performance is approximately...
详细信息
We introduce a new faster molecular dynamics (MD) engine into the CHARMM software package. The new MD engine is faster both in serial (i.e., single CPU core) and parallel execution. Serial performance is approximately two times higher than in the previous version of CHARMM. The newly programmed parallelization method allows the MD engine to parallelize up to hundreds of CPU cores. (c) 2013 Wiley Periodicals, Inc.
暂无评论