In this paper, we present "rules of thumb" for the efficient and straight-forward parallelization of cellular neural networks (CNNs) processing image data on cluster architectures. The rules result from the ...
详细信息
In this paper, we present "rules of thumb" for the efficient and straight-forward parallelization of cellular neural networks (CNNs) processing image data on cluster architectures. The rules result from the application and optimization of the simple but effective structural data parallel approach, which is based on the SPMD model. Digital gray-scale images were used to evaluate the optimized parallel cellular neural network program. The process of parallelizing the algorithm employs HPF to generate an MPI-based program.
The conventional model of parallel programming today involves either copying data across cores (and then having to track its most recent value), or not copying and requiring deep software stacks to perform even the si...
详细信息
ISBN:
(纸本)9781665475075
The conventional model of parallel programming today involves either copying data across cores (and then having to track its most recent value), or not copying and requiring deep software stacks to perform even the simplest operation on data that is “remote”, i.e., out of the range of loads and stores from the current core. As application requirements grow to larger data sets, with more irregular access to them, both conventional approaches start to exhibit severe scaling limitations. This paper reviews some growing evidence of the potential value of a new model of computation that skirts between the two: data does not move (i.e., is not copied), but computation instead moves to the data. Several different applications involving large sparse computations, streaming of data, and complex mixed mode operations have been coded for a novel platform where thread movement is handled invisibly by the hardware. The evidence to date indicates that parallel scaling for this paradigm can be significantly better than any mix of conventional models.
Transactional memory (TM) provides an easy-to-use and high-performance parallel programming model for the upcoming chip-multiprocessor systems. Several researchers have proposed alternative hardware and software TM im...
详细信息
Transactional memory (TM) provides an easy-to-use and high-performance parallel programming model for the upcoming chip-multiprocessor systems. Several researchers have proposed alternative hardware and software TM implementations. However, the lack of transaction-based programs makes it difficult to understand the merits of each proposal and to tune future TM implementations to the common case behavior of real application. This work addresses this problem by analyzing the common case transactional behavior for 35 multithreaded programs from a wide range of application domains. We identify transactions within the source code by mapping existing primitives for parallelism and synchronization management to transaction boundaries. The analysis covers basic characteristics such as transaction length, distribution of read-set and write-set size, and the frequency of nesting and I/O operations. The measured characteristics provide key insights into the design of efficient TM systems for both non-blocking synchronization and speculative parallelization.
Run-time errors in concurrent programs are generally due to the wrong usage of synchronization primitives such as monitors. Conventional validation techniques such as testing become ineffective for concurrent programs...
详细信息
ISBN:
(纸本)9781581135626
Run-time errors in concurrent programs are generally due to the wrong usage of synchronization primitives such as monitors. Conventional validation techniques such as testing become ineffective for concurrent programs since the state space increases exponentially with the number of concurrent processes. In this paper, we propose an approach in which 1) the concurrency control component of a concurrent program is formally specified, 2) it is verified automatically using model checking, and 3) the code for concurrency control component is automatically generated. We use monitors as the synchronization primitive to control access to a shared resource by multipleconcurrent processes. Since our approach decouples the concurrency control component from the rest of the implementation it is scalable. We demonstrate the usefulness of our approach by applying it to a case study on Airport Ground Traffic *** use the Action Language to specify the concurrency control component of a system. Action Language is a specification language for reactive software systems. It is supported by an infinite-state model checker that can verify systems with boolean, enumerated and udbounded integer variables. Our code generation tool automatically translates the verified Action Language specification into a Java monitor. Our translation algorithm employs symbolic manipulation techniques and the specific notification pattern to generate an optimized monitor class by eliminating the context switch overhead introduced as a result of unnecessary thread notification. Using counting abstraction, we show that we can automatically verify the monitor specifications for arbitrary number of threads.
The first version of the MPI standard was released in November 1993. At the time, many of the authors of this standard, myself included, viewed MPI as a temporary solution, to be used until it is replaced by a good pr...
详细信息
ISBN:
(纸本)9781450320658
The first version of the MPI standard was released in November 1993. At the time, many of the authors of this standard, myself included, viewed MPI as a temporary solution, to be used until it is replaced by a good programming language for distributed memory systems. Almost twenty years later, MPI is the main programming model for High-Performance Computing, and practically all HPC applications use MPI, which is now in its third generation; nobody expects MPI to disappear in the coming decade. The talk will discuss some plausible reasons for this situation, and the implications for research on new programming models for Extreme-Scale Computing.
Distributed shared memory (DSM) machines provide the shared memory paradigm and achieve high performance by the caching of shared data. However, they suffer from cache miss and remote access latency with coarse-grain ...
详细信息
ISBN:
(纸本)0769505892
Distributed shared memory (DSM) machines provide the shared memory paradigm and achieve high performance by the caching of shared data. However, they suffer from cache miss and remote access latency with coarse-grain patterns. In this paper we suggest the combination of bulk transfer and prefetching as a new latency hiding technique in DSM machines. The purpose of bulk transfer is to replicate remote data into local memory and thus reduce remote accesses. Adaptive granularity was used for bulk transfer. Prefetching is added to fetch replicated data to the cache at the right time. We could apply simple prefetch scheduling as in uniprocessors since bulk transfer converts remote accesses into local ones. Simulation results show the reduced latency and the potential of AG as a preferable architecture for prefetching in DSM machines.
Incremental stack-copying is a technique which has been successfully used to support efficient parallel execution of a variety of search-based Al systems-e.g., logic-based and constraint-based systems. The idea of inc...
详细信息
Incremental stack-copying is a technique which has been successfully used to support efficient parallel execution of a variety of search-based Al systems-e.g., logic-based and constraint-based systems. The idea of incremental stack-copying is to only copy the difference between the data areas of two agents, instead of copying them entirely, when distributing parallel work. In order to further reduce the communication during stack-copying and make its implementation efficient on message-passing platforms, a new technique, called stack-splitting, has recently been proposed. In this paper, we describe a scheme to effectively combine stack-splitting with incremental stack copying, to achieve superior parallel performance in a non-shared memory environment. We also describe a scheduling scheme for this incremental stack-splitting strategy. These techniques are currently being implemented in the PALS system-a parallel constraint logic programming system.
The problem of M activities of N > M parallel activities being adjusted to a procedure chain is one type of the project scheduling. In allusion to M = 3 , a new polynomial algorithm is proposed to minimize the tota...
详细信息
The problem of M activities of N > M parallel activities being adjusted to a procedure chain is one type of the project scheduling. In allusion to M = 3 , a new polynomial algorithm is proposed to minimize the total tardiness criterion. In order to present this algorithm, and search the optimal procedure chain, we propose a Normal Chain Theory by virtue of the relationships of activities' time parameters, together with the properties that the optimal chain contains the activities with the minimum of earliest finish time. By the analysis of this algorithm, we get the time complexity is O(N log N).
Some important issues in engineering the requirements of a distributed software system and methods that facilitate software system design for distributed or parallel implementations are discussed. The issues are prese...
详细信息
Some important issues in engineering the requirements of a distributed software system and methods that facilitate software system design for distributed or parallel implementations are discussed. The issues are presented from a knowledge engineering perspective and are divided into four levels: acquisition; representation; structuring; and design. The acquisition level entails the methods for eliciting system requirements data (attributes and relationships of software entities) from the end-user group using a model of context classes. The representation level deals with the language paradigm for representing the attributes and relationships of the software entities. The structuring level addresses methods for rearranging and grouping the software objects of the context classes into related clusters. The design level deals with methods for mapping or transforming the clusters of software objects into specification modules to facilitate distributed design. To this end, the design level uses an object-based paradigm for specifying the attributes and abstract behavior of the objects within the modules.< >
暂无评论