Due to the increased use of parallelism in computer systems, designers need tools to evaluate job performance and match problems with candidate architectures. In the paper, jobs are modeled using continuous job profil...
详细信息
Due to the increased use of parallelism in computer systems, designers need tools to evaluate job performance and match problems with candidate architectures. In the paper, jobs are modeled using continuous job profiles. Five classes of continuous job profiles are analyzed in terms of the number of processors used under the processor sweeping scheduling discipline. Characteristics of each job class are discussed, and analytical results are used to develop design guidelines to determine how speedup and efficiency are affected by the addition of processing elements in a multiple processor environment.< >
A model of parallel computation is introduced which employs the PRAM as a sub-model, while simultaneously being more reflective of realistic parallelarchitectures by accounting for and providing abstract control over...
详细信息
A model of parallel computation is introduced which employs the PRAM as a sub-model, while simultaneously being more reflective of realistic parallelarchitectures by accounting for and providing abstract control over communication and synchronization costs. Cost control is achieved via the representation of general degrees of locality ('neighborhoods' of activity). The model organizes 'control asynchrony' via an implicit hierarchy relation, and restricts 'communication asynchrony' in order to obtain determinate algorithms.< >
This paper describes a design of a linear array for use in MIMD, SIMD, pipeline and SPMD modes of programming for image processing applications. It gives some results for an evaluation. A great variety of algorithms, ...
详细信息
This paper describes a design of a linear array for use in MIMD, SIMD, pipeline and SPMD modes of programming for image processing applications. It gives some results for an evaluation. A great variety of algorithms, from low level algorithms used for pixel operations to high level algorithms used for performing region treatments or eventually for symbolic processing, may be implemented on this unique structure. The different memory banks of this structure are connected to the processors through a one stage interconnection network. This network is linearly expandable and provides a direct access from each processor to a subset of memory banks. The way the processors have been designed is such that they can operate in different modes. They can manage various memory organisations through memory address calculation capabilities. Each processor is structured around several general purpose processing units as in the VLIW model. The optimal number of these units is discussed.< >
Nonuniform memory access time (referred to as NUMA) is an important feature in the design of large scale shared memory multiprocessors. However, one implication of the NUMA architecture is that locality of the data is...
详细信息
Nonuniform memory access time (referred to as NUMA) is an important feature in the design of large scale shared memory multiprocessors. However, one implication of the NUMA architecture is that locality of the data is crucial to performance. In the absence of any system support, the programmer is forced to explicitly manage these locality issues, significantly increasing the development time for parallel applications. The authors have enhanced the uniform system programming environment of BBN to allow a programmer to easily separate data structures based on expected reference patterns instead of requiring an understanding of the NUMA memory architecture as the basis for allocation decisions. An operating system providing dynamic page placement can exploit the more homogeneous allocation of data structures to virtual pages.< >
The paper explores the effectiveness of multiply-twisted hypercube networks for parallel computing by considering interprocessor communication problems. It presents SIMD parallel data broadcasting, census, and shortes...
详细信息
The paper explores the effectiveness of multiply-twisted hypercube networks for parallel computing by considering interprocessor communication problems. It presents SIMD parallel data broadcasting, census, and shortest path finding algorithms for multiply-twisted hypercube networks. The data broadcasting algorithms take ((n+1)/2) communication steps to broadcast a message from a processor to all other processors in Q/sub n//sup MT/ (an n-dimensional multiply-twisted hypercube and its graph). Each processor performs at most one receive operation to receive the broadcasted message from exactly one of its adjacent processors, and at most one send operation to send the broadcasted message to a subset of its adjacent processors. The data broadcasting algorithms can be easily converted into census algorithms with similar performance. Moreover, it is shown that the data broadcasting algorithms can be used to send a message from any processor P/sub i/ to any other processor P/sub j/ in a multiply-twisted hypercube in no more than L(i,j)+i communication steps, where L(i,j) is the length of the shortest path from P/sub i/ to P/sub j/.< >
In a real-time application that supports imprecise computation, each task is logically composed of a hard task and a soft task. The hard task must be completed before its deadline. The soft task is an optional task wh...
详细信息
In a real-time application that supports imprecise computation, each task is logically composed of a hard task and a soft task. The hard task must be completed before its deadline. The soft task is an optional task which may not be executed to completion, if insufficient computational resources are available. In the presented model, each task may be parallelized and executed on multiple processors with a multiprocessing overhead which is assumed to be a linear function of the degree of parallelism. It is shown that the problem of time allocation in such a real-time application can be formulated and solved as a linear programming problem. An algorithm is given for constructing a multiprocessor schedule from the linear programming solution. This algorithm guarantees the multiprocessing overhead generated in the multiprocessor schedule not to exceed a linear upper bound.< >
This paper presents Proteus, an architecture-independent language suitable for prototyping parallel and distributed programs. Proteus is a high-level imperative notation based on sets and sequences with a single const...
详细信息
This paper presents Proteus, an architecture-independent language suitable for prototyping parallel and distributed programs. Proteus is a high-level imperative notation based on sets and sequences with a single construct for the parallel composition of processes. Although a shared-memory model is the basis for communication between processes, this memory can be partitioned into shared and private variables. parallel processes operate on individual copies of private variables, which are independently updated and may be merged into the shared state at specifiable barrier synchronization points. Several examples are given to illustrate how the various parallelprogramming models, such as synchronous data-parallelism and asynchronous control-parallelism, can be expressed in terms of this foundation. This common foundation allows prototypes to be tested, evolved and finally implemented through refinement techniques targeting specific architectures.< >
Most current parallel computers are made from tens of processors that may communicate with each other (fairly slowly) by means of static intercommunication paths. However, in the future the use of optical communicatio...
详细信息
Most current parallel computers are made from tens of processors that may communicate with each other (fairly slowly) by means of static intercommunication paths. However, in the future the use of optical communication media and wafer scale integration will facilitate construction of computers with thousands of simple processors each of which may communicate simultaneously with any other. What will these computers run? The authors present three novel algorithms that solve problems by communication (quicksort, tessellation and fractals). One, fractal image generation, works purely by iterative communication which is interesting because studies of the human brain indicate that the connections between neurons are mainly responsible for our powers of thought. These algorithms combine features of both imperative and connectionist programming styles and arose from a systematic study of goal-directed program transformation, including the target architecture with the program specification. They are examples of a whole class of such algorithms which the authors expect will be developed similarly.< >
A systematic methodology for mapping a class of iterative algorithms onto processor array architectures is presented. The concept of the array architecture is combined efficiently with the existing knowledge of compil...
详细信息
A systematic methodology for mapping a class of iterative algorithms onto processor array architectures is presented. The concept of the array architecture is combined efficiently with the existing knowledge of compiler techniques. It is considered as an initial description of the iterative algorithm Fortran-like nested loops, without the requirement of transforming it into any intermediate form such as uniform recurrent equations. The principles of Lamport's coordinate method are used. Depending on the systolic or non-systolic structure of the algorithm and/or the multidimensional mapping, the resulting architectures can be either regular arrays or piecewise regular arrays. The important subclass of algorithm known as weak single assignment codes is treated in an optimal way.< >
Cache or local memory thrashing problem arises very often in parallel processing architectures where each processor has its local cache or memory and a write-back protocol is employed for cache coherence. To solve the...
详细信息
Cache or local memory thrashing problem arises very often in parallel processing architectures where each processor has its local cache or memory and a write-back protocol is employed for cache coherence. To solve the problem of large amount of data moving back and forth between the caches or local memories in different processors, techniques associated with parallel compiler need to be developed. Based on the relations between array element accesses and enclosed loop indices in a nested parallel construct, the authors present some approaches to reduce the data movement between the caches or local memories for parallel programs. By analyzing the array subscript expressions, the compilers let the processor execute the corresponding iterations of parallel loops in terms of the data in its cache or local memory at execution time. It benefits, particularly, the parallel programs in which a parallel loop is enclosed by a sequential loop and array elements are repeatedly used in different iterations of the parallel loop.< >
暂无评论