Dense linear algebra computations such as matrix factorization require the technique of 'block-partitioned algorithms' for their efficient implementation on memory-hierarchy processors. For scalar-based distri...
详细信息
Dense linear algebra computations such as matrix factorization require the technique of 'block-partitioned algorithms' for their efficient implementation on memory-hierarchy processors. For scalar-based distributed memory multiprocessors, the register, cache and off-processor memory levels of the memory hierarchy all affect the optimal block-partition size for such algorithms. Moststudies on matrix factorization and similar algorithms have assumed that the block-partition size or panel width for the algorithm, w, to be the same as the matrix distribution block size, r, where a rectangular block-cyclic matrix distribution is being employed. Here the choice of w=r is essentially determined by the off-processor memory level of the memory hierarchy, with the valve of w being a tradeoff between communication startup overhead and load balance considerations. In this paper, we re-examine this assumption in the contest of LU and Cholesky factorization of block-cyclic distributed matrices on scalar-based distributed memory multiprocessors, such as the Fujitsu AP1000. Here considerations of the register and cache levels of the hierarchy require a large w. We find that the choice of w, given w=r, leads to a tradeoff between load balance and optimal use of register and cache levels of the hierarchy (rather than communication startup), and that this tradeoff substantially limits performance. We then briefly describe 'distributed panels' versions of these algorithms, where generally w>r, which effectively diminishes this tradeoff to an O(w/N) fraction of the overall computation, where N is the matrix size. Two variants of these versions, one with single rows/columns being communicated, and one with single block rows/columns being communicated, are analyzed for their load balance properties. The results of the distributed panels versions of the algorithms on the scalar-based distributed memory multiprocessor the Fujitsu AP1000 are given, which give significantly superior performa
This paper presents the performance analysis of realizing median filtering on a distributed multiprocessor system. The results of the performance analysis give a good indication of the performance gain in using multi-...
详细信息
This paper presents the performance analysis of realizing median filtering on a distributed multiprocessor system. The results of the performance analysis give a good indication of the performance gain in using multi-processor for median filtering over uni-processor. Such performance gain is proportional to the problem size as shown by varying the size of the image. Furthermore, through the analysis, it is clear that the computation time and inter-processor communications scale well with the number of processors in the system. However, the overall system performance does not have such behavior because of the initialization overhead dominating the computation time as the number of processors increases beyond a certain point. It is because of this relationship that an optimal performance is achievable with a certain number of processors. It is also found that this number varies with the problem size. In addition, the subimage model is found to be an acceptable approach far this type of processing as only the necessary parts of the image are sent to the other processors. The master and slave scheme proves to be easy for programming, control and data manipulation. As a whole, this type of non-linear processing seems to fit well into the MIMD architecture.< >
In heterogeneous environments, very sophisticated modelling techniques and tools are required for the analysis of interactions between hardware and software subsystems. After a detailed discussion about performance ev...
详细信息
The main objective of this workshop is to investigate parallel and distributed architectures, algorithms and data structures for the processing of spatial data. The aim is to identify advantages and disadvantages of t...
详细信息
In this paper we describe an approach to coping with parallelism in symbolic applications. Our purpose is to build a parallel symbolic system suited to homogenous and heterogenous distributed Memory Parallel systems. ...
详细信息
In page based distributed shared memory systems false sharing is caused by locating several consistency units into the same transportation unit. We introduce a structuring scheme which combines the transparent access ...
详细信息
In distributed task ready queue organizations, task routing refers to how ready tasks are assigned to processors in the system and task scheduling refers to how these tasks are scheduled on the assigned processor. In ...
详细信息
computing is often viewed as a tool and a computing system is typically treated as a toolbox of applications for performing operations such as calculations and data storage and transfer. Currently, humans develop thes...
详细信息
The main algorithms for sequential and parallel discrete event simulations are introduced. A set of different simulators is evaluated and compared using a transputer-based multicomputer: sequential, parallel conservat...
详细信息
The objective of Open distributed Processing (ODP) is to support the construction of distributedsystems in a multi-vendor environment through the provision of an architectural framework that such systems must adhere ...
详细信息
The objective of Open distributed Processing (ODP) is to support the construction of distributedsystems in a multi-vendor environment through the provision of an architectural framework that such systems must adhere to. However, without a means to assess conformance the value of this architecture is limited. This paper describes a conformance assessment methodology suitable for Open distributed Processing, this methodology includes both testing and specification checking. We also discuss the scope of the methodology, which can be seen to support both de jure and de facto standards.
暂无评论