In the 3rd annual acm symposium on parallel algorithms and architectures, pp. 216-228 (JuIy 1991), we presented several new results in the theory of homogeneous multiprocessor scheduling. A directed acyclic graph (DAG...
详细信息
In the 3rd annual acm symposium on parallel algorithms and architectures, pp. 216-228 (JuIy 1991), we presented several new results in the theory of homogeneous multiprocessor scheduling. A directed acyclic graph (DAG) of tasks was to be scheduled. Tasks were assumed to be parallelizable-as more processors are applied to a task, the time taken to compute it decreases, yielding some speedup. Because of communication, synchronization and task scheduling overheads, this speedup increases less than linearly with the number of processors applied. The optimal scheduling problem is to determine the number of processors assigned to each task, and to the task sequencing, to minimise the finishing time. Using optimal control theory, in the special case where the speedup function of each task is p/sup /spl alpha// (where p is the amount of processing power applied to the task), a closed form solution for task graphs formed from parallel and series connections was derived. This paper considerably extends these techniques for arbitrary DAGs and applies them to matrix arithmetic compilation. The optimality conditions impose nonlinear constraints on the flow of processing power from predecessors to successors, and on the finishing times of siblings. This paper presents a fast algorithm for determining and solving these nonlinear equations. The algorithm utilizes the structure of the finishing time equations to efficiently run a conjugate gradient minimization leading to the optimal solution. The algorithm has been tested on a variety of DAGs. The results presented show that it is superior to alternative heuristic approaches.< >
Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. O...
ISBN:
(纸本)9780818655104
Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate threads that share data on the same processor. Multiple threads sharing in the cache should reduce compulsory and invalidation misses, thereby improving execution *** test this hypothesis, we compared a variety of thread placement algorithms via trace-driven simulation of fourteen coarse- and medium-grain parallel applications on several multithreaded architectures. Our results contradict the hypothesis. Rather than decreasing, compulsory and invalidation misses remained nearly constant across all placement algorithms, for all processor configurations, even with an infinite cache. That is, sharing-based placement had no (positive) effect on execution time. Instead, load balancing was the critical factor that affected performance. Our results were due to one or both of the following reasons: (1) the sequential and uniform access of shared data by the application's threads and (2) the insignificant number of data references that require interconnect access, relative to the total number of instructions.
We introduce Autonomous SIMD (ASIMD) massively parallel architecture, then look at the flexibility, cost, and effectiveness of MIMD and ASIMD parallel systems. We show that ASIMD systems such as the MasPar MP-1 and MP...
详细信息
暂无评论