检索结果-内蒙古大学图书馆

annual acm symposium on parallel algorithms and architectures 1999年 96-105页

作者： Nagar, Shailabh Banerjee, Ajit Sivasubramaniam, Anand Das, Chita R. Pennsylvania State Univ University Park United States

Efficient scheduling of processes on processors of a Network of Workstations (NOW) is essential for good system performance. The design of such schedulers is a complex interaction between several system and workload parameters. Two operations, waiting for a message and arrival of a message, can be used to take remedial actions that can guide the behavior of the system towards coscheduling using local information. An intensive implementation and evaluation exercise in studying these system are presented.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

BOS is boss: A case for bulk-synchronous object systems

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 115-125页

作者： Goudreau, Mark W. Lang, Kevin Narlikar, Girija Rao, Satish B. NEC USA Inc Princeton United States

A key issue for parallel systems is the development of useful programming abstractions that can coexist with good performance. We describe a communication library that supports an object-based abstraction with a bulk-synchronous communication style;this is the first time such a library has been proposed and implemented. By restricting the library to the exclusive use of barrier synchronization, we are able to design a simple and easy-to-use object system. By exploiting established techniques based on the bulk-synchronous parallel (BSP) model, we are able to design algorithms and library implementations that work well across platforms.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Scheduling threads for low space requirement and good locality

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 83-95页

作者： Narlikar, Girija J. CMU Sch of Computer Science

A simple, asynchronous, space-efficient scheduling algorithm for shared memory machines was developed. The algorithm combined the low scheduling overheads and good locality of work stealing with the low space requirements of depth-first schedulers. The algorithm was applied in the context of a native, user-level implementation of Posix standard threads or Pthreads. Its performance was evaluated using a set of C-based benchmarks and was compared with two other schedulers. The new algorithm covered a range of scheduling granularities and space requirements, and allowed the user to trade the space requirement of a program with the scheduling granularity.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Post-mortem black-box correctness tests for basic parallel data structures 99

Post-mortem black-box correctness tests for basic parallel d...

引用

proceedings of the eleventh annual acm symposium on parallel algorithms and architectures

作者： Phillip B. Gibbons John L. Bruno Steven Phillips Bell Laboratories Lucent Technologies Room 2A-341 600 Mountain Avenue Murray Hill NJ Dept. of Computer Science University of California Santa Barbara CA AT&T Labs - Research Florham Park NJ

来源：评论

学校读者我要写书评

暂无评论

Efficient approximation algorithms for scheduling malleable tasks

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 23-32页

作者： Mounie, Gregory Rapine, Christophe Trystram, Denis IMAG Grenoble France

A malleable task is a computational unit which may be executed on any arbitrary number of processors, its execution time depending on the amount of resources allotted to it. According to the standard behavior of parallel applications, we assume that the malleable tasks are monotonic, i.e. that the execution tune is decreasing with the number of processors while the computational work increases. This paper presents a new approach for scheduling a set of independent malleable tasks which leads to a worst case guarantee of √3 for the minimization of the parallel execution time, or makespan. It improves all other existing practical results including the two-phases method introduced by Turek et al. The main idea is to transfer the difficulty of a two phases method from the scheduling part to the allotment selection. We show how to formulate this last problem as a knapsack optimization problem. Then, the scheduling problem is solved by a dual-approximation which leads to a simple structure of two consecutive shelves.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Multidimensional, multiprocessor, out-of-core FFTs with distributed memory and parallel disks

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 242-250页

作者： Baptist, Lauren M. Cormen, Thomas H. Dartmouth Coll Hannover United States

We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. Instead, data reside on a parallel disk system and are brought into memory in sections. We use the parallel Disk Model for implementation and analysis. Our method is a straightforward out-of-core variant of a well-known method for in-core, multidimensional FFTs. It performs 1-dimensional FFT computations on each dimension in turn. This method is easy to generalize to any number of dimensions, and it also readily permits the individual dimensions to be of any sizes that are integer powers of 2. The key step is an out-of-core transpose operation that places the data along each dimension into contiguous positions on the parallel disk system so that the data for the 1-dimensional FFTs are contiguous. We present an I/O complexity analysis for this method as well as empirical results for a DEC 2100 server, an SGI Origin 2000, and a Beowulf cluster, each of which has a parallel disk system.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Recursive array layouts and fast parallel matrix multiplication

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 222-231页

作者： Chatterjee, Siddhartha Lebeck, Alvin R. Patnala, Praveen K. Thottethodi, Mithuna Univ of North Carolina Chapel Hill NC United States

Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms;we provide an algorithmic explanation of this phenomenon. We demonstrate that carrying the recursive layout down to the level of individual matrix elements is counter-productive, and that a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. We evaluate five recursive layouts with successively increasing complexity of address computation, and show that addressing overheads can be kept in control even for the most computationally demanding of these layouts. Finally, we provide a critique of the Cilk system that we used to parallelize our code.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Experience with an adaptive globally-synchronizing clock algorithm

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 106-114页

作者： Liao, Cheng Martonosi, Margaret Clark, Douglas W. Princeton Univ Princeton NJ United States

Fast commodity network-connected PC or workstation clusters are becoming more and more popular. This popularity can be attributed to their ability to provide high-performance parallel computing on a relatively inexpensive platform. An accurate global clock is invaluable for these systems, both for measuring network performance and coordinating distributed applications. Typically, however, these systems do not include dedicated clock synchronization support. Previous clock synchronization methods are not suitable here in general, either because of extra, non-commodity hardware requirements or insufficient synchronized clock accuracy. In this paper we present and evaluate an adaptive clock synchronization algorithm. We have implemented and tested the algorithm on our Myrinet-based PC cluster. It is regularly used as part of a parallel performance tool running on the cluster. The algorithm has several important features. First, it does not require any extra hardware support. Second, we show that this algorithm imposes very low overhead on the system and has microsecond-level accuracy. Finally, our results indicate that adding the ability to adaptively adjust the clock's re-synchronization period causes almost no extra overhead while achieving a much better global clock accuracy.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Tradeoffs between parallelism and fill in nested dissection

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 191-200页

作者： Bornstein, Claudson F. Maggs, Bruce M. Miller, Gary L. Universidade Federal do Rio de Janeiro Rio de Janeiro Brazil

In this paper we demonstrate that parallelism and fill can be traded off in orders for Gaussian elimination. While the well-known nested dissection algorithm produces very parallel elimination orders, we show that by reducing the parallelism it is possible to reduce the fill that the orders generate. In particular, we present a new `less parallel nested dissection' algorithm (LPND). We prove that, unlike standard nested dissection, when applied to a chordal graph LPND finds a zero-fill elimination order. Our implementation of LPND generates less fill than state-of-the-art implementations of the nested dissection (METIS), minimum-degree (AMD), and hybrid (BEND) algorithms on a large body of test matrices, at the cost of a small reduction in the paralellism in the orders that it produces. We have also implemented a nested dissection algorithm that is different from METIS and that uses the same separator algorithm used by our implementation of LPND. This algorithm, like LPND, generates less fill than METIS, and on large graphs generates significantly less fill than AMD. The latter comparison is notable, because although it is known that, for certain classes of graphs, minimum-degree produces asymptotically more fill than nested dissection, minimum-degree is believed to produce low-fill orderings in practice. Our experiments contradict this belief.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

System-level specification framework for I/O architectures

Annual ACM Symposium on Parallel Algorithms and Architecture...

引用

annual acm symposium on parallel algorithms and architectures 1999年 138-147页

作者： Hill, Mark D. Condon, Anne E. Plakal, Manoj Sorin, Daniel J. Univ of Wisconsin-Madison Madison United States

A computer system is useless unless it can interact with the outside world through input/output (I/O) devices. I/O systems are complex, including aspects such as memory-mapped operations, interrupts, and bus bridges. Often, I/O behavior is described for isolated devices without a formal description of how the complete I/O system behaves. The lack of an end-to-end system description makes the tasks of system programmers and hardware implementors more difficult to do correctly. This paper proposes a framework for formally describing I/O architectures called Wisconsin I/O (WIO). WIO extends work on memory consistency models (that formally specify the behavior of normal memory) to handle considerations such as memory-mapped operations, device operations, interrupts, and operations with side effects. Specifically, WIO asks each processor or device that can issue k operation types to specify ordering requirements in a k×k table. A system obeys WIO if there always exists a total order of all operations that respects processor and device ordering requirements and has the value of each `read' equal to the value of the most recent `write' to that address. This paper then presents examples of WIO specifications for systems with various memory consistency models including sequential consistency (SC), SPARC TSO, an approximation of Intel IA-32, and Compaq Alpha. Finally, we present a directory-based implementation of an SC system, and we sketch a proof which shows that the implementation conforms to its WIO specification.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：