检索结果-内蒙古大学图书馆

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Deng, Xiaotie Dymond, Patrick York Univ North York Ont Canada

ISBN: (纸本)9780897918091

The issue of effectiveness of private caches for processors were studied. Since time for all processors to access the shared memory simultaneously is usually much longer than the time for a processor to access its own private cache, scheduling with private caches falls into the distributed memory model where the lower bound applies. The effectiveness of private caches were shown by proving that a version of Dynamic Equi-partition Scheduling Policy (DEQ) achieves a mean response time with five times the optimal mean response time in the cache clock time for a large class of parallel jobs well accepted in the parallel scheduling community. This shows an improvement of system performance by using private caches over that of purely shared memory.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Deterministic sorting and randomized median finding on the BSP model 96

Deterministic sorting and randomized median finding on the B...

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Gerbessiotis, Alexandros V. Siniolakis, Constantinos J. Oxford Univ Oxford United Kingdom

ISBN: (纸本)9780897918091

We present new BSP algorithms for deterministic sorting and randomized median finding. We sort n general keys by using a partitioning scheme that achieves the requirements of efficiency (one-optimality) and insensitivity against data skew (the accuracy of the splitting keys depends solely on the step distance, which can be adapted to meet the worst-case requirements of our application). Although we employ sampling in order to realize efficiency, we can give a precise worst-case estimation of the maximum imbalance which might occur. We also investigate optimal randomized BSP algorithms for the problem of finding the median of n elements that require, with high-probability, 3n/(2p)+o(n/p) number of comparisons, for a wide range of values of n and p. Experimental results for the two algorithms are also presented.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Dynamic load balancing framework for unstructured adaptive computations on distributed-memory multiprocessors

Dynamic load balancing framework for unstructured adaptive c...

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Sohn, Andrew Biswas, Rupak Simon, Horst D. New Jersey Inst of Technology Newark United States

The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view each time the computational mesh is adapted. JOVE has been implemented on an SP2 in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. Furthermore, JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Steady state analysis of diffracting trees

Steady state analysis of diffracting trees

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Shavit, Nir Upfal, Eli Zemach, Asaph Tel Aviv Univ Israel

Diffracting trees are an effective and highly scalable distributed-parallel technique for shared counting and load balancing. This paper presents the first steady-state combinatorial model and analysis for diffracting trees, and uses it to answer several critical algorithmic design questions. Our model is simple and sufficiently high level to overcome many implementation specific details, and yet as we will show it is rich enough to accurately predict empirically observed behaviors. As a result of our analysis we were able to identify starvation problems in the original diffracting tree algorithm and modify it to a create a more stable version. We are also able to identify the range in which the diffracting tree performs most efficiently, and the ranges in which its performance degrades. We believe our model and modeling approach open the way to steady-state analysis of other distributed-parallel structures such am counting networks and elimination trees.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Optimal latency - throughput tradeoffs for data parallel pipelines 96

Optimal latency - throughput tradeoffs for data parallel pip...

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Subhlok, Jaspal Vondran, Gary Carnegie Mellon Univ Pittsburgh PA United States

ISBN: (纸本)9780897918091

This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general and realistic model of inter-task communication, and addresses the entire problem of mapping, which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Analysis of dag-consistent distributed shared-memory algorithms

Analysis of dag-consistent distributed shared-memory algorit...

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Blumofe, Robert D. Frigo, Matteo Joerg, Christopher F. Leiserson, Charles E. Randall, Keith H. Univ of Texas at Austin Austin TX United States

In this paper, we analyze the performance of parallel multithreaded algorithms that use dag-consistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a work-stealing thread scheduler and the BACKER coherence algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), then the expected execution time of a `fully strict' multithreaded computation on P processors, each with an LRU cache of C pages, is O(T1(C)/P+mCT∞), where T1(C) is the total work of the computation including page faults, T∞ is its critical-path length excluding page faults, and m is the minimum page transfer time. As a corollary to this theorem, we show that the expected number of page faults incurred by a computation executed on P processors, each with an LRU cache of C pages, is F1(C)+O(CPT∞), where F1(C) is the number of serial page faults. Finally, we give simple bounds on the number of page faults and the space requirements for `regular' divide-and-conquer algorithms. We use these bounds to analyze parallel multithreaded algorithms for matrix multiplication and LU-decomposition.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Fully dynamic search trees for an extension of the BSP model 96

Fully dynamic search trees for an extension of the BSP model

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Baumker, Armin Dittrich, Wolfgang Univ of Paderborn Paderborn Germany

ISBN: (纸本)9780897918091

We present parallel algorithms that maintain a 2-3 tree under insertions and deletions. The algorithms are designed for an extension of Valiant's BSP model, BSP*, that and reduction of the overhead involved in communication. The BSP*-model is introduced by Baumker et al. in [2]. Our analysis of the data structure goes beyond standard asymptotic analysis: We use Valiant's notion of c-optimality. Intuitively c-optimal algorithms tend to speedup p/c with growing input size (p denotes the number of processors), where the communication time is asymptotically smaller than the computation time. Our first approach allows 1-optimal searching and amortized c-optimal insertion and deletion for a small constant c. The second one allows 2-optimal searching, and c-optimal deletion and insertion for a small constant c. Both results hold with probability 1-o(1) for wide ranges of BSP*-parameters, where the ranges become larger with growing input sizes. The first approach allows much larger ranges. Further, both approaches are memory efficient, their total amount of memory used is proportional to the size m of the set being stored. Our results improve previous results by supporting a fully dynamic search tree rather than a static one, and by significantly reducing the communication time. Further our algorithms use blockwise communication.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Simple randomized mergesort on parallel disks 96

Simple randomized mergesort on parallel disks

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Barve, Rakesh D. Grove, Edward F. Vitter, Jeffrey Scott Duke Univ Durham NC United States

ISBN: (纸本)9780897918091

We consider the problem of sorting a file of N records on the D-disk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the forecasting technique of [Knu73], our algorithm is able to read in, at any time, the `right' block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the `right' blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. Our analysis technique involves a novel reduction to various maximum occupancy problems. We prove that the expected I/O performance of SRM is efficient under varying sizes of memory and that it compares favorably in practice to disk-striped mergesort (DSM). Our studies indicate that SRM outperforms DSM even when the number D of parallel disks is fairly small.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Anticipatory instruction scheduling 96

Anticipatory instruction scheduling

引用

Proceedings of the 1996 8th annual acm symposium on parallel algorithms and architectures

作者： Sarkar, Vivek Simons, Barbara Application Development Technology Inst San Jose CA United States

ISBN: (纸本)9780897918091

Modern processors have many levels of parallelism arising from multiple functional units and pipeline stages. In this paper, we consider the interplay between instruction scheduling performed by a compiler and instruction lookahead performed by hardware. Anticipatory instruction scheduling is the process of rearranging instructions within each basic block so as to minimize the overall completion time of a set of basic blocks in the presence of hardware instruction lookahead, while preserving safety by not moving any instructions beyond basic block boundaries. Anticipatory instruction scheduling delivers many of the benefits of global instruction scheduling by accounting for instruction overlap across basic block boundaries arising from hardware lookahead, without compromising safety (as in some speculative scheduling techniques) or serviceability of the compiled program. We present the first probably optimal algorithm for a special case of anticipatory instruction scheduling for a trace of basic blocks on a machine with arbitrary size lookahead windows. We extend this result for the version of the problem in which a trace of basic blocks is contained within a loop. In addition, we discuss how to modify these special-case optimal algorithms to obtain heuristics for the more general (but NP-hard) problems that occur in practice.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Efficient low-contention parallel algorithms

引用

JOURNAL OF COMPUTER AND SYSTEM SCIENCES 1996年第3期53卷 417-442页

作者： Gibbons, PB Matias, Y Ramachandran, V UNIV TEXAS DEPT COMP SCI AUSTIN TX 78712 USA

The queue-read, queue-write (QRQW) parallel random access machine ( PRAM) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The QRQW PRAM model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied CRCW PRAM or EREW PRAM models, and can be efficiently emulated with only logarithmic slowdown on hypercube-type noncombining networks. This paper describes fast, low-contention, work-optimal, randomized QRQW PRAM algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting, These logarithmic or sublogarithmic time algorithms considerably improve upon the best known EREW PRAM algorithms for these problems, while avoiding the high-contention steps typical of CRCW PRAM algorithms. An illustrative experiment demonstrates the performance advantage of a new QRQW random permutation algorithm when compared with the popular EREW algorithm. Finally, this paper presents new randomized algorithms for integer sorting and general sorting. (C) 1996 Academic Press, Inc.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：