检索结果-内蒙古大学图书馆

Data parallel programming with the Khoros data services library 12th

10 IPPS/SPDP 98 Workshops Held in Conjunction with the 12th international parallel Processing symposium / 9th symposium on parallel Distributed Processing

作者： Kubica, S Robey, T Moorman, C Khoral Res Inc Albuquerque NM 87110 USA

ISBN: (纸本)3540643591

The distributed data service library allows developers to control the distribution of data in a parallel program simply by setting attributes on a distributed data object. This interface provides the power of a data parallel programming paradigm by abstracting all the low-level communication required for effecting a data distribution. The simplicity of the interface should also facilitate the porting of serial routines to run on parallel architectures. The emphasis in the development of distributed data services has been to provide a framework which addresses the common problems encountered in writing even the simplest data parallel programs, while still being extensible to problems which are less frequently encountered, and often more complicated. Distributed data services allows developers writing such applications to use low-level MPI calls as necessary. In this way, distributed data services combines the convenience of a data parallel language such as HPF with the flexibility of the message passing library MPI. © Springer-Verlag Berlin Heidelberg 1998.

关键词： parallel architectures

来源：评论

学校读者我要写书评

暂无评论

Breaking the barriers: Two models for MPI programming

Breaking the barriers: Two models for MPI programming

引用

international Conference on parallel architectures and Compilation Techniques

作者： Roda, J Rodriguez, C Morales, DG Almeida, F Pulido, P Dorta, D Univ La Laguna Dept Estadist IO & Computac Tenerife Spain

ISBN: (纸本)0818685913

The asynchronous nature of many MPI/PVM programs does not fit the BSP model. The barrier synchronization imposed by the model restricts the range of available algorithms and their performance;Through the suppression of barriers and the generalization of the concept of superstep we propose two Mew models, the BSP-like and the BSP Without Barriers (BSPWB) models. While the BSP-like extends the BSP* model to programs written using collective operations, the more general BSPWB model admits the MPI/PVM parallel asynchronous programming: style. As LogP, the model encourages locality bat it is simpler to use. The parameters of the models and their quality are evaluated on a distributed-shared memory machine, the Origin 2000 and on a distributed memory machine, the CRAY T3E. The dependence of the time spent in an h-relation is stronger in the communication pattern than in the number of processors. The total variation of the h-relation time in both the patterns and processor numbers is smaller than sixty nanoseconds. To illustrate the proposed models, two different applications are considered: a parallel Sort using Regular Sampling (PSRS) and a parallel Dynamic programming Algorithm solving the Single Resource Allocation Problem (SRAP). The PSRS is a synchronous algorithm with a rich set of collective communication patterns and coarse grain communications. On the opposite extreme, the SRAP is a fine grain communication algorithm using permutation patterns The computational results prove the accuracy of the models. The prediction of the communication times is robust even for the SRAP, where communication is dominated by small messages.

关键词： Dynamic programming

来源：评论

学校读者我要写书评

暂无评论

PC-based shared memory architecture and language

引用

JOURNAL OF SUPERCOMPUTING 1998年第1-2期12卷 119-136页

作者： Houzet, D Fatni, A Univ Toulouse 3 IRIT ENSEEIHT INP F-31071 Toulouse France

The Image Processing applications require both computing and communication power. The object of the GFLOPS project was to study all aspects concerning the design of such computers. The project's aim was to develop a parallel architecture as well as its software environment to implement these applications efficiently. A development environment, especially a C data-parallel language, has been built for this purpose. The C// parallel language presented here, simplifies the use of such architectures by providing the programmer with a global name space and a control mechanism to exploit fine and medium grain parallelism of its applications. The main advantage of our paradigm is that it allows a unique framework to express both data and control parallelism. We have implemented this programming environment on the GFLOPS machine which supports up to 512 processor nodes, which are PC mother boards, connected over a scaleable and cost-effective network, via the PCI-bus, at a constant cost per node. The aim is to obtain at low cost a scaleable virtually shared memory machine. In this paper we discuss the design of the GFLOPS machine and its C// parallel language, and evaluate the effectiveness of the mechanisms incorporated. The analysis of the architecture's behaviour was conducted with microbenchmarks and image processing algorithms, written in C.

关键词： image processing language parallel architecture evaluation

来源：评论

学校读者我要写书评

暂无评论

The matrix template library: A generic programming approach to high performance numerical linear algebra 2nd

The matrix template library: A generic programming approach ...

引用

2nd international symposium on Computing in Object-Oriented parallel Environments, ISCOPE 1998

作者： Siek, Jeremy G. Lumsdaine, Andrew Computer Science Department University of Illinois UrbanaIL61801 United States

ISBN: (纸本)3540653872

We present a unified approach for building high-performance numerical linear algebra routines for large classes of dense and sparse matrices. As with the Standard Template Library [1], we separate algo- rithms from data structures using generic programming techniques. Such an approach does not hinder high performance, rather, writing porta- ble high-performance codes is enabled because the performance-critical code can be isolated from the algorithms and data structures. We ad- dress the performance portability problem for architecture-dependent algorithms such as matrix-matrix multiply. Recently, code generation systems, such as PHiPAC [2] and ATLAS [3], have allowed algorithms to be tuned to particular architectures. Our approach is to use template metaprograms [4] to directly express performance-critical, architecture- dependent, sections of code. © Springer-Verlag Berlin Heidelberg 1998.

关键词： Matrix algebra

来源：评论

学校读者我要写书评

暂无评论

Aggressive dynamic execution of multimedia kernel traces 1

Aggressive dynamic execution of multimedia kernel traces

引用

1st Merged international parallel Processing symposium/symposium on parallel and Distributed Processing (IPPS/SPDP 1998)

作者： Bishop, B Owens, R Irwin, MJ Penn State Univ Dept Comp Sci & Engn University Pk PA 16802 USA

ISBN: (纸本)0818684038

There has been relatively little analytical work on processor optimizations for multimedia applications. With the introduction of MMX by Intel, it is clear that this is an area of increasing importance. Building on previous work [4, 5, 6, 7, 13, 14], we propose optimizations for multimedia architectures that support independent parallel execution of instructions within dynamically assembled traces, resulting in dramatic performance improvements. Specifically, we propose simplified instruction scheduling and register renaming algorithms due to constraints on trace formation. In addition, we suggest specific instruction pool and trace cache parameters. We constructed a simulator in order to measure the benefits of these processor optimizations for multimedia applications. The simulated machine, which could fetch/decode 2 instructions per cycle, performed better than a superscalar machine that could fetch/decode 8 instructions per cycle. Execution rates as high as 7.3 instructions per cycle were achieved for the benchmarks simulated, assuming 16 instructions per trace.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Achieving portability and efficiency through automatic optimisation: An investigation in parallel image processing

引用

4th international Conference on parallel Processing, Euro-Par 1998

作者： Crookes, D. Morrow, P.J. Brown, T.J. McAleese, S.G. Roantree, D. Spence, I.T.A. Department of Computer Science Queen's University of Belfast Belfast BT7 INN United Kingdom Department of Computing Science University of Ulster at Coleraine Coleraine BT52 7EQ United Kingdom

ISBN: (纸本)3540649522

This paper discusses the main achievements of the EPIC project, whose aim was to design a high level programming environment with an associated implementation for portable parallel image processing. The project was funded as part of the EPSRC Portable Software Tools for parallel architectures (PSTPA) programme. The paper summarises new portable programming abstractions for image processing, and outlines the automatically optimising implementation which achieves portability of application code and efficiency of implementation on a closely coupled distributed memory parallel system. The paper includes timings for optimised and unoptimised versions of typical image processing algorithms;it draws the main conclusion that it is possible to achieve portability with efficiency, for a specific application, by adopting a high level algebraic programming model, together with a transformation-based optimiser which reclaims the loss of efficiency which an algebraic approach traditionally entails.

关键词： parallel architectures

来源：评论

学校读者我要写书评

暂无评论

An adaptive parallel computer vision system

引用

international JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE 1998年第3期12卷 311-334页

作者： Kim, JM Kim, Y Kim, SD Han, TD Yang, SB Yonsei Univ Dept Comp Sci Seoul 120749 South Korea

An approach for designing a hybrid parallel system that can perform different levels of parallelism adaptively is presented. An adaptive parallel computer vision system (APVIS) is proposed to attain this goal. The APVIS is constructed by integrating two different types of parallel architectures, i.e, a multiprocessor based system (MBS) and a memory based processor array (MPA);tightly into a single machine. One important feature in the APVIS is that the programming interface to execute data parallel code onto the MPA is the same as the usual subroutine calling mechanism. Thus the existence of the MPA is transparent to the programmers. This research is to design an underlying base architecture that can be optimally executed for a broad range of vision tasks. A performance model is provided to show the effectiveness of the APVTS. It turns out that the proposed APVIS can provide significant performance improvement and cost effectiveness for highly parallel applications having a mixed set of parallelisms. Also an example application composed of a series of vision algorithms, from low-level and medium-level processing steps, is mapped onto the MPA. Consequently, the APVIS with a few or tens of MPA modules can perform the chosen example application in real time when multiple images are incoming successively with a few seconds inter-arrival time.

关键词： computer vision parallel processing SIMD multiprocessor performance model

来源：评论

学校读者我要写书评

暂无评论

Which comes first: The architecture or the algorithm?

Which comes first: The architecture or the algorithm?

引用

international Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems

作者： Gropp, W Argonne Natl Lab Div Math & Comp Sci Argonne IL 60439 USA

ISBN: (纸本)0818684240

There is a constant tension between the designers of algorithms and architectures. Each designs for the others previous generation. Several factors are making this an increasingly inadequate approach. On the hardware side, the growing disparity between the performance of memory and CPU has pushed many architectures to hierarchical memories that reward significant data reuse (and punish algorithms that use data items a small number of times). On the algorithmic side, at least for a large class of algorithms in scientific computing, the emphasis on increasingly efficient algorithms, measured by the amount of work (floating point operations) per solution value, has led to algorithms that touch data only a few times. Further, algorithmic techniques that lead to adaptive methods (computing only with as much data as is required to accurately represent the solution) often lead to irregular or unpredictable accesses to memory. While these trends in architectures and algorithms are away from each other, there are other trends in algorithms that provide both a degree of greater memory locality without sacrificing algorithmic optimality. These are hierarchical methods, such as multigrid and general domain decomposition. These hierarchical methods place some requirements on architectures;primarily that there be no distinguished collection of processes/threads in a computation. Another opportunity in algorithms is the potential for exploiting split-phase or two-step operations;most algorithms are designed under the assumption that parallel operations, such as a scan or reduction, are single step (blocking in MPI terms), but this is often not required. We conclude by noting that both algorithmic and architecture research can benefit from a more continuous dialog.

关键词： Computer systems programming

来源：评论

学校读者我要写书评

暂无评论

parallel implementation of schönhage’s integer GCD algorithm 3rd

引用

3rd international symposium on Algorithmic Number Theory, ANTS 1998

作者： Cesari, Giovanni Università degli Studi di Trieste DEEI Trieste1-34100 Italy

ISBN: (纸本)3540646574

We present a parallel implementation of Schönhage’s integer GCD algorithm on distributed memory architectures. Results are generalized for the extended GCD algorithm. Experiments on sequential architectures show that Schönhage’s algorithm overcomes other GCD algorithms implemented in two well known multiple-precision packages for input sizes larger than about 50000 bytes. In the extended case this threshold drops to 10000 bytes. In these input ranges a parallel implementation provides additional speed-up. parallelization is achieved by distributing matrix operations and by using parallel implementations of the multiple-precision integer multiplication algorithms. We use parallel Karatsuba’s and parallel 3-primes FFT multiplication algorithms implemented in CALYPSO, a computer algebra library for parallel symbolic computation we have developed. SchSnhage’s parallel algorithm is analyzed by using a message-passing model of computation. Experimental results on distributed memory architectures, such as the Intel Paragon, confirm the analysis. © Springer-Verlag Berlin Heidelberg 1998.

关键词： Message passing

来源：评论

学校读者我要写书评

暂无评论

Detecting data races in Cilk programs that use locks 98

Detecting data races in Cilk programs that use locks

引用

Proceedings of the 1998 10th Annual ACM symposium on parallel algorithms and architectures, SPAA

作者： Cheng, G.-I. Feng, M. Leiserson, Ch.E. Randall, K.H. Stark, A.F. MIT Lab for Computer Science Cambridge MA United States

ISBN: (纸本)9780897919890

When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a 'data race' occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in programs coded in the Cilk multithreaded language. Like its predecessor, the Nondeterminator, which checks for simple 'determinacy' races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input. We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T, accesses V shared memory locations, uses a total of n locks, and holds at most k n locks simultaneously, ALL-SETS runs in. O(nkT α(V, V)) time and O(nkV) space, where α is Tarjan's functional inverse of Ackermann's function. Since ALL-SETS may be too inefficient in the worst case, we propose a much more efficient algorithm which can be used to detect races in programs that obey the 'umbrella' locking discipline, a programming methodology that is more flexible than similar disciplines proposed in the literature. We present an algorithm, BRELLY, which detects violations of the umbrella discipline in O(kT α(V, V)) time using O(kV) space. We also prove that any 'abelian' Cilk program, one whose critical sections commute, produces a determinate final state if it is deadlock free and if it generates any computation which is data-race free. Thus, the Nondeterminator-2's two algorithms can verify the determinacy of a deadlock-free abelian program running on a given input.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：