检索结果-内蒙古大学图书馆

Proceedings of the international parallel processing symposium, IPPS 1999年 569-575页

作者： Sulatycke, Peter D. Ghose, Kanad State Univ of New York Binghamton United States

Out-of-core rendering techniques are necessary for viewing large volume disk-resident data sets produced by many scientific applications or high resolution imaging systems. Traditional visualizers can provide real-time performance but require all of the data to be viewed to be in the RAM. We describe a multithreaded implementation of an out-of-core isosurface renderer that does not impose such restrictions and yet provides performance that scales well with the size of the data. Our renderer uses an interval tree data structure on disk with a layout that reduces disk seeks to read out only the relevant data from the disk. the low resulting disk latencies are hidden by using prefetching and multithreading to overlap the activities of the rendering computations and disk accesses. Our renderer outperforms the out-of-core isosurface renderer of the well-known vtk toolkit by about one order of magnitude and several orders of magnitude when compared against vtk toolkit's optimized in-core algorithm on large representative CT scan data. the multithreaded version also scales well with the number of threads.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

the computational Co-op: Gathering clusters into a metacomputer

The computational Co-op: Gathering clusters into a metacompu...

引用

international symposium on parallel processing

作者： W. Cime K. Marzullo Dept. of Comput. Sci. & Eng. California Univ. San Diego La Jolla CA USA Computer Science and Engineering University of California San Diego USA

We explore the creation of a metacomputer by the aggregation of independent sites. Joining a metacomputer is voluntary, and hence it has to be an endeavor that mutually benefits all parties involved. We identify proportional-share allocation as a key component of such a mutual benefit. Proportional-share allocation is the basis for enforcing the agreement reached among the sites on how to use the metacomputer's resources. We introduce a resource manager that provides proportional-share allocation over a cluster of workstations, assuming applications to be master-slave. this manager is novel because it performs non-preemptive proportional scheduling of multiple processors. A prototype has been implemented and we report on preliminary results. Finally, we discuss how tickets (first-class entities that encapsulate allocation endowments) can be used in practice to enforce the meta-computer agreement, and also how they can ease the site selection to be performed by the application.

关键词： Metacomputing Resource management Concurrent computing Workstations Application software High performance computing Read only memory Computer science Prototypes parallel processing

来源：评论

学校读者我要写书评

暂无评论

distributed processing for cinematic holographic particle image velocimetry

Distributed processing for cinematic holographic particle im...

引用

international symposium on High Performance distributed Computing

作者： Ye Pu D. Andresen Department of Mechanical and Nuclear Engineering Kansas State University Manhattan KS USA

Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3D flow velocity measurement. Due to the 3D nature of this application, very large computation and communication requirements are imposed. An innovation algorithm, the Concise Cross Correlation (CCC), is employed in the system to extract velocity field from the hologram of the test flows. With CCC we achieved a compression ratio of 10/sup 4/ and a processing speed 1000 times faster than with traditional 3D FFT-based correlation. To further accelerate the processing speed for fully time- and space-resolved measurement, parallel processing is necessary. We present our design for a distributed system supporting this previously unparallelized application, and comment on our experiences implementing a master-slave distributed version of CCC utilizing MPI. Brief experimental results on Gigabit Ethernet and multiprocessor Pentium Xeon systems are given.

关键词： distributed processing Holography Velocity measurement Computer applications Technological innovation System testing Acceleration parallel processing Master-slave Ethernet networks

来源：评论

学校读者我要写书评

暂无评论

parallel adaptive version of the block-based Gauss-Jordan algorithm

Proceedings of the International Parallel Processing Symposi...

引用

Proceedings of the international parallel processing symposium, IPPS 1999年 350-354页

作者： Melab, N. Talbi, E.-G. Petiton, S. Universite des Sciences et Technologies de Lille Villeneuve d'Ascq France

this paper presents a parallel adaptive version of the block-based Gauss-Jordan algorithm used in numerical analysis to invert matrices. this version includes a characterization of the workload of processors and a mechanism of its adaptive folding/unfolding. the application is implemented and experimented with MARS in dedicated and non-dedicated environments. the results show that an absolute efficiency of 92% is possible on a cluster of DEC/ALPHA processors interconnected by a Gigaswitch network and an absolute efficiency of 67% can be obtained on an Ethernet network of SUN-Sparc4 workstations. Moreover, the adaptability of the algorithm is experimented on a non-dedicated meta-system including both the two parks of machines.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Portable parallel programming for the dynamic load balancing of unstructured grid applications

Portable parallel programming for the dynamic load balancing...

引用

international symposium on parallel processing

作者： R. Biswas S.K. Das D. Harvey L. Oliker NASA Ames Research Center MRJ Technology Solutions Inc. CA USA Department of Computer Sciences University of North Texas Denton TX USA NASA Ames Research Center RIACS CA USA

the ability to dynamically adapt an unstructured grid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however an efficient parallel implementation is rather difficult, particularly from the viewpoint of portability on various multiprocessor platforms. We address this problem by developing PLUM, an automatic and architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.

关键词： parallel programming Dynamic programming Load management Physics computing Concurrent computing Runtime Application software NASA Broadcasting parallel processing

来源：评论

学校读者我要写书评

暂无评论

A parallel adaptive version of the block-based Gauss-Jordan algorithm

A parallel adaptive version of the block-based Gauss-Jordan ...

引用

international symposium on parallel processing

作者： N. Melab E.-G. Talbi S. Petiton Laboratoire dInformatique Fondamentale de Lille (LIFL-CNRS URA 369) Université des Sciences et Technologies de Lille Villeneuve d'Ascq France

this paper presents a parallel adaptive version of the block-based Gauss-Jordan algorithm used in numerical analysis to invert matrices. this version includes a characterization of the workload of processors and a mechanism of its adaptive folding/unfolding. the application is implemented and experimented with MARS in dedicated and non-dedicated environments. the results show that an absolute efficiency of 92% is possible on a cluster of DEC/ALPHA processors interconnected by a Gigaswitch network and an absolute efficiency of 67% can be obtained on an Ethernet network of SUN-Sparc4 workstations. Moreover the adaptability of the algorithm is experimented on a non-dedicated meta-system including both the two parks of machines.

关键词： Gaussian processes Mars Workstations Ice parallel programming Programming environments Numerical analysis LAN interconnection Network topology Fault tolerance

来源：评论

学校读者我要写书评

暂无评论

Memory dependence speculation tradeoffs in centralized, continuous-window superscalar processors

Memory dependence speculation tradeoffs in centralized, cont...

引用

the 6th international symposium on High-Performance Computer Architecture (HPCA-6)

作者： Moshovos, Andreas Sohi, Gurindar S. Northwestern Univ Evanston United States

We consider a variety of dynamic, hardware-based methods for exploiting load/store parallelism, including mechanisms that use memory dependence speculation. While previous work has also investigated such methods [19,4], this has been done primarily for split, distributed window processor models. We focus on centralized, continuous-window processor models (the common configuration today). We confirm that exploiting load/store parallelism can greatly improve performance. Moreover, we show that much of this performance potential can be captured if addresses of the memory locations accessed by both loads and stores can be used to schedule loads. However, using addresses to schedule load execution may not always be an option due to complexity, latency, and cost considerations. For this reason, we also consider configurations that use just memory dependence speculation to guide load execution. We consider a variety of methods and show that speculation/synchronization can be used to effectively exploit virtually all load/store parallelism. We demonstrate that this technique is competitive to or better than the one that uses addresses for scheduling loads. We conclude by discussing why our findings differ, in part, from those reported for split, distributed window processor models.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Cascaded execution: Speeding up unparallelized execution on shared-memory multiprocessors

Proceedings of the International Parallel Processing Symposi...

引用

Proceedings of the international parallel processing symposium, IPPS 1999年 714-719页

作者： Anderson, Ruth E. Nguyen, thu D. Zahorjan, John Univ of Washington Seattle United States

Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasingly effective, any unparallelized loop becomes an increasingly dominant performance bottleneck. We present a technique for speeding up the execution of unparallelized loops by cascading their sequential execution across multiple processors: only a single processor executes the loop body at any one time, and each processor executes only a portion of the loop body before passing control to another. Cascaded execution allows otherwise idle processors to optimize their memory state for the eventual execution of their next portion of the loop, resulting in significantly reduced overall loop body execution times. We evaluate cascaded execution using loop nests from wave5, a Spec95fp benchmark application, and a synthetic benchmark. Running on a PC with 4 Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors, we observe an overall speedup of 1.35 and 1.7, respectively, for the wave5 loops we examined, and speedups as high as 4.5 for individual loops. Our extrapolated results using the synthetic benchmark show a potential for speedups as large as 16 on future machines.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

OpenMP for networks of SMPs

OpenMP for networks of SMPs

引用

international symposium on parallel processing

作者： Y.C. Hu Honghui Lu A.L. Cox W. Zwaenepoel Department of Computer Science Rice University Houston TX USA Department of Electrical and Computer Engineering Rice University Houston TX USA

In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. this system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. this approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water; NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.

关键词： Switched-mode power supply Programming profession Message passing parallel programming Computer science Automatic logic units Runtime library Yarn Algorithms Open source software

来源：评论

学校读者我要写书评

暂无评论

Efficient VLSI architecture parallel prefix counting with domino logic

Proceedings of the International Parallel Processing Symposi...

引用

Proceedings of the international parallel processing symposium, IPPS 1999年 273-277页

作者： Lin, Rong Nakano, Koji Olariu, Stephan Zomaya, Albert Y. SUNY at Geneseo Geneseo United States

We propose an efficient reconfigurable parallel prefix counting network based on the recently-proposed technique of shift switching with domino logic, where the charge/discharge signals propagate along the switch chain producing semaphores results in a network that is fast and highly hardware-compact. the proposed architecture for prefix counting N-1 bits features a total delay of (4 log N+√N-2)*Td, where Td is the delay for charging or discharging a row of two prefix sum units of eight shift switches. Simulation results reveal that Td does not exceed 1 ns under 0.8-micron CMOS technology. Our design is faster than any design known to us for N&le210. Yet another important and novel feature of the proposed architecture is that it requires very simple controls, partially driven by semaphores, reducing significantly the hardware complexity and fully utilizing the inherent speed of the process.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：