Out-of-core rendering techniques are necessary for viewing large volume disk-resident data sets produced by many scientific applications or high resolution imaging systems. Traditional visualizers can provide real-tim...
详细信息
Out-of-core rendering techniques are necessary for viewing large volume disk-resident data sets produced by many scientific applications or high resolution imaging systems. Traditional visualizers can provide real-time performance but require all of the data to be viewed to be in the RAM. We describe a multithreaded implementation of an out-of-core isosurface renderer that does not impose such restrictions and yet provides performance that scales well withthe size of the data. Our renderer uses an interval tree data structure on disk with a layout that reduces disk seeks to read out only the relevant data from the disk. the low resulting disk latencies are hidden by using prefetching and multithreading to overlap the activities of the rendering computations and disk accesses. Our renderer outperforms the out-of-core isosurface renderer of the well-known vtk toolkit by about one order of magnitude and several orders of magnitude when compared against vtk toolkit's optimized in-core algorithm on large representative CT scan data. the multithreaded version also scales well withthe number of threads.
We explore the creation of a metacomputer by the aggregation of independent sites. Joining a metacomputer is voluntary, and hence it has to be an endeavor that mutually benefits all parties involved. We identify propo...
详细信息
We explore the creation of a metacomputer by the aggregation of independent sites. Joining a metacomputer is voluntary, and hence it has to be an endeavor that mutually benefits all parties involved. We identify proportional-share allocation as a key component of such a mutual benefit. Proportional-share allocation is the basis for enforcing the agreement reached among the sites on how to use the metacomputer's resources. We introduce a resource manager that provides proportional-share allocation over a cluster of workstations, assuming applications to be master-slave. this manager is novel because it performs non-preemptive proportional scheduling of multiple processors. A prototype has been implemented and we report on preliminary results. Finally, we discuss how tickets (first-class entities that encapsulate allocation endowments) can be used in practice to enforce the meta-computer agreement, and also how they can ease the site selection to be performed by the application.
Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3D flow velocity measur...
详细信息
Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3D flow velocity measurement. Due to the 3D nature of this application, very large computation and communication requirements are imposed. An innovation algorithm, the Concise Cross Correlation (CCC), is employed in the system to extract velocity field from the hologram of the test flows. With CCC we achieved a compression ratio of 10/sup 4/ and a processing speed 1000 times faster than with traditional 3D FFT-based correlation. To further accelerate the processing speed for fully time- and space-resolved measurement, parallelprocessing is necessary. We present our design for a distributed system supporting this previously unparallelized application, and comment on our experiences implementing a master-slave distributed version of CCC utilizing MPI. Brief experimental results on Gigabit Ethernet and multiprocessor Pentium Xeon systems are given.
this paper presents a parallel adaptive version of the block-based Gauss-Jordan algorithm used in numerical analysis to invert matrices. this version includes a characterization of the workload of processors and a mec...
详细信息
this paper presents a parallel adaptive version of the block-based Gauss-Jordan algorithm used in numerical analysis to invert matrices. this version includes a characterization of the workload of processors and a mechanism of its adaptive folding/unfolding. the application is implemented and experimented with MARS in dedicated and non-dedicated environments. the results show that an absolute efficiency of 92% is possible on a cluster of DEC/ALPHA processors interconnected by a Gigaswitch network and an absolute efficiency of 67% can be obtained on an Ethernet network of SUN-Sparc4 workstations. Moreover, the adaptability of the algorithm is experimented on a non-dedicated meta-system including boththe two parks of machines.
the ability to dynamically adapt an unstructured grid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however an efficient parallel implementation is rather difficult, ...
详细信息
the ability to dynamically adapt an unstructured grid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however an efficient parallel implementation is rather difficult, particularly from the viewpoint of portability on various multiprocessor platforms. We address this problem by developing PLUM, an automatic and architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.
this paper presents a parallel adaptive version of the block-based Gauss-Jordan algorithm used in numerical analysis to invert matrices. this version includes a characterization of the workload of processors and a mec...
详细信息
this paper presents a parallel adaptive version of the block-based Gauss-Jordan algorithm used in numerical analysis to invert matrices. this version includes a characterization of the workload of processors and a mechanism of its adaptive folding/unfolding. the application is implemented and experimented with MARS in dedicated and non-dedicated environments. the results show that an absolute efficiency of 92% is possible on a cluster of DEC/ALPHA processors interconnected by a Gigaswitch network and an absolute efficiency of 67% can be obtained on an Ethernet network of SUN-Sparc4 workstations. Moreover the adaptability of the algorithm is experimented on a non-dedicated meta-system including boththe two parks of machines.
We consider a variety of dynamic, hardware-based methods for exploiting load/store parallelism, including mechanisms that use memory dependence speculation. While previous work has also investigated such methods [19,4...
详细信息
We consider a variety of dynamic, hardware-based methods for exploiting load/store parallelism, including mechanisms that use memory dependence speculation. While previous work has also investigated such methods [19,4], this has been done primarily for split, distributed window processor models. We focus on centralized, continuous-window processor models (the common configuration today). We confirm that exploiting load/store parallelism can greatly improve performance. Moreover, we show that much of this performance potential can be captured if addresses of the memory locations accessed by both loads and stores can be used to schedule loads. However, using addresses to schedule load execution may not always be an option due to complexity, latency, and cost considerations. For this reason, we also consider configurations that use just memory dependence speculation to guide load execution. We consider a variety of methods and show that speculation/synchronization can be used to effectively exploit virtually all load/store parallelism. We demonstrate that this technique is competitive to or better than the one that uses addresses for scheduling loads. We conclude by discussing why our findings differ, in part, from those reported for split, distributed window processor models.
Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasing...
详细信息
Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasingly effective, any unparallelized loop becomes an increasingly dominant performance bottleneck. We present a technique for speeding up the execution of unparallelized loops by cascading their sequential execution across multiple processors: only a single processor executes the loop body at any one time, and each processor executes only a portion of the loop body before passing control to another. Cascaded execution allows otherwise idle processors to optimize their memory state for the eventual execution of their next portion of the loop, resulting in significantly reduced overall loop body execution times. We evaluate cascaded execution using loop nests from wave5, a Spec95fp benchmark application, and a synthetic benchmark. Running on a PC with4 Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors, we observe an overall speedup of 1.35 and 1.7, respectively, for the wave5 loops we examined, and speedups as high as 4.5 for individual loops. Our extrapolated results using the synthetic benchmark show a potential for speedups as large as 16 on future machines.
In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. this system enables the programmer to rely on a single, standard, shared-memory API for parallelization w...
详细信息
In this paper we present the first system that implements OpenMP on a network of shared-memory multiprocessors. this system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. this approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water; NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.
We propose an efficient reconfigurable parallel prefix counting network based on the recently-proposed technique of shift switching with domino logic, where the charge/discharge signals propagate along the switch chai...
详细信息
We propose an efficient reconfigurable parallel prefix counting network based on the recently-proposed technique of shift switching with domino logic, where the charge/discharge signals propagate along the switch chain producing semaphores results in a network that is fast and highly hardware-compact. the proposed architecture for prefix counting N-1 bits features a total delay of (4 log N+√N-2)*Td, where Td is the delay for charging or discharging a row of two prefix sum units of eight shift switches. Simulation results reveal that Td does not exceed 1 ns under 0.8-micron CMOS technology. Our design is faster than any design known to us for N&le210. Yet another important and novel feature of the proposed architecture is that it requires very simple controls, partially driven by semaphores, reducing significantly the hardware complexity and fully utilizing the inherent speed of the process.
暂无评论