In this paper, we consider techniques for designing and analyzing distributed security transactions. We present a layered approach, with a high-level security transaction layer running on top of a lower-level secure t...
详细信息
We investigate the performance of the time warp kernel APSIS when running on various communication layers, in particular on a wide-area grid. Several cancellation strategies are tried, among them the lazy cancellation...
详细信息
Distributed systems such as clusters of PCs are low-cost alternatives for running parallel rendering systems, but they have high communication overhead, and limited memory capacity on each processing node. In this pap...
详细信息
ISBN:
(纸本)0769520324
Distributed systems such as clusters of PCs are low-cost alternatives for running parallel rendering systems, but they have high communication overhead, and limited memory capacity on each processing node. In this paper, we focus on the strategy for distributing the parallel rendering work among the PCs. A good distribution strategy provides better load balance, and avoids the need for replicating data on the relatively small memory of each PC. Our goal is to study different distribution strategies on the scope of the Parallel ZSweep algorithm, introducing in PZSweep another work distribution strategy: work stealing. this strategy allows a decentralized control of the work to be done, and provides a dynamic load redistribution. We propose two different algorithms to select the processor that will be "stolen" and show that the simplest one, nearest neighbor, was the most efficient. We also showed that the load redistribution schemes strongly depended on the initial load distribution, with an interleaved assignment, our systems could outperform the original Parallel ZSweep algorithm. We conclude that for running large datasets on a cluster of PCs, Parallel ZSweep requires dynamic load distribution strategy.
We present implementation of a very fast parallel complex FFT on M2, the second generation of MorphoSys reconfigurable computation platform, which is targeting on streamed applications such as multimedia and DSP. the ...
详细信息
We present implementation of a very fast parallel complex FFT on M2, the second generation of MorphoSys reconfigurable computation platform, which is targeting on streamed applications such as multimedia and DSP. the proposed mapping comprises fast presorting, cascaded radix-2 stages, and postreordering. Data and twiddle factors are 16-bit real and 16-bit imaginary in 2's complement format and scaling is performed to avoid overflow. the mapping is tested on our cycle-accurate simulator, "mulate", and the performance is encouragingly better than other architectures such as Imagine and VIRAM. Moreover, the performance is scalable according to FFT sizes. Since there is no functionality specifically tailored to FFT, the results demonstrate the capability of MorphoSys architecture to extract parallelism from streamed applications. Further rationales are given based on the concepts of scalar operand networks and memory hierarchy.
A long-standing challenge in the field of volume rendering is to obtain high quality images in near real-time. this is particularly important for scientific datasets where highly precise results are required to ensure...
详细信息
ISBN:
(纸本)0769520324
A long-standing challenge in the field of volume rendering is to obtain high quality images in near real-time. this is particularly important for scientific datasets where highly precise results are required to ensure accurate data interpretation. this paper presents RZSweep as a new hardware-assisted technique for volume rendering of regular datasets based on the plane-sweep paradigm. Although some research has been done in the past for irregular datasets, this paper presents the first attempt to explore the capabilities of the plane sweep paradigm for regular datasets. RZSweep is an exact, object order, direct volume rendering (DVR), back-to-front projection algorithm wherein a plane sweeps through the volume in depth order, projecting all the data elements within the user specified threshold onto the image plane. the algorithm uses a hardware rendering pipeline to composite the final image. the space complexity of the algorithm is bound to the number of vertices in the largest diagonal plane of the dataset. A prototype of the algorithm renders a dataset of size 256 3 in near real-time of 0.77 seconds on a single off-the-shelf commodity hardware PC. Every data element within the specified threshold contributes to the final image in its correct spatial order that is guaranteed by the use of a heap.
Advances in sensor technology, electronic devices, digital processing hardware, and computation systems have vastly increased the volume of data available for collection at scientific research sites. the Starfire Opti...
详细信息
ISBN:
(纸本)0769519148
Advances in sensor technology, electronic devices, digital processing hardware, and computation systems have vastly increased the volume of data available for collection at scientific research sites. the Starfire Optical Range of the Air Force Research Laboratory Directed Energy Directorate is responsible for the research and development of advanced techniques for atmospheric compensation for large telescopes. Recent experiments have been completed which encompassed more that 100TB of collected data. the collection, processing, archival, and analysis of these data required an entirely new approach for the facility staff. this paper presents the architecture, hardware and software components, integration difficulties, and performance results in a "lessons learned" format focused on specific implementation choices, interoperability lessons and issues, and results and capabilities.
A high-radix composite algorithm for the computation of the powering function (X/sup Y/) is presented. the algorithm consists of a sequence of overlapped operations: (i) digit-recurrence logarithm, (ii) left-to-right ...
详细信息
A high-radix composite algorithm for the computation of the powering function (X/sup Y/) is presented. the algorithm consists of a sequence of overlapped operations: (i) digit-recurrence logarithm, (ii) left-to-right carry-free (LRCF) multiplications, and (iii) online exponential. A redundant number system is used, and the selection in (i) and (iii) is done by rounding except from the first iteration, when selection by table look-up is necessary to guarantee the convergence of the recurrences. A sequential implementation of the algorithm is proposed, and the execution times and hardware requirements are estimated for single and double-precision floating-point computations, for radix r=128, showing that powering can be computed with similar performance as high-radix CORDIC algorithms.
the X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels. Although it is a general-purpose microprocessor, it has the reliable performance of a reconfigurable architecture. We expose its archit...
详细信息
the X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels. Although it is a general-purpose microprocessor, it has the reliable performance of a reconfigurable architecture. We expose its architecture and programming levels, and discuss the powerful interaction between parallel programming and reconfiguration. It shows two performance-optimized implementations of matrix multiplication using both parallel and reconfigurable paradigms and a parallel implementation of miner intelligent agents.
Commodities-built clusters, a low cost alternative for distributed parallel processing, brought high-performancecomputing to a wide range of users. However, the existing widespread tools for distributed parallel prog...
详细信息
Commodities-built clusters, a low cost alternative for distributed parallel processing, brought high-performancecomputing to a wide range of users. However, the existing widespread tools for distributed parallel programming, such as messaging passing libraries, does not attend new software engineering requirements that have emerged due to increase in complexity of applications. Haskell/sub #/ is a parallel programming language intending to reconcile higher abstraction and modularity with scalable performance. It is demonstrated the use of Haskell/sub #/ in the programming of three SPMD benchmark programs, which have lower-level MPI implementations available.
One of the main challenges to the wide use of the Internet is the scalability of the servers, that is, their ability to handle the increasing demand. Scalability in stateful servers, which comprise e-commerce and othe...
详细信息
One of the main challenges to the wide use of the Internet is the scalability of the servers, that is, their ability to handle the increasing demand. Scalability in stateful servers, which comprise e-commerce and other transaction-oriented servers, is even more difficult, since it is necessary to keep transaction data across requests from the same user. One common strategy for achieving scalability is to employ clustered servers, where the load is distributed among the various servers. However, as a consequence of the workload characteristics and the need of maintaining data coherent among the servers that compose the cluster, load imbalance arise among servers, reducing the efficiency of the server as a whole. We propose and evaluate a strategy for load balancing in stateful clustered servers. Our strategy is based on control theory and allowed significant gains over configurations that do not employ the load balancing strategy, reducing the response time in up to 50% and increasing the throughput in up to 16%.
暂无评论