In this paper, we discuss the procedures how to make Viterbi decoder faster. The implementation in Intel CPU with SSE4 parallelprocessing instruction sets and some other methods achieves the decoding speed 47.05 Mbps...
详细信息
ISBN:
(纸本)9780769537474
In this paper, we discuss the procedures how to make Viterbi decoder faster. The implementation in Intel CPU with SSE4 parallelprocessing instruction sets and some other methods achieves the decoding speed 47.05 Mbps (0.64 Mbps originally). The DVB-T mode used in Taiwan needs 13.27 Mbps to achieve real-time reception, so our implementation of software Viterbi decoder takes only 28% CPU loading.
Overlapping Reconfiguration is currently the most effi-cient method to reconfigure an interconnection network, but is only valid for systems that apply distributed routing. This paper proposes a solution which enables...
详细信息
Graphic processing Unit (GPU), with many lightweight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not ...
详细信息
ISBN:
(纸本)9780769537474
Graphic processing Unit (GPU), with many lightweight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. But the powerful computing capacity could not be fully utilized for memory-intensive applications, which are limited by off-chip memory bandwidth and latency. Stencil computation has abundant parallelism and low computational intensity which make it a useful architectural evaluation benchmark. In this paper, we propose some memory optimizations for a stencil based application mgrid from SPEC 2K benchmarks. Through exploiting data locality in 3-level memory hierarchies and tuning the thread granularity, we reduce the pressure on the off-chip memory bandwidth. To hide the long off-chip memory access latency, we further prefetch data during computation through double-buffer. In order to fully exploit the CPU-GPU heterogeneous system, we redistribute the computation between these two computing resource. Through all these optimizations, we gain 24.2x speedup compared to the simple mapping version, and get as high as 34.3x speedup when compared with a CPU implementation.
Tools are becoming increasingly important to efficiently utilize the computing power available in contemporary large scale systems. The drastic increase in the size and the complexity of systems require tools to be sc...
详细信息
ISBN:
(纸本)9781424437504
Tools are becoming increasingly important to efficiently utilize the computing power available in contemporary large scale systems. The drastic increase in the size and the complexity of systems require tools to be scalable while producing meaning full and easily digestible information that may help the user pin-point problems at scale. The goal of this tutorial is to introduce some state-of-the-art performance tools from three different organizations to a diverse audience group. Together these tools provide a broad spectrum of capabilities necessary to analyze the performance of scientific and engineering applications on a variety of large and small scale systems.
With the ever increasing demand for high quality 3D image processing on markets such as cinema and gaming, graphics processing units (GPUs) capabilities have shown tremendous advances. Although GPU-based cluster compu...
详细信息
ISBN:
(纸本)9780769536804
With the ever increasing demand for high quality 3D image processing on markets such as cinema and gaming, graphics processing units (GPUs) capabilities have shown tremendous advances. Although GPU-based cluster computing, which uses GPUs as the processing units, is one of the most promising high performance parallel computing platforms, currently there is no programming environment, interface or library designed to use these multiple computing resources to compute tasks in parallel. This paper proposes the CaravelaMPI, a new message passing interface targeted for GPU cluster computing, providing a unified and transparent interface to manage both communication and GPU execution. Experimental results show that the transparent interface of CaravelaMPI allows to efficiently program GPU-based clusters, not only decreasing the required programming effort but also increasing the performance of GPU-based cluster computing platforms.
Caches play a major role in the performance of high-speed computer systems. Trace-driven simulator is the most widely used method to evaluate cache architectures. However, as the cache design moves to more complicated...
详细信息
ISBN:
(纸本)9780769537474
Caches play a major role in the performance of high-speed computer systems. Trace-driven simulator is the most widely used method to evaluate cache architectures. However, as the cache design moves to more complicated architectures, along with the size of the trace is becoming larger and larger. Traditional simulation methods are no longer practical due to their long simulation cycles. Several techniques have been proposed to reduce the simulation time of sequential trace-driven simulation. This paper considers the use of generic GPU to accelerate cache simulation which exploits set-partitioning as the main source of parallelism. We develop more efficient parallel simulation techniques by introducing more knowledge into the Compute Unified Device Architecture (CUDA) on the GPU. Our experimental result shows that the new algorithm gains 2.76x performance improvement compared to traditional CPU-based sequential algorithm.
The large sized data sets are replicated in more than one site for the better availability to the nodes in a grid. Downloading the dataset from these replicated locations have practical difficulties, due to network tr...
详细信息
ISBN:
(纸本)9780769537474
The large sized data sets are replicated in more than one site for the better availability to the nodes in a grid. Downloading the dataset from these replicated locations have practical difficulties, due to network traffic, congestion, frequent change-in performance of the servers, etc. In order to speed up the download, complex server selection techniques, network and server loads are used. However, consistent performance is not guaranteed due to the shared nature of network links of the load on them, which can vary unpredictably. In this paper, we present a bandwidth sensitive co-allocation scheme for parallel downloading in grid economics. Objective of the proposed technique aims to service grid applications efficiently and economically in data grids. With the consideration of cost factor, we present a novel mechanism for server selection, dynamic rile decomposition and co-allocation. Under considerations in costs, our mechanism for selections of servers with various techniques combined is able to significantly attenuate economic costs. We compared our scheme with the existing schemes and the preliminary results show notable improvement in overall completion time of data transfer.
In In-networking storage Wireless Sensor Networks, sensed data are stored locally for a long term and retrieved on-demand instead of real-time. To maximize data survival, the sensed data are normally distributively st...
详细信息
ISBN:
(纸本)9780769537474
In In-networking storage Wireless Sensor Networks, sensed data are stored locally for a long term and retrieved on-demand instead of real-time. To maximize data survival, the sensed data are normally distributively stored at multiple nearby nodes. It arises a problem that how to check and grantee data integrity of distributed data storage in the context of resource constraints. In this paper, a technique called Two Granularity Linear Code (TGLC) that consists of Intracodes and Inter-codes is presented. An efficient and lightweight data integrity check scheme based on TGLC is proposed. Data integrity can be checked by any one who holds short Intercodes, and the checking credentials is short Intra-codes that is dynamically generated. The proposed scheme is efficient and lightweight with respect to low storage and communication overhead, and yet checking validity is maintained. Our conclusion is justified by extensive analysis.
Web servers often need to manage encrypted transfers of *** encryption activity is computationally intensive, and exposes a significant degree of parallelism. At the same time, cheap multicore processors are readily a...
详细信息
in regards to applications like 3D seismic migration, it is quite important to improve the I/O performance within an cluster computing system. Such seismic data processing applications are the I/O intensive applicatio...
详细信息
ISBN:
(纸本)9780769537474
in regards to applications like 3D seismic migration, it is quite important to improve the I/O performance within an cluster computing system. Such seismic data processing applications are the I/O intensive applications. For example, large 3D data volume cannot be hold totally in computer memories. Therefore the input data files have to be divided into many fine-grained chunks. Intermediate results are written out at various stages during the execution, and final results are written out by the master process. This paper describes a novel manner for optimizing the parallel I/O data access strategy and load balancing for the above-mentioned particular program model. The optimization, based on the application defined API, reduces the number of I/O operations and communication (as compared to the original model). This is done by forming groups of threads with "group roots", so to speak, that read input data (determined by an index retrieved from the master process) and then send it to their group members. In the original model, each process/thread reads the whole input data and outputs its own results. Moreover the loads are balanced, for the on-line dynamic scheduling of access request to process the migration data. Finally, in the actual performance test, the improvement of performance is often more than 60% by comparison with the original model.
暂无评论