Ray-tracing, can produce high-quality images, however, the use of ray-tracing has been limited due to its high demands on computational power and memory bandwidth, especially in the case of satellite imagery. In this ...
详细信息
Field programmable gate arrays (FPGAs) are widely used in reliability-critical systems due to their reconfiguration ability. However, with the shrinking device feature size and increasing die area, nowadays FPGAs can ...
详细信息
The wide application of General Purpose Graphic Processing Units (GPGPUs) results in large manual efforts on porting and optimizing algorithms on them. However, most existing automatic ways of generating GPGPU code fa...
详细信息
The wide application of General Purpose Graphic Processing Units (GPGPUs) results in large manual efforts on porting and optimizing algorithms on them. However, most existing automatic ways of generating GPGPU code fail to conduct optimization strategies regarding a specific computation and to reuse constantly evolving manual optimizations. In this paper, we present a computation pattern driven approach for computation-specific GPGPU code generation and optimization, which in turn reuses manual optimizations to a certain extent. We suggest language extensions to OpenMP, high-level data structure attributes, in order to assist the process of computation pattern matching and to help give users intuitive performance tuning parameters in the view of data structure attributes. We illustrate the feasibility of this approach through three important computation dwarfs, which are dense matrix, sparse matrix, and structured mesh computation in scientific computing. We also build a prototype OpenMP-to-CUDA translator that consists of computation pattern recognition and code template instantiation. The experimental results demonstrate the performance benefits of computation pattern driven method. To our best knowledge, it is the first work on reusing manual optimizations for GPGPUs with computation pattern driven approach.
With the prevalence of multi-core processors, it is a trend that the embedded cluster deploys SMP nodes to gain more computing power. As a crucial issue, the MPI interprocess communication has been suffering the contr...
详细信息
With further development and wide acceptance of cloud computing, lots of companies and colleges decide to take advantage of it in their own data centers, which is known as private clouds. Since private clouds have som...
详细信息
In contrast with public clouds, private clouds have some unique features, especially when related to workflow scheduling. Of course, the tradeoff problem between power and performance remains to be one of the key conc...
详细信息
In contrast with public clouds, private clouds have some unique features, especially when related to workflow scheduling. Of course, the tradeoff problem between power and performance remains to be one of the key concerns. Based on our previous research, in this paper, we propose a hybrid energy-efficient scheduling algorithm using dynamic migration. The experiments show that it can not only reduce the response time, conserve more energy, but also achieve higher level of load balancing.
Machine translation (MT), with its broad potential use, has gained increased attention from both researchers and software vendors. To generate high quality translations, however, MT decoders can be highly computat...
详细信息
Machine translation (MT), with its broad potential use, has gained increased attention from both researchers and software vendors. To generate high quality translations, however, MT decoders can be highly computation intensive. With significant raw computing power, multi-core microprocessors have the potential to speed up MT software on desktop machines. However, retrofitting existing MT decoders is a nontrivial issue. Race conditions and atomicity issues are among those complications making parallelization difficult. In this article, we show that, to parallelize a state-of-the-art MT decoder, it is much easier to overcome such difficulties by using a process-based parallelization method, called functional task parallelism, than using conventional thread-based methods. We achieve a 7.60 times speed up on an 8-core desktop machine while making significantly less changes to the original sequential code than required by using multiple threads.
Thread Level Speculation (TLS) is a technique aims at boosting the performance of sequential programs running on Chip Multiprocessors (CMPs) by automatically parallelizing them. It exempts programmers from the heavy t...
详细信息
Thread Level Speculation (TLS) is a technique aims at boosting the performance of sequential programs running on Chip Multiprocessors (CMPs) by automatically parallelizing them. It exempts programmers from the heavy task of parallel programming. But its performance may suffer from frequent squashing caused by inter-thread data dependency violation. In this paper, we propose a Network-on-Chip (NoC) in CMP that employs a priority-aware packet arbitration policy. Packet scheduling guided by such policy reduces the occurrence of TLS squashes. Simulation results with 5 applications show that our policy reduces squashes by 22% in best case and 15% on average. Moreover, our priority aware approach could be generalized to similar scenarios in which different threads running on CMP manifest different priorities.
暂无评论