In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods ...
详细信息
In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.
In this paper, we present some generalisations of previous multicollision finding methods and apply these against a new type of tree-based hash functions. We also show that the very general class of hash functions fir...
详细信息
In large projects parallelization of existing programs or refactoring of source code is time consuming as well as error-prone and would benefit from tool support. However, existing automatic transformation systems are...
详细信息
In large projects parallelization of existing programs or refactoring of source code is time consuming as well as error-prone and would benefit from tool support. However, existing automatic transformation systems are not extensively used because they either require tedious definitions of source code transformations or they lack general adaptability. In our approach, a programmer changes code inside a project, resulting in before and after source code versions. The difference (the generated transformation) is stored in a database. When presented with some arbitrary code, our tool mines the database to determine which of the generalized transformations possibly apply. Our system is different from a pure compiler based (semantics preserving) approach as we only suggest code modifications. Our contribution is a set of generalizing annotations that we have found by analyzing recurring patterns in open source projects. We show the usability of our system and the annotations by finding matches and applying generated transformations in real-world applications.
As semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm t...
详细信息
As semi-conductor technologies move down to the nanometer scale, leakage power has become a significant component of the total power consumption. In this paper, we present a leakage-aware modulo scheduling algorithm to achieve leakage energy saving for applications with loops on Very Long Instruction Word (VLIW) architectures. The proposed algorithm is designed to maximize the idleness of function units integrated with the dual-threshold domino logic, and reduce the number of transitions between the active and sleep modes. We have implemented our technique in the Trimaran compiler and conducted experiments using a set of embedded benchmarks from DSPstone and Mibench on the cycle-accurate VLIW simulator of Trimaran. The results show that our technique achieves significant leakage energy saving compared with a previously published DAG-based (Directed Acyclic Graph) leakage-aware scheduling algorithm.
The 15th ACM SIGPLAN International Conference on Functional programming (ICFP) took place on September 27–29, 2010 in Baltimore, Maryland. After the conference, the programme committee, chaired by Stephanie Weirich, ...
The 15th ACM SIGPLAN International Conference on Functional programming (ICFP) took place on September 27–29, 2010 in Baltimore, Maryland. After the conference, the programme committee, chaired by Stephanie Weirich, selected several outstanding papers and invited their authors to submit to this special issue of Journal of Functional programming. Umut A. Acar and James Cheney acted as editors for these submissions. This issue includes the seven accepted papers, each of which provides substantial new material beyond the original conference version. The selected papers reflect a consensus by the program committee that ICFP 2010 had a number of strong papers that link core functional programming ideas with other areas, such as multicore, embedded systems, and data compression.
Structural (manual or automated) testing today often overlooks typical programming faults because of inherent flaws in the simple criteria applied (e.g. branch or all-uses). Dedicated testing strategies that address s...
详细信息
While using a single GPU is fairly easy, using multiple CPUs and GPUs potentially distributed over multiple machines is hard because data needs to be kept consistent using message exchange and the load needs to be bal...
详细信息
Access control is a vital security mechanism in today's operating systems and the security policies dictating the security relevant behaviors is lengthy and complex for example in Security-Enhanced Linux (SELinux)...
详细信息
The efficient use of future MPSoCs with f 000 or more processor cores requires new means of resource-aware programming to deal with increasing imperfections such as process variation, fault rates, aging effects, and p...
详细信息
Happens-before detectors are precise but can be too conservative to detect certain data races in repeated test runs as they are sensitive to thread interleaving. By making the opposite tradeoffs, lockset detectors can...
详细信息
ISBN:
(纸本)9781612843568
Happens-before detectors are precise but can be too conservative to detect certain data races in repeated test runs as they are sensitive to thread interleaving. By making the opposite tradeoffs, lockset detectors can detect more races but are not precise (by reporting false positives). For both types of detectors, happens-before detectors run more slowly as they use expensive vector clocks. Existing hybrid race detectors (combining lockset and happens-before) alleviate some of the limitations in both analysis techniques at the cost of additional analysis overhead. Recently, due to FastTrack, epoch-based happens-before and lockset detectors now exhibit comparable performance. It is the time to rethink how to design a hybrid race detector to balance precision and coverage, by leveraging the lightweightness of epoch clocks. Acculock is the first such a solution. Acculock analyzes a program by reasoning about the subset of the happens-before relation observed with lock acquires and releases excluded, thereby reducing its sensitivity to thread interleaving. When such a weaker happens-before relation is violated, Acculock applies a new efficient lockset algorithm to enforce a lock-based synchronization discipline by distinguishing the locks protecting reads and writes. The key motivation behind is to ensure that Acculock can improve happens-before detectors by discovering also data races in alternate thread interleavings when analyzing one program execution while limiting false warnings thus incurred in a controlled manner. In addition, Acculock achieves these objectives by maintaining comparable performance as FastTrack, the fastest happens-before detector. All these properties of Acculock are validated and confirmed by comparing it against six other detectors, all implemented in Jikes RVM using 11 benchmark programs.
暂无评论