GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improv...
详细信息
GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource: Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.
Mobile edge computing has shown its potential in serving emerging latency-sensitive mobile applications in ultra-dense 5G networks via offloading computation workloads from the remote cloud data center to the nearby n...
详细信息
Mobile edge computing has shown its potential in serving emerging latency-sensitive mobile applications in ultra-dense 5G networks via offloading computation workloads from the remote cloud data center to the nearby network ***,current computation offloading studies in the heterogeneous edge environment face multifaceted challenges:Dependencies among computational tasks,resource competition among multiple users,and diverse long-term *** applications typically consist of several functionalities,and one huge category of the applications can be viewed as a series of sequential *** this study,we first proposed a novel multiuser computation offloading framework for long-term sequential ***,we presented a comprehensive analysis of the task offloading process in the framework and formally defined the multiuser sequential task offloading ***,we decoupled the long-term offloading problem into multiple single time slot offloading problems and proposed a novel adaptive method to solve *** further showed the substantial performance advantage of our proposed method on the basis of extensive experiments.
It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. Gen- eral purpose GPU (GPGPU) provides high computi...
详细信息
It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. Gen- eral purpose GPU (GPGPU) provides high computing abil- ity and substantial bandwidth that cannot be fully exploited by SpMV due to its irregularity. In this paper, we propose two novel methods to optimize the memory bandwidth for SpMV on GPGPU. First, a new storage format is proposed to exploit memory bandwidth of GPU architecture more effi- ciently. The new storage format can ensure that there are as many non-zeros as possible in the format which is suitable to exploit the memory bandwidth of the GPU. Second, we pro- pose a cache blocking method to improve the performance of SpMV on GPU architecture. The sparse matrix is partitioned into sub-blocks that are stored in CSR format. With the block- ing method, the corresponding part of vector x can be reused in the GPU cache, so the time to access the global memory for vector x is reduced heavily. Experiments are carried out on three GPU platforms, GeForce 9800 GX2, GeForce GTX 480, and Tesla K40. Experimental results show that both new methods can efficiently improve the utilization of GPU mem- ory bandwidth and the performance of the GPU.
We investigate the dependence of the switching process on the perpendicular magnetic anisotropy (PMA) constant in perpendicular spin transfer torque magnetic tunnel junctions (P-MTJs) using micromagnetic simulatio...
详细信息
We investigate the dependence of the switching process on the perpendicular magnetic anisotropy (PMA) constant in perpendicular spin transfer torque magnetic tunnel junctions (P-MTJs) using micromagnetic simulations. It is found that the final stable states of the magnetization distribution of the free layer after switching can be divided into three different states based on different PMA constants: vortex, uniform, and steady. Different magnetic states can be attributed to a trade-off among demagnetization, exchange, and PMA energies. The generation of the vortex state is also related to the non-uniform stray field from the polarizer, and the final stable magnetization is sensitive to the PMA constant. The vortex and uniform states have different switching processes, and the switching time of the vortex state is longer than that of the uniform state due to hindrance by the vortex.
As the gap between processing capability and bandwidth requirement of microprocessor increases, optical interconnects are used more and more widely in chip-to-chip data links. Trade-offs are made among latency, area, ...
详细信息
Increasingly there is a need to process graphs that are larger than the available memory on today's *** systems have been developed with grapli representations that are efficient and compact for out-of-core proces...
详细信息
Increasingly there is a need to process graphs that are larger than the available memory on today's *** systems have been developed with grapli representations that are efficient and compact for out-of-core processing.A necessary task in these systems is memory *** paper presents a system called Cacheap which automatically and efficiently manages the available memory to maximize the speed of grapli processing,minimize the amount of disk access,and maximize the utilization of memory for graph *** has a simple interface that can be easily adopted by existing graph *** paper describes the new system,uses it in recent graph engines,and demonstrates its integer factor improvements in the speed of large-scale grapli processing.
The wide application of General Purpose Graphic Processing Units (GPGPUs) results in large manual efforts on porting and optimizing algorithms on them. However, most existing automatic ways of generating GPGPU code fa...
详细信息
Recently, Wang et al. presented a new construction of attribute-based signature with policy-and-endorsement mechanism. The existential unforgeability of their scheme was claimed to be based on the strong Diffie-Hellma...
详细信息
Recently, Wang et al. presented a new construction of attribute-based signature with policy-and-endorsement mechanism. The existential unforgeability of their scheme was claimed to be based on the strong Diffie-Hellman assumption in the random oracle model. Unfortunately, by carefully revisiting the design and security proof of Wang et alfs scheme, we show that their scheme cannot provide unforgeability, namely, a forger, whose attributes do not satisfy a given signing predicate, can also generate valid signatures. We also point out the flaws in Wang et al.'s proof.
Much research has been done on the dependability evaluation of computer systems. However, much of this is gone no further than study of the fault coverage of such systems, with little focus on the relationship between...
详细信息
Much research has been done on the dependability evaluation of computer systems. However, much of this is gone no further than study of the fault coverage of such systems, with little focus on the relationship between fault coverage and overall system dependability. In this paper, a Markovian dependability model for triple-modular-redundancy (TMR) system is presented. Having fully considered the effects of fault coverage, working time, and constant failure rate of single module on the dependability of the target TMR system, the model is built based on the stepwise degradation strategy. Through the model, the relationship between the fault coverage and the dependability of the system is determined. What is more, the dependability of the system can be dynamically and precisely predicted at any given time with the fault coverage set. This will be of much benefit for the dependability evaluation and improvement, and be helpful for the system design and maintenance.
Since the invention of the first industrial robot in 1959, the missions of robots have evolved from basic mechanical transfer or assistance to a diverse range of tasks through close interactions with environment, thei...
详细信息
Since the invention of the first industrial robot in 1959, the missions of robots have evolved from basic mechanical transfer or assistance to a diverse range of tasks through close interactions with environment, their human counterparts and robot peers. Through adaptation to uncertain and dynamic environments, legged robots can achieve coordinated locomotion in rough terrain, even
暂无评论