作者:
Cohen, AlbertWu, ChenggangINRIA
École Normale Supérieure Département d'Informatique 45 rue d'Ulm 75005 Paris France Chinese Academy of Sciences
Institute of Computing Technology State Key Laboratory of Computer Architecture No. 6 Kexueyuan South Road Haidian District 100190 Beijing China
Algorithm-specific, that is, semantic-specific optimizations have been observed to bring significant performance gains, especially for a diverse set of multi/many-core architectures. However, current programming model...
详细信息
In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods ...
详细信息
In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.
Test power consumption is becoming a major concern in low power integrated circuits(ICs). This paper presents a revised low power compression architecture for scan test. In this paper, the variance in power consumptio...
详细信息
Memory profiling is the process of collecting memory address traces during the execution of a program, then analyzing and characterizing the memory behavior of the program offline. With the trend that there will be mo...
详细信息
With the dramatic increase in network speed during the past ten years, network processing efficiency has been significantly decreased. In this paper, we propose a network accelerating scheme, which employs cache locki...
详细信息
It is an important task to tune performance for sparse matrix vector multiplication (SpMV), but it is also a difficult task because of its irregularity. In this paper, we propose a cache blocking method to improve the...
详细信息
Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, th...
详细信息
ISBN:
(纸本)9781457720642
Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, there is little work done to understand the behavior of desktop cloud. On one hand, desktop cloud workloads are different from conventional data center workloads in that they are rich with interactive operations. Desktop cloud workloads are different from traditional non-virtualized desktop workloads in that they have an extra layer of software stack - hypervisor. On the other hand, desktop cloud servers are mostly built with conventional commodity processors. While such processors are well optimized for traditional desktops and high performance computing workloads, their effectiveness for desktop cloud workloads remains to be studied. As an attempt to shed some lights on the effectiveness of conventional general-purpose processors on desktop cloud workloads, we have studied the behavior of desktop cloud workloads and compared it with that of SPEC CPU2006, TPC-C, PARSEC, and CloudSuite. We evaluate a Xen-based virtualization platform. The performance results reveal that desktop cloud workloads have significantly different characteristics with SPEC CPU2006, TPC-C and PARSEC, but they perform similarly with data center scale-out benchmarks from CloudSuite. In particular, desktop cloud workloads have high instruction cache miss rate (12.7% on average), high percentage of kernel instructions (23%, on average), and low IPC (0.36 on average). And they have much higher TLB miss rates and lower utilization of off-chip memory bandwidth than traditional benchmarks. Our experimental numbers indicate that the effectiveness of existing commodity processors is quite low for desktop cloud workloads. In this paper, we provide some preliminary discussions on some potential architectural and micro-architectural enhancements. We hope that the performance numbers presented i
Security has become an important characteristic for many real-time systems. Due to the lack of enough and stable energy supply in battery-powered embedded systems, one of the foremost challenges is the mismatch betwee...
详细信息
Modern video applications such as video codecs are memory-intensive. As an emerging non-volatile memory technology, phase change memory (PCM) will benefit video applications due to its high density, low leakage power ...
详细信息
暂无评论