Power efficiency is increasingly critical to battery-powered smartphones. Given the using experience is most valued by the user, we propose that the power optimization should directly respect the user experience. We c...
详细信息
ISBN:
(纸本)9781450321532
Power efficiency is increasingly critical to battery-powered smartphones. Given the using experience is most valued by the user, we propose that the power optimization should directly respect the user experience. We conduct a statistical sample survey and study the correlation among the user experience, the system runtime activities, and the minimal required frequency of an application processor. This study motivates an intelligent self-adaptive scheme, SmartCap, which automatically identifies the most power-efficient state of the application processor according to system activities. Compared to prior Linux power adaptation schemes, SmartCap can help save power from 11% to 84%, depending on applications, with little decline in user experience.
Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-spe...
详细信息
ISBN:
(纸本)9781450320146
Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the libraries become too complicated to be used extensively in real applications. In this work we develop a Sparse Matrix-vector multiplication Auto-Tuning system (SMAT) to bridge the gap between specific optimizations and general-purpose usage. S-MAT provides users with a unified programming interface in compressed sparse row (CSR) format and automatically determines the optimal format and implementation for any input sparse matrix at runtime. For this purpose, SMAT leverages a learning model, which is generated in an off-line stage by a machine learning method with a training set of more than 2000 matrices from the UF sparse matrix collection, to quickly predict the best combination of the matrix feature parameters. Our experiments show that SMAT achieves impressive performance of up to 51GFLOPS in single-precision and 37GFLOPS in double-precision on mainstream x86 multi-core processors, which are both more than 3 times faster than the Intel MKL library. We also demonstrate its adaptability in an algebraic multi-grid solver from Hypre library with above 20% performance improvement reported.
The world faces an energy problem. Oil supply is gradually running out. Its use is polluting the planet with greenhouse gas. Most alternative energy sources also pose some environmental problems. Hence the efficient u...
详细信息
computerarchitecture simulator is an important tool for computerarchitecture researchers. Recent development of parallel architectures bring great challenge to computer simulations. On the target side, as processors...
详细信息
computerarchitecture simulator is an important tool for computerarchitecture researchers. Recent development of parallel architectures bring great challenge to computer simulations. On the target side, as processors move towards multi-core and many-core, the complexity of the target system is doubling in the speed of Moore's law as the simulated target core number grows;on the host side, the speed of sequential simulation is halted as the speed of a single host processor halts. Due to the above two reasons, sequential simulation could no longer meet the challenge of new parallel architectures. In this paper, we will describe the necessity and feasibility of parallel simulation for parallel computerarchitectures using two examples: a many-core processor simulator and a many-core cluster simulator. For many-core processor simulator, we use parallel discrete event simulation (PDES) to speed it up 10.9 times without accuracy lost. For many-core cluster simulation, we simulated a cluster at 1024-core scale, with MPI/Pthreads runtime support.
MPI Alltoall is an important collective operation. In multicore clusters, many processes run in a node. On the one hand, shared memory can be adopted to optimize Alltoall communications of small messages by leader-bas...
详细信息
MPI Alltoall is an important collective operation. In multicore clusters, many processes run in a node. On the one hand, shared memory can be adopted to optimize Alltoall communications of small messages by leader-based schemes. However, as these schemes adopt a fixed number of leader processes, the optimal performance can't be obtained for all small messages. On the other hand, processes within a node contend for the same network resource. In Alltoall communications of large messages, many synchronization messages are used. Nevertheless, the contention makes their latency increase many times and the synchronization overhead can't be ingored. To solve these problems, two optimizations are presented. For small messages, the PLP method adopts changeable numbers of leader processes. For large messages, the LSS method reduces the number of synchronization messages from 3N to 2√N. The evaluations prove two methods. For small messages, the PLP method always obtains optimal performance. For large messages, the LSS method brings almost constant improvement percentage. The performance is improved by 25% for 32 KB and 64 KB messages.
As an emerging non-volatile memory technology, phase change memory (PCM) is promising as an alternative for traditional memories such as DRAM. In spite of its non-volatility, high density, low standby power, and resil...
详细信息
As an emerging non-volatile memory technology, phase change memory (PCM) is promising as an alternative for traditional memories such as DRAM. In spite of its non-volatility, high density, low standby power, and resilience to soft errors, PCM has a limited write endurance or lifetime, which means that each PCM cell can only be overwritten finite times. More importantly, limited lifetime potentially provides malicious attackers an opportunity to intentionally aggravate write traffic into PCM. In this paper, from the standpoint of attackers, we propose random stream attacks (RSAK) methods for phase change memory used in video applications. Experimental results show that compared to natural video sequences, RSAK incurs higher total write traffic or worsened lifetime. RSAK also gives hints on how to build a more secure PCM in video applications to counter malicious write streams.
In modern processor systems, on-chip Last Level Caches (LLCs) are used to bridge the speed gap between CPUs and off-chip memory. In recent years, the LRU policy effectiveness in low level caches has been questioned. A...
详细信息
In this paper, we address the column-based low-rank matrix approximation problem using a novel parallel approach. Our approach is based on the divide-and-combine idea. We first perform column selection on submatrices ...
详细信息
ISBN:
(纸本)9781577356332
In this paper, we address the column-based low-rank matrix approximation problem using a novel parallel approach. Our approach is based on the divide-and-combine idea. We first perform column selection on submatrices of an original data matrix in parallel, and then combine the selected columns into the final output. Our approach enjoys a theoretical relative-error upper bound. In addition, our column-based low-rank approximation partitions data in a deterministic way and makes no assumptions about matrix coherence. Compared with other traditional methods, our approach is scalable on largescale matrices. Finally, experiments on both simulated and real world data show that our approach is both efficient and effective.
Cloud computing is a recently developed new technology for complex systems with massive service sharing, which is different from the resource sharing of the grid computingsystems. In a cloud environment, service requ...
详细信息
Voltage emergencies have become a major challenge to multi-core processors because core-to-core resonance may put all cores into danger which jeopardizes system reliability. We observed that the applications following...
详细信息
ISBN:
(纸本)9781450321532
Voltage emergencies have become a major challenge to multi-core processors because core-to-core resonance may put all cores into danger which jeopardizes system reliability. We observed that the applications following SPMD (Single Program and Multiple Data) programming model tend to spark domain-wide voltage resonance because multiple threads sharing the same function body exhibit similar power activity. When threads are judiciously relocated among the cores, the voltage droops can be greatly reduced. We propose “Orchestrator”, a sensor-free non-intrusive scheme for multi-core architectures to smooth the voltage droops. Orchestrator focuses on the inter-core voltage interactions, and maximally leverages the thread diversity to avoid voltage droops synergy among cores. Experimental results show that Orchestrator can reduce up to 64% voltage emergencies on average, meanwhile improving performance.
暂无评论