GPUs have recently been explored as a new general-purpose computing platform, which are suitable for the acceleration of compute-intensive EDA applications. In this paper we describe a GPU-based one- to n-detection fa...
详细信息
GPUs have recently been explored as a new general-purpose computing platform, which are suitable for the acceleration of compute-intensive EDA applications. In this paper we describe a GPU-based one- to n-detection fault simulator for both stuck-at and transition faults, which demonstrates a 20X speedup over a commercial CPU-based fault simulator. We further show new fault-simulation-based test selection applications enabled by this accelerated fault simulation. Our results demonstrate that the tests selected from the applications achieve higher fault coverages for 1-to-n detections with steeper fault coverage curves, as well as a better delay test quality, in comparison with tests deterministically generated by commercial ATPG tools.
Web workloads are known to vary dynamically with time which poses a challenge to resource allocation among the applications. In this paper, we argue that the existing dynamic resource allocation based on resource util...
详细信息
Web workloads are known to vary dynamically with time which poses a challenge to resource allocation among the applications. In this paper, we argue that the existing dynamic resource allocation based on resource utilization has some drawbacks in virtualized servers. Dynamic resource allocation directly based on real-time user experience is more reasonable and also has practical significance. To address the problem, we propose a systemarchitecture that combines real time measurements and analysis of user experience for resource allocation. We evaluate our proposal using Webbench. The experiment results show that these techniques can judiciously allocate system resources.
ISS (Instruction Set Simulator) plays an important role in pre-silicon software development for ASIP. However, the speed of traditional simulation is too slow to effectively support full-scale software development. In...
详细信息
ISS (Instruction Set Simulator) plays an important role in pre-silicon software development for ASIP. However, the speed of traditional simulation is too slow to effectively support full-scale software development. In this paper, we propose a hybrid simulation framework which further improves the previous simulation methods by aggressively utilizing the host machine resources. The utilization is achieved by categorizing instructions of ASIP application into two types, namely custom and basic instructions, via binary instrumentation. Then in a way of hybrid simulation, only custom instructions are simulated on the ISS and basic instructions are executed fast and natively on the host machine. We implement this framework for an industrial ASIP to validate our approach. Experimental results show that when the implemented ISS, namely GS-Sim, is applied to practical multimedia decoders, an average simulation speed up to 1058.5MIPS can be achieved, which is 34.7 times of the state-of-art dynamic binary translation simulator and is the fastest to the best of our knowledge.
Supply voltage fluctuation caused by inductive noises has become a critical problem in microprocessor design. A voltage emergency occurs when supply voltage variation exceeds the acceptable voltage margin, jeopardizin...
详细信息
Supply voltage fluctuation caused by inductive noises has become a critical problem in microprocessor design. A voltage emergency occurs when supply voltage variation exceeds the acceptable voltage margin, jeopardizing the microprocessor reliability. Existing techniques assume all voltage emergencies would definitely lead to incorrect program execution and prudently activate rollbacks or flushes to recover, and consequently incur high performance overhead. We observe that not all voltage emergencies result in external visible errors, which can be exploited to avoid unnecessary protection. In this paper, we propose a substantial-impact-filter based method to tolerate voltage emergencies, including three key techniques: 1) Analyze the architecture-level masking of voltage emergencies during program execution; 2) Propose a metric intermittent vulnerability factor for intermittent timing faults (IV F itf ) to quantitatively estimate the vulnerability of microprocessor structures (load/store queue and register file) to voltage emergencies; 3) Propose a substantial-impact-filter based method to handle voltage emergencies. Experimental results demonstrate our approach gains back nearly 57% of the performance loss compared with the once-occur-then-rollback approach.
As more and more Web applications emerging on sever end today, the Web browser on client end has become a host of a variety of applications other than just rendering static Web pages. This leads to more and more perfo...
详细信息
As more and more Web applications emerging on sever end today, the Web browser on client end has become a host of a variety of applications other than just rendering static Web pages. This leads to more and more performance requirements of a Web browser, for which user experience is very important. This situation may become more urgency when on handheld devices. Some efforts like redesign a new Web browser have been made to overcome this problem. In this paper, we address this issue by optimizing the main processes of the Web browser on a state-of-the-art 64-core architecture, Godson-T, which was developed at Chinese Academy of Sciences, as multi-/many-core architecture to be the mainstream processor in the upcoming years. We start a new core to process a new tab when facing up to intensive URL requests, and we use scratch-pad memory (SPM) of each core as a local buffer to store the HTML source data to be processed to reduce off-chip memory access and exploit more data locality, otherwise, we use DTA to transfer HTML data for backup. Experiments conducted on the cycle-accurate simulator show that, starting each tab process by a new core could obtain 5.7% to 50% speedup with different number of cores used to process corresponding URL requests, with on-chip scratchpad memory of each core used to store the HTML data, more speedup could be achieved when number of cores increase. Also, as Data Transfer Agent (DTA) used to transfer the HTML data, the backup of HTML data can get 2X to 5X speedups according to different data amount.
Dynamic Binary Translation (DBT) has been widely used in various applications. Although new architectures and micro-architectures often create performance opportunities for programmers and compilers, such performance ...
详细信息
ISBN:
(纸本)9781612843568
Dynamic Binary Translation (DBT) has been widely used in various applications. Although new architectures and micro-architectures often create performance opportunities for programmers and compilers, such performance opportunities may not be exploited by legacy executables. For example, the additional general-purpose and XMM registers in the Intel64 architecture do not benefit the IA-32 binaries. In this paper, we designed and developed a DBT system to dynamically promote stack variables in the source binaries to the additional registers of the target architecture. One of the most challenging problems is how to deal with the possible but rare memory aliases between promoted stack variables and other implicit memory references. We devised a runtime alias detection approach based on the page protection mechanism in Linux and a novel stack switching method to catch memory aliases at run-time. This approach is much less expensive than traditional approaches like inserting address checking instructions. On an Intel64 platform, our DBT system with speculative stack variable promotion has sped up several SPEC CPU2006 benchmarks in IA-32 code, with the largest performance gain over 45%.
Currently, with the evolution of virtualization technology, cloud computing mode has become more and more popular. However, people still concern the issues of the runtime integrity and data security of cloud computing...
详细信息
Currently, with the evolution of virtualization technology, cloud computing mode has become more and more popular. However, people still concern the issues of the runtime integrity and data security of cloud computing platform, as well as the service efficiency on such computing platform. At the same time, according to our knowledge, the design theory of the trusted virtual computing environment and its core system software for such network-based computing platform is at the exploratory stage. In this paper, we believe that efficiency and isolation are the two key proprieties of the trusted virtual computing environment. To guarantee these two proprieties, based on the design principle of splitting, customizing, reconstructing, and isolation-based enhancing to the platform, we introduce TRainbow, a novel trusted virtual computing platform developing by our research *** the two creative mechanisms, that is, capacity flowing amongst VMs and VM-based kernel reconstructing, TRainbow provides great improvements (up to 42%) in service performance and isolated reliable computing environment for Internet-oriented, large-scale, concurrent services.
MPI All to all communication is widely used in many high performance computing (HPC) applications. In All to all communication, each process sends a distinct message to all other participating processes. In multicore ...
详细信息
MPI All to all communication is widely used in many high performance computing (HPC) applications. In All to all communication, each process sends a distinct message to all other participating processes. In multicore clusters, processes within a node simultaneously contend for the same network resource of the node in All to all communication. However, many small synchronization messages are required in All to all communication of large messages. With the contention, their latency is orders of magnitude larger than that without contention. As a result, the synchronization overhead is significantly increased and accounts for a large proportion to the whole latency of All to all communication. In this paper, we analyse the considerable overhead of synchronization messages. Base on the analysis, an optimization is presented to reduce the number of synchronization messages from 3N to 2¡ÌN. Evaluations on a 240-core cluster show that the performance is improved by almost constant ratio, which is mainly determined by message size and independent of system scale. The performance of All to all communication is improved by 25% for 32K and 64K bytes messages. For FFT application, performance is improved by 20%.
In current social computing system, not only hardware but also software experiences a directly discarded mode. Such directly discarded mode may result in huge waste. The major challenge in green computing is the recyc...
详细信息
In current social computing system, not only hardware but also software experiences a directly discarded mode. Such directly discarded mode may result in huge waste. The major challenge in green computing is the recyclability of the computing system. In order to address this challenge in the software field, this paper proposes a design idea of green software embodying the adaptability and recyclability. The adaptable and recyclable strategy may consist of two phases. The first one is compressing the increasingly deeper software stack, another one is keeping the functionality recycling and code reuse. The adaptability and recyclability mean automatically decomposing the complex software into several parts which are ease to be reused and automatically selecting the feasible parts for the on-writing software. And we also explore the system software design path to adaptability and recyclability in our previous work as well as in the future.
Due to complex abstractions implemented over shared data structures protected by locks, conventional symmetric multithreaded operating system kernel such as Linux is hard to achieve high scalability on the emerging mu...
详细信息
暂无评论