Dynamic Binary Translation (DBT) has been widely used in various applications. Although new architectures and micro-architectures often create performance opportunities for programmers and compilers, such performance ...
详细信息
ISBN:
(纸本)9781612843568
Dynamic Binary Translation (DBT) has been widely used in various applications. Although new architectures and micro-architectures often create performance opportunities for programmers and compilers, such performance opportunities may not be exploited by legacy executables. For example, the additional general-purpose and XMM registers in the Intel64 architecture do not benefit the IA-32 binaries. In this paper, we designed and developed a DBT system to dynamically promote stack variables in the source binaries to the additional registers of the target architecture. One of the most challenging problems is how to deal with the possible but rare memory aliases between promoted stack variables and other implicit memory references. We devised a runtime alias detection approach based on the page protection mechanism in Linux and a novel stack switching method to catch memory aliases at run-time. This approach is much less expensive than traditional approaches like inserting address checking instructions. On an Intel64 platform, our DBT system with speculative stack variable promotion has sped up several SPEC CPU2006 benchmarks in IA-32 code, with the largest performance gain over 45%.
An 8-bit bit-parallel RSFQ microprocessor, named HUTU, is proposed. It can execute 28 different instructions. Each instruction consists of eight bits. Harvard-type architecture is adopted for parallel processing betwe...
An 8-bit bit-parallel RSFQ microprocessor, named HUTU, is proposed. It can execute 28 different instructions. Each instruction consists of eight bits. Harvard-type architecture is adopted for parallel processing between the control unit and the datapath. The control unit uses an asynchronous timing method to avoid pipeline flushing and to reduce the area. Concurrent-flow clocking is adopted in the datapath for high performance. The simulation result shows that the elements of HUTU run correctly.
Current QoS-aware automatic service composition queries over a network of web services are often one-time in nature. After a network of web services is built, such queries are issued once, and answers are found from t...
详细信息
Strategies for partitioning an application¿s data and computation play fundamental role in determining the efficiency of parallelization. This paper describes a sophisticated strategy for partitioning data and co...
详细信息
Strategies for partitioning an application¿s data and computation play fundamental role in determining the efficiency of parallelization. This paper describes a sophisticated strategy for partitioning data and computation known as multi-partitioning, which can support the best parallelization for some applications such as the line sweep computations. However, the implementation of multi-partitioning is very difficult and, as we know, there is none automatic parallelizing compiler supports such partitioning strategy. Though the dHPF compiler implemented multi-partitioning as a special extension for block style HPF partitioning, it still needs the programmer¿s participation to analyze the application and decide the data distribution scheme. In this paper, we present a global tiling transformation algorithm and a tile-to-processors mapping strategy called hyper-diagonal modular mapping, to implement the multi-partitioning strategy. The experimentation with NPB2.3-serial SP shows that the code generated by the compiler achieves scalable performance.
Logic design of a 16-bit bit-slice shifter for 64-bit superconducting rapid single-flux-quantum (RSFQ) microprocessors is proposed. The shifter supports three types of shift operations including logic shift, arithmeti...
Logic design of a 16-bit bit-slice shifter for 64-bit superconducting rapid single-flux-quantum (RSFQ) microprocessors is proposed. The shifter supports three types of shift operations including logic shift, arithmetic shift and rotating shift. Each of 64-bit shift input operands is divided into four slices of 16-bit each. In order to simulate the digital function and timing of the proposed 16-bit bit-slice shifter, we design a logic-level simulation model based on the Open Dataset of CONNECT Cell Library for AIST ADP2. As the results of simulation, the information of RSFQ circuits, such as the number of Josephson junctions, area and latency of the 16-bit bit slice shifter can be obtained. The simulation results show that the proposed 16-bit bit-slice shifter can work correctly.
Recently, deep neural networks (DNNs) have been widely applied in mobile intelligent applications. The inference for the DNNs is usually performed in the cloud. However, it leads to a large overhead of transmitting da...
详细信息
Weakly supervised object detection (WSOD) focuses on training object detector with only image-level annotations, and is challenging due to the gap between the supervision and the objective. Most of existing approaches...
详细信息
Rapid single-flux-quantum (RSFQ) is expected to be the next generation integrated circuit technology because of its ultra-high-speed with ultra-low-power consumption. We propose datapath circuits for an 8-bit bit-para...
Rapid single-flux-quantum (RSFQ) is expected to be the next generation integrated circuit technology because of its ultra-high-speed with ultra-low-power consumption. We propose datapath circuits for an 8-bit bit-parallel RSFQ microprocessor. The proposed datapath circuits process 8-bit data each clock cycle. Seven instructions are executed in the datapath, including ADD, ADDI, IN, OUT, LOADI, SRL and MOV. The datapath circuits consist of eight input ports, eight output ports, five multiplexers (MUXs), two 8-bit data registers and one 8-bit bit-parallel arithmetic logic unit (ALU). The datapath circuits contain 12 pipeline stages and 2993 JJs based on the Open Dataset of CONNECT Cell Library for AIST ADP2 without considering wiring cells. We perform digital simulation of the proposed datapath circuits. The simulation results show correct operation with the assumed frequency of 20 GHz.
We analyze that different methods based channel or position attention mechanism give rise to different performance on scale, and some of state-of-the-art detectors applying feature pyramid are integrated with various ...
详细信息
Face recognition is widely used in the scene. However, different visual environments require different methods, and face recognition has a difficulty in complex environments. Therefore, this paper mainly experiments c...
详细信息
暂无评论