The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunit...
详细信息
The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study (using the 1-D Jacobi computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: 1) chip-level global addressable memory in particular the scratchpad memories (SPM) local to the processing cores; 2) fine-grain memory based synchronization (e.g., full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g., timed tiling and variants), we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.
The Godson-3A microprocessor is a quad-core version of the scalable Godson-3 multi-core series. It is physically implemented based on the 65 nm CMOS process. This 174 mm2 chip consists of 425 million transistors. The ...
详细信息
The Godson-3A microprocessor is a quad-core version of the scalable Godson-3 multi-core series. It is physically implemented based on the 65 nm CMOS process. This 174 mm2 chip consists of 425 million transistors. The maximum frequency is 1GHz with a maximum power consumption of 15 W. The main challenges of Godson-3A physical implementation include very large scale, high frequency requirement, sub-micron technology effects and aggressive time schedule. This paper describes the design methodology of the physical implementation of Godson-3A, with particular emphasis on design methods for high frequency, clock tree design, power management, and on-chip variation (OCV) issue.
The authors studied the scientific application LU decomposition deeply. A speedup model for LU decomposition was proposed, and an algorithm for LU decomposition based on bit reverse xor (BRX) was implemented. Then a d...
详细信息
The authors studied the scientific application LU decomposition deeply. A speedup model for LU decomposition was proposed, and an algorithm for LU decomposition based on bit reverse xor (BRX) was implemented. Then a dynamic absolute balance policy (DABP) algorithm was presented. In order to estimate the algorithms of 2 dimensional (2D) scatter, BRX and DABP, two different estimation functions were given and they were used to estimate the load balance problem of the algorithms. These two functions verify that the DABP algorithm has the best load balance. The simulations of the three algorithms were performed on the many-core architecture Godson-T. The experiments prove that the speedup of the DABP algorithm is 46 and it is the best performance of the three algorithms.
MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to...
详细信息
To obtain the efficiency of DBMS, HadoopDB combines Hadoop and DBMS, and claims the superiority over Hadoop in terms of performance. However, the approach of HadoopDB is simply putting MapReduce onto unmodified single...
详细信息
Test data volume and test power are two major concerns when testing modern large circuits. Recently, selective encoding of scan slices is proposed to compress test data. This encoding technique, unlike many other comp...
详细信息
Due to complex abstractions implemented over shared data structures protected by locks, conventional symmetric multithreaded operating system kernel such as Linux is hard to achieve high scalability on the emerging mu...
详细信息
ISBN:
(纸本)9781424464425
Due to complex abstractions implemented over shared data structures protected by locks, conventional symmetric multithreaded operating system kernel such as Linux is hard to achieve high scalability on the emerging multi-core architectures, which integrate more and more cores on a single die. This paper presents GenerOS - a general asymmetric operating system kernel for multi-core systems. In principal, GenerOS partitions processing cores into application core, kernel core and interrupt core, each of which is dedicated to a specified function. In implementation, we conduct a delicate modification to Linux kernel and provide the same interface as Linux kernel so that GenerOS is compatible with legacy applications. The better performance of GenerOS mainly benefits from: (1) Applications run on their own cores with minimal interrupt and kernel support; (2) Every kernel service is encapsulated in to a serial process so that there will be fewer contentions than conventional symmetric kernel; (3) A slim schedule policy is used in the kernel core to support schedule between system calls with low overhead. Experiments with two typical workloads on 16-core AMD machine show that GenerOS behaves better than original Linux kernel when there are more processing cores (19.6% for TPC-H using oracle database management system and 42.8% for httperf using apache web server).
Multi-core processors are commonly available now, but most traditional computer architectural simulators still use single-thread execution. In this paper we use parallel discrete event simulation (PDES) to speedup a c...
详细信息
Bugs are becoming unavoidable in complex integrated circuit design. It is imperative to identify the bugs as soon as possible through post-silicon debug. For post-silicon debug, observability is one of the biggest cha...
详细信息
In current virtualized cloud platforms, resource provisioning strategy is still a big challenge. Provisioning will gain low resource utilization based on peak workload, and provisioning based on average work loads wil...
详细信息
暂无评论