检索结果-内蒙古大学图书馆

作者： Yuan, Nan Yu, Lei Fan, Dong-Rui Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing 100190 China Graduate University of Chinese Academy of Sciences Beijing 100039 China

ISBN: (纸本)9783642245671

This paper presents the design and implementation of a runtime system (named "GodRunner") on Godson-T many-core processor to support task-level parallelism efficiently and flexibly. GodRunner abstracts underlying hardware resource, providing ease-of-use programming interface. A two-grade task management mechanism is proposed to support both coarse-grained and fine-grained multithreading efficiently. Two load-balanced scheduling policies are combined flexibly in GodRunner. The software-controlled task management makes GodRunner more configurable and extensible than hard-wired ones. The experiment shows that the tasking overhead in GodRunner is as small as hundreds of cycles, which is about the hundreds of times faster than the conventional Pthread based multithreading on a SMP machine. Furthermore, our approach scales well and supports fine-grained tasks as small as 20k cycles optimally. © 2011 Springer-Verlag Berlin Heidelberg.

关键词： computer architecture

来源：评论

学校读者我要写书评

暂无评论

A unified online fault detection scheme via checking of stability violation 09

A unified online fault detection scheme via checking of stab...

引用

2009 Design, Automation and Test in Europe Conference and Exhibition, DATE '09

作者： Guihai, Yan Yinhe, Han Xiaowei, Li Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing 100190 China Graduate University Chinese Academy of Sciences Beijing 100039 China

ISBN: (纸本)9783981080155

In ultra-deep submicro technology, two of the paramount reliability concerns are soft errors and device aging. Although intensive studies have been done to face the two challenges, most take them separately so far, thereby failing to reach better performance-cost tradeoffs. To support a more efficient design tradeoff, we present a new fault model, Stability Violation, derived from analysis of signal behavior. Furthermore, we propose a unified fault detection scheme - Stability Violation based Fault Detection (SVFD), by which the soft errors (both Single Event Upset and Single Event Transient), aging delay, and delay faults can be uniformly handled. SVFD can greatly facilitate soft error-resistant and aging-aware designs. SVFD is validated by conducting a set of intensive Hspice simulations targeting 65nm CMOS technology. Experimental results show that SVFD has more robust capability for fault detection than previous schemes at comparable overhead in terms of area, power, and performance. © 2009 EDAA.

关键词： Fault detection

来源：评论

学校读者我要写书评

暂无评论

Accelerating Iterative Big Data computing Through MPI

引用

Journal of computer Science & technology 2015年第2期30卷 283-294页

作者：梁帆鲁小亿 State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing 100190 China University of Chinese Academy of Sciences Beijing 100049 China Department of Computer Science and Engineering The Ohio State University Columbus OH 43210-1277 U.S.A

Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI- Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X-21X speedup over Apache Hadoop, and 2X-3X speedup over Apache Spark for PageRank and K-means.

关键词： iterative computation DataMPI Spark Hadoop MapReduce

来源：评论

学校读者我要写书评

暂无评论

Testable critical path selection considering process variation

Testable critical path selection considering process variati...

引用

作者： Fu, Xiang Li, Huawei Li, Xiaowei Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academic of Science Beijing 100190 China Graduate University of Chinese Academic of Sciences Beijing 100039 China

Critical path selection is very important in delay testing. Critical paths found by conventional static timing analysis (STA) tools are inadequate to represent the real timing of the circuit, since neither the testability of paths nor the statistical variation of cell delays caused by process variation is considered. This paper proposed a novel path selection method considering process variation. The circuit is firstly simplified by eliminating non-critical edges under statistical timing model, and then divided into sub-circuits, while each sub-circuit has only one prime input (PI) and one prime output (PO). Critical paths are selected only in critical sub-circuits. The concept of partially critical edges (PCEs) and completely critical edges (CCEs) are introduced to speed up the path selection procedure. Two path selection strategies are also presented to search for a testable critical path set to cover all the critical edges. The experimental results showed that the proposed circuit division approach is efficient in path number reduction, and PCEs and CCEs play an important role as a guideline during path selection. Copyright © 2010 The institute of Electronics.

关键词： Delay circuits

来源：评论

学校读者我要写书评

暂无评论

Green challenges to system software in data centers

引用

Frontiers of Materials Science 2011年第3期5卷 353-368页

作者： Yuzhong SUN Yiqiang ZHAO Ying SONG Vajun YANG Haifeng FANG Hongyong ZANG Yaqiong LI Yunwei GAO Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing 100190 China Graduate University of Chinese Academy of Sciences Beijing 100190 China

With the increasing demand and the wide application of high performance commodity multi-core processors, both the quantity and scale of data centers grow dramatically and they bring heavy energy consumption. Researchers and engineers have applied much effort to reducing hardware energy consumption, but software is the true consumer of power and another key in making better use of energy. system software is critical to better energy utilization, because it is not only the manager of hardware but also the bridge and platform between applications and hardware. In this paper, we summarize some trends that can affect the efficiency of data centers. Meanwhile, we investigate the causes of software inefficiency. Based on these studies, major technical challenges and corresponding possible solutions to attain green system software in programmability, scalability, efficiency and software architecture are discussed. Finally, some of our research progress on trusted energy efficient system software is briefly introduced.

关键词： green software multi-core data center,power efficient system software

来源：评论

学校读者我要写书评

暂无评论

A routing algorithm for random error tolerance in network-on-chip

引用

12th International Conference on Human-computer Interaction, HCI International 2007

作者： Zhang, Lei Li, Huawei Li, Xiaowei Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences 100080 Beijing China Graduate University Chinese Academy of Sciences Beijing 100080 China

ISBN: (纸本)9783540731092

In DSM and nanometer technology, there will present more and more new fault types, which are difficult to predict and avoid. Applying fault tolerant algorithms to achieve reliable on-chip communication is one of the most important issues of Network-on-Chip (NoC). This paper reviews the main on-chip fault tolerant communication algorithms and then proposes a new routing algorithm with end-to-end feedback. The average transmission latency, power consumption and reliability are compared with other techniques. As experiments show, the proposed algorithm has lower latency, lower power consumption compared with those of others, and it can provide high reliability. © Springer-Verlag Berlin Heidelberg 2007.

关键词： Routing algorithms

来源：评论

学校读者我要写书评

暂无评论

Fetching primary and redundant instructions in turn for a fault-tolerant embedded microprocessor

Fetching primary and redundant instructions in turn for a fa...

引用

14th IEEE Pacific Rim International Symposium on Dependable computing, PRDC 2008

作者： Zhang, Shijian Hu, Weiwu Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing 100080 China Graduate School of the Chinese Academy of Sciences Beijing 100039 China

ISBN: (纸本)9780769534480

With the development of semiconductor technology, microprocessors become more and more susceptible to transient faults. Some proposed schemes support redundant execution of a program in a superscalar processor for fault tolerance. However, they require a huge queue to accommodate interim states, which enlarge the hardware cost significantly. This paper analyzes the effect of halving a processor's instruction fetch bandwidth on a program's performance. We find that the performance degradation resulted from halving instruction fetch bandwidth declines when instruction latency is lengthened, branch prediction accuracy deteriorates or cache miss rate increases. Since an embedded microprocessor is characterized by long instruction latency, high branch misprediction rate and cache miss rate, a fault-tolerant scheme is proposed, in which two threads fetch instructions in turn and execute in the same processor core simultaneously without any extra queue. The simulation results from eight embedded applications show that performance penalty of our solution ranges from 6.5% to 30.1%, with an average of 22.5%, which is lower than that of the other proposed schemes. The experiment also indicates that our scheme can effectively detect faults occurring in the entire pipeline with short fault detection latency and minimal hardware cost. It is well suited for our solution to realize a reliable embedded microprocessor. © 2008 IEEE.

关键词： Fault detection

来源：评论

学校读者我要写书评

暂无评论

Building algorithmically nonstop fault tolerant MPI programs

Building algorithmically nonstop fault tolerant MPI programs

引用

18th International Conference on High Performance computing, HiPC 2011

作者： Wang, Rui Yao, Erlin Chen, Mingyu Tan, Guangming Balaji, Pavan Buntinas, Darius State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences China Mathematics and Computer Science Argonne National Laboratory United States

ISBN: (纸本)9781457719516

With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node;instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the correct solution can be recovered algorithmically at a very low cost. In order to implement the scheme, some new fault-tolerant features of the Message Passing Interface (MPI) have been investigated and utilized in the MPICH implementation of MPI. We also describe a case study using High Performance Linpack (HPL) with these new features and evaluate the performance of both our new scheme and ABFT recovery. Experimental results show the advantage of our new scheme over ABFT recovery even in a small scale. © 2011 IEEE.

关键词： Message passing

来源：评论

学校读者我要写书评

暂无评论

Register-based implementation of the sparse general matrix-matrix multiplication on GPUs

引用

ACM SIGPLAN Notices 2018年第1期53卷 407-408页

作者： Liu, Junhong He, Xin Liu, Weifeng Tan, Guangming State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences China University of Chinese Academy of Sciences China Department of Computer Science Norwegian University of Science and Technology Norway

General sparse matrix-matrix multiplication (SpGEMM) is an essential building block in a number of applications. In our work, we fully utilize GPU registers and shared memory to implement an efficient and load balance... 详细信息

ISBN: (纸本)9781450349116

关键词： Matrix algebra

来源：评论

学校读者我要写书评

暂无评论

I/O lower bounds for auto-tuning of convolutions in CNNs 21

I/O lower bounds for auto-tuning of convolutions in CNNs

引用

26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2021

作者： Zhang, Xiaoyang Xiao, Junmin Tan, Guangming State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences University of Chinese Academy of Science China

ISBN: (纸本)9781450382946

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results for two main convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm, which represents the direct and indirect implementations of a convolution respectively. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32× performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：