检索结果-内蒙古大学图书馆

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

Journal of Computer Science & Technology 2012年第2期27卷 240-255页

作者：徐新海杨学军薛京灵林宇斐林一松 National Laboratory for Parallel and Distributed Processing School of ComputerNational University of Defense Technology Programming Languages and Compilers Group School of Computer Science and Engineering University of New South Wales

GPGPUs are increasingly being used to as performance accelerators for HPC （High Performance Computing） applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world＇s fastest supercomputer in the TOP500 list, built at NUDT （National University of Defense Technology） last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT （single-instruction, multiple-thread） characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC （a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs） shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.

关键词： GPGPU partial recomputing fault tolerance CUDA checkpointing

来源：评论

学校读者我要写书评

暂无评论

Memory access characterization of scientific applications on GPU and its implication on low power optimization

Memory access characterization of scientific applications on...

引用

2011 International Conference on Computational and Information Sciences, ICCIS 2011

作者： Wang, Guibin National Laboratory for Parallel and Distributed Processing National University of Defense Technology China

ISBN: (纸本)9780769545011

Following current IC design technology trend, modern GPUs integrate more and more processing cores, and the speed gap between processor and memory system becomes even larger. As the number of cores continually increases, the available bandwidth per core decreases correspondingly. Therefore, memory access performance has been one of the most critical bottlenecks for better performance. This paper analyzes the impact of memory system on performance and scalability for GPU with several scientific applications using a cycle-accurate simulator. Two observations we make are (1) that memory bandwidth has relatively greater impact on performance than memory latency, because the latter factor could be well hidden with tremendous concurrent executing threads supported in modern GPU architecture, and (2) that through examining the performance scalability of variable active cores, using the maximum hardware-supported cores may not bring in better performance, especially for the memory-intensive applications. In the end we suggest a better power-efficient exploitation of GPU is to make judicious concurrency-throttling based on the memory usage in application. © 2011 IEEE.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

Data parallelism exploiting for H.264 encoder

Data parallelism exploiting for H.264 encoder

引用

International Conference on Multimedia and Signal processing

作者： Wen, Mei Ren, Ju Wu, Nan Su, Huayou Xun, Changqing Zhang, Chunyuan Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha China

ISBN: (纸本)9780769543567

Real-time H.264 encoding of high-definition (HD) video (up to 1080p) is a challenge workload to most existing programmable processors. Instead, the novel programmable parallel processors such as stream processor, Graphic processor unit (GPU) and DSP offer a different and very promising technology for these demands. Thus, parallel computing for H.264 encoding on these processors is becoming a hot research point. It's challenged, because most emerging parallel processors focus on supporting Data Level parallel (DLP), while the dependency inherently existing in traditional H.264 encoding algorithm significantly restricts exploiting DLP. Facing the challenge, this paper presents data parallel processing methods for key modules of H.264 encoder which can eliminate the dependency restriction. The result shows that key modules including Intra-prediction, Inter-prediction and CAVLC achieve significant speedup on stream processor by using these data parallel processing methods.© 2011 IEEE.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

Approximate Iteration Detection and Precoding in Massive MIMO

引用

China Communications 2018年第5期15卷 183-196页

作者： Chuan Tang Yerong Tao Yancang Chen Cang Liu Luechao Yuan Zuocheng Xing Luoyang Electronic Equipment Test Center LuoYang 471000China National Laboratory for Parallel and Distributed Processing National University of Defense TechnologyChangsha 410073China

Massive multiple-input multiple-output provides improved energy efficiency and spectral efficiency in 5 G. However it requires large-scale matrix computation with tremendous complexity, especially for data detection and precoding. Recently, many detection and precoding methods were proposed using approximate iteration methods, which meet the demand of precision with low complexity. In this paper, we compare these approximate iteration methods in precision and complexity, and then improve these methods with iteration refinement at the cost of little complexity and no extra hardware resource. By derivation, our proposal is a combination of three approximate iteration methods in essence and provides remarkable precision improvement on desired vectors. The results show that our proposal provides 27%-83% normalized mean-squared error improvement of the detection symbol vector and precoding symbol vector. Moreover, we find the bit-error rate is mainly controlled by soft-input soft-output Viterbi decoding when using approximate iteration methods. Further, only considering the effect on soft-input soft-output Viterbi decoding, the simulation results show that using a rough estimation for the filter matrix of minimum mean square error detection to calculating log-likelihood ratio could provideenough good bit-error rate performance, especially when the ratio of base station antennas number and the users number is not too large.

关键词： massive MIMO detection and precoding matrix inversion iteration refinement soft Viterbi decoding

来源：评论

学校读者我要写书评

暂无评论

The design of QoS management framework based on CORBA A/V STREAM architecture 4

The design of QoS management framework based on CORBA A/V ST...

引用

4th International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, HPC-Asia 2000

作者： Chen, Xiaomei Lu, Xichen Wang, Huaimin National Laboratory of Parallel and Distributed Processing National University of Defense Technology China

ISBN: (纸本)0769505902

As distributed multimedia applications become more widely diffused, flexibility for QoS management is increasingly essential. We put forward a QoS management framework called QoSMF. In order to solve the heterogeneous problems and get more flexible and reusable characteristics, we implement the QoSMF model into the CORBA A/V STREAM architecture. With the QoSMF, we can easily change the QoS strategies and arithmetic without wholly reprogramming and redesigning for different types of applications. © 2000 IEEE.

关键词： Common object request broker architecture (CORBA)

来源：评论

学校读者我要写书评

暂无评论

Communication-aware task partition and voltage scaling for energy minimization on heterogeneous parallel systems

Communication-aware task partition and voltage scaling for e...

引用

2011 12th International Conference on parallel and distributed Computing, Applications and Technologies, PDCAT 2011

作者： Wang, Guibin Song, Wei National Laboratory for Parallel and Distributed Processing National University of Defense Technology China

ISBN: (纸本)9780769545646

Heterogeneous parallel systems have become popular in general purpose computing and even high performance computing fields. There are many studies focused on harnessing heterogeneous parallel processing for better performance. However the energy optimization for heterogeneous system has not been well studied. Owing to the differences in performance and energy consumption, the energy optimization technique for heterogeneous system is different from the existing methods designed for homogeneous system. Besides typical voltage scaling method, reasonable task partitioning is also an essential method for optimizing energy consumption on heterogeneous systems. Through partitioning a data parallel task and mapping sub-tasks onto several processors, one could achieve better performance and reduced energy consumption. As the computation cost reduces with specific accelerators, the communication overhead becomes more prominent. Therefore, the task partition optimization should holistically consider the computation improvement and communication overhead to achieve higher energy efficiency. Typically, task partition and voltage scaling are not orthogonal and influence the effect of each other in the energy optimization problem. In order to harness both two knobs efficiently, this paper proposes an integer linear programming (ILP) based energy-optimal solution designed for heterogeneous system. We present a case study of optimizing MGRID benchmark on a typical CPU-GPU heterogeneous system. The experimental results demonstrate that the proposed method could exploit the heterogeneity in different processors and achieve improved energy efficiency. © 2011 IEEE.

关键词： Energy efficiency

来源：评论

学校读者我要写书评

暂无评论

Implementation of ternary Shor's algorithm based on vibrational states of an ion in anharmonic potential

引用

Chinese Physics B 2015年第3期24卷 157-165页

作者：刘威陈书明张见吴春旺吴伟陈平形 College of Computer National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory (PDL) National University of Defense Technology College of Science National University of Defense Technology

It is widely believed that Shor＇s factoring algorithm provides a driving force to boost the quantum computing ***, a serious obstacle to its binary implementation is the large number of quantum gates. Non-binary quantum computing is an efficient way to reduce the required number of elemental gates. Here, we propose optimization schemes for Shor＇s algorithm implementation and take a ternary version for factorizing 21 as an example. The optimized factorization is achieved by a two-qutrit quantum circuit, which consists of only two single qutrit gates and one ternary controlled-NOT gate. This two-qutrit quantum circuit is then encoded into the nine lower vibrational states of an ion trapped in a weakly anharmonic potential. Optimal control theory（OCT） is employed to derive the manipulation electric field for transferring the encoded states. The ternary Shor＇s algorithm can be implemented in one single step. Numerical simulation results show that the accuracy of the state transformations is about 0.9919.

关键词： ternary Shor's algorithm anharmonic ion trapping optimal control theory vibrational state

来源：评论

学校读者我要写书评

暂无评论

A prototype of Web-based distributed simulation environment 4

A prototype of Web-based distributed simulation environment

引用

4th International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region, HPC-Asia 2000

作者： Dang, Gang Wang, Xiaoqing Zhang, Wensong Jin, Shiyao National Laboratory of Parallel and Distributed Processing National University of Defense Technology China

ISBN: (纸本)0769505902

Nowadays, many simulation environments not only can not reuse existing simulation models and tools, but also depend on operating systems and hardware platforms, and even more they lack the capability to execute over the Internet and the Web. The paper addresses these disadvantages by putting forward a prototype of a distributed simulation environment based on the Web (DSEW). It introduces both the technology of software components and the technology of integration with Java and CORBA, which enables portability, scalability, reusability, interoperability and visualization of the simulation application, and can also shorten the development period and communicate efficiently. © 2000 IEEE.

关键词： Common object request broker architecture (CORBA)

来源：评论

学校读者我要写书评

暂无评论

PIBUS: A network memory-based peer-to-peer IO buffering service

引用

6th International IFIP-TC6 Networking Conference, NETWORKING 2007

作者： Zhang, Yiming Li, Dongsheng Chu, Rui Xiao, Nong Lu, Xicheng National Laboratory for Parallel and Distributed Processing NUDT Changsha Hunan410073 China

ISBN: (纸本)9783540726050

This paper proposes a network memory-based P2P IO BUffering Service (PIBUS), which buffers blocks for IO-intensive applications in P2P network memory like a 2-level disk cache. PIBUS reduces the IO overhead when local cache is missed due to speed advantage of network memory over disks, and improves hit ratio based on accurate classification of IO behaviors. © IFIP International Federation for Information processing 2007.

关键词： Peer to peer networks

来源：评论

学校读者我要写书评

暂无评论

An adaptive data objects placement algorithm for non-uniform capacities

引用

3rd International Conference on Grid and Cooperative Computing, GCC2004

作者： Liu, Zhong Zhou, Xing-Ming National Laboratory for Parallel and Distributed Processing Changsha410073 China

ISBN: (纸本)3540235647

The capacities of storage nodes usually are non-uniform and storage nodes are dynamically changed in large-scale distributed storage systems. In this paper, a novel dynamic data objects placement algorithm is proposed;data objects are always distributed among the storage nodes according to their capabilities. When storage nodes are changed, it affords to immediately rebalance data objects distribution according to weight of storage nodes. Simulation results indicates that data objects are always distributed among the storage nodes according to their capabilities and data objects are migrated throughout all storage nodes in parallel, resulting in minimum amount of replacement of objects. © Springer-Verlag Berlin Heidelberg 2004.

关键词： Multiprocessing systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：