检索结果-内蒙古大学图书馆

International Conference on Multimedia, Communication and computing Application, MCCA 2014

作者： Yang, Y.W. Wang, Y.W. Gong, Y.Z. Wang, Y.W. State Key Laboratory of Networking and Switching Technology Beijing University of Posts and Telecommunications Beijing China State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences China

ISBN: (纸本)9781138027756

Library functions and system calls have been major difficulties faced by automatic test. Input/ output (I/O) functions are a set of common library functions. Testers have to interact with the test procedures if the test program contains I/O functions, resulting in inefficient automatic test. Unlike normal functions, the association between I/O functions and operating on a specific I/O device makes ordinary stub method disabled. Aiming at solving this problem, this paper proposes one solution by I/O device modeling and I/O function modeling. The semantics of the I/O functions are stored in the I/O function model, and the operations of I/O functions are similar to the I/O device model. The final state and properties of the I/O device are represented by two models. Finally the stub code is automatically generated by path-oriented stub generation technology. Experiments show that this method is effective. © 2015 Taylor & Francis Group, London.

关键词： Semantics

来源：评论

学校读者我要写书评

暂无评论

Persistent B+-Trees in non-volatile main memory 41st

Persistent B+-Trees in non-volatile main memory

引用

41st International Conference on Very Large Data Bases, VLDB 2015

作者： Chen, Shimin Jin, Qin State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences China Computer Science Department School of Information Renmin University of China China

computer systems in the near future are expected to have Non- Volatile Main Memory (NVMM), enabled by a new generation of Non-Volatile Memory (NVM) technologies, such as Phase Change Memory (PCM), STT-MRAM, and Memristor. The non-volatility property has the promise to persist in-memory data structures for instantaneous failure recovery. However, realizing such promise requires a careful design to ensure that in-memory data structures are in known consistent states after failures. This paper studies persistent in-memory B+-Trees as B+-Trees are widely used in database and data-intensive systems. While traditional techniques, such as undo-redo logging and shadowing, support persistent B+-Trees, we find that they incur drastic performance overhead because of extensive NVM writes and CPU cache flush operations. PCM-friendly B+-Trees with unsorted leaf nodes help mediate this issue, but the remaining overhead is still large. In this paper, we propose write atomic B+-Trees (wB+-Trees), a new type of main-memory B+-Trees, that aim to reduce such overhead as much as possible. wB+-Tree nodes employ a small indirect slot array and/or a bitmap so that most insertions and deletions do not require the movement of index entries. In this way, wB+- Trees can achieve node consistency either through atomic writes in the nodes or by redo-only logging. We model fast NVM using DRAM on a real machine and model PCM using a cycle-accurate simulator. Experimental results show that compared with previous persistent B+-Tree solutions, wB+-Trees achieve up to 8.8x speedups on DRAM-like fast NVM and up to 27.1x speedups on PCM for insertions and deletions while maintaining good search performance. Moreover, we replaced Memcached's internal hash index with tree indices. Our real machine Memcached experiments show that wB+-Trees achieve up to 3.8X improvements over previous persistent tree structures with undo-redo logging or shadowing. © 2015 VLDB Endowment 21508097/15/03.

关键词： Dynamic random access storage

来源：评论

学校读者我要写书评

暂无评论

HDCat: Effectively identifying hot data in large-Scale I/O streams with enhanced temporal locality 1

引用

15th International Conference on Algorithms and architectures for Parallel Processing, ICA3PP 2015

作者： Chen, Jiahao Deng, Yuhui Huang, Zhan Department of Computer Science Jinan University Guangzhou510632 China State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing100190 China

ISBN: (数字)9783319271224

ISBN: (纸本)9783319271217

Hot data is very important for optimizing modern computer systems. For example, the identified hot data can be employed to extend the lifespan of flash memory. However, it is very challenging to effectively identify hot data with low memory consumption and low runtime overhead. This paper proposes a Hot Data Catcher (HDCat) which can effectively identify hot data in large-scale I/O streams by leveraging enhanced temporal locality. HDCat only maintains a hot data queue and a candidate hot data queue to record the data access pattern by tracking limited data set, thus effectively reducing the memory consumption. Furthermore, HDCat adopts a D-bit counter and a recency-bit to leverage both the frequency and recency contained in the data stream. Additionally, HDCat can significantly reduce the conversion between hot data and cold data. Real traces are used to evaluate the proposed approach. Experimental results demonstrate that HDCat significantly outperforms the state-of-the-art Multi-hash algorithm and the two-level LRU algorithm. © Springer International Publishing Switzerland 2015.

关键词： Hash functions

来源：评论

学校读者我要写书评

暂无评论

Optimizing Parallel Kinetic Monte Carlo Simulation by Communication Aggregation and Scheduling

Optimizing Parallel Kinetic Monte Carlo Simulation by Commun...

引用

2015大数据技术及应用论坛

作者： Baodong Wu Shigang Li Yunquan Zhang State Key Laboratory of Computer System and Architecture Institute of Computing TechnologyChinese Academy of Sciences

Kinetic Monte Carlo(KMC) algorithm has been widely applied for simulation of radiation damage, grain growth and chemical reactions. To simulate at a large temporal and spatial scale, domain decomposition is commonly used to parallelize the KMC algorithm. However, through experimental analysis, we find that the communication overhead is the main bottleneck which affects the overall performance and limits the scalability of parallel KMC algorithm on large-scale clusters. To alleviate the above problems, we present a communication aggrega‐tion approach to reduce the total number of messages and eliminate the commu‐nication redundancy, and further utilize neighborhood collective operations to optimize the communication scheduling. Experimental results show that the opti‐mized KMC algorithm exhibits better performance and scalability compared with the well-known open-source library—SPPARKS. On 32-node Xeon E5-2680 cluster(total 640 cores), the optimized algorithm reduces the total execution time by 16 %, reduces the communication time by 50 % on average, and achieves 24 times speedup over the single node(20 cores) execution.

关键词： Domain decomposition Communication aggregation Communication scheduling Neighborhood collectives

来源：评论

学校读者我要写书评

暂无评论

Achieving high throughput and low delay in mobile data networks by accurately predicting queue lengths 15

Achieving high throughput and low delay in mobile data netwo...

引用

12th ACM International Conference on computing Frontiers, CF 2015

作者： Liu, Ke Lee, Jack Y.B. State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences China Department of Information Engineering Chinese University of Hong Kong Hong Kong Hong Kong

ISBN: (纸本)9781450333580

Knowledge of the queue length for a radio link in a mobile data network has a significant effect on the performance of the communication protocol TCP. If the queue length can be accurately estimated and regulated to a target value, then low end-to-end delay and high bandwidth utilization can be achieved. One method for estimating and regulating the queue length is the queue-length-based congestion control (QCC) algorithm. However, this algorithm estimates the queue length over one RTT interval prior to transmission, and the actual queue length after that time can differ significantly, because the bandwidth can vary substantially between the neighboring propagation delays, which could result in a false positive in the queue length adaption, thereby affecting the QoS performance. To address this problem, we propose PQ-TCP, a method that predicts the queue length directly by predicting the bandwidth variations over the ensuing period of time equal to the propagation delay and using post-bandwidth analysis to minimize the prediction error. Trace-driven simulations are used to show that the QoS performance of PQ-TCP is superior to that of current QCC algorithms. PQ-TCP achieves the lowest RTT while maintaining nearly 90% bandwidth utilization for a small target queue length of 5 packets. © Copyright 2015 ACM.

关键词： Quality of service

来源：评论

学校读者我要写书评

暂无评论

Optimizing CPU cache performance for Pregel-like graph computation

Optimizing CPU cache performance for Pregel-like graph compu...

引用

International Conference on Data Engineering Workshops

作者： Songjie Niu Shimin Chen State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences

ISBN: (纸本)9781479984435

In-memory graph computation systems have been used to support many important applications, such as PageRank on the web graph and social network analysis. In this paper, we study the CPU cache performance of graph computation. We have implemented a graph computation system, called GraphLite, in C/C++ based on the description of Pregel. We analyze the CPU cache behavior of the internal data structures and operations of graph computation. Then we exploit CPU cache prefetching techniques to improve the cache performance. Real machine experimental results show that our solution achieves 1.9-2.2x speedups compared to the baseline implementation.

关键词： Prefetching Arrays Computational modeling Aggregates Web pages Programming

来源：评论

学校读者我要写书评

暂无评论

A co-evolutionary particle swarm optimization with dynamic topology for solving multi-objective optimization problems

Advances in Modelling and Analysis A

引用

Advances in Modelling and Analysis A 2016年第1期53卷 145-159页

作者： Wu, Daqing Tang, Lixiang Li, Haiyan Ouyang, LiJun Computer Science and Technology Institute University of South China HangyangHunan China Antai College of Economics and Management Shanghai Jiao Tong University Shanghai200240 China Zigong643000 China Key Laboratory of Guangxi High Schools for Complex System and Computational Intelligence Guangxi University for Nationalities Nanning530006 China Key Laboratory of Intelligent Computing and Signal Processing Ministry of Education Anhui University HefeiAnhui Province230039 China Department of Business Administration Hunan University of Finance and Economics Hunan410205 China

This paper proposes a multi-objective with dynamic topology particle swarm optimization (PSO) algorithm for solving multi-objective problems, named DTPSO. One of the main drawbacks of classical multi-objective particle swarm optimization algorithm is low diversity. To overcome this disadvantage, DTPSO uses two dynamic local best particles to lead the search particles with multiple populations to deal with multiple objectives, and maintains diversity of new found non-dominated solutions via partitioned the searching space into fixed number of cells. The proposed DTPSO is validated through comparisons with other two multi-objective algorithms using established benchmarks and metrics. Simulation results demonstrated that DTPSO shows competitive, if not better, performance as compared to the other algorithms. © 2016, AMSE Press. All rights reserved.

关键词： Multiobjective optimization

来源：评论

学校读者我要写书评

暂无评论

An Intra-Server Interconnect Fabric for Heterogeneous computing

引用

Journal of computer Science & technology 2014年第6期29卷 976-988页

作者：曹政Zheng Cao 刘小丽李强刘小兵王展安学军 CCF ACM State Key Laboratory of Computer Architecture Institute of Computing TechnologyChinese Academy of Sciences

With the increasing diversity of application needs and computing units, the server with heterogeneous pro- cessors is more and more widespread. However, conventional SMP/ccNUMA server architecture introduces communication bottleneck between heterogeneous processors and only uses heterogeneous processors as coprocessors, which limits the efficiency and flexibility of using heterogeneous processors. To solve this problem, this paper proposes an intra-server inter- connect fabric that supports both intra-server peer-to-peer interconnection and I/O resource sharing among heterogeneous processors. By connecting processors and I/O devices with the proposed fabric, heterogeneous processors can perform direct communication with each other and run in stand-alone mode with shared intra-server resources. We design the proposed fabric by extending the de-facto system I/O bus protocol PCIe （Peripheral computer Interconnect Express） and implement it with a single chip cZodiac. By making full use of PCIe＇s original advantages, the interconnection and the I/O sharing mechanism are light weight and efficient. Evaluations that have been carried out on both the FPGA （Field Programmable Gate Array） prototype and the cycle-accurate simulator demonstrate that our design is feasible and scalable. In addition, our design is suitable for not only the heterogeneous server but also the high density server.

关键词： heterogeneous system interconnection I/O virtualization PCI-express

来源：评论

学校读者我要写书评

暂无评论

A small-footprint accelerator for large-scale neural networks

A small-footprint accelerator for large-scale neural network...

引用

作者： Chen, Tianshi Zhang, Shijin Liu, Shaoli Du, Zidong Luo, Tao Gao, Yuan Liu, Junjie Wang, Dongsheng Wu, Chengyong Sun, Ninghui Chen, Yunji Temam, Olivier State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing100190 China TNLIST Tsinghua University Beijing100084 China Inria Saclay France CAS Center for Excellence in Brain Science China

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-theart across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm2 and 485mW;compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87?faster, and it can reduce the total energy by 21.08×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications. © 2015 ACM.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

A polynomial algorithm to performance analysis of concurrent systems via petri nets and ordinary differential equations

A polynomial algorithm to performance analysis of concurrent...

引用

作者： Ding, Zuohua Zhou, Yuan Zhou, Meng Chu Laboratory of Intelligent Computing and Software Engineering Zhejiang Sci-Tech University Hangzhou310018 China Key Laboratory of Embedded System and Service Computing Ministry of Education Tongji University Shanghai200092 China Department of Electrical and Computer Engineering New Jersey Institute of Technology NewarkNJ07102-1982 United States

In this paper, a new method is proposed to evaluate the performance of concurrent systems. A concurrent system consisting of multiple processes that communicate via message passing mechanisms is modeled by a Petri net, which is in turn represented by a set of ordinary differential equations (ODEs) of a restricted type. The equations describe the system state changes, and the solutions, also called state measures, can be used for the performance analysis such as estimating response time, throughput and efficiency. This method can avoid a state explosion problem encountered by the conventional methods based on Continuous-Time Markov Chains. Its application to an IBM business system is given as an example. © 2004-2012 IEEE.

关键词： Ordinary differential equations

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：