Hot data is very important for optimizing modern computersystems. For example, the identified hot data can be employed to extend the lifespan of flash memory. However, it is very challenging to effectively identify h...
详细信息
Kinetic Monte Carlo(KMC) algorithm has been widely applied for simulation of radiation damage, grain growth and chemical reactions. To simulate at a large temporal and spatial scale, domain decomposition is commonly u...
详细信息
Kinetic Monte Carlo(KMC) algorithm has been widely applied for simulation of radiation damage, grain growth and chemical reactions. To simulate at a large temporal and spatial scale, domain decomposition is commonly used to parallelize the KMC algorithm. However, through experimental analysis, we find that the communication overhead is the main bottleneck which affects the overall performance and limits the scalability of parallel KMC algorithm on large-scale clusters. To alleviate the above problems, we present a communication aggrega‐tion approach to reduce the total number of messages and eliminate the commu‐nication redundancy, and further utilize neighborhood collective operations to optimize the communication scheduling. Experimental results show that the opti‐mized KMC algorithm exhibits better performance and scalability compared with the well-known open-source library—SPPARKS. On 32-node Xeon E5-2680 cluster(total 640 cores), the optimized algorithm reduces the total execution time by 16 %, reduces the communication time by 50 % on average, and achieves 24 times speedup over the single node(20 cores) execution.
Knowledge of the queue length for a radio link in a mobile data network has a significant effect on the performance of the communication protocol TCP. If the queue length can be accurately estimated and regulated to a...
详细信息
In-memory graph computation systems have been used to support many important applications, such as PageRank on the web graph and social network analysis. In this paper, we study the CPU cache performance of graph comp...
详细信息
ISBN:
(纸本)9781479984435
In-memory graph computation systems have been used to support many important applications, such as PageRank on the web graph and social network analysis. In this paper, we study the CPU cache performance of graph computation. We have implemented a graph computation system, called GraphLite, in C/C++ based on the description of Pregel. We analyze the CPU cache behavior of the internal data structures and operations of graph computation. Then we exploit CPU cache prefetching techniques to improve the cache performance. Real machine experimental results show that our solution achieves 1.9-2.2x speedups compared to the baseline implementation.
This paper proposes a multi-objective with dynamic topology particle swarm optimization (PSO) algorithm for solving multi-objective problems, named DTPSO. One of the main drawbacks of classical multi-objective particl...
详细信息
With the increasing diversity of application needs and computing units, the server with heterogeneous pro- cessors is more and more widespread. However, conventional SMP/ccNUMA server architecture introduces communica...
详细信息
With the increasing diversity of application needs and computing units, the server with heterogeneous pro- cessors is more and more widespread. However, conventional SMP/ccNUMA server architecture introduces communication bottleneck between heterogeneous processors and only uses heterogeneous processors as coprocessors, which limits the efficiency and flexibility of using heterogeneous processors. To solve this problem, this paper proposes an intra-server inter- connect fabric that supports both intra-server peer-to-peer interconnection and I/O resource sharing among heterogeneous processors. By connecting processors and I/O devices with the proposed fabric, heterogeneous processors can perform direct communication with each other and run in stand-alone mode with shared intra-server resources. We design the proposed fabric by extending the de-facto system I/O bus protocol PCIe (Peripheral computer Interconnect Express) and implement it with a single chip cZodiac. By making full use of PCIe's original advantages, the interconnection and the I/O sharing mechanism are light weight and efficient. Evaluations that have been carried out on both the FPGA (Field Programmable Gate Array) prototype and the cycle-accurate simulator demonstrate that our design is feasible and scalable. In addition, our design is suitable for not only the heterogeneous server but also the high density server.
Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especia...
详细信息
In this paper, a new method is proposed to evaluate the performance of concurrent systems. A concurrent system consisting of multiple processes that communicate via message passing mechanisms is modeled by a Petri net...
详细信息
The Physical Unclonable Function (PUF) has broad application prospects in the field of hardware security. The arbiter PUF is a typical kind of strong PUF. However, due to its deterministic logic, attackers can use mod...
详细信息
ISBN:
(纸本)9781467379069
The Physical Unclonable Function (PUF) has broad application prospects in the field of hardware security. The arbiter PUF is a typical kind of strong PUF. However, due to its deterministic logic, attackers can use modeling techniques to break it in short time. Therefore, this paper proposes an Obfuscation logic based PUF (OPUF) design. A Boolean obfuscation module is proposed to obfuscate the logic which is employed to select the path segments in the arbiter PUF. In this way, the nondeterminacy of PUF is improved, and the computation complexities of modeling attacks are significantly increased, making the OPUF much safer against modeling attack. Both the theoretical analysis and the experimental results show the proposed OPUF design has good stability and randomness.
The decades-old synchronous memory bus interface has restricted many innovations in the memory system, which is facing various challenges (or walls) in the era of multi-core and big data. In this paper, we argue tha...
详细信息
The decades-old synchronous memory bus interface has restricted many innovations in the memory system, which is facing various challenges (or walls) in the era of multi-core and big data. In this paper, we argue that a message- based interface should be adopted to replace the traditional bus-based interface in the memory system. A novel message interface based memory system called MIMS is proposed. The key innovation of MIMS is that processors communicate with the memory system through a universal and flexible message packet interface. Each message packet is allowed to encapsulate multiple memory requests (or commands) and additional semantic information. The memory system is more intelligent and active by equipping with a local buffer scheduler, which is responsible for processing packets, scheduling memory requests, preparing responses, and executing specific commands with the help of semantic information. Under the MIMS framework, many previous innovations on memory architecture as well as new optimization opportunities such as address compression and continuous requests combination can be naturally incorporated. The experimental results on a 16-core cycle-detailed simulation system show that: with accurate granularity message, MIMS can improve system performance by 53.21% and reduce energy delay product (EDP) by 55.90%. Furthermore, it can improve effective bandwidth utilization by 62.42% and reduce memory access latency by 51% on average.
暂无评论