This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approa...
详细信息
This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.
This paper reports our experience optimizing the performance of a high-order and high accurate Computational Fluid Dynamics (CFD) application (HOSTA) on the state of art multicore processor and the emerging Intel Many...
详细信息
Our daily life is changing by the smart objects, such as smart watches, smart phones etc. They make the cyber world and the physical world integrated by their abundant abilities of sensing, communication and computati...
详细信息
ISBN:
(纸本)9781479982172
Our daily life is changing by the smart objects, such as smart watches, smart phones etc. They make the cyber world and the physical world integrated by their abundant abilities of sensing, communication and computation etc. Focusing on a wide range of the integrated network, a statistical based strategy was introduced to get a special kind of link between objects, the statistical probability communication link. To get a maximized information spread probability for grouped people, this paper introduced a distributed, yet efficient algorithm naming DMPID algorithm, for finding a sub-network to spread people oriented inforamtion. The DMPID algorithm take the size of the selection and the information spread probability into account, and made a balance between the two parameters. Extended simulation showed that the DMPID algorithm performs well in different distributed networks.
MapReduce has become a popular model for large-scale data processing in recent years. However, existing MapRe-duce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is une...
详细信息
ISBN:
(纸本)9781479982424
MapReduce has become a popular model for large-scale data processing in recent years. However, existing MapRe-duce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is unevenly distributed among reduce tasks. In this paper, we present DREAMS, a framework that provides run-time partitioning skew mitigation. Unlike previous approaches that try to balance the workload of reducers by repartitioning the intermediate data assigned to each reduce task, in DREAMS we cope with partitioning skew by adjusting task run-time resource allocation. We show that our approach allows DREAMS to eliminate the overhead of data repartitioning. Through experiments using both real and synthetic workloads running on a 11-node virtual virtualised Hadoop cluster, we show that DREAMS can effectively mitigate negative impact of partitioning skew, thereby improving job performance by up to 20.3%.
It is shown by particle-in-cell simulations that a narrow electron beam with high energy and charge density can be generated in a subcritical-density plasma by two consecutive laser pulses. Although the first laser pu...
详细信息
It is shown by particle-in-cell simulations that a narrow electron beam with high energy and charge density can be generated in a subcritical-density plasma by two consecutive laser pulses. Although the first laser pulse dissipates rapidly, the second pulse can propagate for a long distance in the thin wake channel created by the first pulse and can further accelerate the preaccelerated electrons therein. Given that the second pulse also self-focuses, the resulting electron beam has a narrow waist and high charge and energy densities. Such beams are useful for enhancing the target-back space-charge field in target normal sheath acceleration of ions and bremsstrahlung sources, among others.
With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the us...
详细信息
With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
The performance of an ad-hoc network is greatly limited by collisions due to hidden terminals. In this paper, we propose a receiver tracking contention (RTC) scheme, which achieves high throughput by allowing the rece...
详细信息
ISBN:
(纸本)9781479982172
The performance of an ad-hoc network is greatly limited by collisions due to hidden terminals. In this paper, we propose a receiver tracking contention (RTC) scheme, which achieves high throughput by allowing the receivers to assist for channel contention. In RTC, link is the basic unit for channel access contention. Specifically, transmitter is used to contend for the channel and receiver is used to announce the potential collision. Based on INT message coding scheme, transmitter and its corresponding receiver can be well coordinated. In such mechanism, hidden terminals are avoided and exposed terminals are encouraged to transmit simultaneously. Based on OFDM modulation, RTC packets several subcarriers as subcontention unit and operates channel contention over multiple subcontention units. Furthermore, each subcontention unit maintains a transmission set, where collision-free links are allowed to merged into the transmission set In this case, the transmission set of subcontention unit can be aggregated after each contention period. When the subcontention unit i is the smallest index of non-empty subcontention unit, the transmission set of unit i will win the channel contention and transmitters of unit i will start to transmit in the following data transmission period. Analysis and simulation results show that RTC achieves a notable throughput gain over Back2f as high as 190% through simulation.
Nowadays open source software becomes highly popular and is of great importance for most software engi- neering activities. To facilitate software organization and re- trieval, tagging is extensively used in open sour...
详细信息
Nowadays open source software becomes highly popular and is of great importance for most software engi- neering activities. To facilitate software organization and re- trieval, tagging is extensively used in open source communi- ties. However, finding the desired software through tags in these communities such as Freecode and ohloh is still chal- lenging because of tag insufficiency. In this paper, we propose TRG (tag recommendation based on semantic graph), a novel approach to discovering and enriching tags of open source software. Firstly, we propose a semantic graph to model the semantic correlations between tags and the words in software descriptions. Then based on the graph, we design an effec- tive algorithm to recommend tags for software. With com- prehensive experiments on large-scale open source software datasets by comparing with several typical related works, we demonstrate the effectiveness and efficiency of our method in recommending proper tags.
Partitioning links rather than nodes is effective in overlapping community detection (OCD) on complex networks. However, it consumes high CPU and memory overheads because the volume of links is huge especially when th...
详细信息
ISBN:
(纸本)9781479986989
Partitioning links rather than nodes is effective in overlapping community detection (OCD) on complex networks. However, it consumes high CPU and memory overheads because the volume of links is huge especially when the network is rather complex. In this paper, we proposes a symmetric non-negative matrix factorization (SNMF) based link partition method called SNMF-Link to overcome this deficiency. In particular, SNMF-Link represents data in a lower-dimensional space spanned by the node-link incidence matrix. By solving a lighter SNMF problem, SNMF-Link learns the clustering indicators of each links. Since traditional multiplicative update rule (MUR) based optimization algorithm for SNMF suffers from slow convergence, we applied the augmented Lagrangian method (ALM) to efficiently optimize SNMF. Experimental results show that SNMF-Link is much more efficient than the representative clustering algorithms without reducing the OCD performance.
Data distribution is a key technology for resources convergence and sharing in distributed environment. To better meet the requirement for real time data distribution in the dynamic network, a trace routing algorithm ...
详细信息
ISBN:
(纸本)9781479999620
Data distribution is a key technology for resources convergence and sharing in distributed environment. To better meet the requirement for real time data distribution in the dynamic network, a trace routing algorithm named CRAWL based on the hybrid two-layered topology is put forward. The algorithm contains an overlay topology named CBDLO, upper of which consists of multiple distributed balanced binary trees corresponding to different properties and the lower of which is an unstructured topology. CRAWL forwards the data on the lower unstructured topology in the form of random walk, so that the data can be sent to the corresponding upper topology entry, It also includes a matching algorithm named CDM for the parallel matching data properties on the upper distributed and balanced binary tree and transmitting the matched data to the nodes that are interested in the data. The experimental results show that the algorithm can effectively support large-scale data distribution in a dynamical network, reduce distribution overhead and matching delays.
暂无评论