Heavy ion experiments were performed on D flip-flop(DFF) and TMR flip-flop(TMRFF) fabricated in a 65-nm bulk CMOS process. The experiment results show that TMRFF has about 92% decrease in SEU crosssection compared to ...
详细信息
Heavy ion experiments were performed on D flip-flop(DFF) and TMR flip-flop(TMRFF) fabricated in a 65-nm bulk CMOS process. The experiment results show that TMRFF has about 92% decrease in SEU crosssection compared to the standard DFF design in static test mode. In dynamic test mode, TMRFF shows much stronger frequency dependency than the DFF design, which reduces its advantage over DFF at higher operation frequency. At 160 MHz, the TMRFF is only 3.2× harder than the standard DFF. Such small improvement in the SEU performance of the TMR design may warrant reconsideration for its use in hardening design.
Monte Carlo (MC) simulation plays an important part in dose calculation for radiotherapy treatment planning. Since the accuracy of MC simulation relies on the number of simulated particles histories, it's very tim...
详细信息
Monte Carlo (MC) simulation plays an important part in dose calculation for radiotherapy treatment planning. Since the accuracy of MC simulation relies on the number of simulated particles histories, it's very time-consuming. The Intel Many Integrated Core (MIC) architecture, which consists of more than 50 cores and supports many parallel programming models, provides an efficient alternative for accelerating MC dose calculation. This paper implements the OpenMP-based MC Dose Planning Method (DPM) for radiotherapy treatment problems on the Intel MIC architecture. The implementation has been verified on the target MIC coprocessor including 57 cores. The results demonstrate that the OpenMP-based DPM implementation exhibits very accurate results and achieves the maximum speedup of 10.53 times in comparison to the original DPM one on a Xeon E5-2670 CPU. Additionally, speedup and efficiency of the implementation running on the different number of cores in MIC are also reported.
It is shown by particle-in-cell simulations that a narrow electron beam with high energy and charge density can be generated in a subcritical-density plasma by two consecutive laser pulses. Although the first laser pu...
详细信息
It is shown by particle-in-cell simulations that a narrow electron beam with high energy and charge density can be generated in a subcritical-density plasma by two consecutive laser pulses. Although the first laser pulse dissipates rapidly, the second pulse can propagate for a long distance in the thin wake channel created by the first pulse and can further accelerate the preaccelerated electrons therein. Given that the second pulse also self-focuses, the resulting electron beam has a narrow waist and high charge and energy densities. Such beams are useful for enhancing the target-back space-charge field in target normal sheath acceleration of ions and bremsstrahlung sources, among others.
The hyperspectral remote sensing is one of the frontier techniques in the remote sensing research fields. Applying the sparse coding model to the hyperspectral remote sensing image processing is a hot topic in hypersp...
详细信息
ISBN:
(纸本)9781467372220
The hyperspectral remote sensing is one of the frontier techniques in the remote sensing research fields. Applying the sparse coding model to the hyperspectral remote sensing image processing is a hot topic in hyperspectral information processing. To improve the accuracy of hyperspectral image classification, we propose a classification method based on the spatial-spectral join-t contextual sparse coding. Firstly, a dictionary is obtained by training using samples selected from the ground-truth reference data. Then, the sparse coefficients of each pixel are calculated based on the learned dictionary. Afterward, the sparse coefficients are input to the classifier and the final classification result is obtained. The visible and near-infrared hyperspectral remote sensing image collected by Tiangong-1 in Chaoyang District of Beijing is used to evaluate the performance of the proposed approach. Experimental results show that the proposed method yields the best classification performance with the overall accuracy of 95.74% and the Kappa coefficient of 0.9476 in comparison with other classification methods.
This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approa...
详细信息
This paper presents the design and implementation of a highly efficient Double-precision General Matrix Multiplication (DGEMM) based on Open BLAS for 64-bit ARMv8 eight-core processors. We adopt a theory-guided approach by first developing a performance model for this architecture and then using it to guide our exploration. The key enabler for a highly efficient DGEMM is a highly-optimized inner kernel GEBP developed in assembly language. We have obtained GEBP by (1) maximizing its compute-to-memory access ratios across all levels of the memory hierarchy in the ARMv8 architecture with its performance-critical block sizes being determined analytically, and (2) optimizing its computations through exploiting loop unrolling, instruction scheduling and software-implemented register rotation and taking advantage of A64 instructions to support efficient FMA operations, data transfers and prefetching. We have compared our DGEMM implemented in Open BLAS with another implemented in ATLAS (also in terms of a highly-optimized GEBP in assembly). Our implementation outperforms the one in ALTAS by improving the peak performance (efficiency) of DGEMM from 3.88 Gflops (80.9%) to 4.19 Gflops (87.2%) on one core and from 30.4 Gflops (79.2%) to 32.7 Gflops (85.3%) on eight cores. These results translate into substantial performance (efficiency) improvements by 7.79% on one core and 7.70% on eight cores. In addition, the efficiency of our implementation on one core is very close to the theoretical upper bound 91.5% obtained from micro-benchmarking. Our parallel implementation achieves good performance and scalability under varying thread counts across a range of matrix sizes evaluated.
In IaaS cloud environments, peak memory demand caused by hotspot applications in Virtual Machine (VM) often results in performance degradation within and outside of this VM. Some solutions such as host swapping and ba...
详细信息
In IaaS cloud environments, peak memory demand caused by hotspot applications in Virtual Machine (VM) often results in performance degradation within and outside of this VM. Some solutions such as host swapping and ballooning for memory consolidation and overcommitment have been proposed. These solutions, however, have no help for addressing guest swapping issues inside VM. Even though host holds sufficient memory pages, guest OS is unable to utilize free pages in host directly due to the semantic gap between VMM and it. Our goal is to alleviate the performance degradation by decreasing disk I/O operations generated by guest swapping. Based on the insight analysis of behavioral features of guest swapping, we design HybridSwap, a distributed scalable framework which organize surplus memory in all hosts within data center into virtual pools for swapping. This framework builds up a synthetic swapping mechanism in a peer-to-peer way, which VM can adaptively choose suitable pools for swapping. We implement the prototype of HybridSwap and evaluate it with different benchmarks. The results demonstrate that our solution has the ability to promote the guest swapping efficiency indeed. Even in some cases, it shows 2-5 times of performance promotion compared with the baseline setup.
The coupling of microwaves into apertures plays an important part in many electromagnetic physics and engineering fields. When the width of apertures is very small, Finite Difference Time Domain (FDTD) simulation of t...
详细信息
ISBN:
(纸本)9781467377898
The coupling of microwaves into apertures plays an important part in many electromagnetic physics and engineering fields. When the width of apertures is very small, Finite Difference Time Domain (FDTD) simulation of the coupling is very time-consuming. As a many-core architecture, the Intel's Many Integrated Core (MIC) architecture owns 512-bit vector units and more than 200 threads. In this paper, we parallelize FDTD simulation of microwave pulse coupling into narrow slots on the Intel MIC architecture. In the implementation, the parallel programming model OpenMP is used to exploit thread parallelism while loop unrolling and SIMD intrinsic functions are utilized to accomplish vectorization. Compared with the serial version on Intel Xeon E5-2670 CPU, the implementation on the MIC coprocessor including 57 cores obtains a speedup of 11.57 times. The experiment results also demonstrate that the parallelization has good scalability in performance. Additionally, how binding relationship between OpenMP threads and hardware threads in MIC influences performance is also reported.
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design...
详细信息
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interpr...
详细信息
Interconnection network plays an important role in scalable high performance computer (HPC) systems. The TH Express-2 interconnect has been used in MilkyWay-2 system to provide high-bandwidth and low-latency interprocessot communications, and continuous efforts are devoted to the development of our proprietary interconnect. This paper describes the state-of-the-art of our proprietary interconnect, especially emphasizing on the design of network interface. Several key features are introduced, such as user-level communication, remote direct memory access, offload collective operation, and hardware reliable end-to-end communication, etc. The design of a low level message passing infrastructures and an upper message passing services are also proposed. The preliminary performance results demonstrate the efficiency of the TH interconnect interface.
MapReduce has become a popular model for large-scale data processing in recent years. However, existing MapRe-duce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is une...
详细信息
ISBN:
(纸本)9781479982424
MapReduce has become a popular model for large-scale data processing in recent years. However, existing MapRe-duce schedulers still suffer from an issue known as partitioning skew, where the output of map tasks is unevenly distributed among reduce tasks. In this paper, we present DREAMS, a framework that provides run-time partitioning skew mitigation. Unlike previous approaches that try to balance the workload of reducers by repartitioning the intermediate data assigned to each reduce task, in DREAMS we cope with partitioning skew by adjusting task run-time resource allocation. We show that our approach allows DREAMS to eliminate the overhead of data repartitioning. Through experiments using both real and synthetic workloads running on a 11-node virtual virtualised Hadoop cluster, we show that DREAMS can effectively mitigate negative impact of partitioning skew, thereby improving job performance by up to 20.3%.
暂无评论