BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three ...
详细信息
ISBN:
(纸本)9781479938018
BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting GEMM becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.
We have studied the vibrational modes and Raman spectra of P-doped Si nanocrystals using pseudopotential density functional theory and the Placzek approximation. We find that Si nanocrystal vibrations are largely unaf...
详细信息
We have studied the vibrational modes and Raman spectra of P-doped Si nanocrystals using pseudopotential density functional theory and the Placzek approximation. We find that Si nanocrystal vibrations are largely unaffected by the introduction of P dopants. However, the Raman spectra of doped nanocrystals are enhanced relative to those of pristine nanocrystals, and demonstrate a strong dependence on dopant position. Thus, Raman has the potential of being developed as a tool for probing the location of the dopant within the nanocrystal. Our analysis shows that vibrational modes involving atoms in the vicinity of the dopant give the largest contributions to the Raman spectra.
Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down the error rates, for example within 1% for Illumina HiSeq reads. Moreover, the er...
详细信息
Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down the error rates, for example within 1% for Illumina HiSeq reads. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can have significant impact on run-time complexity of applications. In this paper, we present a simple and fast k-spectrum analysis based method to identify error-free reads. Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the coverage by about 7% on an average, coverage pattern across genome remains similar. The filtration process can be customized at several levels of stringency depending upon the downstream application need.
In this paper, visual attention transfer is formulated as a nonlocal diffusion equation. Different from the other diffusion based method, a nonlocal diffusion tensor is introduced to consider both the diffusion streng...
详细信息
ISBN:
(纸本)9781479974351
In this paper, visual attention transfer is formulated as a nonlocal diffusion equation. Different from the other diffusion based method, a nonlocal diffusion tensor is introduced to consider both the diffusion strength and direction. Along with the principle direction, the diffusion should be suppressed to preserve the dissimilarity between the foreground and background, and in other directions, the diffusion should be boosted to combine the similar regions and highlight the saliency object as a whole. Through a two-stages diffusion, the final saliency map is obtained and quantitative and visual comparisons are executed on two large benchmark databases. Experimental results demonstrate the superior performance of our method.
作者:
Xianmin XuXiaoping WangLSEC
Institute of Computational Mathematics and Scientific/Engineering Computing NCMIS AMSS Chinese Academy of Sciences Beijing 100190 China Department of Mathematics
Hong Kong University of Science and Technology Clear Water Bay Kowloon Hong Kong China
We study the macroscopic behavior of two-phase flow in porous media from a phase-field model. A dissipation law is first derived from the phase-field model by homogenization. For simple channel geometry in pore scale,...
详细信息
We study the macroscopic behavior of two-phase flow in porous media from a phase-field model. A dissipation law is first derived from the phase-field model by homogenization. For simple channel geometry in pore scale, the scaling relation of the averaged dissipation rate with the velocity of the two-phase flow can be explicitly obtained from the model which then gives the force-velocity relation. It is shown that, for the homogeneous channel surface, Dacry's law is still valid with a significantly modified permeability including the contribution from the contact line slip. For the chemically patterned surfaces, the dissipation rate has a non-Darcy linear scaling with the velocity, which is related to a depinning force for the patterned surface. Our result offers a theoretical understanding on the prior observation of non-Darcy behavior for the multiphase flow in either simulations or experiments.
Based on the PMHSS preconditioning matrix, we construct a class of rotated block triangular preconditioners for block two-by-two matrices of real square blocks, and analyze the eigen-properties of the corresponding pr...
详细信息
Based on the PMHSS preconditioning matrix, we construct a class of rotated block triangular preconditioners for block two-by-two matrices of real square blocks, and analyze the eigen-properties of the corresponding preconditioned matrices. Numerical experiments show that these rotated block triangular pre- conditioners can be competitive to and even more efficient than the PMHSS preconditioner when they are used to accelerate Krylov subspeme iteration methods for solving block two-by-two linear systems with coefficient matrices possibly of nonsymmetric sub-blocks.
作者:
CUI LongMING PingBingLSEC
Institute of Computational Mathematics and Scientific/Engineering ComputingAcademy of Mathematics and Systems ScienceChinese Academy of Sciences
We study the effect of "ghost forces" for a quasicontinuum method in three dimension with a planar interface. "Ghost forces" are the inconsistency of the quasicontinuum method across the interface between the atom...
详细信息
We study the effect of "ghost forces" for a quasicontinuum method in three dimension with a planar interface. "Ghost forces" are the inconsistency of the quasicontinuum method across the interface between the atomistic region and the continuum region. Numerical results suggest that "ghost forces" may lead to a negilible error on the solution, while lead to a finite size error on the gradient of the solution. The error has a layer-like profile, and the interfacial layer width is of O(ε). The error in certain component of the displacement gradient decays algebraically from O(1) to O(ε) away from the interface. A surrogate model is proposed and analyzed, which suggests the same scenario for the effect of "ghost forces". Our analysis is based on the explicit solution of the surrogate model.
With molecular dynamics simulations,we systematically uncover a new kind of intrinsic thermal resistance that exists in two-dimensional materials under uneven external perturbation,by using partly encased graphene as ...
With molecular dynamics simulations,we systematically uncover a new kind of intrinsic thermal resistance that exists in two-dimensional materials under uneven external perturbation,by using partly encased graphene as a typical *** with lattice dynamics analysis,we demonstrate that this intrinsic thermal resistance originates from the softening of flexural phonons partly in graphene induced by inhomogeneous external potential field or substrates which serve as *** the interface between graphene sections with and without external potential field,in-plane phonon modes can transmit well,whereas,low frequency flexural phonon modes are reflected,leading to this nontrivial intrinsic thermal resistance in the individual single-layer *** intrinsic thermal resistance closely depends on coupling strength between graphene and substrates,and could be significant when the coupling is ***,it is suppressed at high *** is also found that this intrinsic thermal resistance depends on the size of the system to some extent,and a length independent value is ***,we demonstrate that thermal rectification can be realized by including the uneven external *** study provides new insight to better understand thermal transport in two-dimensional materials.
Learning Bayesian networks is NP-hard. Even with recent progress in heuristic and parallel algorithms, modeling capabilities still fall short of the scale of the problems encountered. In this paper, we present a massi...
详细信息
Learning Bayesian networks is NP-hard. Even with recent progress in heuristic and parallel algorithms, modeling capabilities still fall short of the scale of the problems encountered. In this paper, we present a massively parallel method for Bayesian network structure learning, and demonstrate its capability by constructing genome-scale gene networks of the model plant Arabidopsis thaliana from over 168.5 million gene expression values. We report strong scaling efficiency of 75% and demonstrate scaling to 1.57 million cores of the Tianhe-2 supercomputer. Our results constitute three and five orders of magnitude increase over previously published results in the scale of data analyzed and computations performed, respectively. We achieve this through algorithmic innovations, using efficient techniques to distribute work across all compute nodes, all available processors and coprocessors on each node, all available threads on each processor and coprocessor, and vectorization techniques to maximize single thread performance.
Sensitivity analysis (SA) is a fundamental tool of uncertainty quantification(UQ). Adjoint-based SA is the optimal approach in many large-scale applications, such as the direct numerical simulation (DNS) of combustion...
详细信息
Sensitivity analysis (SA) is a fundamental tool of uncertainty quantification(UQ). Adjoint-based SA is the optimal approach in many large-scale applications, such as the direct numerical simulation (DNS) of combustion. However, one of the challenges of the adjoint workflow for time-dependent applications is the storage and I/O requirements for the application state. During the time-reversal portion of the workflow, forward state is required in last-in-first-out order. The resulting requirements for storage at exascale are enormous. To mitigate this requirement, application state is regenerated from checkpoints over short windows of application time. This approach drastically reduces the total volume of stored data, allows the caching of state in the regeneration window in memory and on local SSDs, may accelerate the application execution by reducing output frequency, and reduces the power overhead from I/O. We explore variations to this workflow, applied to a proxy for the SA of turbulent combustion, by varying checkpoint number, state storage, and other regeneration options to find efficient implementations for minimizing compute time or power consumption.
暂无评论