Seven application programs were used in the Technology Insertion for 2003 benchmark testing process to determine what new high-performancecomputing capability should be procured. One of the most intensive parts of en...
详细信息
Seven application programs were used in the Technology Insertion for 2003 benchmark testing process to determine what new high-performancecomputing capability should be procured. One of the most intensive parts of engineering and scientific computations is the solution of a simultaneous, linear system of equations. We survey the seven benchmark application programs for what linear solvers are used and where they originated.
Wide vector units in Intel's Xeon Phi accelerator cards can significantly boost application performance when used effectively. However, there is a lack of performance tools that provide programmers accurate inform...
详细信息
Wide vector units in Intel's Xeon Phi accelerator cards can significantly boost application performance when used effectively. However, there is a lack of performance tools that provide programmers accurate information about the level of vectorization in their codes. This paper presents VecMeter, an easy-to-use tool to measure vectorization on the Xeon Phi. VecMeter utilizes binary instrumentation and therefore does not require source code modifications. This paper describes the design of VecMeter, demonstrates its accuracy, defines a metric for quantifying vectorization, and provides an example where the tool can guide code optimization to improve performance by up to 33%.
In this paper, a systematic study of the effects of complexity of prediction methodology on its accuracy for a set of real applications on a variety of HPC systems is performed. Results indicate that the use of any si...
详细信息
Accelerators are becoming prevalent in highperformancecomputing as a way of achieving increased computational capacity within a smaller power budget. Effectively utilizing the raw compute capacity made available by ...
详细信息
Accelerators are becoming prevalent in highperformancecomputing as a way of achieving increased computational capacity within a smaller power budget. Effectively utilizing the raw compute capacity made available by these systems, however, remains a challenge because it can require a substantial investment of programmer time to port and optimize code to effectively use novel accelerator hardware. In this paper we present a methodology for isolating and modeling the performance of common performance-critical patterns of code (so-called idioms) and other relevant behavioral characteristics from large scale HPC applications which are likely to perform favorably on Intel Xeon Phi. The benefits of the methodology are twofold: (1) it directs programmer efforts toward the regions of code most likely to benefit from porting to the Xeon Phi and (2) provides speedup estimates for porting those regions of code. We then apply the methodology to the stencil idiom, showing performance improvements of up to a factor of 4.7× on stencil-based benchmark codes.
Lattice Boltzmann algorihms are a mesoscopic representation of nonlinear continuum physics (like Navier-Stokes, magnetohydrodynamics (MHD), Gross-Pitaevskii equations) which are ideal for parallel supercomputers becau...
详细信息
In order to achieve a high level of performance, data intensive applications such as the real-time processing of surveillance feeds from unmanned aerial vehicles will require the strategic application of multi/many-co...
In order to achieve a high level of performance, data intensive applications such as the real-time processing of surveillance feeds from unmanned aerial vehicles will require the strategic application of multi/many-core processors and coprocessors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program design decisions, memory traces gathered through binary instrumentation can be used to understand the low-level interactions between a data intensive code and the memory subsystem of a multi-core processor or many-core co-processor. Toward this end, this paper introduces the addition of threading support for PMaCs Efficient Binary Instrumentation Toolkit for Linux/x86 (PEBIL) and compares PEBILs threading model to the threading models of two other popular Linux/x86 binary instrumentation platforms - Pin and Dyninst - on both theoretical and empirical grounds. The empirical comparisons are based on experiments which collect memory address traces for the OpenMP-threaded implementations of the NASA Advanced Supercomputing Parallel Benchmarks (NPBs). This work shows that the overhead of collecting full memory address traces for multithreaded programs is higher in PEBIL (7.7x) than in Pin (4.7x), both of which are significantly lower than Dyninst (897x). This work also shows that PEBIL, uniquely, is able to take advantage of interval-based sampling of a memory address trace by rapidly disabling and re-enabling instrumentation at the transitions into and out of sampling periods in order to achieve significant decreases in the overhead of memory address trace collection. For collecting the memory address streams of each of the NPBs at a 10% sampling rate, PEBIL incurs an average slowdown of 2.9x compared to 4.4x with Pin and 897x with Dyninst.
A novel unitary quantum lattice algorithm is developed to explore quantum turbulence. Because of its low memory requirements and its near perfect parallelization to the full 12,288 cores on the Cray XT5, simulations w...
详细信息
A novel unitary quantum lattice algorithm is developed to explore quantum turbulence. Because of its low memory requirements and its near perfect parallelization to the full 12,288 cores on the Cray XT5, simulations were run up to spatial grids of 5,7603. The Gross-Pitaevskii equation, which describes the ground state of a Bose Einstein condensate (BEC), is solved and it is found that the incompressible kinetic energy spectrum exhibits 3 distinct power laws: classical Kolmogorov k -5/3 spectrum at scales much larger than the individual quantum vortex cores, and a quantum Kelvin wave cascade spectrum of k -3 at scales of the order of the quantum cores. In the adjoining semiclassical regime, there is a steeper spectral decay transitioning between the classical and quantum regimes. However, its spectral exponent does not seem to be universal. This is the first, first-principle simulation yielding the universal quantum Kelvin cascade exponent.
Lattice Boltzmann algorihms are a mesoscopic representation of nonlinear continuum physics (like Navier-Stokes, magnetohydro dynamics (MHD), Gross- Pitaevskii equations) which are ideal for parallel supercomputers bec...
详细信息
Lattice Boltzmann algorihms are a mesoscopic representation of nonlinear continuum physics (like Navier-Stokes, magnetohydro dynamics (MHD), Gross- Pitaevskii equations) which are ideal for parallel supercomputers because they transform the difficult nonlinear convective macroscopic derivatives into purely local moments of distribution functions. The macroscopic nonlinearities are recovered by relaxation distribution functions in the collision operator whose dependence on the macroscopic velocity is algebraically nonlinear and thus purely local. Unlike standard computational fluid dynamics codes, there is no loss in parallelization in handling arbitrary geometric boundaries, e.g., using bounce-back rules from kinetic theory. By encoding detailed balance into the collision operator through the introduction of discrete H-function, the lattice Boltzmann algorithm can be made unconditionally stable for arbitrary high Reynolds numbers. It is shown that this approach is a special case of a quantum lattice Boltzmann algorithm that entangles local qubits through unitary collision operators and which is ideally parallelized on quantum computer architectures. Here we consider turbulence simulations using 2,048 PEs on a 1,6003-spatial grid. A connection is found between the rate of change of enstrophy and the onset of laminar-to- turbulent flows.
暂无评论