Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize mass...
详细信息
ISBN:
(纸本)9781450305525
Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications and architectures on such heterogeneous systems. In this paper we present a practice on how to exploit and orchestrate parallelism at algorithm level to take advantage of underlying parallelism at architecture level. A potential Petaflops application cryo-EM 3D reconstruction is selected as an example. We exploit all possible parallelism in cryo-EM 3D reconstruction, and leverage a self-adaptive dynamic scheduling algorithm to create a proper parallelism mapping between the application and architecture. the parallelized programs are evaluated on a subsystem of Dawning Nebulae supercomputer, whose node is composed of two Intel six-core Xeon CPUs and one Nvidia Fermi CPU. the experiment confirms that hierarchical parallelism is an efficient pattern of parallel programming to utilize capabilities of both CPU and CPU in a heterogeneous system. the CUDA kernels run more than 3 times faster than the OpenMP parallelized ones using 12 cores (threads). Based on the CPU-only version, the hybrid CPU-CPU program further improves the whole application's performance by 30% on the average.
Increasing demand for computing power in scientific and engineering applications has spurred deployment of high-performancecomputing clusters. According to the TOP500 list, an industry respected report of the most po...
详细信息
ISBN:
(纸本)9781424452026
Increasing demand for computing power in scientific and engineering applications has spurred deployment of high-performancecomputing clusters. According to the TOP500 list, an industry respected report of the most powerful computer systems, the high-performancecomputing market entered the Teraflop era in 2005 (the entry point on the list became greater than1 Teraflop) and anticipates entering the Petaflop era in 2015. Future HPC systems that are capable of running large-scale parallel applications will span thousands to tens-of-thousands of nodes;all connected together via high-speed connectivity solutions to form a Peta to multi-Petaflop clusters. there are several architectural approaches to interconnect nodes together to construct an HPC system, such as the use of a Fat-Tree (CBB) or a 3-D Torus. the overall number of communication links grows withthe size of the system. the physical medium for those Links have become a growing concern for large-scale platforms, as they tend to impact the system architecture, the system reliability and its cost. In the paper we review the requirements for HPC systems cabling, and in particular Optical cables. Optical cables capabilities will become the main limitation for building systems at any scale according to the CPU, memory and interconnect roadmaps in the next few years.
A description is given of the VMP-MC design, a distributed parallel multicomputer based on the VMP multiprocessor design that is intended to provide a set of building blocks for configuring machines from one to severa...
详细信息
ISBN:
(纸本)9780897913195
A description is given of the VMP-MC design, a distributed parallel multicomputer based on the VMP multiprocessor design that is intended to provide a set of building blocks for configuring machines from one to several thousand processors. VMP-MC uses a memory hierarchy based on shared caches, ranging from on-chip caches to board-level caches connected by buses to a high-speed fiber-optic ring. In addition to describing the building block components of this architecture, the authors identify the key performance issues associated withthe design and provide performance evaluation of these issues using trace-driven simulation and measurements from the VMP.
In this paper, a new structure of ANL logic, named TPANL, is presented to achieve higher performance, lower power consumption and eliminating glitches. Different ANL logics suffer from output glitches due to race prob...
详细信息
ISBN:
(纸本)9781467314824;9781467314817
In this paper, a new structure of ANL logic, named TPANL, is presented to achieve higher performance, lower power consumption and eliminating glitches. Different ANL logics suffer from output glitches due to race problem. Our proposed TPANL logic by two phase nonoverlapping clocks eliminates output glitches and reduces glitch power. TPANL logic speedup is mainly due to reduced capacitance at each evaluation node of a dynamic circuit. this logic works in both operational region of strong inversion and subthreshold region, with 1 OGHz to 12.5 MHz respectively. In spite of NonInv./Inv. pipeline in ANL logics, TPANL is based on NonInv./NonInv. pipeline and therefore it solves the voltage drops on NMOS Inv. stages in subthreshold regions. the simulation results of 4-bit CLA adder show 27% and 72.9% power consumption reduction, also, 60% and 50% performance improvement, in strong inversion region rather than ANL and DPANL respectively. the 4-bit CLA adder with TPANL logic in the subthreshold region has about 92nW power consumption.
the grid computing technology is evolving from the emergence to the stable and production status. In the computational science field the grid computing approach for storage and computing element resource sharing is a ...
详细信息
ISBN:
(纸本)1595937188
the grid computing technology is evolving from the emergence to the stable and production status. In the computational science field the grid computing approach for storage and computing element resource sharing is a common way thanks to middleware as the Globus Toolkit and grid environments as the Condor project. In many domain applications, as environmental science, bioinformatics and interactive multimedia delivery, the shared resource is composed by tagged data and contents instead of hardware components. We provide a grid aware component leveraging on our Resource Broker Service implementing a wrap over the world-wide spread OpenDAP scientific data access protocol, ensuring effective and efficient grid based content distribution, and contributing to the "grand challenge" concerning the development of grid-aware software infrastructure for data intensive environmental applications.
Contemporary computing systems, especially large-scale systems such as Grids promise ultra-fast ubiquitous utility computing, always available at the flip of a switch. A major unresolved issue is the organization and ...
详细信息
ISBN:
(纸本)0769516866
Contemporary computing systems, especially large-scale systems such as Grids promise ultra-fast ubiquitous utility computing, always available at the flip of a switch. A major unresolved issue is the organization and efficient usage of such infrastructure in a commercial context where several entities compete for shared resources. this has long been resolved for conventional utility resources such as gas and electricity through commoditization, a variety of market designs, customization, and decision support for the resulting portfolios of assets and commitments. this paper reviews the state of Grid commercialization and compares it to the commercialization of conventional resources. We draw specific lessons for commercialized Grids and detail them as architecture requirements at each level of the architecture stack. We provide an example to illustrate the benefits of commercialized resources in terms of the financial clarity it brings to decisions for different user groups, namely application users and IT managers.
Grid applications must increasingly self-adapt dynamically to changing environments. In most cases, adaptation has been implemented in an ad hoc fashion, on a per-application basis. this paper describes work which gen...
详细信息
ISBN:
(纸本)0769516866
Grid applications must increasingly self-adapt dynamically to changing environments. In most cases, adaptation has been implemented in an ad hoc fashion, on a per-application basis. this paper describes work which generalizes adaptation so that it can be used across applications by providing an adaptation framework. this framework uses a software architectural model of the system to analyze whether the application requires adaptation, and allows repairs to be written in the context of the architectural model and propagated to the running system. In this paper, we exemplify our framework by applying it to the domain of load-balancing a client-server system. We report on an experiment conducted using our framework, which illustrates that this approach maintains architectural requirements.
highperformancecomputing has gradually shifted from the realm of research into development and partially even into the production cycles of industry, highperformancecomputers therefore have to be integrated into p...
详细信息
In many application domains, FPGAS are now promoted as a way of getting round the restrictions of specific CPU designs on system scalability. However, in the current state-of-the art, programming FPGAS remains essenti...
详细信息
In this work, we propose and evaluate a Network-on-Chip (NoC) augmented with light-weight processing elements to provide a lean dataflow-style system. We show that contemporary NoC routers can frequently experience lo...
详细信息
ISBN:
(纸本)9781728161495
In this work, we propose and evaluate a Network-on-Chip (NoC) augmented with light-weight processing elements to provide a lean dataflow-style system. We show that contemporary NoC routers can frequently experience long periods of idle time, with less than 10% link utilization in HPC applications. By repurposing the temporal and spatial slack of the NoC, the proposed platform, SnackNoC, is able to compute linear algebra kernels efficiently within the communication layer with minimal additional resource costs. SnackNoC 'Snack' application kernels are programmed with a producer-consumer data model that uses the NoC slack to store and transmit intermediate data between processing elements. SnackNoC is demonstrated in a multi-program environment that continually executes linear algebra kernels on the NoC simultaneously with chip multiprocessor (CMP) applications on the processor cores. Linear algebra kernels are computed up to 6.15x faster on SnackNoC compared to an Intel Haswell EPx86 processing core. the cost of executing 'snack' kernels in parallel to the CMP applications is a minimal runtime impact of 0.01% to 0.83% due to higher link utilization, and an uncore area overhead of 1.1%.
暂无评论