Withthe rapid progress of high-performance cluster applications, data transfer between clusters in distant locations becomes more important. But, it is difficult to transfer data using parallel TCP streams on long di...
详细信息
General-Purpose computing on Graphics processing Units (GPGPU) is becoming popular in HPC because of its high peak performance. However, in spite of the potential performance improvements as well as recent promising r...
详细信息
ISBN:
(纸本)9781424416936
General-Purpose computing on Graphics processing Units (GPGPU) is becoming popular in HPC because of its high peak performance. However, in spite of the potential performance improvements as well as recent promising results in scientific computing applications, its real performance is not necessarily higher than that of the current high-performance CPUs, especially with recent trends towards increasing the number of cores on a single die. this is because the GPU performance can be severely limited by such restrictions as memory size and bandwidth and programming using graphics-specific APIs. To overcome this problem, we propose a model-based, adaptive library for 2D FFT that automatically achieves optimal performance using available heterogeneous CPU-GPU computing resources. To find optimal load distribution ratios between CPUs and GPUs, we construct a performance model that captures the respective contributions of CPU vs. GPU, and predicts the total execution time of 2D-FFT for arbitrary problem sizes and load distribution. the performance model divides the FFT computation into several small sub steps, and predicts the execution time of each step using profiling results. Preliminary evaluation with our prototype shows that the performance model can predict the execution time of problem sizes that are 16 times as large as the profile runs with less than 20% error and that the predicted optimal load distribution ratios have less than 1% error. We show that the resulting performance improvement using both CPUs and GPUs can be as high as 50% compared to using either a CPU core or a GPU.
A method is presented for modeling application performance on parallel computers in terms of the performance of microkernels from the HPC Challenge benchmarks. Specifically, the application run time is expressed as a ...
详细信息
ISBN:
(纸本)9781424416936
A method is presented for modeling application performance on parallel computers in terms of the performance of microkernels from the HPC Challenge benchmarks. Specifically, the application run time is expressed as a linear combination of inverse speeds and latencies from microkernels or system characteristics. the model parameters are obtained by an automated series of least squares fits using backward elimination to ensure statistical significance. If necessary, outliers are deleted to ensure that the final fit is robust. Typically three or four terms appear in each model: at most one each for floating-point speed, memory bandwidth, interconnect bandwidth, and interconnect latency. Such models allow prediction of application performance on future computers from easier-to-make predictions of microkernel performance. the method was used to build models for four benchmark problems involving the PARATEC and MILC scientific applications. these models not only describe performance well on the ten computers used to build the models, but also do a good job of predicting performance on three additional computers with newer design features. For the four application benchmark problems with six predictions each, the relative root mean squared error in the predicted run times varies between 13 and 16%. the method was also used to build models for the HPL and G-FFTE benchmarks in HPCC, including functional dependences on problem size and core count from complexity analysis. the model for HPL predicts performance even better than the application models do, while the model for G-FFTE systematically underpredicts run times.
the complexity of today's embedded real-time systems is continuously growing with high demands on dependability, resource-efficiency, and reusability Two solution approaches address these needs: First, in the comp...
详细信息
ISBN:
(纸本)9781424416936
the complexity of today's embedded real-time systems is continuously growing with high demands on dependability, resource-efficiency, and reusability Two solution approaches address these needs: First, in the component based software engineering (CBSE) paradigm, software is decomposed into self-contained components with explicit interactions and context dependencies. Connectors represent the abstraction of interactions between these components. Second, components can be shifted from software to reconfigurable hardware, typically field programmable gate arrays (FPGAs), in order to meet real-time constraints. this paper proposes a component-based concept to support efficient hardware/software co-design: A hardware component together withthe hardware/soflware connector can seamlessly replace a software component withthe same functionality, while the particularities of the alternative interaction are encapsulated in the component connector. Our approach provides for tools that can generate all necessary interaction mechanisms between hardware and software components. A proof-of-concept application demonstrates the advantages of our concept: Rapid change and comparison of different partitioning decisions due to automated and faultless generation of the hardware/software connectors.
the application and research area of Multimedia Content Analysis (MAICA) considers all aspects of the automated extraction of new knowledge from large multimedia data streams and archives. In recent years, there has b...
详细信息
ISBN:
(纸本)9781424442379
the application and research area of Multimedia Content Analysis (MAICA) considers all aspects of the automated extraction of new knowledge from large multimedia data streams and archives. In recent years, there has been a tremendous growth in the MMCA application domain (for real-time and off-line execution scenarios alike), and this growth is likely to continue in the near future. Multimedia applications operated in a real-time environment pose very strict requirements on the obtained processing times, while off-line applications have to perform within 'tolerable' time frames. To meet these requirements, large-scale multimedia applications typically are being executed on Grid systems consisting of large collections of compute clusters. For optimized use of resources, it is essential to determine the optimal number of compute nodes per cluster, properly dealing withthe perceived computation versus communication ratio. this ratio generally depends on the characteristics of the application at hand, and on the software and hardware specifics of the computational environment. Motivated by these observations, in this paper we develop a simple and easy-to-implement method to determine the "optimal" number of parallel compute nodes. the method is based on the classical binary search method for non-linear optimization, and does not depend on the, usually unknown, specifics of the system. Extensive experimental validation on a real distributed system shows that our method is indeed highly effective.
the UK Engineering and Physical Sciences Research Council (EPSRC) funded project "Meeting the Design Challenges of nanoCMOS Electronics" (nanoCMOS) is developing a research infrastructure for collaborative e...
详细信息
ISBN:
(纸本)9780769534718
the UK Engineering and Physical Sciences Research Council (EPSRC) funded project "Meeting the Design Challenges of nanoCMOS Electronics" (nanoCMOS) is developing a research infrastructure for collaborative electronics research across multiple institutions in the UK with especially strong industrial and commercial involvement. Unlike other domains, the electronics industry is driven by the necessity of protecting the intellectual property of the data, designs and software associated with next generation electronics devices and therefore requires fine-grained security. Similarly, the project also demands seamless access to large scale high performance compute resources for atomic scale device simulations and the capability to manage the hundreds of thousands of files and the metadata associated withthese simulations. Within this context, the project has explored a wide range of authentication and authorization irfrastructures facilitating compute resource access and providing fine-grained security over numerous distributed file stores and files. We conclude that no single security solution meets the needs of the project. this paper describes the experiences of applying X.509-based certificates and public key infrastructures, VOMS, PERMIS, Kerberos and the Internet2 Shibboleth technologies for nanoCMOS security. We outline how we are integrating these solutions to provide a complete end-to-end security framework meeting the demands of the nanoCMOS electronics domain.
distributed storage systems apply erasure-tolerant codes to guarantee reliable access to data despite failures of storage resources. While many codes can be mapped to XOR operations and efficiently implemented on comm...
详细信息
Conference proceedings front matter may contain various advertisements, welcome messages, committee or program information, and other miscellaneous conference information. this may in some cases also include the cover...
Conference proceedings front matter may contain various advertisements, welcome messages, committee or program information, and other miscellaneous conference information. this may in some cases also include the cover art, table of contents, copyright statements, title-page or half title-pages, blank pages, venue maps or other general information relating to the conference that was part of the original conference proceedings.
Grid schedulers which need to decide on which sites the jobs are best allocated require controlled and predictable service. Fair-share scheduling has become widely used but lacks a formal model and depends on the curr...
详细信息
暂无评论