This paper provides an overview of the current status of methods that may be used to induce parallel properties into the temporal axis for time dependent problems described by differential *** extension to problems wi...
详细信息
This paper provides an overview of the current status of methods that may be used to induce parallel properties into the temporal axis for time dependent problems described by differential *** extension to problems with two spatial dimensions is also included.
The JEM-EUSO space observatory will be launched and attached to the Japanese module of the international Space Station (ISS) in 2016. Its aims is to observe UV photon tracks produced by Ultra High Energy Cosmic Rays (...
详细信息
The JEM-EUSO space observatory will be launched and attached to the Japanese module of the international Space Station (ISS) in 2016. Its aims is to observe UV photon tracks produced by Ultra High Energy Cosmic Rays (UHECR) and Extremely High Energy Cosmic Rays (EHECR) developing in the atmosphere and producing Extensive Air Showers (EAS). JEM-EUSO will use our atmosphere as a huge calorimeter, to detect the electromagnetic and hadronic components of the EAS. For High Energy Physic (HEP) experiments the huge amount of data and complex analysis algorithms require the use of advanced GRID computational resources. Therefore a complete infrastructure is needed for the simulation and data analysis, in the frame of the GRID architecture, and computing infrastructure, intended for the JEM-EUSO space mission software. Moreover solutions to account for a complete installed cluster system, with the software and data repositories, as well as a many other details are pointed out.
Deploying an application onto a target platform for high performance oftentimes demands manual tuning by experts. As machine architecture gets increasingly complex, tuning becomes even more challenging and calls for s...
详细信息
ISBN:
(纸本)9781467309752
Deploying an application onto a target platform for high performance oftentimes demands manual tuning by experts. As machine architecture gets increasingly complex, tuning becomes even more challenging and calls for systematic approaches. In our earlier work we presented a prototype that combines efficiently expert knowledge, static analysis, and runtime observation for bottleneck detection, and employs refactoring and compiler feedback for mitigation. In this study, we develop a software tool that facilitates \emph{fast} searching of bottlenecks and effective mitigation of problems from major dimensions of computing (e.g., computation, communication, and I/O). The impact of our approach is demonstrated by the tuning of the LBMHD code and a Poisson solver code, representing traditional scientific codes, and a graph analysis code in UPC, representing emerging programming paradigms. In the experiments, our framework detects with a single run of the application intricate bottlenecks of memory access, I/O, and communication. Moreover, the automated solution implementation yields significant overall performance improvement on the target platforms. The improvement for LBMHD is up to 45\%, and the speedup for the UPC code is up to 5. These results suggest that our approach is a concrete step towards systematic tuning of high performance computing applications.
DCAF is a directly connected arbitration free photonic crossbar that is realized by taking advantage of multiple photonic layers connected with photonic vias. In order to evaluate DCAF we developed a detailed implemen...
详细信息
ISBN:
(纸本)9781467309752
DCAF is a directly connected arbitration free photonic crossbar that is realized by taking advantage of multiple photonic layers connected with photonic vias. In order to evaluate DCAF we developed a detailed implementation model for the network and analyzed the power and performance on a variety of benchmarks, including SPLASH-2 and synthetic traces. Our results demonstrate that the overhead required by arbitration is non-trivial, especially at high loads. Eliminating the need for arbitration, sizing the buffers carefully and retransmitting lost packets when there is contention results in a 44% reduction in average packet latency without additional power overhead. We also use an analytical model for ScaLAPACK QR decomposition and find that a 64 processor DCAF could outperform a 1024 node cluster connected with 40Gbps links on matrices up to 500MB in size.
Performance optimization, especially in the field of HPC, is an integral part of today's software development process. One powerful way of optimizing applications is to analyze their event traces. Yet, the compari...
详细信息
Performance optimization, especially in the field of HPC, is an integral part of today's software development process. One powerful way of optimizing applications is to analyze their event traces. Yet, the comparison of traces of multiple application runs is cumbersome. The impact of optimizations in the source code or the usage of different compiler flags has to be tracked manually. The challenge is to automatically identify exactly those areas that changed in the large amount of trace data. We propose a novel solution that combines sequence alignment algorithms with call graph analysis to compare and highlight traces event-wise. Our approach is able to automatically detect differences by aligning event traces. Fine-grained execution time differences can be extracted and displayed in performance charts. The results of our implementation are presented and discussed.
Current cloud service description languages envision the ability to automatically combine cloud service offerings across multiple abstraction layers, i.e. software, platform, and infrastructure service offerings, to a...
详细信息
Current cloud service description languages envision the ability to automatically combine cloud service offerings across multiple abstraction layers, i.e. software, platform, and infrastructure service offerings, to achieve a common shared business goal. However, only little effort has been spent in this direction. This paper formalizes the issue of automatic combination of cloud services showing its computationally intensive nature. In order to overcome this issue we propose a Resource Description Framework (RDF)-based prototype implementation that leverages a batch process for automatically constructing possible combinations of cloud services. Using this approach we are able to analyze possible combinations of cloud services that may fit particular customer needs in a timely fashion.
Energy will be a major limiting factor in future multi-core architectures, so optimizing performance per watt should be a key driver for next generation massive-core architectures. Recent studies show that heterogeneo...
详细信息
Energy will be a major limiting factor in future multi-core architectures, so optimizing performance per watt should be a key driver for next generation massive-core architectures. Recent studies show that heterogeneous chips integrating different core architectures, such as CPU and GPU, on a single die is the most promising solution. We investigated how energy efficiency and scalability are affected by the power constraints imposed on contemporary hybrid CPU-GPU processors. Analytical models were developed to extend Amdahl's Law by accounting for energy limitations before examining the three processing modes available to heterogeneous processors, i.e., symmetric, asymmetric, and simultaneous asymmetric. The analysis shows clearly that greater parallelism is the most important factor affecting power consumption.
The aim of this paper is to present a new hybrid solver for linear feasibility systems that uses a block-parallel scheme combined with a new variable weight projection operator which takes into account the distances t...
详细信息
The aim of this paper is to present a new hybrid solver for linear feasibility systems that uses a block-parallel scheme combined with a new variable weight projection operator which takes into account the distances to the semi spaces onto which it projects. The solver can tackle very large, dense, systems. The results of our study show that a specialized variant of the solver is more efficient at solving a certain class of dense systems in terms of resources than other variants. Furthermore, we will also show results that suggest that the distribution scheme does not greatly affect the number of required iterations for a solution to be reached.
Large Web search engines are constructed as a collection of services that are deployed on dedicated clusters of distributed-memory processors. In particular, efficient user query throughput heavily relies on using res...
详细信息
Large Web search engines are constructed as a collection of services that are deployed on dedicated clusters of distributed-memory processors. In particular, efficient user query throughput heavily relies on using result cache services devoted to maintaining the answers to most frequent queries. Load balancing and fault tolerance are critical to this service. This paper proposes the design of a result cache service based on consistent hashing and a strategy for enabling fault tolerance. Performance evaluation is performed by using actual queries from a commercial search engine. The results show that the proposed cache service outperforms baseline approaches, decreases the average query response time, increases query throughput and efficiently recovers performance after processor failures.
Modern computer systems are becoming increasingly heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to...
详细信息
Modern computer systems are becoming increasingly heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of several programming models (e.g., MPI with OpenCL or CUDA) in order to exploit the full compute capability of a system. In this paper, we presentd OpenCL (distributed OpenCL) - a uniform approach to programming distributed heterogeneous systems with accelerators. d OpenCL extends the OpenCL standard, such that arbitrary computing devices installed on any node of a distributed system can be used together within a single application. OpenCL allows moving data and program code to these devices in a transparent, portable manner. Sinced OpenCL is designed as a fully-fledged implementation of the OpenCL API, it allows running existing OpenCL applications in a heterogeneous distributed environment without any modifications. We describe in detail the mechanisms that are required to implement OpenCL for distributed systems, including a device management mechanism for running multiple applications concurrently. Using three application studies, we compare the performance of dOpenCL with MPI+OpenCL and a standard OpenCL implementation.
暂无评论