In this paper we present the design, implementation and evaluation of a framework that uses JavaSpaces [1] to support this type of opportunistic adaptive parallel/distributed computing over networked clusters in a non...
详细信息
In this paper we focus on the implementation of large scientific applications with LB_Migrate, a dynamic load balancing library. The library employs dynamic loop scheduling techniques to address performance degradatio...
详细信息
In this paper we focus on the implementation of large scientific applications with LB_Migrate, a dynamic load balancing library. The library employs dynamic loop scheduling techniques to address performance degradation factors due to load imbalance, provides a flexible interface with the native data structure of the application, and performs data migration. The library is reusable and it is not application specific. For initial testing, the library was employed in three applications: the profiling of an automatic quadrature routine, the simulation of a hybrid model for image denoising, and N-body simulations. We discuss the original applications without the library, the changes made to the applications to be able to interface with the library, and we present experimental results. Performance results indicate that the library adds minimal overhead, up to 6%, and it varies from application to application. However the benefits gained from the use of the library are substantial.
Declustering is a well known technique to achieve high performance for queries on parallel databases. We propose novel General Disk Module (GDM) based declustering algorithms, GDM Cartesian and GDM Circle, for distrib...
详细信息
Declustering is a well known technique to achieve high performance for queries on parallel databases. We propose novel General Disk Module (GDM) based declustering algorithms, GDM Cartesian and GDM Circle, for distributing uniformly distributed multidimensional datasets to parallel disks, for datasets of any dimension. We compare the performance of the new approaches with several existing declustering algorithms, using variable numbers of disks, and with variable shapes and dimensions of the datasets. Our results show that the new approaches significantly outperform the others for almost all configurations tested.
Summary form only given. QR methods for solving Toeplitz tridiagonal systems are well developed with applications in numerous interdisciplinary fields. There is a strong motivation to develop faster, more efficient an...
详细信息
Summary form only given. QR methods for solving Toeplitz tridiagonal systems are well developed with applications in numerous interdisciplinary fields. There is a strong motivation to develop faster, more efficient and, more importantly, scalable algorithms to factor such systems due to their significance in many scientific applications. We present two parallel QR factorization algorithms used to solve Toeplitz tridiagonal systems. QR factorization is accomplished using Householder reflections and Givens rotations. These parallel algorithms exhibit high scalability and near linear to superlinear speedup on large system sizes when implemented on a distributed system.
Performing rigorous analysis of parallel and distributedsystems (PDS) specifications is one of the important tasks during the early stages of development. The ambiguities and errors left unchecked during the analysis...
详细信息
Performing rigorous analysis of parallel and distributedsystems (PDS) specifications is one of the important tasks during the early stages of development. The ambiguities and errors left unchecked during the analysis phase can creep into design and development phases, resulting in cost and schedule overruns and a less reliable end product. Commercial off the shelf CASE (Computer Aided softwareengineering) tools can play an important role in the analysis and design phases. However techniques must be developed to address the shortcomings of CASE tools. A set of such techniques is presented in this paper. CASE tools can be used to gather PDS specifications in the form of analysis models. The techniques presented in this paper deal with the problem of performing rigorous analysis of PDS specifications originally developed using a CASE tool. The approach is based on integrating a CASE tool with a verification tool based on coloured Petri nets (CPNs). CPNs can be used to model and analyze concurrency in specifications and design phases. Dynamic simulations of CPN models can be used to conduct performance/performability analysis as well as risk assessment studies.
In this paper, we investigate the optimal guaranteed cost control problem for a class of uncertain delta operator systems with both state and input delays. Based on Lyapunov-Krasovskii functional in delta domain, suff...
详细信息
In this paper, we investigate the optimal guaranteed cost control problem for a class of uncertain delta operator systems with both state and input delays. Based on Lyapunov-Krasovskii functional in delta domain, sufficient conditions for the existence of guaranteed cost controller of the class of delta operator systems are presented in terms of linear matrix inequalities (LMIs). The proposed method can unify some previous related continuous and discrete systems with uncertainties into the delta operator systems framework.
The execution of applications in dependable system requires a high level of instrumentation for automatic control. We present in this paper a monitoring solution for complex application execution. The monitoring solut...
详细信息
ISBN:
(纸本)9781424444106
The execution of applications in dependable system requires a high level of instrumentation for automatic control. We present in this paper a monitoring solution for complex application execution. The monitoring solution is dynamic, offering real-time information about systems and applications. The complex applications are described using workflows. We show that the management process for application execution is improved using monitoring information. The environment is represented by distributed dependable systems that offer a flexible support for complex application execution. Our experimental results highlight the performance of the proposed monitoring tool, the MonALISA framework.
Exploiting thread-level parallelism (TLP) is a promising way to improve the performance of applications with the advent of general-purpose cost effective uni-processor and shared-memory multiprocessor systems. In this...
详细信息
Exploiting thread-level parallelism (TLP) is a promising way to improve the performance of applications with the advent of general-purpose cost effective uni-processor and shared-memory multiprocessor systems. In this paper, we describe the OpenMP implementation in the Intel/spl reg/ C++ and Fortran compilers for Intel platforms. We present our major design consideration and decisions in the Intel compiler for generating efficient multithreaded codes guided by OpenMP directives and pragmas. We describe several transformation phases in the compiler for the OpenMP parallelization. In addition to compiler support, the OpenMP runtime library is a critical part of the Intel compiler. We present runtime techniques developed in the Intel OpenMP runtime library for exploiting thread-level parallelism as well as integrating the OpenMP support with other forms of threading termed as sibling parallelism. The performance results of a set of benchmarks show good speedups over the well-optimized serial code performance on Intel/spl reg/ Pentium- and Itanium-processor based systems.
The design of microprocessor chip for high-end computing systems is moving towards many-core architectures with 10s or 100+ processing units. An important class of the target applications for such architectures are sc...
详细信息
The design of microprocessor chip for high-end computing systems is moving towards many-core architectures with 10s or 100+ processing units. An important class of the target applications for such architectures are scientific numerical computations, many of which are intrinsically deterministic - that is for a given input a fixed output (result) should be produced no matter how the program is parallelized. It is critical that the read-after-write data dependencies in such programs should be implemented correctly and efficiently via fine-grain data synchronization. In this paper, we investigate the parallelization of three representative scientific computation kernels using fine-grain data synchronization supported by an recently proposed architectural mechanism for many-core chips, called synchronization state buffer (SSB). Using detailed simulation on a simulator for the IBM 160-core Cyclops-64 chip architecture with the SSB extension, our experiments demonstrate significant performance advantage of using fine-grain data synchronization based parallelization schemes for scientific workloads.
With current trend of increasing the number processing elements (PEs) on a single chip, on-chip network provides a fast and reliable interconnect technology for highly parallel applications. Yet, the end-to-end data t...
详细信息
ISBN:
(纸本)9781450320634
With current trend of increasing the number processing elements (PEs) on a single chip, on-chip network provides a fast and reliable interconnect technology for highly parallel applications. Yet, the end-to-end data throughput at software layer on a NoC (Network-on-Chip) platform often cannot match the hardware native speed without an efficient hardware/software interface. In this paper, we present a high-throughput PE-to-PE communication unit with a corresponding driver layer on NoC-based many-core architectures. The proposed communication unit with applicationlevel flow control can handle complicated inter-PE communication for practical parallel applications. The maximum throughput of a unidirectional transmission with flow control protocol at application-level is 2687.3 Mbps (normalized at operating frequency of 100MHz), where the native NoC speed is 3200 Mbps. As a comparison, a software-based protocol is only rated at 148.5 Mbps. The communication unit is also area-efficient at only 19.2K gates, which is roughly 3.2% of a single in-order RISC-based PE. Copyright 2013 ACM.
暂无评论