This paper presents PPerfGrid, a tool that addresses the challenges involved in the exchange of heterogeneous parallel computing performance data. parallel computing performance data exists in a wide variety of differ...
详细信息
ISBN:
(纸本)0769523129
This paper presents PPerfGrid, a tool that addresses the challenges involved in the exchange of heterogeneous parallel computing performance data. parallel computing performance data exists in a wide variety of different schemas and formats, from basic text files to relational databases to XML, and it is stored on geographically dispersed host systems of various platforms. PPerfGrid uses Grid Services to address these challenges. PPerfGrid exposes Application and Execution semantic objects as Grid services and publishes their location and characteristics in a registry. PPerfGrid clients access this registry, locate the PPerfGrid sites with performance data they are interested in, and bind to a set of Grid services that represent this data. This set of Application and Execution Grid services provides a uniform, virtual view of the data available in a particular PPerfGrid session. PPerfGrid addresses scalability by allowing specific questions to be asked about a data store, thereby narrowing the scope of the data returned to a client. In addition, by using a Grid services approach, the Application and Execution Grid services involved in a particular query can be dynamically distributed across several hosts, thereby taking advantage of parallelism and improving scalability, We describe our PPerfGrid prototype and include data from preliminary prototype performance tests.
In this paper, we describe a prototype software framework that implements a formalized methodology for partitioning computational intensive applications between reconfigurable hardware blocks of different granularity....
详细信息
ISBN:
(纸本)0769523129
In this paper, we describe a prototype software framework that implements a formalized methodology for partitioning computational intensive applications between reconfigurable hardware blocks of different granularity. A hybrid granularity reconfigurable generic architecture is considered for this methodology, so as the methodology is applicable to a large variety of hybrid reconfigurable architectures. Although, the proposed framework is parametrical in respect to the mapping procedures to the fine- and coarse-grain reconfigurable units, we provide mapping algorithms for these types of hardware. The experimental results show the effectiveness of the functionality partitioning framework. We have validated the framework using two real-world applications, an OFDM transmitter and a JPEG encoder. For the OFDM transmitter, a maximum clock cycles decrease of 82% relative to an all fine-grain mapping solution is achieved. The performance improvement for the JPEG is 44%.
In this paper we present a method to optimize the over-head in dynamically reconfigurable computing systems. Applications are considered to be partitioned into algorithmic blocks. Our method allows a reduction of over...
详细信息
ISBN:
(纸本)0769523129
In this paper we present a method to optimize the over-head in dynamically reconfigurable computing systems. Applications are considered to be partitioned into algorithmic blocks. Our method allows a reduction of overhead when reconfiguration between those blocks is required. For each block a variety of specifications is constructed using high level algorithmic transformations based on a partitioning method for nested loop programs. The partitioning method allows an efficient verification with the given design constraints. The specifications differ in resource usage and execution time. The reconfiguration costs are reduced by finding the best matching specifications of the algorithmic blocks. The specifications with the lowest reconfiguration cost are selected for implementation using the matching information as input for the implementation tools. Finally we present an optimal solution for a reconfigurable 2D mean filter. Two configurations with different filter sizes and word widths were implemented according to the matching specifications. We reduced the required logic area compared to the non-reconfigurable implementation and reduced significantly the reconfiguration costs.
Current processors exploit out-of-order execution and branch prediction to improve instruction level parallelism. When a branch prediction is wrong, processors flush the pipeline and squash all the speculative work. H...
详细信息
ISBN:
(纸本)0769523129
Current processors exploit out-of-order execution and branch prediction to improve instruction level parallelism. When a branch prediction is wrong, processors flush the pipeline and squash all the speculative work. However, control-flow independent instructions compute the same results when they re-enter the pipeline down the correct path. If these instructions are not squashed, branch misprediction penalty can significantly be reduced. In this paper we present a novel mechanism that detects control-flow independent instructions, executes them before the branch is resolved, and avoids their re-execution in the case of a branch misprediction. The mechanism can detect and exploit control-flow independence even for instructions that are far away from the corresponding branch and even out of the instruction window. Performance figures show that the proposed mechanism can exploit control-flow independence for nearly 50% of the mispredicted branches, which results in a performance improvement that ranges from 14% to 17,8% for realistic configurations of forthcoming microprocessors.
Breakthrough-quality scientific discoveries in the new millennium (such as those expected in computation biology and others), along with optimal engineering designs, have created a demand for High-End Computing (HEC) ...
详细信息
ISBN:
(纸本)0769523129
Breakthrough-quality scientific discoveries in the new millennium (such as those expected in computation biology and others), along with optimal engineering designs, have created a demand for High-End Computing (HEC) systems with sustained performance requirements at a petaflop scale and beyond. Despite the very pessimistic (if not negative) views on parallel computing systems that have prevailed in 1990s, there seems to be no other viable alternatives for such HEC systems. In this talk, we present a fresh look at the problems facing the design of petascale parallel computing systems. We review several fundamental issues that such HEC parallel computing systems must resolve. These issues include: execution models that support dynamic and adaptive multithreading, fine-grain synchronization, and global name-space and memory consistency. Related issues in parallel programming, dynamic compilation models, and system software design will also be discussed. Present solutions and future direction will be discussed based on (1) application demand (e.g. computation biology and others), (2) the recent trend as demonstrated by the HTMT, HPCS, and the Blue-Gene Cyclops (e.g. Cyclops-64) architectures, and (3) a historical perspective on influential models such as dataflow, along with concepts learned from these models.
Buffered CoScheduled (BCS) MPI is a novel implementation of MPI based on global synchronization of all system activities. BCS-MPI imposes a model where all processes and their communication are tightly scheduled at a ...
详细信息
ISBN:
(纸本)0769523129
Buffered CoScheduled (BCS) MPI is a novel implementation of MPI based on global synchronization of all system activities. BCS-MPI imposes a model where all processes and their communication are tightly scheduled at a very fine granularity. Thus, BCS-MPI provides a system that is much more controllable and deterministic. BCS-MPI leverages this regular behavior to provide a simple yet powerful monitoring and debugging subsystem that streamlines the analysis of parallel software. This subsystem, called Monitoring and Debugging System (MDS), provides exhaustive process and communication scheduling statistics. This paper covers in detail the design and implementation of the MDS subsystem, and demonstrates how the MDS can be used to monitor and debug not only parallel MPI applications but also the BCS-MPI runtime system itself. Additionally, we show that this functionality need not come at a significant performance loss.
The proceedings contain 36 papers. The topics discussed include: distributed simulation of vehicular networks;consistency overhead using HLA for collaborative work;concurrency control frameworks for interactive sharin...
详细信息
The proceedings contain 36 papers. The topics discussed include: distributed simulation of vehicular networks;consistency overhead using HLA for collaborative work;concurrency control frameworks for interactive sharing of data spaces;using web services and data mediation/storage services to enable command and control to simulation interoperability;a version of MASM potable across different UNIX systems and different hardware architectures;using consistent global checkpoints to synchronize processes in distributed simulation;dealing with global guards in a distributed simulation of colored Petri Nets;and 3D mesh compression using an efficient neighborhood-based segmentation.
In this paper we test the suitability of Java to implement a scalable Web Service that solves a set of problems related to peer-to-peer interactions between Web Services that are behind firewalls or not generally acce...
详细信息
ISBN:
(纸本)0769523129
In this paper we test the suitability of Java to implement a scalable Web Service that solves a set of problems related to peer-to-peer interactions between Web Services that are behind firewalls or not generally accessible. In particular we describe how to enable reliable and long running conversations through firewalls between Web Service peers that have no accessible network endpoints. Our solution is to implement in Java a Web Services Dispatcher (WSD) that is an intermediary service that forwards messages and can facilitate message exchanges by supporting SOAP RFC over HTTP and WS-Addressing for asynchronous messaging. We describe how Web Service clients that have no network endpoints, such as applets, can become Web Service peers by using an additional message store-and-forward service ("mailbox"). Then we conduct a set of experiments to evaluate performance of Java implementation in realistic Web Service scenarios, involving intercontinental tests between France and the US.
Handling very large datasets has been a key problem addressed in real-time distributed rendering research. With the advent of the programmable Graphics processing Unit (GPU), it is now possible and even profitable to ...
详细信息
ISBN:
(纸本)0769523129
Handling very large datasets has been a key problem addressed in real-time distributed rendering research. With the advent of the programmable Graphics processing Unit (GPU), it is now possible and even profitable to move many application-specific computations to be carried out by the GPU. It has been shown that modern GPUs outperform the standard PC-platform CPUs on a broad class of computations by over a factor of seven [9]. Given the low costs and high processing speeds of GPUs, there is a trend towards using clusters of CPU/GPU systems. Configuring and programming these clusters for efficient distribution of data and computations is a major challenge. What are the computations that can be offloaded from the CPU to a GPU? The answer to this question is not simple as it depends on the following four factors: GPU's processing capacity, CPU's internal bandwidth, GPU-CPU communication bandwidth and the external network bandwidth. All these factors are subject to change with every generation of hardware. But additions and alternatives to the traditional data-parallel architectures are now needed to exploit the full capability of such clusters using functional parallelism. In this paper, we present a number of architectural configurations that could be adapted on such clusters. Specifically, we demonstrate use of one such architecture: application of a GPU-based pipelined architecture to our work on real-time processing and rendering of large-point datasets which demands complex computations. We have also introduced a list of application and system parameters that are necessary to determine an optimal distribution of computation on the GPUs of a graphics cluster.
The use of numerical methods for the solution of large electromagnetic (EM) problems is nowadays a common practice. Among the several available techniques, the Finite Difference Time Domain (FDTD) is one of the most f...
详细信息
ISBN:
(纸本)0769523129
The use of numerical methods for the solution of large electromagnetic (EM) problems is nowadays a common practice. Among the several available techniques, the Finite Difference Time Domain (FDTD) is one of the most frequently adopted algorithms, thanks to its versatility and its ability to deal with complex structures. Unfortunately, the method requires a huge computational effort, so that the study of large simulation domains and the investigation of many practical EM aspects cannot be afforded by using traditional computers. Also, the needs of high accuracy when the simulation domain is modeled, implies small mesh size, with a non-negligible impact on the memory requirement if a uniform mesh scheme is adopted. The proposed variable mesh FDTD (VM-FDTD) algorithm allows a natural and efficient parallel implementation, guaranteeing the possibility of managing large memory requirements. Results on relevant topics such as the antennas characterization and the interaction between humans and EM sources are investigated, demonstrating its ability to deal with complex EM problems.
暂无评论