Emerging general purpose graphics processing units (GPGPU) make use of a memory hierarchy very similar to that of modern multi-core processors they typically have multiple levels of on-chip caches and a DDR-like off-c...
详细信息
ISBN:
(纸本)9781728165820
Emerging general purpose graphics processing units (GPGPU) make use of a memory hierarchy very similar to that of modern multi-core processors they typically have multiple levels of on-chip caches and a DDR-like off-chip main memory. In such massively parallel architectures, caches are expected to reduce the average data access latency by reducing the number of off-chip memory accesses;however, our extensive experimental studies confirm that not all applications utilize the on-chip caches in an efficient manner. Even though GPGPUs are adopted to run a wide range of general purpose applications, the conventional cache management policies are incapable of achieving the optimal performance over different memory characteristics of the applications. This paper first investigates the underlying reasons for inefficiency of common cache management policies in GPGPUs. To address and resolve those issues, we then propose (i) a characterization mechanism to analyze each kernel at runtime and, (ii) a selective caching policy to manage the flow of cache accesses. Evaluation results of the studied platform show that our proposed dynamically reconfigurable cache hierarchy improves the system performance by up to 105% (average of 27%) over a wide range of modern GPGPU applications, which is within 10% of the optimal improvement.
The evolution of distributedapplications to reflect structural changes or to adapt to specific conditions of the run-time environment is a difficult issue especially if continuous service is required from end-users. ...
详细信息
ISBN:
(纸本)1892512416
The evolution of distributedapplications to reflect structural changes or to adapt to specific conditions of the run-time environment is a difficult issue especially if continuous service is required from end-users. This latter constraint implies to perform changes with minimal penalty on the service provisioning. The set of tools and services that allow such a goal to be achieved is usually designated as dynamic reconfiguration capabilities. A major issue related to dynamic reconfiguration is to ensure applications consistency after a reconfiguration. Classical transactional models provide a way to identify application calculations and to ensure consistency of these calculations despite faults. In a first step, we propose an extended transaction model to ensure such a consistency when a reconfiguration occurs. We argue that ensuring consistency can be done easily by using an extended transaction model with a strong isolation property. Then we discuss possible solutions for non transactional applications.
We address a significant problem in parallelprocessing research, namely, how to port existing sequential programs to run efficiently on parallel machines (the 'dusty deck' problem). Conventional domain-indepe...
详细信息
ISBN:
(纸本)0818667052
We address a significant problem in parallelprocessing research, namely, how to port existing sequential programs to run efficiently on parallel machines (the 'dusty deck' problem). Conventional domain-independent techniques are inadequate for solving this problem because they miss significant opportunities of parallelism. We present experimental evidence to support our claim, analyze why current techniques are inadequate, and propose a knowledge-based reverse engineering approach for attacking this problem.
This paper studies the impact of using automatic data-layout techniques on the process of coding the well-known multigrid MG NAS parallel benchmark. We describe the sequential problem in detail, and discuss the parall...
详细信息
ISBN:
(纸本)9780769543284
This paper studies the impact of using automatic data-layout techniques on the process of coding the well-known multigrid MG NAS parallel benchmark. We describe the sequential problem in detail, and discuss the parallel version and its optimizations. Then, we implement the parallel algorithm using Hitmap, a highly-efficient modular library for hierarchical tiling and mapping of arrays. We describe how to use the library plug-in system to add a new data-layout module that encapsulates a generalization of the data-alignment policy of the MG benchmark. The module system applies this policy to automatically adapt the data distribution and communication code to any grain level. The impact of using these techniques is qualitatively and quantitatively described in terms of development effort and performance. Our results show that it is possible to introduce flexible automatic data-layout techniques in current parallel compiler technology, without sacrificing performance.
The rechargeable sensor network is promising for various applications. However, improving network performance is challenging, because the energy depletion of the sensor nodes will result in abnormal death of the nodes...
详细信息
ISBN:
(纸本)9781538637906
The rechargeable sensor network is promising for various applications. However, improving network performance is challenging, because the energy depletion of the sensor nodes will result in abnormal death of the nodes. In this paper, we propose a hybrid framework to model the abnormal death of the sensor nodes. Based on the Markov fluid queue theory, the model includes three parts, namely utilizing a Markov process to simulate the charging behavior, a queuing model to trace the working mechanism of rechargeable sensor nodes, and a continuous fluid process to indicate the energy level of sensor nodes. The numerical results show that our model can effectively predict the probability of abnormal death and stationary energy consumption of the sensor nodes.
Transactional Memory (TM) is reputed by many researchers to be a promising solution to ease parallel programming on multicore processors. This model provides the scalability of fine-grained locking while avoiding comm...
详细信息
ISBN:
(纸本)9781479927289
Transactional Memory (TM) is reputed by many researchers to be a promising solution to ease parallel programming on multicore processors. This model provides the scalability of fine-grained locking while avoiding common issues of traditional mechanisms, such as deadlocks. During these almost twenty years of research, several TM systems and benchmarks have been proposed. However, TM is not yet widely adopted by the scientific community to develop parallelapplications due to unanswered questions in the literature, such as "how to identify if a parallel application can exploit TM to achieve better performance?" or "what are the reasons of poor performances of some TM applications?". In this work, we contribute to answer those questions through a comparative evaluation of a set of TM applications on four different state-of-the-art TM systems. Moreover, we identify some of the most important TM characteristics that impact directly the performance of TM applications. Our results can be useful to identify opportunities for optimizations.
The NuMesh system defines a high-speed communication substrate optimized for off-line routing. By determining possible communication paths at compile time, highly efficient hardware and software constructs can be expl...
详细信息
The NuMesh system defines a high-speed communication substrate optimized for off-line routing. By determining possible communication paths at compile time, highly efficient hardware and software constructs can be exploited to yield superior network performance. These communication paths can be independently tuned to allow more utilized paths greater bandwidth. Although communication paths are scheduled, data need not be sent during every scheduled cycle. Flow-control protocols allow for empty communication cycles as well as for data backup in the network. Limited gate delays between NuMesh registers as well as single-cycle message transfers allow for a high clock frequency and low network latency. A highly pipelined architecture for this communication is presented and a mechanism for efficient flow-controlled communication is discussed. A unique communication protocol is presented and shown to provide single-cycle transfers between nodes. An overview of the necessary compiler support is also provided. Preliminary results and a description of the current hardware and software status are listed.
In this paper, we have proposed a novel recovery approach to deal with the lost and orphan messages for distributed computing environment. The proposed scheme considers the complex issue of handling concurrent failure...
详细信息
ISBN:
(纸本)1601320841
In this paper, we have proposed a novel recovery approach to deal with the lost and orphan messages for distributed computing environment. The proposed scheme considers the complex issue of handling concurrent failures. It avoids the complex recovery scheme associated with the asynchronous approach in such a way that in the event of a failure after the system recovers from it, processes can restart from their respective recent checkpoints (thus avoiding the domino effect) irrespective of the existence of any lost or orphan messages among these recent checkpoints. It reduces to a good extent the re-computation time per process after a failure occurs.
The traditional distributed simulations are small-scale and individual, but the today's distributed simulations are large-scale and complex. Therefore the complex distributed simulations require significant advanc...
详细信息
ISBN:
(纸本)1892512416
The traditional distributed simulations are small-scale and individual, but the today's distributed simulations are large-scale and complex. Therefore the complex distributed simulations require significant advances both in the underlying network technologies and in the ability of simulations to exploit new networking capabilities such as flexibility, reusability interoperability, performance, and scalability. This paper first describes a study to design for a chemical contamination diffusion model (CCD Model) that a potential model developing based on HLA (High Level Architecture). Next, discusses modifications that were made to the RTI (Run-Time Infrastructure), to the simulations that comprised the federation, and to the FOM (Federation Object Model) to incorporate these capabilities in the operation of the simulation. We had tested the ModCCD in order to estimating the effectiveness of it and we found that the ModCCD reduced network traffic. Our proposed model is more effective when federates (or federation) contain large numbers of entities having limited regions of interaction with each other under distributed environment.
This work aims at distilling a systematic methodology to modernize existing sequential scientific codes with a limited re-designing effort, turning an old codebase into modern code, i.e., parallel and robust code. We ...
详细信息
ISBN:
(纸本)9781728165820
This work aims at distilling a systematic methodology to modernize existing sequential scientific codes with a limited re-designing effort, turning an old codebase into modern code, i.e., parallel and robust code. We propose an automatable methodology to parallelize scientific applications designed with a purely sequential programming mindset, thus possibly using global variables, aliasing, random number generators, and stateful functions. We demonstrate the methodology by way of an astrophysical application, where we model at the same time the kinematic profiles of 30 disk galaxies with a Monte Carlo Markov Chain (MCMC), which is sequential by definition. The parallel code exhibits a 12 times speedup on a 48-core platform.
暂无评论