the ability to reliably distribute simulations across a distributed system and seamlessly integrate them as a workflow regardless of their level of abstraction is critical to improving the quality of product manufactu...
详细信息
ISBN:
(纸本)9781479987818
the ability to reliably distribute simulations across a distributed system and seamlessly integrate them as a workflow regardless of their level of abstraction is critical to improving the quality of product manufacturing. this paper presents the DIVIDER architecture for managing and maintaining real-time performance simulations integrated through SOAs. the described approach captures features present in complex workflow patterns such as asynchronous arbitrary cycles and estimates the worst case execution time in the context of the interfering execution environment.
Recent developments in the international arena has meant the technology is now mature enough to bring together those required for the implementation of a grid computing facility. this paper examines the requirements a...
详细信息
ISBN:
(纸本)0769517722
Recent developments in the international arena has meant the technology is now mature enough to bring together those required for the implementation of a grid computing facility. this paper examines the requirements and applications for an eScience infrastructure with particular reference to developments in Europe.
For a long time the Instruction Set architecture (ISA) has been the firm contract between software and hardware. this firm contract plays an important role by decoupling the development of software from hardware micro...
详细信息
ISBN:
(纸本)9781509012336
For a long time the Instruction Set architecture (ISA) has been the firm contract between software and hardware. this firm contract plays an important role by decoupling the development of software from hardware micro-architectural features, enabling both to evolve independently. Nonetheless, it also condemns the ISA to become larger, more cluttered and inefficient as new instructions are incorporated over the years and deprecated instructions are left untouched to keep legacy compatibility. In this work we propose OpenISA, a flexible ISA that enables boththe software and the hardware to evolve independently and discuss how OpenISA 1.0 was designed to enable efficient OpenISA software emulation on alien ISAs, which is key to free the user from hardware lock-ins. Our results show that software compiled to OpenISA can be latter emulated on x86 and ARM processors with very little overhead achieving near native performance, under 10% for the majority of programs.
Cloud computing is continuously increasing its popularity as key features such as scalability, pay-per-use and availability continue to evolve. It is also becoming a competitive platform for running highperformance c...
详细信息
ISBN:
(纸本)9781538658154
Cloud computing is continuously increasing its popularity as key features such as scalability, pay-per-use and availability continue to evolve. It is also becoming a competitive platform for running highperformancecomputing (HPC) and parallel applications due to the increasing performance of virtualized, highly-available instances. However, migrating HPC applications to cloud still requires native fault-tolerant solutions to fully leverage cloud features and maximize the resource utilization at the best cost - particularly for long-running parallel applications where faults can cause invalid states or data loss. this requires re-executing applications which increases completion time and cost. We propose Resilience as a Service (RaaS), a fault tolerant framework for HPC applications running in cloud. In this paper RADIC architecture (Redundant Array of Distributed Independent Fault Tolerance Controllers) is used to provide clouds with a highly available, distributed and scalable fault-tolerant service. the paper explores how traditional HPC protection and recovery mechanisms must be redesigned to natively leverage cloud properties and its multiple alternatives for implementing rollback recovery protocols using virtual machines, containers, object and block storage or database services. Results show that RaaS restores and completes the application execution using available resources while reducing overhead up to 8% for different fault-tolerant configuration alternatives.
Lazy hardware transactional memory has been shown to be more efficient at extracting available concurrency than its eager counterpart. However, it poses scalability challenges at commit time as existence of conflicts ...
详细信息
ISBN:
(纸本)9781467308243;9781467308267
Lazy hardware transactional memory has been shown to be more efficient at extracting available concurrency than its eager counterpart. However, it poses scalability challenges at commit time as existence of conflicts among concurrent transactions is not known prior to commit. Non-conflicting transactions may have to wait before committing, severely affecting performance in certain workloads. Early conflict detection can be employed to allow such transactions to commit simultaneously. In this paper we show that the potential of this technique has not yet been fully utilized, with design choices in prior work severely burdening common-case transactional execution to avoid some relatively uncommon correctness concerns. the paper quantifies the severity of the problem and develops. pi-TM, an early conflict detection - lazy conflict resolution design. this design highlights how, with modest extensions to existing directory-based coherence protocols, information regarding possible conflicts can be effectively used to achieve true parallelism at commit without burdening the common-case. We leverage the observation that contention is typically seen on only a small fraction of shared data accessed by coarse-grained transactions. Pessimistic invalidation of such lines when committing or aborting, therefore, enables fast common-case execution. Our results show that. pi-TM performs consistently well and, in particular, far better than previous work on early conflict detection in lazy HTM. We also identify a pathological scenario that lazy designs with early conflict detection suffer from and propose a simple hardware workaround to sidestep it.
An analytic model is introduced that not only explains the behavior seen in small-scale simulation studies, but also makes it possible to extrapolate forward to evaluate the efficiency of limited pointers directories ...
详细信息
ISBN:
(纸本)0897913949
An analytic model is introduced that not only explains the behavior seen in small-scale simulation studies, but also makes it possible to extrapolate forward to evaluate the efficiency of limited pointers directories in large-scale systems. the model shows that miss rates inherent to invalidation-based consistency schemes are relatively high (typically 10% to 60%) for actively shared data, across a variety of workloads. It is found that limited pointers schemes that resort to broadcasting invalidations when the pointers are exhausted perform very poorly in large-scale machines, even if there are sufficient pointers most of the time. On the other hand, no-broadcast strategies that limit the degree of caching to the number of pointers in an entry have only a modest impact on the cache miss rate and network traffic under a wide range of workloads, including those in which data blocks are actively accessed by a large number of processors.
In this paper we present VCube-PS, a topic-based Publish/Subscribe system built on the top of a virtual hypercube-like topology. Membership information and published messages to subscribers (members) of a topic group ...
详细信息
ISBN:
(纸本)9781509012336
In this paper we present VCube-PS, a topic-based Publish/Subscribe system built on the top of a virtual hypercube-like topology. Membership information and published messages to subscribers (members) of a topic group are broadcast over dynamically built spanning trees rooted at the message's source. For a given topic, delivery of published messages respects causal order. performance results of experiments conducted on the PeerSim simulator confirm the efficiency of VCube-PS in terms of scalability, latency, number, and size of messages when compared to a single rooted, not dynamically, tree built approach.
this work presents an implementation of Neocognitron Neural Network, using a highperformancecomputingarchitecture based on GPU (Graphics Processing Unit). Neocognitron is an artificial neural network, proposed by F...
详细信息
ISBN:
(纸本)9780769534237
this work presents an implementation of Neocognitron Neural Network, using a highperformancecomputingarchitecture based on GPU (Graphics Processing Unit). Neocognitron is an artificial neural network, proposed by Fukushima and collaborators, constituted of several hierarchical stages of neuron layers, organized in. two-dimensional matrices called cellular planes. For the highperformance computation of Face Recognition application using Neocognitron it was used CUDA (Compute Unified Device architecture) as API (Application Programming Interface) between the CPU and the GPU, from GeForce 8800 GTX of NVIDIA company, with 128 ALU's. As face image databases it was used a face database created at UFS-Car and the CMU-PIE (Carnegie Mellon University Pose, Illumination and Expression) database. the load balancing was achieved through the use of cellular connections as threads organized in blocks, following the CUDA philosophy), of development. the results showed the feasibility of this type of device as a massively parallel data processing tool, and that smaller the granularity and the data dependency of the parallel processing, better is its performance.
Phase change memory (PCM) recently has emerged as a promising technology to meet the fast growing demand for large capacity memory in modern computer systems. In particular, multi-level cell (MLC) PCM that stores mult...
详细信息
ISBN:
(纸本)9781467308243;9781467308267
Phase change memory (PCM) recently has emerged as a promising technology to meet the fast growing demand for large capacity memory in modern computer systems. In particular, multi-level cell (MLC) PCM that stores multiple bits in a single cell, offers high density with low per-byte fabrication cost. However, despite many advantages, such as good scalability and low leakage, PCM suffers from exceptionally slow write operations, which makes it challenging to be integrated in the memory hiearchy. In this paper, we propose architectural innovations to improve the access time of MLC PCM. Due to cell process variation, composition fluctuation and the relatively small differences among resistance levels, MLC PCM typically employs an iterative write scheme to achieve precise control, which suffers from large write access latency. To address this issue, we propose write truncation (WT) to reduce the number of write iterations withthe assistance of an extra error correction code (ECC). We also propose form switch (FS) to reduce the storage overhead of the ECC. By storing highly compressible lines in SLC form, FS improves read latency as well. Our experimental results show that WT and FS improve the effective write/read latency by 57%/28% respectively, and achieve 26%performance improvement over the state of the art.
the development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. this emerging scenario of heterogeneous mobile archite...
详细信息
ISBN:
(纸本)9781509012336
the development of new technologies is setting a new era characterized, among other factors, by the rise of sophisticated mobile devices containing CPUs and GPUs. this emerging scenario of heterogeneous mobile architectures brings challenging issues regarding the use of the available computing resources. Such issues are mainly related to the intrinsic complexity of coordinating these processors in order to increase application performance. In this sense, this paper presents a high-level programming model to implement parallel patterns that can be executed in a coordinate way by heterogeneous mobile architectures. A comparative analysis of performance and programming complexity is presented, contrasting code generated automatically from the proposed programming model with low-level manually-optimized implementations.
暂无评论