the linear structure of blockchain ensures data security and credibility. But at the same time, it has become the performance bottleneck of the entire system, limiting the growth of the transaction processing rate. th...
详细信息
ISBN:
(纸本)9781665435741
the linear structure of blockchain ensures data security and credibility. But at the same time, it has become the performance bottleneck of the entire system, limiting the growth of the transaction processing rate. the inherent concurrency of directed acyclic graph (DAG) technology solves these problems, but it also brings new problems: blocks total order and ledger consistency. In this paper, we propose a Layer-based DAG (L-DAG) blockchain, which avoids the complex total ordering algorithms by keeping blocks in order between and within layers during the generation process. And we introduce the proportional-integral-derivative (PID) controller to dynamically control the width of layers by using in-degree and out-degree of blocks to achieve the consistency of ledger. We extend Practical Byzantine Fault Tolerance (PBFT) protocol in parallel based on the L-DAG structure and successfully apply it to consortium blockchain scenarios. the L-DAG blockchain structure is implemented through Hyperledger Fabric. Experimental results show that as the number of consensus threads increases, the TPS of L-DAG-based PBFT can grow with near-linear efficiency.
Several variants of parallel multipole-based algorithms have been implemented to further research in fields such as computational chemistry and astrophysics. We present a distributedparallel implementation of a multi...
详细信息
Several variants of parallel multipole-based algorithms have been implemented to further research in fields such as computational chemistry and astrophysics. We present a distributedparallel implementation of a multipole-based algorithm that is portable to a wide variety of applications and parallel platforms. Performance data are presented for loosely coupled networks of workstations as well as for more tightly coupled distributed multiprocessors, demonstrating the portability and scalability of the application to large number of processors.
In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FP...
详细信息
ISBN:
(纸本)9781728174457
In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA's I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.
More and more network devices and chips commonly apply multi-core architectures to meet increasingly performance demands. But the lack of efficient program level parallelism and workload allocation in the packet proce...
详细信息
ISBN:
(纸本)9780769548180
More and more network devices and chips commonly apply multi-core architectures to meet increasingly performance demands. But the lack of efficient program level parallelism and workload allocation in the packet processing system greatly limits the utilization of multi-core architectures. In this paper, we propose a parallel packet processing runtime system and explore an affinity-based packet scheduler withthe goal of raising load balancing and decreasing cache miss. We can use the system that handles the allocation of processing tasks to simplify the implementation of new applications. the experiment results show task distributor and scheduler can achieve a better compromise between load balancing and cache affinity in the parallel packet processing system.
Application portability between different multicore architecture-parallel programming paradigm/tool pairs is a big problem nowadays leading often to a complete rewrite of an application when switching from an architec...
详细信息
ISBN:
(纸本)9780769546766
Application portability between different multicore architecture-parallel programming paradigm/tool pairs is a big problem nowadays leading often to a complete rewrite of an application when switching from an architecture-paradigm pair to another. this is caused by a wide variety of architectural properties requiring different optimization techniques for different architectures, typically hiding the essence of (parallel) computation defined by the application. In this paper, we introduce the Multi-Core Portability Abstraction (MCPA) simplifying portability and implementation of parallelapplications making use of shared memory. It abstracts away typical architecture dependent effects caused by latency, synchronization, and partitioning and acts as an executable intermediate abstraction/reference implementation as well as a tool for analyzing the intrinsic parallelism of the application and relative goodness of architectures in executing it. We give a short application example with performance measurements.
To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this techn...
详细信息
ISBN:
(纸本)9781665497473
To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this technique, both computations and communications are running at the same time. But computation usually also performs some data movements. Since data for computations and for communications use the same memory system, memory contention may occur when computations are memory-bound and large messages are transmitted through the network at the same time. In this paper we propose a model to predict memory band-width for computations and for communications when they are executed side by side, according to data locality and taking contention into account. Elaboration of the model allowed to better understand locations of bottleneck in the memory system and what are the strategies of the memory system in case of contention. the model was evaluated on many platforms with different characteristics, and showed a prediction error in average lower than 4 %.
In this paper we evaluate the performance of the Chapel programming language from the perspective of its language primitives and features, where the microbenchmarks are synthesized from our lessons learned in developi...
详细信息
ISBN:
(纸本)9780769546766
In this paper we evaluate the performance of the Chapel programming language from the perspective of its language primitives and features, where the microbenchmarks are synthesized from our lessons learned in developing molecular dynamics simulation programs in Chapel. Experimental results show that most language building blocks have comparable performance to corresponding hand-written C code, while the complex applications can achieve up to 70% of the performance of C implementation. We identify several causes of overhead that can be further optimized by Chapel compiler. this work not only helps Chapel users understand the performance implication of using Chapel, but also provides useful feedbacks for Chapel developers to make a better compiler.
One-way Wave Equation Migration (OWEM) is a classic seismic imaging method offering a good trade-off between quality and compute cost in most geological cases. In recent years, GPU-based heterogeneous architecture has...
详细信息
ISBN:
(纸本)9781665440660
One-way Wave Equation Migration (OWEM) is a classic seismic imaging method offering a good trade-off between quality and compute cost in most geological cases. In recent years, GPU-based heterogeneous architecture has gained popularity for seismic imaging. In this paper, we present a generic design for asynchronous processing and data management. By applying this design, we present an efficient GPU implementation of OWEM combining OpenACC and CUDA. Our approach improves upon classic designs by exploring asynchronous compute and data transfer between CPU and GPU using high-speed NVLink, completely masking the cost of MPI communications and I/O. Using 3,018 GPUs, our tine-tuned OWEM can process 11,172 seismic shots in less than 75 minutes. By tuning CPU and GPU clock frequencies, we achieve around 30% energy saving with only 4% loss of performance on PANGEA III supercomputer. We believe our design combined withthe energy-aware tuning will be beneficial to many GPU applications.
Faults are commonplace in large scale systems. these systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an e...
详细信息
ISBN:
(纸本)9781509021406
Faults are commonplace in large scale systems. these systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. this paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application error and hence a recovery algorithm should be invoked - or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given a fault signature (a set of attributes comprising of system and application state), we use machine learning to create a model which predicts whether a multi-bit permanent/transient main memory fault will likely result in error. We present the design elements such as the fault injection methodology for covering important data structures, the application and system attributes which should be used for learning the model, the supervised learning algorithms (and potentially ensembles), and important metrics. We use three applications - NWChem, LULESH and SVM - as examples for demonstrating the effectiveness of the proposed fault modeling methodology.
Developments in machine learning and graph analytics have seen these fields establish themselves as pervasive in a wide range of applications. Non-volatile memory (NVRAM) offers higher capacity and information retainm...
详细信息
ISBN:
(纸本)9781538643686
Developments in machine learning and graph analytics have seen these fields establish themselves as pervasive in a wide range of applications. Non-volatile memory (NVRAM) offers higher capacity and information retainment in case of power loss, therefore it is expected to be adopted for such applications. However, the asymmetric access latencies of NVRAM greatly degrade performance. the focus of this paper is to reduce the effect of memory access latency on emerging machine learning and graph workloads. the proposed mechanism uses software tagging of application data structures so as to control on-chip cache evictions based on data type and reuse patterns in an NVRAM based multicore system. Learner models are developed that are capable of predicting cache allocations for a variety of machine learning and graph applications. the optimized learning model yields an average performance benefit of 21% compared to a system that does not optimize for the write latency challenges in NVRAM.
暂无评论