KNN (k-nearest neighbor) algorithm is an important method which exhibits great performance in many fields. It is a commonly used step in graph convolutional networks (GCN) when graph structure is not available. Howeve...
详细信息
the current sharding schemes show some shortcomings, such as poor performance while handling cross-shard transactions, and the transactions are verified in an inefficient way. A sharding blockChain system with output ...
详细信息
ISBN:
(纸本)9798400708930
the current sharding schemes show some shortcomings, such as poor performance while handling cross-shard transactions, and the transactions are verified in an inefficient way. A sharding blockChain system with output shard Batch processing and parallel transaction Verification(BPPV-Chain) is proposed in this study. the core idea of the proposed scheme is elucidated as follows. the output shard is capable of verifying and processingthe input availability certificates generated by the input shard in a batch manner, and it can generate the transaction availability certificates of different input shards when coping withthe cross-shard transactions. the input shard unlocks or spends UTXO following the transaction avaliability certificates to for the cross-shard collaboration. On that basis, the communication complexity of cross-shard transactions can be reduced. Moreover, a parallel transaction verification scheme is present to increase the efficiency of transaction verification. In this scheme, UTXO is verified in a serial manner to prevent double spending and the signatures and values of multiple transactions are checked in a parallel manner. As indicated by the experimental results, BPPV-Chain outperforms existing sharding blockchain systems especially under the percentage of cross-shard transactions of not more than 80%. Furthermore, BPPV-Chain also ensures linear growth of throughput withthe number of shards increasing, such that the scalability of BPPV-Chain can be confirmed.
More and more emphasis is allocated to production job scheduling by the modern organizations in order to preserve their performance indicators stable. this research addresses the dynamic scheduling of jobs in a produc...
详细信息
ISBN:
(纸本)9783031693434;9783031693441
More and more emphasis is allocated to production job scheduling by the modern organizations in order to preserve their performance indicators stable. this research addresses the dynamic scheduling of jobs in a production system with a single input queue and parallelmachines. the processing times and the times between job arrivals are assumed probabilistic. Jobs belong to different classes and a due date is assigned to each job at the time of its arrival. A machine needs to be setup every time it switches production from some job class to another. In this article, is considered a set of alternative priority rules for dynamic job scheduling using discrete-event simulation. the priority heuristics are compared in respect to several performance metrics in a series of simulation experiments. the behaviour of the scheduling heuristics is assessed under the influence of various parameters. Moreover, are offered managerial insights for scheduling decisions in industry based on the numerical results.
the popularity of multicore processors and the rise of High Performance Computing as a Service (HPCaaS) have made parallel programming essential to fully utilize the performance of multicore systems. OpenMP, a widely ...
详细信息
ISBN:
(纸本)9798350386783;9798350386776
the popularity of multicore processors and the rise of High Performance Computing as a Service (HPCaaS) have made parallel programming essential to fully utilize the performance of multicore systems. OpenMP, a widely adopted shared-memory parallel programming model, is favored for its ease of use. However, it is still challenging to assist and accelerate automation of its parallelization. Although existing automation tools such as Cetus and DiscoPoP to simplify the parallelization, there are still limitations when dealing with complex data dependencies and control flows. Inspired by the success of deep learning in the field of Natural Language processing (NLP), this study adopts a Transformer-based model to tackle the problems of automatic parallelization of OpenMP instructions. We propose a novel Transformer-based multimodal model, ParaMP, to improve the accuracy of OpenMP instruction classification. the ParaMP model not only takes into account the sequential features of the code text, but also incorporates the code structural features and enriches the input features of the model by representing the Abstract Syntax Trees (ASTs) corresponding to the codes in the form of binary trees. In addition, we built a BTCode dataset, which contains a large number of C/C++ code snippets and their corresponding simplified AST representations, to provide a basis for model training. Experimental evaluation shows that our model outperforms other existing automated tools and models in key performance metrics such as F1 score and recall. this study shows a significant improvement on the accuracy of OpenMP instruction classification by combining sequential and structural features of code text, which will provide a valuable insight into deep learning techniques to programming tasks.
Monte Carlo (MC) methods, due to their strong geometric simulation capabilities, comprehensive physical modeling, and minimal simulation approximation, are widely applied in areas such as radiation transport, physical...
详细信息
Adders are essential components of modern digital circuits, and their primary design goal is to achieve high speed. However, power consumption and chip area are also important considerations in modern circuit design. ...
详细信息
ISBN:
(纸本)9783031751691;9783031751707
Adders are essential components of modern digital circuits, and their primary design goal is to achieve high speed. However, power consumption and chip area are also important considerations in modern circuit design. Optimizing digital adder performance plays a crucial role in enhancing the speed of binary operations within complex circuits. Various architectures address the carry propagation bottleneck, each with its own strengths and weaknesses. Choosing the most appropriate architecture depends on the specific application requirements, ensuring optimal performance within the available resource constraints. this paper provides a comprehensive analysis of various adder topologies and their performance characteristics. By carefully considering the trade-offs between delay, power consumption, and area, engineers can choose the optimal architecture for their specific application requirements, leading to significant improvements in digital system performance and efficiency. the analyzed adder topologies include Ripple Carry Adder (RCA), Carry Lookahead Adder (CLA), Carry Skip Adder (CSK), Carry Select Adder (CSLA), Carry Increment Adder (CIA), Brent kung adder (BKA), Kong stone adder. the analysis is conducted using HDL on the Xilinx ISE 14.7 platform.
the paper is devoted to an analysis and comparison in the development of new high - performance computers and the improvements and development of new more reliable versions of the Danish Eulerian model for computer st...
详细信息
ISBN:
(纸本)9783031562075;9783031562082
the paper is devoted to an analysis and comparison in the development of new high - performance computers and the improvements and development of new more reliable versions of the Danish Eulerian model for computer studying of the transport of the air pollutants over Europe and surrounding areas, studying some economical and agricultural problems, regional and global climate changing, etc.
In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing element...
详细信息
ISBN:
(纸本)9781665469586
In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing elements (PEs) arranged in a two-dimensional array. the operations of PEs and the interconnections between them are adaptively changed depending on a target application, and this contributes to a higher energy efficiency compared to general-purpose processors. the application kernel executed on CGRAs is represented as a data flow graph (DFG), and CGRA compilers are responsible for mapping the DFG onto the PE array. thus, mapping algorithms significantly influence the performance and power efficiency of CGRAs as well as the compile lime. this paper proposes POCOCO, a compiler framework for CGRAs that can use pre-optimized subgraph mappings. this contributes to reducing the compiler optimization task. To leverage the subgraph mappings, we extend an existing mapping method based on a genetic algorithm. Experiments on three architectures demonstrated that the proposed method reduces the optimization lime by 48%, on an average, for the best case of the three architectures.
CPU-based inference can be deployed as an alternative to off-chip accelerators. In this context, emerging vector architectures are a promising option, owing to their high efficiency. Yet the large design space of conv...
详细信息
ISBN:
(纸本)9798350337662
CPU-based inference can be deployed as an alternative to off-chip accelerators. In this context, emerging vector architectures are a promising option, owing to their high efficiency. Yet the large design space of convolutional algorithms and hardware implementations makes the selection of design options challenging. In this paper, we present our ongoing research into co-designing future vector architectures for CPU-based Convolutional Neural Networks (CNN) inference focusing on the im2col+GEMM and Winograd kernels. Using the Gem5 simulator we explore the impact of several hardware microarchitectural features including (i) vector lanes, (ii) vector lengths, (iii) cache sizes, and (iv) options for integrating the vector unit into the CPU pipeline. In the context of im2col+GEMM, we study the impact of several BLIS-like algorithmic optimizations such as (1) utilization of vector registers, (2) loop unrolling, (3) loop reorder, (4) manual vectorization, (5) prefetching, and (6) packing of matrices, on the RISC-V Vector Extension and ARM-SVE ISAs. We use the YOLOv3 and VGG16 network models for our evaluation. Our co-design study shows that BLIS-like optimizations are not beneficial to all types of vector microarchitectures. We additionally demonstrate that longer vector lengths (of at least 8192 bits) and larger caches (of 256MB) can boost performance by 5x, with our optimized CNN kernels, compared to a vector length of 512-bit and 1MB of L2 cache. In the context of Winograd, we present our novel approach of inter-tile parallelization across the input/output channels by using 8x8 tiles per channel to vectorize the algorithm on vector length agnostic (VLA) architectures. Our method exploits longer vector lengths and offers high memory reuse, resulting in performance improvement of up to 2.4x for non-strided convolutional layers with 3x3 kernel size, compared to our optimized im2col+GEMM approach on the Fujitsu A64FX processor. Our co-design study furthermore reveals that W
this research work presents a decentralized parallel blockchain-based agricultural product traceability system, which aims to enhance information security and data model efficiency, and achieve real-time tracking of a...
详细信息
暂无评论