Lattice based cryptography can be considered a candidate alternative for post-quantum cryptosystems offering key exchange, digital signature and encryption functionality. Number Theoretic Transform (NTT) can be utiliz...
详细信息
ISBN:
(纸本)9781665427036
Lattice based cryptography can be considered a candidate alternative for post-quantum cryptosystems offering key exchange, digital signature and encryption functionality. Number Theoretic Transform (NTT) can be utilized to achieve better performance for these functionalities, where polynomials are needed to be multiplied. NTT simplifies the multiplication overhead allowing point-wise multiplication by transforming the polynomials into the spectral domain and then inversing the result to the original domain. It is important to optimize this technique that is used in a wide range of computing systems. In this paper we study the feasibility of using OpenCL, a portable framework, to implement a parallelized version of NTT which allows deployment on heterogeneous platforms, such as Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). We measure the performance of our implementation on a GPU and evaluate when and where such a deployment is beneficial. Our results showed that the proposed parallel implementation is a viable acceleration approach for these algorithms for lattice-based cryptography solutions.
Node sizes in multicore clusters are becoming larger, so applications should exploit the shared memory inside a node, to potentially reduce communication latencies compared to network communications. The Message Passi...
详细信息
Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we...
详细信息
This paper discusses the optimization problem for parallelization the program code in heterogeneous system. The optimization problem and constraints are defined. Authors present the main approached to find the best so...
详细信息
ISBN:
(纸本)9781665414746
This paper discusses the optimization problem for parallelization the program code in heterogeneous system. The optimization problem and constraints are defined. Authors present the main approached to find the best solution. The special aspects of optimization problem in heterogeneous systems arc discussed and the heuristics according to the aspects are proposed.
Reductions are a common pattern in parallel programming, and every parallel programming language or framework has its own reduction abstraction with its own idiosyncrasies. These abstractions differ not only in their ...
详细信息
ISBN:
(纸本)9781665424394
Reductions are a common pattern in parallel programming, and every parallel programming language or framework has its own reduction abstraction with its own idiosyncrasies. These abstractions differ not only in their syntax, but also in their semantics and their ability to express certain types of reduction. Such differences may prevent specific combinations of abstraction and hardware platform from reaching high levels of performance, with consequences for portability and programmer productivity. In this paper, we present a set of representative reduction benchmarks to explore the capabilities of five contemporary programming languages and frameworks - OpenMP, Kokkos, RAJA, SYCL, and the oneAPI DPC++ Library (oneDPL) - across a variety of hardware platforms, including CPUs and GPUs from multiple vendors. We discuss the advantages and disadvantages of each reduction abstraction, and conclude with recommendations to improve their design and implementation.
We propose a dense tensor accelerator called VectorMesh, a scalable, memory-efficient architecture that can support a wide variety of DNN and computer vision workloads. Its building block is a tile execution unit (TEU...
详细信息
ISBN:
(纸本)9781728192017
We propose a dense tensor accelerator called VectorMesh, a scalable, memory-efficient architecture that can support a wide variety of DNN and computer vision workloads. Its building block is a tile execution unit (TEU), which includes dozens of processing elements (PEs) and SRAM buffers connected through a butterfly network. A mesh of FIFOs between the TEUs facilitates data exchange between tiles and promote local data to global visibility. Our design performs better according to the roofline model for CNN, GEMM, and spatial matching algorithms compared to state-of-the-art architectures. It can reduce global buffer and DRAM fetches by 2-22 times and up to 5 times, respectively.
Pseudocode is a valuable resource used in programming education, software development, and scientific reports for designing algorithmic solutions as it is easy to write, understand, and modify. Since pseudocode is lac...
详细信息
ISBN:
(纸本)9781665495035
Pseudocode is a valuable resource used in programming education, software development, and scientific reports for designing algorithmic solutions as it is easy to write, understand, and modify. Since pseudocode is lacking in its ability to be tested, it is difficult to determine whether a pseudocode solution is correct or not. Software tools are specially required to reach this goal, e.g., helping professors find race conditions, deadlocks, or starvation issues while grading students' concurrent pseudocode. Although there are various tools to work with sequential pseudocode, there is a lack of tools to work with concurrent pseudocode. This shortage motivated us to determine the state-of-the-art in notations and tools for testing concurrent and distributed pseudocode. We conducted a systematic literature review and found only a few related publications, confirming that this topic is understudied. We found and report about five software tools capable of interpreting concurrent or distributed pseudocode, and two software tools capable of verifying its correctness. As another result, no other literature review was found about this topic, conferring novelty to the contributions of this work.
The PVS search function, as a current mainstream and efficient algorithm, has been widely used in various kinds of chess program. We applied the parallel search function based on the PVS and improved the running speed...
详细信息
ISBN:
(纸本)9781665440899
The PVS search function, as a current mainstream and efficient algorithm, has been widely used in various kinds of chess program. We applied the parallel search function based on the PVS and improved the running speed of the program. At the same time, we also did some research and experiments on the evaluation function of Amazon chess which provided a set of available Amazon evaluation functions and parameter adjustment results for reference.
Task-based programming models promise improved communication performance for irregular, fine-grained, and load imbalanced applications. They do so by relaxing some of the messaging semantics of stricter models and tak...
详细信息
ISBN:
(纸本)9781665411400
Task-based programming models promise improved communication performance for irregular, fine-grained, and load imbalanced applications. They do so by relaxing some of the messaging semantics of stricter models and taking advantage of those at the lower-levels of the software stack. For example, while MPI's two-sided communication model guarantees in-order delivery, requires matching sends to receives, and has the user schedule communication, task-based models generally favor the runtime system scheduling all execution based on the dependencies and message deliveries as they happen. The messaging semantics are critical to enabling high performance. In this paper, we build on previous work that added zero copy semantics to Converse/LRTS. We examine the messaging semantics of Charm++ as it relates to large message buffers, identify shortcomings, and define new communication APIs to address them. Our work enables in-place communication semantics in the context of point-to-point messaging, broadcasts, transmission of read-only variables at program startup, and for migration of chares. We showcase the performance of our new communication APIs using benchmarks for Charm++ and Adaptive MPI, which result in nearly 90% latency improvement and 2x lower peak memory usage.
This research presents some of the critical information required to understand the concept of parallel programming and the implementation of OpenMP in parallel programming. parallelism is the preferred tool for expedi...
详细信息
ISBN:
(纸本)9781665416344
This research presents some of the critical information required to understand the concept of parallel programming and the implementation of OpenMP in parallel programming. parallelism is the preferred tool for expediting an algorithm, as demonstrated by the evolution of computing architectures (multi-core and many-core) towards a greater number of processing cores. The report will focus on OpenMP parallel programming models and further examine its implementation and features. parallel programming OpenMP model is increasingly preferred for its ability to deliver real-time processing, thereby, meeting system requirements performance wise. Furthermore, the study of implementing OpenMP in enhancing the efficiency of 3D discontinuous deformation analysis (3D-DDA) for expansive simulation using parallel block Jacobi (BJ) and Pre-conditioned conjugate gradient (PCG) algorithms. The absence of synchronization of data in parallel programming makes the system more prone to errors in programming since the parallel environment is much more complicated than perceived. The studies performed will highlight how synchronization is managed using OpenMP model. In the field of biometrics, the most important issue faced in DNA sequencing and pattern discovery is locating the longest common subsequence (LCS) among sequences. To identify the LCS of DNA sequences, we will look into the solutions achieved using OpenMP mols based on CPU, that extend major improvements hi processing speed, capital, and ubiquity, and the results based on the analysis are discussed.
暂无评论