Transactional memory has been attracting increasing attention in recent years, and it provides optimistic concurrency control schemes for shared-memory parallel programs. The rapid development and wide adoption of tra...
详细信息
Transactional memory has been attracting increasing attention in recent years, and it provides optimistic concurrency control schemes for shared-memory parallel programs. The rapid development and wide adoption of transactional memory make this programming paradigm promising for achieving breakthroughs in massively parallel computing. There has been a large number of discussions towards transactional memory systems, which aimed at providing relatively simple and intuitive synchronization construction for shared-memory parallel programs without sacrificing performance. Hardware transactional memory (HTM) has become commercially available in mainstream processors, however, due to several inherent architectural limitations that will abort hardware transactions, such as cache overflows, context switches, hardware as well as software exceptions, etc., nowadays HTM systems come in a best-effort way, which necessitates the adoption of a software fallback path to ensure forward progress. In this paper, we survey state-of-the-art software-side optimizations for best-effort hardware transaction system, as well as several novel performance tuning techniques. Research efforts about joint usage of HTM and non-volatile memory (NVM) are also discussed.
This work presents the main activities and results of a parallel programming learning process for Graphics Processing Units (GPU) language CUDA based on algorithms for processing and generating 2D and 3D images. The p...
详细信息
ISBN:
(数字)9781665421614
ISBN:
(纸本)9781665421621
This work presents the main activities and results of a parallel programming learning process for Graphics Processing Units (GPU) language CUDA based on algorithms for processing and generating 2D and 3D images. The proposed learning activities focus on the key points of parallel programming, such as the optimal use of the different types of memory. The learning process has been proposed as a set of master classes on image theory and parallel programming, along with practical CUDA programming sessions for 2D and 3D image generation and processing. Results show the student satisfaction for the proposed learning process and similar marks than before its application.
Usage of multiprocessor and multicore computers implies parallel programming. Tools for preparing parallel programs include parallel languages and libraries as well as parallelizing compilers and convertors that can p...
详细信息
The fiducial-marks-based alignment process is one of the most critical steps in printed circuit board (PCB) manufacturing. In the alignment process, a machine vision technique is used to detect the fiducial marks and ...
详细信息
The fiducial-marks-based alignment process is one of the most critical steps in printed circuit board (PCB) manufacturing. In the alignment process, a machine vision technique is used to detect the fiducial marks and then adjust the position of the vision system in such a way that it is aligned with the PCB. The present study proposed an embedded PCB alignment system, in which a rotation, scale and translation (RST) template-matching algorithm was employed to locate the marks on the PCB surface. The coordinates and angles of the detected marks were then compared with the reference values which were set by users, and the difference between them was used to adjust the position of the vision system accordingly. To improve the positioning accuracy, the angle and location matching process was performed in refinement processes. To overcome the matching time, in the present study we accelerated the rotation matching by eliminating the weak features in the scanning process and converting the normalized cross correlation (NCC) formula to a sum of products. Moreover, the scanning time was reduced by implementing the entire RST process in parallel on threads of a graphics processing unit (GPU) by applying hash functions to find refined positions in the refinement matching process. The experimental results showed that the resulting matching time was around 32x faster than that achieved on a conventional central processing unit (CPU) for a test image size of 1280 x 960 pixels. Furthermore, the precision of the alignment process achieved a considerable result with a tolerance of 36.4 mu m.
parallel programming within the computer science degree is now mandatory. New hardware platforms, with multiple cores and the execution of concurrent threads, require it. Despite the above, the teaching of parallelism...
详细信息
parallel programming within the computer science degree is now mandatory. New hardware platforms, with multiple cores and the execution of concurrent threads, require it. Despite the above, the teaching of parallelism with the usual methods and classical algorithms, make this topic hard for our students to understand. On the other hand, teaching complex topics through the techniques of gamification has already demonstrated, in a reliable way, a positive reinforcement of the student in front of the learning of complex concepts. In this work we demonstrate a way to convey the teaching of parallelism to undergraduate students using gamification in microworlds. The results obtained by the students who followed this model, compared to a control group that followed the standard model, show a statistically significant advantage in favor of the teaching of parallelism, using a gamification with microworlds model.
Derivatives are key to numerous science, engineering, and machine learning applications. While existing tools generate derivatives of programs in a single language, modern parallel applications combine a set of framew...
详细信息
ISBN:
(纸本)9781665454452
Derivatives are key to numerous science, engineering, and machine learning applications. While existing tools generate derivatives of programs in a single language, modern parallel applications combine a set of frameworks and languages to leverage available performance and function in an evolving hardware landscape. We propose a scheme for differentiating arbitrary DAG-based parallelism that preserves scalability and efficiency, implemented into the LLVM-based Enzyme automatic differentiation framework. By integrating with a full-fledged compiler backend, Enzyme can differentiate numerous parallel frameworks and directly control code generation. Combined with its ability to differentiate any LLVM-based language, this flexibility permits Enzyme to leverage the compiler tool chain for parallel and differentiation-specitic optimizations. We differentiate nine distinct versions of the LULESH and miniBUDE applications, written in different programming languages (C++, Julia) and parallel frameworks (OpenMP, MPI, RAJA, Julia tasks, ***), demonstrating similar scalability to the original program. On benchmarks with 64 threads or nodes, we find a differentiation overhead of 3.4–6.8× on C++ and 5.4–12.5× on Julia.
Multiple signal classification algorithm (MUSICAL) provides a super-resolution microscopy method. In the previous research, MUSICAL has enabled data-parallelism well on a desktop computer or a Linux-based server. Howe...
详细信息
This article presents the definition and implementation of a quantum computer architecture to enable creating a new computational device-a quantum computer as an accelerator. A key question addressed is what such a qu...
详细信息
This article presents the definition and implementation of a quantum computer architecture to enable creating a new computational device-a quantum computer as an accelerator. A key question addressed is what such a quantum computer is and how it relates to the classical processor that controls the entire execution process. In this article, we present explicitly the idea of a quantum accelerator that contains the full stack of the layers of an accelerator. Such a stack starts at the highest level describing the target application of the accelerator. The next layer abstracts the quantum logic outlining the algorithm that is to be executed on the quantum accelerator. In our case, the logic is expressed in the universal quantum-classical hybrid computation language developed in the group, called OpenQL, which visualized the quantum processor as a computational accelerator. The OpenQL compiler translates the program to a common assembly language, called cQASM, which can be executed on a quantum simulator. The cQASM represents the instruction set that can be executed by the microarchitecture implemented in the quantum accelerator. In a subsequent step, the compiler can convert the cQASM to generate the eQASM, which is executable on a particular experimental device incorporating the platform-specific parameters. This way, we are able to distinguish clearly the experimental research toward better qubits, and the industrial and societal applications that need to be developed and executed on a quantum device. The first case offers experimental physicists with a full-stack experimental platform using realistic qubits with decoherence and error-rates, whereas the second case offers perfect qubits to the quantum application developer, where there is neither decoherence nor error-rates. We conclude the article by explicitly presenting three examples of full-stack quantum accelerators, for an experimental superconducting processor, for quantum accelerated genome sequencing and for
Persistent homology is perhaps the most popular and useful tool offered by topological data analysis - with point-cloud data being the most common setup. Its older cousin, the Euler characteristic curve (ECC) is less ...
详细信息
The coarray programming model is an expression of the Single-Program-Multiple-Data (SPMD) programming model through the simple device of adding a codimension to the Fortran language. A data object declared with a codi...
详细信息
The coarray programming model is an expression of the Single-Program-Multiple-Data (SPMD) programming model through the simple device of adding a codimension to the Fortran language. A data object declared with a codimension is a coarray object. Codimensions express the idea that some objects are located in local memory while others are located in remote memory. Coarray syntax obeys most of the same rules for normal array syntax. It is familiar to the Fortran programmer so the use of coarray syntax is natural and intuitive. Although the basic idea is quite simple, inserting it into the language definition turned out to be difficult. In addition, the process was complicated by rapidly changing hardware and heated arguments over whether parallelism should be supported best as an interface to language-independent libraries, as a set of directives superimposed on languages, or as a set of specific extensions to existing languages. In this paper, we review both the early history of coarrays and also their development into a part of Fortran 2008 and eventually into a larger part of Fortran 2018. Coarrays have been used, for example, in weather forecasting and in neural networks and deep learning.
暂无评论