Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular exp...
详细信息
ISBN:
(纸本)9798400714436
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks the overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup over a serial algorithm. the existing data-parallel DFA-based recognizers suffer from an excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions.
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to lim...
详细信息
ISBN:
(纸本)9798400714436
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. this paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02× and 4.19×, respectively.
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (M...
详细信息
ISBN:
(纸本)9798400714436
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (ML) tools have been leveraged in MLIPs, but they are not perfectly matched with each other, since many optimization opportunities in MLIPs have been missed by ML tools. this inefficiency arises from the fact that HPC+AI applications work with far more computational complexity compared with pure AI scenarios. this paper has developed an MLIP, named TensorMD, independently from any ML tool. TensorMD has been evaluated on two supercomputers and scaled to 51.8 billion atoms, i.e., ~ 3× compared with state-of-the-art.
the advent of new parallel architectures has increased the need for parallel optimizing compilers to assist developers in creating efficient code. OpenUH is a state-of-the-art optimizing compiler, but it only performs...
详细信息
ISBN:
(纸本)9781605583976
the advent of new parallel architectures has increased the need for parallel optimizing compilers to assist developers in creating efficient code. OpenUH is a state-of-the-art optimizing compiler, but it only performs a limited set of optimizations for OpenMP programs due to its conservative assumptions of shared memory programming. these limitations may prevent some OpenMP applications from being fully optimized to the extent of its sequential counterpart. this paper describes our design and implementation of a parallel data flow framework, consisting of a parallel Control Flow Graph (PCFG) and a parallel SSA (PSSA) representation in OpenUH, to model data flow for OpenMP programs. this framework enables the OpenUH compiler to perform all classical scalar optimizations for OpenMP programs, in addition to conducting OpenMP specific optimizations.
Boosted transactions offer an attractive method that enables programmers to create larger transactions that scale well and offer deadlock-free guarantees. However, as boosted transactions get larger, they become more ...
详细信息
ISBN:
(纸本)9781605583976
Boosted transactions offer an attractive method that enables programmers to create larger transactions that scale well and offer deadlock-free guarantees. However, as boosted transactions get larger, they become more susceptible to conflicts and aborts. We describe a linear-time algorithm to detect transactions that cannot make progress, which transactions need to be aborted, and when. the algorithm guarantees zero false positives with minimal aborts. Our proposals, as implemented in DSTM2, increase the transactional throughput of the system, often by more than 30%.
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallelprogramming on may-core architectures. We show that the NB-FEB primitive is universal, s...
详细信息
ISBN:
(纸本)9781605583976
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallelprogramming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).
A stream processor executes an application that has been decomposed into a sequence of kernels that operate on streams of data elements. During the execution of a kernel, all streams accessed must be communicated thro...
详细信息
ISBN:
(纸本)9781605583976
A stream processor executes an application that has been decomposed into a sequence of kernels that operate on streams of data elements. During the execution of a kernel, all streams accessed must be communicated through the SRF (Stream Register File), a non-bypassing software-managed on-chip memory. therefore, optimizing utilization of the SRF is crucial for good performance. the key insight is that the interference graphs formed by the streams in stream applications tend to be comparability graphs or decomposable into a set of multiple comparability graphs. We present a compiler algorithm that can find optimal or near-optimal colorings in stream IGs, thereby improving SRF utilization than the First-Fit bin-packing algorithm, the best in the literature.
暂无评论