Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with smal...
详细信息
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular exp...
详细信息
ISBN:
(纸本)9798400714436
Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks the overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup over a serial algorithm. the existing data-parallel DFA-based recognizers suffer from an excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions.
It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving devic...
详细信息
Hypergraph partitioning is used in many problem domains including VLSI design, linear algebra, Boolean satisfiability, and data mining. Most versions of this problem are NP-complete or NP-hard, so practical hypergraph...
详细信息
Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoidin...
详细信息
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to lim...
详细信息
ISBN:
(纸本)9798400714436
Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. this paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02× and 4.19×, respectively.
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (M...
详细信息
ISBN:
(纸本)9798400714436
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (ML) tools have been leveraged in MLIPs, but they are not perfectly matched with each other, since many optimization opportunities in MLIPs have been missed by ML tools. this inefficiency arises from the fact that HPC+AI applications work with far more computational complexity compared with pure AI scenarios. this paper has developed an MLIP, named TensorMD, independently from any ML tool. TensorMD has been evaluated on two supercomputers and scaled to 51.8 billion atoms, i.e., ~ 3× compared with state-of-the-art.
Binary code analysis is widely used to help assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging infor...
详细信息
We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates...
详细信息
ISBN:
(纸本)9798400700156
We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. this provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.
Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficien...
详细信息
ISBN:
(纸本)9798400700156
Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficiency, we propose a fast BN exact inference solution named Fast-BNI on multi-core CPUs. Fast-BNI enhances the efficiency of exact inference through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. We also propose techniques to further simplify the bottleneck operations of BN exact inference. Fast-BNI source code is freely available at https://***/jjiantong/FastBN.
暂无评论