检索结果-内蒙古大学图书馆

34th International Workshop on Languages and compilers for Parallel Computing (LCPC)

作者： Magnussen, Birk Martin Kawasumi, Tohma Mikami, Hiroki Kimura, Keiji Kasahara, Hironori Waseda Univ Green Comp Ctr Dept Comp Sci & Engn Shinjuku Ku 27 Waseda Machi Tokyo 1620042 Japan

ISBN: (纸本)9783030993726;9783030993719

With an increasing number of shared memory multicore processor architectures, there is a requirement for supporting multiple architectures in automatic parallelizing compilers. The OSCAR (Optimally Scheduled Advanced Multiprocessor) automatic parallelizing compiler is able to parallelize many different sequential programs, such as scientific applications, embedded real-time applications, multimedia applications, and more. OSCAR compiler's features include coarse-grain task parallelization with earliest execution condition analysis, analyzing both data and control dependencies, data locality optimizations over different loop nests with data dependencies, and the ability to generate parallelized code using the OSCAR API 2.1. The OSCAR API 2.1 is compatible with OpenMP for SMP multicores, with additional directives for power control and supporting heterogeneous multicores. This allows for a C or Fortran compiler with OpenMP support to generate parallel machine code for the target multicore. Additionally, using the OSCAR API analyzer allows a sequential-only compiler without OpenMP support to generate machine code for each core separately, which is then linked to one parallel application. Overall, only little configuration changes to the OSCAR compiler are needed to run and optimize OSCAR compiler-generated code on a specific platform. This paper evaluates the performance of OSCAR compiler-generated code on different modern SMP multicore processors, including Intel and AMD x86 processors, an Arm processor, and a RISC-V processor using scientific and multimedia benchmarks in C and Fortran. The results show promising speedups on all platforms, such as a speedup of 7.16 for the swim program of the SPEC2000 benchmarks on an 8-core Intel x86 processor, a speedup of 9.50 for the CG program of the NAS parallel benchmarks on 8 cores of an AMD x86 Processor, a speedup of 3.70 for the BT program of the NAS parallel benchmarks on a 4-core RISC-V processor, and a speedup of 2.64 fo

关键词： multicore parallelizing compiler OSCAR multiple platforms shared memory

来源：评论

学校读者我要写书评

暂无评论

FJComp, a Java parallelizing compiler for dealing with divide-and-conquer algorithm

FJComp, a Java parallelizing compiler for dealing with divid...

引用

International Conference on Computer Applications Technology (ICCAT)

作者： Senghor, Abdourahmane Konate, Karim Cheikh Anta Diop Univ Dakar Dept Math & Informat Dakar Senegal

ISBN: (纸本)9781467352857

The purpose of this article is to design and implement a performing compiler for parallelizing Java application with divide-and-conquer algorithm. The compiler is built around Java ForkJoin framework, which is directly integrated within Java 1.7 version and imported as archive library in Java 1.6 and 1.5 versions. This compiler tends to make easier and less error-prone the parallelization of recursive applications. Although in Java ForkJoin Framework there are two user-level performance parameters, which are the number of threads and the threshold, our compiler introduces another user-level performance parameter which is the MaxDepth corresponding to the maximum of depth after which, only sequential execution is enforced. This allows balancing between fine-grain and coarse-grain parallelisms. Experimental results are presented and discussed.

关键词： parallelizing compiler divive-and-conquer algorithm ForkJoin framework fine-grain parallelism coase-grain parallelism

来源：评论

学校读者我要写书评

暂无评论

Performance of Static and Dynamic Task Scheduling for Real-Time Engine Control System on Embedded Multicore Processor 1

引用

32nd International Workshop on Languages and compilers for Parallel Computing (LCPC)

作者： Oki, Yoshitake Mikami, Hiroki Nishida, Hikaru Umeda, Dan Kimura, Keiji Kasahara, Hironori Waseda Univ Dept Comp Sci & Engn Tokyo Japan

ISBN: (数字)9783030727895

ISBN: (纸本)9783030727895;9783030727888

Embedded multicore processors running hard real-time applications such as engine control programs require an appropriate scheduling routine to meet the real-time deadline constraints. These applications typically consist of various conditional branches which change the flow of the program and the task executions based on sensors inputs and vehicle status information. Conventionally, dynamic on-line scheduling was the only option for such applications that have unpredictable runtime behaviors. However, techniques for compilers and schedulers allow static off-line scheduling to be applied to engine control programs by utilizing execution profile feedback methods to feed task execution time information to the compiler. This paper is the first to compare dynamic scheduling and static scheduling schemes through the OSCAR multi-grain automatic parallelizing compiler and its overheads on an actual engine control program using an embedded multicore processor implemented on an FPGA. Evaluations and analysis on the engine control program indicate promising results for static scheduling, recording a 2.53x speedup on 4 cores compared to single core execution. In contrast, speedup on dynamic scheduling with 4 cores was only 0.86x comared to sequential execution. The evaluation shows that static scheduling with execution profile feedbacpk methods is an effective tool for real hard-real time control applications that have task granularity that is too fine for dynamic scheduling on embedded multicore processors.

关键词： parallelizing compiler Multicore Dynamic scheduling Static scheduling Scheduling overhead Performance comparison Hard real-time control systems

来源：评论

学校读者我要写书评

暂无评论

compiler Software Coherent Control for Embedded High Performance Multicore

引用

IEICE TRANSACTIONS ON ELECTRONICS 2020年第3期E103C卷 85-97页

作者： Adhi, Boma A. Kashimata, Tomoya Takahashi, Ken Kimura, Keiji Kasahara, Hironori Waseda Univ Dept Comp Sci & Engn Tokyo 1698555 Japan

The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism;therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 "equake" with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8

关键词： multicore software coherence control parallelizing compiler shared memory cache soft core

来源：评论

学校读者我要写书评

暂无评论

Automatic parallelization of representative-based clustering algorithms for multicore cluster systems

引用

INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2020年第2期10卷 135-159页

作者： Islam, Saiyedul Balasubramaniam, Sundar Gupta, Shruti Brajesh, Shikhar Badlani, Rohan Labhishetty, Nitin Baid, Abhinav Goyal, Poonam Goyal, Navneet Birla Inst Technol & Sci Pilani Adv Data Analyt & Parallel Technol Lab Dept Comp Sci & Informat Syst Pilani Campus Pilani Rajasthan India

Ease of programming and optimal parallel performance have historically been on the opposite side of a trade-off, forcing the user to choose. With the advent of the Big Data era and the rapid evolution of sequential algorithms, the data analytics community can no longer afford the trade-off. We observed that several clustering algorithms often share common traits-particularly, algorithms belonging to the same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. The compiler either generates MPI C++ code for distributed memory parallel processing or MPI-OpenMP C++ code for hybrid memory parallel processing, depending upon the target architecture. Our experiments on different state-of-the-art parallelization frameworks show that our system can achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community. Results are presented for both distributed and hybrid memory systems.

关键词： Clustering Domain Specific Language parallelizing compiler High Performance Computing Programming patterns

来源：评论

学校读者我要写书评

暂无评论

Workload Characterization of Nondeterministic Programs Parallelized by STATS

Workload Characterization of Nondeterministic Programs Paral...

引用

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

作者： Deiana, Enrico A. Campanoni, Simone Northwestern Univ Evanston IL 60208 USA

ISBN: (纸本)9781728107462

Chip Multiprocessors (CMP) are everywhere, from mobile systems, to servers. Thread Level Parallelism (TLP) is the characteristic of a program that makes use of the parallel cores of a CMP to generate performance. Despite all efforts for creating TLP, multiple cores are still underutilized even though we have been in the multicore era for more than a decade. Recently, a new approach called STATS has been proposed to generate additional TLP for complex and irregular nondeterministic programs. STATS allows a developer to describe application-specific information that its compiler uses to automatically generate a new source of TLP. This new source of TLP increases with the size of the input and it has the potential to generate scalable performance with the number of cores. Even though STATS obtains most of its potential, some of it is still unreached. This paper identifies and characterizes the sources of overhead that are currently blocking STATS parallelized programs to achieve their full potential. To this end, we characterized the workloads generated by the STATS compiler on a 28 core Intel-based machine (dual-socket). This paper shows that the performance loss is due to a combination of factors: some can be optimized via engineering efforts and some require a deeper evolution of STATS. We also highlight potential solutions to significantly reduce most of this overhead. Exploiting these insights will unblock scalable performance for the parallel binaries generated by STATS.

关键词： nondeterminism parallelizing compiler speculation workload characterization

来源：评论

学校读者我要写书评

暂无评论

A Rapid Prototyping Approach for High Performance Density-Based Clustering 6

A Rapid Prototyping Approach for High Performance Density-Ba...

引用

6th IEEE International Conference on Data Science and Advanced Analytics (DSAA)

作者： Islam, Saiyedul Balasubramaniam, Sundar Goyal, Poonam Sultana, Ankit Bhutani, Lakshit Raje, Saurabh Goyal, Navneet Birla Inst Technol & Sci Pilani Adv Data Analyt & Parallel Technol Lab Dept Comp Sci & Informat Syst Pilani Campus Pilani Rajasthan India

ISBN: (纸本)9781728144931

Big Data has significantly increased the dependence of data analytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.

关键词： Clustering Domain Specific Language Big Data Engineering Programming Patterns parallelizing compiler

来源：评论

学校读者我要写书评

暂无评论

K-DT: a formal system for the evaluation of linear data dependence testing techniques

引用

JOURNAL OF SUPERCOMPUTING 2018年第4期74卷 1655-1675页

作者： Zhao, Jie Zhao, Rongcai Natl Digital Switching Syst Engn & Technol Res Ct Zhengzhou 450001 Henan Peoples R China Zhongyuan Univ Technol Zhengzhou 450007 Henan Peoples R China

The power of data dependence testing techniques of a parallelizing compiler is its essence to transform and optimize programs. Numerous techniques were proposed in the past, and it is, however, still a challenging problem to evaluate the relative power of these techniques to better understand the data dependence testing problem. In the past, either empirical studies or experimental evaluation results are published to compare these data dependence testing techniques, being not able to convince the research community completely. In this paper, we show a theoretical study on this issue, comparing the power on disproving dependences of existing techniques by proving theorems in a proposed formal system K-DT. Besides, we also present the upper bounds of these techniques and introduce their minimum complete sets. To the best of our knowledge, K-DT is the first formal system used to compare the power of data dependence testing techniques, and this paper is the first work to show the upper bounds and minimum complete sets of data dependence testing techniques.

关键词： parallelizing compiler Data dependence testing technique Predicate logic Minimum complete set

来源：评论

学校读者我要写书评

暂无评论

Pattern-based Automatic Parallelization of Representative-based Clustering Algorithms 5

Pattern-based Automatic Parallelization of Representative-ba...

引用

5th IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA)

作者： Islam, Saiyedul Balasubramaniam, Sundar Gupta, Shruti Brajesh, Shikhar Badlani, Rohan Labhishetty, Nitin Baid, Abhinav Goyal, Poonam Goyal, Navneet Birla Inst Technol & Sci Pilani Dept Comp Sci & Informat Syst Adv Data Analyt & Parallel Technol Lab Pilani Campus Pilani Rajasthan India BITS Pilani Pilani Rajasthan India

ISBN: (纸本)9781538650905

Ease of programming and optimal parallel performance have historically been on the opposite side of a tradeoff, forcing the user to choose. With the advent of the Big Data era and rapid evolution of sequential algorithms, the data analytics community can no longer afford the tradeoff. We observed that several clustering algorithms often share common traits-particularly, algorithms belonging to same class of clustering exhibit significant overlap in processing steps. Here, we present our observation on domain patterns in Representative-based clustering algorithms and how they manifest as clearly identifiable programming patterns when mapped to a Domain Specific Language (DSL). We have integrated the signatures of these patterns in the DSL compiler for parallelism identification and automatic parallel code generation. Our experiments on different state-of-the-art parallelization frameworks shows that our system is able to achieve near-optimal speedup while requiring a fraction of the programming effort, making it an ideal choice for the data analytics community.

关键词： Clustering Domain Specific Language parallelizing compiler High Performance Computing Programming Patterns

来源：评论

学校读者我要写书评

暂无评论

A Polyhedral Compilation Framework for Loops with Dynamic Data-Dependent Bounds 2018

A Polyhedral Compilation Framework for Loops with Dynamic Da...

引用

27th International Conference on compiler Construction (CC)

作者： Zhao, Jie Kruse, Michael Cohen, Albert Ecole Normale Super INRIA Paris France Ecole Normale Super DI Paris France

ISBN: (纸本)9781450356442

We study the parallelizing compilation and loop nest optimization of an important class of programs where counted loops have a dynamic data-dependent upper bound. Such loops are amenable to a wider set of transformations than general while loops with inductively defined termination conditions: for example, the substitution of closed forms for induction variables remains applicable, removing the loop-carried data dependences induced by termination conditions. We propose an automatic compilation approach to parallelize and optimize dynamic counted loops. Our approach relies on affine relations only, as implemented in state-of-the-art polyhedral libraries. Revisiting a state-of-the-art framework to parallelize arbitrary while loops, we introduce additional control dependences on data-dependent predicates. Our method goes beyond the state of the art in fully automating the process, specializing the code generation algorithm to the case of dynamic counted loops and avoiding the introduction of spurious loop-carried dependences. We conduct experiments on representative irregular computations, from dynamic programming, computer vision and finite element methods to sparse matrix linear algebra. We validate that the method is applicable to general affine transformations for locality optimization, vectorization and parallelization.

关键词： parallelizing compiler loop nest optimization polyhedral model dynamic counted loop

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：