检索结果-内蒙古大学图书馆

48th International Conference on parallel Processing (ICPP)

作者： Williams, Brody Wang, Xi Leidel, John D. Chen, Yong Texas Tech Univ Lubbock TX 79409 USA Tact Comp Labs Muenster TX USA

ISBN: (纸本)9781450371964

parallel programming methodologies are fundamentally dissimilar to those of conventional programming, and software developers without the requisite skillset often find it difficult to adapt to these new methods. This is particularly true for parallel programming in a distributed address space, which is necessary for any meaningful degree of scalability. As such, an approach that combines a more intuitive interface together with excellent performance within the distributed address space model is desired. In this work, we present our initial API design and implementation as well as the underlying algorithms for a collective communication library built for the Extended Base Global Address Space (xBGAS) extension to the RISC-V microarchitecture. Our runtime library is designed to enact the Partitioned Global Address Space model (PGAS) in an attempt to alleviate the difficulty associated with traditional distributed address space programming while the underlying collective implementation is formulated to prevent the loss of, and even improve, performance over traditional solutions.

关键词： PGAS RISC-V Remote Memory Access parallel programming Collectives

来源：评论

学校读者我要写书评

暂无评论

Extending LLVM for Lightweight SPMD Vectorization: Using SIMD and Vector Instructions Easily from Any Language 2019

Extending LLVM for Lightweight SPMD Vectorization: Using SIM...

引用

17th IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

作者： Kruppe, Robin Oppermann, Julian Sommer, Lukas Koch, Andreas Tech Univ Darmstadt Embedded Syst & Applicat Grp Darmstadt Germany

ISBN: (纸本)9781728114361

Popular language extensions for parallel programming such as OpenMP or CUDA require considerable compiler support and runtime libraries and are therefore only available for a few programming languages and/or targets. We present an approach to vectorizing kernels written in an existing generalpurpose language that requires minimal changes to compiler front- ends. Programmers annotate parallel (SPMD) code regions with a few intrinsic functions, which then guide an ordinary automatic vectorization algorithm. This mechanism allows programming SIMD and vector processors effectively while avoiding much of the implementation complexity of more comprehensive and powerful approaches to parallel programming. Our prototype implementation, based on a custom vectorization pass in LLVM, is integrated into C, C++ and Rust compilers using only 29-37 lines of frontend-specific code each.

关键词： Kernel Graphics processing units parallel processing Runtime library parallel programming C++ languages

来源：评论

学校读者我要写书评

暂无评论

PhD Forum: Towards Embedded Heterogeneous FPGA-GPU Smart Camera Architectures for CNN Inference 13

PhD Forum: Towards Embedded Heterogeneous FPGA-GPU Smart Cam...

引用

13th International Conference on Distributed Smart Cameras (ICDSC)

作者： Carballo-Hernandez, Walther Berry, Francois Pelcat, Maxime Arias-Estrada, Miguel Inst Pascal Dept Images Percept Syst & Robot Aubiere France UMR CNRS Inst Natl Sci Appliquees INSA Rennes IETR Dept Images Rennes France INAOE Dept Comp Sci Puebla Mexico

ISBN: (纸本)9781450371896

The success of Deep Learning (DL) algorithms in computer vision tasks have created an on-going demand of dedicated hardware architectures that could keep up with the their required computation and memory complexities. This task is particularly challenging when embedded smart camera platforms have constrained resources such as power consumption, Processing Element (PE) and communication. This article describes a heterogeneous system embedding an FPGA and a GPU for executing CNN inference for computer vision applications. The built system addresses some challenges of embedded CNN such as task and data partitioning, and workload balancing. The selected heterogeneous platform embeds an Nvidia (R) Jetson TX2 for the CPU-GPU side and an Intel Altera (R) Cyclone10GX for the FPGA side interconnected by PCIe Gen2 with a MIPI-CSI camera for prototyping. This test environment will be used as a support for future work on a methodology for optimized model partitioning.

关键词： Heterogeneous Computing Edge Computing Internet of Things parallel programming Single Instruction Multiple Data Pipelining Models of Computation and Architecture

来源：评论

学校读者我要写书评

暂无评论

Improve Student Performance Using Moderated Two-Stage Projects 19

Improve Student Performance Using Moderated Two-Stage Projec...

引用

4th ACM Conference on Global Computing Education (CompEd)

作者： Chen, Juan Cao, Yingjun Du, Linlin Ouyang, Youwen Shen, Li Natl Univ Def Technol Coll Comp Changsha Hunan Peoples R China Univ Calif San Diego San Diego CA 92103 USA Calif State Univ San Marcos Dept Comp Sci & Informat Syst San Marcos TX USA

ISBN: (纸本)9781450362597

parallel programming skills are becoming more popular due to the unprecedented boom in artificial intelligent and high-performance computing. programming assignments are widely used in parallel programming courses to measure student performance and expose students to constraints in real projects. However, due to the difficulty level of these assignments, many students struggle to write fully functional and adequately documented programs. To improve student performance, we implemented a moderated two-stage format for five course projects in a graduate-level introductory parallel programming class. Each project is divided into two stages where students complete the assignment individually without any collaboration in the first stage. Then students work in pairs to work on the same project in the second stage so they can review each other's work from the first stage and improve their programs collaboratively. For two of the five projects, a moderated meeting is conducted in between the two stages where the instructor moderated a group discussion on general issues raised by students. We found that students' performance improved from stage one to stage two. In addition, the two projects with a moderated meeting show better performance gains. This paper also examines students' perceptions of and experiences with the moderated two-stage projects. Students favor working on two-stage projects because they had a chance to discuss challenging concepts and the moderated discussion session tend to guide them to the correct path should they make mistakes in stage one.

关键词： Moderated Two-Stage Projects Collaborative Project parallel programming

来源：评论

学校读者我要写书评

暂无评论

Performance of Map-Reduce Using Java-8 parallel Streams

Performance of Map-Reduce Using Java-8 Parallel Streams

引用

Computing Conference

作者： Lester, Bruce P. Maharishi Univ Management Comp Sci Dept Fairfield IA 52557 USA

ISBN: (纸本)9783030011741;9783030011734

The primary purpose of parallel streams in the recent release of Java 8 is to help Java programs make better use of multi-core processors for improved performance. However, in some cases, parallel streams can actually perform considerably worse than ordinary sequential Java code. This paper presents a Map-Reduce parallel programming pattern for Java parallel streams that produces good speedup over sequential code. An important component of the Map-Reduce pattern is two optimizations: grouping and locality. Three parallel application programs are used to illustrate the Map-Reduce pattern and its optimizations: Histogram of an Image, Document Keyword Search, and Solution to a Differential Equation. A proposal is included for a new terminal stream operation for the Java language called MapReduce() that applies this pattern and its optimizations automatically.

关键词： parallel programming Multi-core programming MapReduce Java parallel streams parallel computing

来源：评论

学校读者我要写书评

暂无评论

A Solution of Python Distributed STM Based on Data Replication 27

A Solution of Python Distributed STM Based on Data Replicati...

引用

27th Telecommunications Forum (TELFOR)

作者： Popovic, Marko Popovic, Miroslav Kordic, Branislav Basicevic, Ilija Univ Novi Sad Fac Tech Sci Trg Dositeja Obradovica 6 Novi Sad 21000 Serbia

ISBN: (纸本)9781728147895

Nowadays development of venous distributed STMS, which aid parallel programming of distributed systems, attracts interest of many researchers. In this paper, we developed the Python distributed STM based on data replication, which provides better performance as well as tolerance to replica faults. The solution supports both eventual and sequential data consistency. Experimental results show that reading t-variables from a local replica is up to 16 times faster than reading them from the base replica.

关键词： distributed systems parallel programming Python transactional memories data replication data consistency

来源：评论

学校读者我要写书评

暂无评论

Multi-accelerator extension in OpenMP based on PGAS model 2019

Multi-accelerator extension in OpenMP based on PGAS model

引用

International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia)

作者： Nakao, Masahiro Murai, Hitoshi Sato, Mitsuhisa RIKEN Ctr Computat Sci Kobe Hyogo Japan

ISBN: (纸本)9781450366328

Many systems used in HPC field have multiple accelerators on a single compute node. However, programming for multiple accelerators is more difficult than that for a single accelerator. Therefore, in this paper, we propose an OpenMP extension that allows easy programming for multiple accelerators. We extend existing OpenMP syntax to create Partitioned GlobalAddress Space (PGAS) on separated memories of several accelerators. The feature enables users to perform programming to use multiple accelerators in ease. In performance evaluation, we implement the STREAM Triad and the HIMENO benchmarks using the proposed OpenMP extension. As a result of evaluating the performance on a compute node equipped with up to four GPUs,we con firm that the proposed OpenMP extension demonstrates sufficient performance.

关键词： OpenMP accelerator partitioned global address space parallel programming

来源：评论

学校读者我要写书评

暂无评论

Efficient parallelization of MLFMA for 3D Electromagnetic Scattering Problems on Sunway Many-core Processor SW26010

Efficient Parallelization of MLFMA for 3D Electromagnetic Sc...

引用

Photonics and Electromagnetics Research Symposium - Fall (PIERS - Fall)

作者： He, W. J. Yang, M. L. Wang, W. Sheng, X. Q. Beijing Inst Technol Ctr Electromagnet Simulat Beijing 100081 Peoples R China Chinese Acad Sci Comp Network Informat Ctr Beijing 100190 Peoples R China

ISBN: (纸本)9781728153049

A many-core implementation of the multilevel fast multipole algorithm (MLFMA) based on the Athread parallel programming model for computing electromagnetic scattering by a 3-D object on the homegrown many-core SW26010 CPU of China is presented. In the proposed many-core implementation of MLFMA, the data access efficiency is improved by using data structures based on the Structure-of-Array (SoA). The adaptive workload distribution strategies are adopted on different MLFMA tree levels to ensure full utilization of computing capability and the scratchpad memory (SPM). A double-buffering scheme is specially designed to make communication overlapped computation. The resulting Athread-based many-core implementation of the MLFMA is capable for solving real-life problems with over four hundred thousand unknowns with a remarkable speed-up. Numerical results show that with the proposed parallel scheme, a total speed-up larger than 7 times can be achieved, compared with the CPU master-core.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Structured Stream parallelism for Rust 19

Structured Stream Parallelism for Rust

引用

23rd Brazilian Symposium on programming Languages (SBLP)

作者： Pieper, Ricardo Griebler, Dalvan Fernandes, Luiz Gustavo Pontificia Univ Catolica Rio Grande do Sul Sch Technol Porto Alegre RS Brazil

ISBN: (纸本)9781450376389

Structured parallel programming has been studied and applied in several programming languages. This approach has proven to be suitable for abstracting low-level and architecture-dependent parallelism implementations. Our goal is to provide a structured and high-level library for the Rust language, targeting parallel stream processing applications for multi-core servers. Rust is an emerging programming language that has been developed by Mozilla Research group, focusing on performance, memory safety, and thread-safety. However, it lacks parallel programming abstractions, especially for stream processing applications. This paper contributes to a new API based on the structured parallel programming approach to simplify parallel software developing. Our experiments highlight that our solution provides higher-level parallel programming abstractions for stream processing applications in Rust. We also show that the throughput and speedup are comparable to the state-of-the-art for certain workloads.

关键词： programming Language parallel programming Multi-core Stream Processing parallel Patterns Structured parallelism

来源：评论

学校读者我要写书评

暂无评论

CUDA-Based Particle Swarm Optimization in Reflectarray Antenna Synthesis

ADVANCED ELECTROMAGNETICS

引用

ADVANCED ELECTROMAGNETICS 2020年第2期9卷 66-74页

作者： Capozzoli, Amedeo Curcio, Claudio Liseno, Angelo Univ Napoli Federico II Dipartimento Ingn Elettr & Tecnol Informaz Via Claudio 21 I-80125 Naples Italy

The synthesis of electrically large, highly performing reflectarray antennas can be computationally very demanding both from the analysis and from the optimization points of view. It therefore requires the combined usage of numerical and hardware strategies to control the computational complexity and provide the needed acceleration. Recently, we have set up a multi-stage approach in which the first stage employs global optimization with a rough, computationally convenient modeling of the radiation, while the subsequent stages employ local optimization on gradually refined radiation models. The purpose of this paper is to show how reflectarray antenna synthesis can take profit from parallel computing on Graphics Processing Units (GPUs) using the CUDA language. In particular, parallel computing is adopted along two lines. First, the presented approach accelerates a Particle Swarm Optimization procedure exploited for the first stage. Second, it accelerates the computation of the field radiated by the reflectarray using a GPU-implemented Non-Uniform FFT routine which is used by all the stages. The numerical results show how the first stage of the optimization process is crucial to achieve, at an acceptable computational cost, a good starting point.

关键词： Reflectarray antenna synthesis CUDA parallel programming Graphics Processing Units (GPUs) Particle Swarm Optimization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：