检索结果-内蒙古大学图书馆

Fancier: A Unified Framework for Java, C, and OpenCL Integration

IEEE ACCESS 2021年 9卷 164570-164588页

作者： Afonso, Sergio Almeida, Francisco Univ La Laguna Dept Comp Engn & Syst San Cristobal De La Lagu 38200 Spain

Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators provide requires the use of specialized native programming interfaces, such as CUDA or OpenCL, or higher-level programming models like OpenMP or OpenACC. However, on managed programming languages, offloading execution into GPUs is much harder and error-prone, mainly due to the need to call through a native API (Application programming Interface), and because of mismatches between value and reference semantics. The Fancier framework provides a unified interface to Java, C/C++, and OpenCL C compute kernels, together with facilities to smooth the transitions between these programming languages. This combination of features makes GPU acceleration on Java much more approachable. In addition, Fancier Java code can be directly translated into equivalent C/C++ or OpenCL C code easily, which simplifies the implementation of higher-level abstractions targeting GPU or parallel execution on Java. Furthermore, it reduces the programming effort without adding significant overhead on top of the necessary OpenCL and Java Native Interface (JNI) API calls. We validate our approach on several image processing workloads running on different Android devices.

关键词： Java Codes programming Standards Runtime Libraries parallel programming Application programming interfaces hardware acceleration heterogeneous systems image processing mobile computing parallel programming performance analysis

来源：评论

学校读者我要写书评

暂无评论

Performance Implications of Thread Count on OS Level Factors in Multithreaded Applications 6

Performance Implications of Thread Count on OS Level Factors...

引用

6th International Conference on Computing, Communication, Control and Automation, ICCUBEA 2022

作者： Malave, Sachin Shinde, Subhash Lokmanya Tilak College of Engineering Computer Department New Mumbai India

ISBN: (纸本)9781665484510

In high-performance computing, picking the right number of threads to gain a good speedup is important, as many OS-level parameters are influenced by even slight adjustments in thread count. These parameters are required by the operating system for process management and should not be ignored. They also contribute overhead to the running program, which can mount up quickly if not properly managed. Using too many threads in the system raises overheads, but using too few threads in the system significantly reduces performance. In this paper, the impact of page faults, CPU migrations, CPU utilisation, and context switching on execution time is investigated. The proposed work is simulated on a dual-socket Intel Xeon E5-2603 v3 using the well-known benchmark PARSEC 3.0. After studying performance parameters, simulation results reveal that running multithreaded programs with a correct number of threads can result in greater speedup and save overall system time. © 2022 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

A Message Passing Interface Library for High-Level Synthesis on Multi-FPGA Systems 15

A Message Passing Interface Library for High-Level Synthesis...

引用

15th IEEE International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2022

作者： Hironaka, Kazuei Iizuka, Kensuke Amano, Hideharu Keio University Dept. of Information and Computer Science Yokohama Japan

ISBN: (纸本)9781665464994

One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is difficult without a standard interface. Message Passing Interface (MPI) is a standard parallel programming interface commonly used in distributed memory systems. This paper presents a tool-independent MPI library called FiC-MPI that can be used in HLS for multi-FPGA systems in which each FPGA node is connected directly. By using FiC-MPI, various parallel software, including a general-purpose benchmark, can be easily implemented. FiC-MPI was implemented and evaluated on the M-KUBOS cluster consisting of Zynq MPSoC boards connected with a static time-division multiplexing network. By using the FiC-MPI simulator, parallel programs can be debugged before implementing on real machines. As a case study, the Himeno-BMT benchmark was implemented with FiC-MPI. It achieved 178.7 MFLOPS with a single node and scaled to 643.7 MFLOPS with four nodes, and 896.9 MFLOPS with six nodes of the M-KUBOS cluster. Through the implementation, the easiness of developing parallel programs with FiC-MPI on multi-FPGA systems was demonstrated. © 2022 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

The future of aliasing in parallel programming

Lecture Notes in Computer Science (including subseries Lectu...

引用

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2013年 7850卷 501-502页

作者： Bocchino Jr., Robert L. Carnegie Mellon University United States

ISBN: (纸本)9783642369452

In recent years, the research community has made great strides in alias annotations that support parallel programming [1]. Using these techniques, programmers no longer have to guess where aliased mutable state may cause unintended data races or nondeterminism;instead, such problems can simply be eliminated, either at compile time or at runtime. This represents a major advance in the safety and reliability of parallel code. © Springer-Verlag Berlin Heidelberg 2013.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallel Computing of 3-D FEA Including Matrix Preconditioning for Analysis of Rotating Machines Coupled With Circuit Equations

引用

IEEE TRANSACTIONS ON MAGNETICS 2021年第6期57卷 1-4页

作者： Utsunomiya, Ryouma Yamazaki, Katsumi Chiba Inst Technol Dept Elect & Elect Engn Narashino Chiba 2750016 Japan

In this article, we propose a parallel computing method of 3-D finite-element analysis coupled with circuit equations for characteristic calculation of rotating machines. In the proposed method, the preconditioning part in the matrix solver is parallelized as well as the other part, in order to obtain the stable solution within short computational time. The proposed method is applied to the loss calculation of an interior permanent magnet synchronous motor fed by an inverter to clarify the advantages.

关键词： Eddy currents finite-element methods parallel programming permanent magnet motors

来源：评论

学校读者我要写书评

暂无评论

Optimizing the Cray Graph Engine for performant analytics on cluster, SuperDome Flex, Shasta systems and cloud deployment

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2024年第10期36卷 e7982-e7982页

作者： Rickett, Christopher D. Maschhoff, Kristyn J. Sukumar, Sreenivas R. Hewlett Packard Enterprise Spring TX 77389 USA

We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-sided MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems.

关键词： Cray Graph Engine graph analytics parallel programming pattern mining pattern search PGAS semantics

来源：评论

学校读者我要写书评

暂无评论

Towards Automatic Block Size Tuning for Image Processing Algorithms on CUDA

Towards Automatic Block Size Tuning for Image Processing Alg...

引用

17th International Conference on Software Technologies (ICSOFT)

作者： Guerfi, Imene Kriaa, Lobna Saidane, Leila Azouz Univ Manouba Natl Sch Comp Sci ENSI CRISTAL Lab RAMSIS Pole Manouba Tunisia

ISBN: (纸本)9789897585883

With the growing amount of data, computational power has became highly required in all fields. To satisfy these requirements, the use of GPUs seems to be the appropriate solution. But one of their major setbacks is their varying architectures making writing efficient parallel code very challenging, due to the necessity to master the GPU's low-level design. CUDA offers more flexibility for the programmer to exploit the GPU's power with ease. However, tuning the launch parameters of its kernels such as block size remains a daunting task. This parameter requires a deep understanding of the architecture and the execution model to be well-tuned. Particularly, in the Viola-Jones algorithm, the block size is an important factor that improves the execution time, but this optimization aspect is not well explored. This paper aims to offer the first steps toward automatically tuning the block size for any input without having a deep knowledge of the hardware architecture, which ensures the automatic portability of the performance over different GPUs architectures. The main idea is to define techniques on how to get the optimum block size to achieve the best performance. We pointed out the impact of using static block size for all input sizes on the overall performance. In light of the findings, we presented two dynamic approaches to select the best block size suitable to the input size. The first one is based on an empirical search;this approach provides the optimal performance;however, it is tough for the programmer, and its deployment is time-consuming. In order to overcome this issue, we proposed a second approach, which is a model that automatically selects a block size. Experimental results show that this model can improve the execution time by up to 2.5x over the static approach.

关键词： GPU Computing parallel programming Program Optimization Auto-tuning and Face Detection

来源：评论

学校读者我要写书评

暂无评论

SwarmL: A Language for programming Fully Distributed Intelligent Building Systems

引用

BUILDINGS 2023年第2期13卷 499页

作者： Chen, Wenjie Yang, Qiliang Jiang, Ziyan Xing, Jianchun Zhao, Shuo Zhou, Qizhen Han, Deshuai Feng, Bowei Army Engn Univ PLA Coll Def Engn Nanjing 211101 Peoples R China Tsinghua Univ Bldg Energy Res Ctr Beijing 100084 Peoples R China China Xian Satellite Control Ctr Xian 710043 Peoples R China Rocket Force Univ Engn Coll Combat Support Xian 710025 Peoples R China

Fully distributed intelligent building systems can be used to effectively reduce the complexity of building automation systems and improve the efficiency of the operation and maintenance management because of its self-organization, flexibility, and robustness. However, the parallel computing mode, dynamic network topology, and complex node interaction logic make application development complex, time-consuming, and challenging. To address the development difficulties of fully distributed intelligent building system applications, this paper proposes a user-friendly programming language called SwarmL. Concretely, SwarmL (1) establishes a language model, an overall framework, and an abstract syntax that intuitively describes the static physical objects and dynamic execution mechanisms of a fully distributed intelligent building system, (2) proposes a physical field-oriented variable that adapts the programming model to the distributed architectures by employing a serial programming style in accordance with human thinking to program parallel applications of fully distributed intelligent building systems for reducing programming difficulty, (3) designs a computational scope-based communication mechanism that separates the computational logic from the node interaction logic, thus adapting to dynamically changing network topologies and supporting the generalized development of the fully distributed intelligent building system applications, and (4) implements an integrated development tool that supports program editing and object code generation. To validate SwarmL, an example application of a real scenario and a subject-based experiment are explored. The results demonstrate that SwarmL can effectively reduce the programming difficulty and improve the development efficiency of fully distributed intelligent building system applications. SwarmL enables building users to quickly understand and master the development methods of application tasks in fully distributed intelligent

关键词： swarm intelligence fully distributed intelligent building system parallel programming domain-specific language

来源：评论

学校读者我要写书评

暂无评论

Towards Safer parallel STL Usage 16

Towards Safer Parallel STL Usage

引用

16th IEEE International Scientific Conference on Informatics, Informatics 2022

作者： Barth, Benjamin Szalay, Richard Porkolab, Zoltan Eötvös Loránd University Faculty of Informatics Budapest Hungary Eötvös Loránd University Department of Programming Languages and Compilers Budapest Hungary

ISBN: (纸本)9798350310344

Effective and safe parallel programming is among the biggest challenges of today's software technology. The C++ 17 standard introduced parallel STL: a set of overloaded functions taking an additional 'execution policy' parameter in the Algorithms chapter of the Standard library. During the years since its introduction, a few shortages of parallel STL have been revealed. While the Standard defines the semantics of the individual algorithms, adherence to their abstract requirements-e.g., absolutely no data races or deadlocks during the evaluation of a predicate or other customisation point-is up to the developer. Experience shows that programmers frequently make mistakes and write erroneous code, which is hard to debug. In this paper, we investigate some of the critical issues of the parallel STL library and suggest improvements to increase its safety. While a fully automatic detection of erroneous constructs is computationally infeasible to do, we introduce a framework with which the user will be able to indicate-axiomatically, based on absolute trust-that an operation has 'safe' properties, e.g., commutativity of certain functors. We implemented a prototype of the proposed framework to demonstrate its usability and effectiveness. © 2022 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Flexible task-DAG management in PHAST library: Data-parallel tasks and orchestration support for heterogeneous systems

Flexible task-DAG management in PHAST library: Data-parallel...

引用

作者： Peccerillo, Biagio Bartolini, Sandro Department of Information Engineering and Mathematical Sciences University of Siena Siena Italy

Heterogeneous architectures proved successful in achieving unprecedented performance and energy-efficiency. However, taking advantage of these diverse processing elements is still hard. Programmers need to code through the different approaches suitable for each target architecture and need to decide the distribution of activities on the different resources. The majority of current frameworks focuses on either performance or productivity. The former mainly provides low-level target-specific programming interfaces, and the latter offers high-level tools that often fail in achieving high-performance. In both cases, the design is usually data-parallel, as task-parallelism is not supported. In this work, we propose a task-based solution within the data-parallel heterogeneous single-source PHAST library. Tasks can be coded in a target-agnostic fashion, can be compiled and parallelized on multi-core CPUs and NVIDIA GPUs automatically and support the choice of the execution platform at runtime. We evaluate the capabilities of the proposed task-directed acyclic graph support in case of an extensive set of randomly generated task-based applications with different sizes and characteristics. We compare it against a SYCL implementation in terms of performance and complexity metrics, highlighting that PHAST achieves about 1.56× and 2.60× speedup over SYCL for multi-core CPU and GPU, respectively, while improving also code complexity metrics. © 2020 John Wiley & Sons, Ltd.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：