检索结果-内蒙古大学图书馆

8th IEEE International Conference on Big Data (Big Data)

作者： Gilroy, Justin Paronyan, Satine Acoltzi, Jonathan Fukuda, Munehiro Univ Washington Bothell Comp & Software Syst Bothell WA 98011 USA

ISBN: (纸本)9781728162515

Some graph analyses, such as social network and biological network, need large-scale graph construction and maintenance over distributed memory space. Distributed data-streaming tools, including MapReduce and Spark, restrict some computational freedom of incremental graph modification and run-time graph visualization. Instead, we take an agent-based approach. We construct a graph from a scientific dataset in CSV, tab, and XML formats;dispatch many reactive agents on it;and analyze the graph in the form of their collective group behavior: propagation, flocking, and collision. The key to success is how to automate the run-time construction and visualization of agent-navigable graphs mapped over distributed memory. We implemented this distributed graph-computing support in the multi-agent spatial simulation (MASS) library, coupled with the Cytoscape graph visualization software. This paper presents the MASS implementation techniques and demonstrates its execution performance in comparison to MapReduce and Spark, using two benchmark programs: (1) an incremental construction of a complete graph and (2) a KD tree construction.

关键词： multi-agent systems agent-based modeling data analysis data visualization parallel programming

来源：评论

学校读者我要写书评

暂无评论

An HTM-Based Update-side Synchronization for RCU on NUMA systems 20

An HTM-Based Update-side Synchronization for RCU on NUMA sys...

引用

15th European Conference on Computer Systems (EuroSys)

作者： Park, SeongJae McKenney, Paul E. Dufour, Laurent Yeom, Heon Y. Amazon Bellevue WA 98004 USA Facebook Cambridge MA USA IBM Linux Technol Ctr Cambridge MD USA Seoul Natl Univ Seoul South Korea

ISBN: (纸本)9781450368827

Read-copy update (RCU) can provide ideal scalability for read-mostly workloads, but some believe that it provides only poor performance for updates. This belief is due to the lack of RCU-centric update synchronization mechanisms. RCU instead works with a range of update-side mechanisms, such as locking. In fact, many developers embrace simplicity by using global locking. Logging, hardware transactional memory, or fine-grained locking can provide better scalability, but each of these approaches has limitations, such as imposing overhead on readers or poor scalability on nonuniform memory access (NUMA) systems, mainly due to their lack of NUMA-aware design principles. This paper introduces an RCU extension (RCX) that provides highly scalable RCU updates on NUMA systems while retaining RCU's read-side benefits. RCX is a software-based synchronization mechanism combining hardware transactional memory (HTM) and traditional locking based on our NUMA-aware design principles for RCU. Micro-bench-marks on a NUMA system having 144 hardware threads show RCX has up to 22.6 times better performance and up to 145 times lower HTM abort rates compared to a state-of-the-art RCU/HTM combination. To demonstrate the effectiveness and applicability of RCX, we have applied RCX to parallelize some of the Linux kernel memory management system and an in-memory database system. The optimized kernel and the database show up to 24 and 17 times better performance compared to the original version, respectively.

关键词： RCU synchronization transactional memory parallel programming operating systems

来源：评论

学校读者我要写书评

暂无评论

Assessing Kokkos Performance on Selected Architectures 6th

Assessing Kokkos Performance on Selected Architectures

引用

6th Latin American Conference on High Performance Computing (CARLA)

作者： Phuong, Chang Saied, Noman Tanis, Craig Univ Tennessee Chattanooga TN 37403 USA

ISBN: (纸本)9783030410056;9783030410049

Performance Portability frameworks allow developers to write code for familiar High-Performance Computing (HPC) architecture and minimize development effort over time to port it to other HPC architectures with little to no loss of performance. In our research, we conducted experiments with the same codebase on a Serial, OpenMP, and CUDA execution and memory space and compared it to the Kokkos Performance Portability framework. We assessed how well these approaches meet the goals of Performance Portability by solving a thermal conduction model on a 2D plate on multiple architectures (NVIDIA (K20, P100, V100, XAVIER), Intel Xeon, IBM Power 9, ARM64) and collected execution times (wall-clock) and performance counters with perf and nvprof for analysis. We used the Serial model to determine a baseline and to confirm that the model converges on both the native and Kokkos code. The OpenMP and CUDA models were used to analyze the parallelization strategy as compared to the Kokkos framework for the same execution and memory spaces.

关键词： Performance Portability OpenMP CUDA Kokkos High-Performance Computing HPC parallel programming

来源：评论

学校读者我要写书评

暂无评论

Methods and Tools for Formal Verification of Cloud Sisal Programs 2

Methods and Tools for Formal Verification of Cloud Sisal Pro...

引用

2nd International Conference on Mathematics and Computers in Science and Engineering (MACISE)

作者： Kasyanov, Victor N. Kasyanova, Elena, V Inst Informat Syst Novosibirsk 630090 Russia

ISBN: (纸本)9781728166957

A cloud parallel programming system CPPS being under development at the Institute of Informatics Systems is aimed to be an interactive visual environment of functional and parallel programming for supporting of computer science teaching and learning. The system will support the development, verification and debugging of architecture-independent parallel Cloud Sisal programs and their correct conversion into efficient code of parallel computing systems for its execution in clouds. In the paper, methods and tools of the CPPS system intended for formal verification of Cloud Sisal programs are described.

关键词： automated theorem proof Cloud Sisal deductive verification functional programming parallel programming

来源：评论

学校读者我要写书评

暂无评论

Hybrid deterministic non-deterministic data-parallel algorithm for real-time unmanned aerial vehicle trajectory planning in CUDA

引用

e-Prime - Advances in Electrical Engineering, Electronics and Energy 2022年 2卷

作者： Roberge, Vincent Tarbouchi, Mohammed Royal Military College of Canada Department of Electrical and Computer Engineering Canada

The development of autonomous Unmanned Aerial Vehicles (UAVs) is a priority to many civilian and military organizations. Real time optimal trajectory planning is an essential element for the autonomy of UAVs. The use of metaheuristic algorithms for solving such complex optimization problems with non-linearity and multimodality has gained popularity recently. In this paper, we use a non-deterministic Flower Pollination Algorithm (FPA) to deal with the problem's complexity and compute feasible and quasi-optimal trajectories for fixed-wing UAVs in complex 3D environments, taking into account the vehicle's flight properties. The global optimization algorithm is improved with the addition of a deterministic 2-opt local search providing a significant improvement. To achieve real-time performance, the proposed trajectory planner in implemented and parallelized following the data-parallel paradigm on a Graphics Processing Unit (GPU) using the Compute Unified Device Architecture (CUDA) resulting in a 253.4x speedup compared to the sequential implementation on CPU. The parallel implementation is able to compute quasi-optimal trajectories in just 0.369 s. © 2022

关键词： Flower pollination algorithm Graphics processing unit parallel programming Trajectory planning Unmanned aerial vehicle

来源：评论

学校读者我要写书评

暂无评论

SOFF: An OpenCL High-Level Synthesis Framework for FPGAs 20

SOFF: An OpenCL High-Level Synthesis Framework for FPGAs

引用

47th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA)

作者： Jo, Gangwon Kim, Heehoon Lee, Jeesoo Lee, Jaejin ManyCoreSoft Seoul 08826 South Korea Seoul Natl Univ Dept Comp Sci & Engn Seoul 08826 South Korea

ISBN: (纸本)9781728146614

Recently, OpenCL has been emerging as a programming model for energy-efficient FPGA accelerators. However, the state-of-the-art OpenCL frameworks for FPGAs suffer from poor performance and usability. This paper proposes a high-level synthesis framework of OpenCL for FPGAs, called SOFF. It automatically synthesizes a datapath to execute many OpenCL kernel threads in a pipelined manner. It also synthesizes an efficient memory subsystem for the datapath based on the characteristics of OpenCL kernels. Unlike previous high-level synthesis techniques, we propose a formal way to handle variable-latency instructions, complex control flows, OpenCL barriers, and atomic operations that appear in real-world OpenCL kernels. SOFF is the first OpenCL framework that correctly compiles and executes all applications in the SPEC ACCEL benchmark suite except three applications that require more FPGA resources than are available. In addition, SOFF achieves the speedup of 1.33 over Intel FPGA SDK for OpenCL without any explicit user annotation or source code modification.

关键词： Accelerator architectures FPGAs high level synthesis parallel programming pipeline processing

来源：评论

学校读者我要写书评

暂无评论

Implementing Sparse Linear Algebra Kernels on the Lucata Pathfinder-A Computer

Implementing Sparse Linear Algebra Kernels on the Lucata Pat...

引用

IEEE High Performance Extreme Computing Conference (HPEC)

作者： Krawezik, Geraud P. Kuntz, Shannon K. Kogge, Peter M. Lucata Corp New York NY 10018 USA Univ Notre Dame Dept Comp Sci & Engn Notre Dame IN 46556 USA

ISBN: (纸本)9781728192192

We present the implementation of two sparse linear algebra kernels on a migratory memory-side processing architecture. The first is the Sparse Matrix-Vector (SpMV) multiplication, and the second is the Symmetric Gauss-Seidel (SymGS) method. Both were chosen as they account for the largest run time of the HPCG benchmark. We introduce the system used for the experiments, as well as its programming model and key aspects to get the most performance from it. We describe the data distribution used to allow an efficient parallelization of the algorithms, and their actual implementations. We then present hardware results and simulator traces to explain their behavior. We show an almost linear strong scaling with the code, and discuss future work and improvements.

关键词： Linear Algebra Sparse Matrices SpMV Gauss-Seidel parallel programming Memory-Side Processing Migrating Threads Partitioned Global Address Space

来源：评论

学校读者我要写书评

暂无评论

ScoRD: A Scoped Race Detector for GPUs 20

ScoRD: A Scoped Race Detector for GPUs

引用

47th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA)

作者： Kamath, Aditya K. George, Alvin A. Basu, Arkaprava Indian Inst Sci Dept Comp Sci & Automat Bangalore Karnataka India

ISBN: (纸本)9781728146614

GPUs have emerged as a key computing platform for an ever-growing range of applications. Unlike traditional bulk-synchronous GPU programs, many emerging GPU-accelerated applications, such as graph processing, have irregular interaction among the concurrent threads. Consequently, they need complex synchronization. To enable both high performance and adequate synchronization, GPU vendors have introduced scoped synchronization operations that allow a programmer to synchronize within a subset of concurrent threads (a.k.a., scope) that she deems adequate. Scoped-synchronization avoids the performance overhead of synchronization across thousands of GPU threads while ensuring correctness when used appropriately. This flexibility, however, could be a new source of incorrect synchronization where a race can occur due to insufficient scope of the synchronization operation, and not due to missing synchronization as in a typical race. We introduce ScoRD, a race detector that enables hardware support for efficiently detecting global memory races in a GPU program, including those that arise due to insufficient scopes of synchronization operations. We show that ScoRD can detect a variety of races with a modest performance overhead (on average, 35%). In the process of this study, we also created a benchmark suite consisting of seven applications and three categories of microbenchmarks that use scoped synchronization operations.

关键词： Graphics processing units parallel programming Software debugging

来源：评论

学校读者我要写书评

暂无评论

Compiler Abstractions and Runtime for Extreme-scale SAR and CFD Workloads 5

Compiler Abstractions and Runtime for Extreme-scale SAR and ...

引用

IEEE/ACM 5th International Workshop on Extreme Scale programming Models and Middleware (ESPM2)

作者： Imes, Connor Colin, Alexei Zhang, Naifeng Srivastava, Ajitesh Prasanna, Viktor Walters, John Paul USC Informat Sci Inst Los Angeles CA 90007 USA Univ Southern Calif Los Angeles CA 90007 USA

ISBN: (纸本)9781665422840

As HPC progresses toward exascale, writing applications that are highly efficient, portable, and support programmer productivity is becoming more challenging than ever. The growing scale, diversity, and heterogeneity in compute platforms increases the burden on software to efficiently use available distributed parallel resources. This burden has fallen on developers who, increasingly, are experts in application domains rather than traditional computer scientists and engineers. We propose CASPER-Compiler Abstractions Supporting high Performance on Extreme-scale Resources-a novel domain-specific compiler and runtime framework to enable domain scientists to achieve high performance and scalability on complex HPC systems. CASPER extends domain-specific languages with machine learning to map software tasks to distributed, heterogeneous resources, and provides a runtime framework to support a variety of adaptive optimizations in dynamic environments. This paper presents an initial design and analysis of CASPER for synthetic aperture radar and computational fluid dynamics domains.

关键词： parallel programming runtime dynamic compiler high performance computing synthetic aperture radar fluid dynamics

来源：评论

学校读者我要写书评

暂无评论

Memory-Latency-Accuracy Trade-offs for Continual Learning on a RISC-V Extreme-Edge Node 34

Memory-Latency-Accuracy Trade-offs for Continual Learning on...

引用

34th IEEE Workshop on Signal Processing Systems (SiPS)

作者： Ravaglia, Leonardo Rusci, Manuele Capotondi, Alessandro Conti, Francesco Pellegrini, Lorenzo Lomonaco, Vincenzo Maltoni, Davide Benini, Luca Univ Bologna DISI Bologna Italy Univ Bologna DEI Bologna Italy Univ Modena & Reggio Emilia FIM Reggio Emilia Italy Swiss Fed Inst Technol IIS Zurich Switzerland

ISBN: (纸本)9781728180991

AI-powered edge devices currently lack the ability to adapt their embedded inference models to the ever-changing environment. To tackle this issue, Continual Learning (CL) strategies aim at incrementally improving the decision capabilities based on newly acquired data. In this work, after quantifying memory and computational requirements of CL algorithms, we define a novel HW/SW extreme-edge platform featuring a low power RISC-V octa-core cluster tailored for on-demand incremental learning over locally sensed data. The presented multi-core HW/SW architecture achieves a peak performance of 2.21 and 1.70 MAC/cycle, respectively, when running forward and backward steps of the gradient descent. We report the trade-off between memory footprint, latency, and accuracy for learning a new class with Latent Replay CL when targeting an image classification task on the CORe50 dataset. For a CL setting that retrains all the layers, taking 5h to learn a new class and achieving up to 77.3% of precision, a more efficient solution retrains only part of the network, reaching an accuracy of 72.5% with a memory requirement of 300 MB and a computation latency of 1.5 hours. On the other side, retraining only the last layer results in the fastest (867 ms) and less memory hungry (20 MB) solution but scoring 58% on the CORe50 dataset. Thanks to the parallelism of the low-power cluster engine, our HW/SW platform results 25x faster than typical MCU device, on which CL is still impractical, and demonstrates an 11x gain in terms of energy consumption with respect to mobile-class solutions.

关键词： continual learning extreme edge deep learning parallel programming online learning federated learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：