检索结果-内蒙古大学图书馆

International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)

作者： Dorin-Marian Ionita Filip-George Manole Emil-Ioan Slusanschi University Politehnica of Bucharest Bucharest Romania

ISBN: (数字)9781728176284

ISBN: (纸本)9781728176291

A common pattern in high performance scientific computing is the structured grid pattern in which one or more elements of a matrix are computed as a stencil operation of other matrix neighbouring elements. Since there are multiple options to efficiently implement this pattern on modern computing architectures, we provide a comparison of the performance of a number of parallel implementations on a multi-core system with GPU capabilities and also on a FPGA embedded inside a SoC. The application used for this case study implements the propagation of wireless signals in a bi-dimensional environment, considering reflections and signal attenuation. The parallel programming paradigms examined in this paper include CUDA, TBB, Rust, OpenMP, and HLS as hardware description paradigm, with CUDA proving to be the fastest implementation.

关键词： Wireless communication Scientific computing parallel programming Graphics processing units Numerical simulation Reflection Numerical models

来源：评论

学校读者我要写书评

暂无评论

Approach for Accelerating Image Enhancement Processes: Optimized OpenCL Architecture

Approach for Accelerating Image Enhancement Processes: Optim...

引用

International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT)

作者： Menduh Furkan Aslan Tuna Göksu Elektrik-Elektronik Mühendisliği Bölümü Antalya Bilim Üniversitesi Antalya Türkiye

ISBN: (数字)9781728190907

ISBN: (纸本)9781728190914

Computer technology, which continues to develop today, often has difficulties in meeting the needs of signal and image processing software. As a result of the developing technology, software needs larger memory and faster processor. parallel programming method has been developed to solve the speed problems of processors. In this study, OpenCL based image enhancement applications that can work in parallel on the graphics processor unit have been implemented.. The OpenCL architecture has been optimized to maximize the amount of acceleration. Appropriate image enhancement applications have been tested to observe that the designed algorithm and architecture are successful in simple or complex operations. In order to make sense of the speed gain, the same applications were developed with serial programming technique and the results obtained were compared with the applications developed in parallel. It is supported by the comparison results that parallel programming is better in terms of performance. Due to the parallel programming for the hardware used, it was observed that the calculation times were reduced by 1.58 times to 561 times.

关键词： Graphics parallel programming Computer architecture Software Hardware Acceleration Image enhancement

来源：评论

学校读者我要写书评

暂无评论

Road Recognition System with Heuristic Method and Machine Learning

Road Recognition System with Heuristic Method and Machine Le...

引用

International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA)

作者： Hagai Raja Sinulingga Rinaldi Munir School of Electrical Engineering and Informatics Institut Teknologi Bandung Medan Indonesia

ISBN: (纸本)9781728180397

Road recognition is one of essential information for determining an Autonomous Vehicle movement. Latest research has shown that machine learning could be used to obtain the information from images. Nevertheless, the system could be improved by effectivity and efficiency. This research proposed finding better feature combinations and using Artificial Neural Network algorithm to build higher accuracy road detection model for better effectivity. Region of Interest module using heuristic method also applied to reduce computation for better efficiency. These three new modules are implemented and combined with road recognition module to become road recognition system. The proposed method performance then tested and compared with the latest research. The experiment results shown that Artificial Neural Network cannot increase the system effectiveness. Nonetheless, with right feature and region of interest module, the proposed system successfully gives better performance. The prototype has accuracy increased from F1-score 0,94 to 0,95 and speed increased from 99 to 112 frames processed per second.

关键词： Machine learning algorithms parallel programming Roads Prototypes Artificial neural networks Streaming media Real-time systems

来源：评论

学校读者我要写书评

暂无评论

DtCraft: A High-Performance Distributed Execution Engine at Scale

DtCraft: A High-Performance Distributed Execution Engine at ...

引用

作者： Huang, Tsung-Wei Lin, Chun-Xun Wong, Martin D. F. Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign ChampaignIL61801 United States

Recent years have seen rapid growth in data-driven distributed systems, such as Hadoop MapReduce, Spark, and Dryad. However, the counterparts for high-performance or compute-intensive applications including large-scale optimizations, modeling, and simulations are still nascent. In this paper, we introduce DtCraft, a modern C++ based distributed execution engine to streamline the development of high-performance parallel applications. Users need no understanding of distributed computing and can focus on high-level developments, leaving difficult details, such as concurrency controls, workload distribution, and fault tolerance handled by our system transparently. We have evaluated DtCraft on both micro-benchmarks and large-scale optimization problems, and shown the promising performance from single multicore machines to clusters of computers. In a particular semiconductor design problem, we achieved 30 × speedup with 40 nodes and 15 × less development efforts over hand-crafted implementation. © 1982-2012 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Rigel: A framework for openMP performancetuning 21

Rigel: A framework for openMP performancetuning

引用

21st IEEE International Conference on High Performance Computing and Communications, 17th IEEE International Conference on Smart City and 5th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2019

作者： Amarasinghe Baragamage, Piyumi Rameshka Senanayake, Pasindu Kannangara, Thulana Seneviratne, Praveen Jayasena, Sanath Patabandi, Tharindu Rusira Hall, Mary Dept of Computer Science AND Engineering University of Moratuwa Katubedda Sri Lanka School of Computing University of Utah Salt Lake CityUT United States

ISBN: (纸本)9781728120584

OpenMP allows developers to harness the power of shared memory multiprocessing in C and C++ applications, but the performance gained with OpenMP is highly sensitive to the underlying hardware, making performance portability across different hardware architectures fragile. For example, in mapping a parallel for loop to hardware, OpenMP 4 offers commands for exploiting vector instructions (simd directives) and automatic GPU offloading (target directives), as well as schedule directives for CPU load balancing. These benefits come with a cost. A developer has to be well aware of the architecture details, and the application, and must iteratively tune to determine the best combination of pragma directives delivering higher performance for the given target architecture. Hence in this paper we introduce Rigel, a framework that automates these decisions to arrive at optimized OpenMP annotated code. Given a code segment with inherent parallelism, our framework uses separate machine learning classification models to predict the anticipated benefit of each optimization. Both Vector Classification and GPU Offloading Classification models perform with average accuracies of 83%. Succeeding the classification process, code segments are optimized accordingly. Our results show that GPU offloading optimization lead to an average speedup of 8x over default non-optimized CPU parallel execution (pragma omp parallel for) and average Vector optimization speedup is 6x compared to LLVM Clang 4.0 auto-vectorization. Furthermore Scheduling mechanism selection process results in overall accuracy of 90%. © 2019 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Language Constructs and Semantics for Runtime-independent parallelism Expression on Heterogeneous Systems 5

Language Constructs and Semantics for Runtime-independent Pa...

引用

5th IEEE International Conference on Computer and Communications, ICCC 2019

作者： Wu, Shusen Dong, Xiaoshe Wang, Yufei Chen, Weiduo Xi'An Jiaotong University School of Electronic and Information Engineering Xi'an China

ISBN: (数字)9781728147437

ISBN: (纸本)9781728147437

The emergence of heterogeneous processors such as GPUs provide massively parallel computing power but also exacerbate the difficulties of parallel programming. Although low-level programming methods such as CUDA and OpenCL can yield good performance, the programming productivity is poor and applications lack portability. In this paper, we present a core language Ruler, which extends C with high-level parallel constructs. These constructs enable programmers to express parallelism in programs without concerning runtime details, thus ease user programming. We present the operational semantics of the language and show how these constructs reserve parallel patterns and parallelism degree of high-level applications. Those information could inform the compiler to generate efficient code and maintain the performance on different platforms. We have implemented a compiler and runtime system for Ruler on the top of OpenCL. Multiple benchmarks are rebuilt with Ruler and evaluated on both a NVIDIA GPU and an Intel MIC platform to demonstrate the effectiveness of our techniques. The size of Ruler code is only 13%-64% to that of the OpenCL code. The rebuilt benchmarks execute smoothly on both platforms after compilation, yielding a competitive performance to that of handcrafted benchmark OpenCL code on both platforms. © 2019 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Fast scale and illumination invariant method for region labeling

Fast scale and illumination invariant method for region labe...

引用

作者： Savolainen, Tuomas Aalto University

学位级别：硕士

This work describes how to find 3D objects in 2D images. The images may contain various illumination conditions and backgrounds. Furthermore the distance and the rotation of the camera with respect to the object can be arbitrary. The method described in this work provides a way to reduce computation time of the 3D object localization problem by searching only from the regions of the image that include a combination of the most common colors of the object. The accuracy and speed of the implementation is tested on images taken under various illuminations and backgrounds.

关键词： GPU parallel programming machine vision inertial measurement

来源：评论

学校读者我要写书评

暂无评论

Memory-Latency-Accuracy Trade-offs for Continual Learning on a RISC-V Extreme-Edge Node

arXiv

引用

arXiv 2020年

作者： Ravaglia, Leonardo Rusci, Manuele Capotondi, Alessandro Conti, Francesco Pellegrini, Lorenzo Lomonaco, Vincenzo Maltoni, Davide Benini, Luca DISI University of Bologna Italy DEI University of Bologna Italy FIM University of Modena and Reggio Emilia Italy IIS ETH Zurich Switzerland

AI-powered edge devices currently lack the ability to adapt their embedded inference models to the ever-changing environment. To tackle this issue, Continual Learning (CL) strategies aim at incrementally improving the decision capabilities based on newly acquired data. In this work, after quantifying memory and computational requirements of CL algorithms, we define a novel HW/SW extreme-edge platform featuring a low power RISC-V octa-core cluster tailored for on-demand incremental learning over locally sensed data. The presented multi-core HW/SW architecture achieves a peak performance of 2.21 and 1.70 MAC/cycle, respectively, when running forward and backward steps of the gradient descent. We report the trade-off between memory footprint, latency, and accuracy for learning a new class with Latent Replay CL when targeting an image classification task on the CORe50 dataset. For a CL setting that retrains all the layers, taking 5h to learn a new class and achieving up to 77.3% of precision, a more efficient solution retrains only part of the network, reaching an accuracy of 72.5% with a memory requirement of 300 MB and a computation latency of 1.5 hours. On the other side, retraining only the last layer results in the fastest (867 ms) and less memory hungry (20 MB) solution but scoring 58% on the CORe50 dataset. Thanks to the parallelism of the low-power cluster engine, our HW/SW platform results 25× faster than typical MCU device, on which CL is still impractical, and demonstrates an 11× gain in terms of energy consumption with respect to mobile-class solutions. Copyright © 2020, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Retrofiting parallelism onto ocaml

arXiv

引用

arXiv 2020年

作者： Sivaramakrishnan, K.C. Dolan, Stephen White, Leo Jaffer, Sadiq Kelly, Tom Sahoo, Anmol Parimala, Sudha Dhiman, Atul Madhavapeddy, Anil Iit Madras OCaml Labs Jane Street and University of Cambridge Computer Laboratory Opsian and OCaml Labs Iit Madras University of Cambridge Computer Laboratory OCaml Labs

OCaml is an industrial-strength, multi-paradigm programming language, widely used in industry and academia. OCaml is also one of the few modern managed system programming languages to lack support for shared memory parallel programming. This paper describes the design, a full-fedged implementation and evaluation of a mostly-concurrent garbage collector (GC) for the multicore extension of the OCaml programming language. Given that we propose to add parallelism to a widely used programming language with millions of lines of existing code, we face the challenge of maintaining backwards compatibility-not just in terms of the language features but also the performance of single-threaded code running with the new GC. To this end, the paper presents a series of novel techniques and demonstrates that the new GC strikes a balance between performance and feature backwards compatibility for sequential programs and scales admirably on modern multicore processors. Copyright © 2020, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Hyaline: A Transparent Distribwuted Computing Framework for CUDA 5

Hyaline: A Transparent Distribwuted Computing Framework for ...

引用

5th IEEE International Conference for Convergence in Technolog, I2CT 2019

作者： Jain, Akshat Kazi, Owais Joshi, Raviraj Basantwani, Shraddha Bang, Yogita Khengare, Rahul PICT Computer Engineering Pune India IIT Computer Science Madras India

ISBN: (纸本)9781538680759

The GPU usually handles the homogenous data parallel work, by taking advantage of its massive number of cores. In most of the applications, we use CUDA programming for utilizing the power of GPU. In data intensive high computational applications like neural networks, utilizing the GPU on a single machine is time consuming. Instead if multiple GPUs are used in a network the amount of time required will be significantly reduced. Traditionally to enable a set of program to be run in a distributed environment, programmer has to accordingly design components to make his system dynamic and resilient to changes in number of systems in cluster, which is a daunting task. This work distribution can be a poor solution as it may underutilize the GPUsIn our approach, we developed a framework which transparently distributes data parallel kernels across multiple GPUs in a distributed network. The programmer is responsible for developing a single data parallel kernel in CUDA while the framework automatically distributes the workload across an arbitrary set of CUDA enabled GPUs. Depending on current workload on GPUs and the amount of data to be processed optimal distribution is done. The goal is to maximally utilize the available resources with minimal programming complexity. The systems not compatible with CUDA can also utilize our solution. We expect our framework to reduce the processing time along with simplifying the task of programmers. © 2019 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：