检索结果-内蒙古大学图书馆

IEEE international conference on High Performance Computing and Communications (HPCC)

作者： Guilherme Cassales Heitor Gomes Albert Bifet Bernhard Pfahringer Hermes Senger Federal University of São Carlos Brazil University of Waikato New Zealand

ISBN: (纸本)9781728176505

Machine Learning techniques have been employed in virtually all domains in the past few years. New applications demand the ability to cope with dynamic environments like data streams with transient behavior. Such environments present new requirements like incrementally process incoming data instances in a single pass, under both memory and time constraints. Furthermore, prediction models often need to adapt to concept drifts observed in non-stationary data streams. Ensemble learning comprises a class of stream mining algorithms that achieved remarkable prediction performance in this scenario. Implemented as a set of (several) individual component classifiers whose predictions are combined to predict new incoming instances, ensembles are naturally amendable for task parallelism. Despite its relevance, an efficient implementation of ensemble algorithms is still challenging. For example, dynamic data structures used to model non-stationary data behavior and detect concept drifts cause inefficient memory usage patterns and poor cache memory performance in multi-core environments. In this paper, we propose a minibatching strategy which can significantly reduce cache misses and improve the performance of several ensemble algorithms for stream mining in multi-core environments. We assess our strategy on four different state-of-art ensemble algorithms applying four widely used machine learning benchmark datasets with varied characteristics. Results from two different hardware show speedups of up to 5X on 8-core processors with ensembles of 100 and 150 learners. the benefits come at the cost of changes in predictive performances.

关键词： Adaptation models Machine learning algorithms Multicore processing Heuristic algorithms Memory management parallel processing Prediction algorithms

来源：评论

学校读者我要写书评

暂无评论

Rigel: A framework for openMP performancetuning 21

Rigel: A framework for openMP performancetuning

引用

21st IEEE international conference on High Performance Computing and Communications, 17th IEEE international conference on Smart City and 5th IEEE international conference on Data Science and Systems, HPCC/SmartCity/DSS 2019

作者： Amarasinghe Baragamage, Piyumi Rameshka Senanayake, Pasindu Kannangara, thulana Seneviratne, Praveen Jayasena, Sanath Patabandi, tharindu Rusira Hall, Mary Dept of Computer Science AND Engineering University of Moratuwa Katubedda Sri Lanka School of Computing University of Utah Salt Lake CityUT United States

ISBN: (纸本)9781728120584

OpenMP allows developers to harness the power of shared memory multiprocessing in C and C++ applications, but the performance gained with OpenMP is highly sensitive to the underlying hardware, making performance portability across different hardware architectures fragile. For example, in mapping a parallel for loop to hardware, OpenMP 4 offers commands for exploiting vector instructions (simd directives) and automatic GPU offloading (target directives), as well as schedule directives for CPU load balancing. these benefits come with a cost. A developer has to be well aware of the architecture details, and the application, and must iteratively tune to determine the best combination of pragma directives delivering higher performance for the given target architecture. Hence in this paper we introduce Rigel, a framework that automates these decisions to arrive at optimized OpenMP annotated code. Given a code segment with inherent parallelism, our framework uses separate machine learning classification models to predict the anticipated benefit of each optimization. Both Vector Classification and GPU Offloading Classification models perform with average accuracies of 83%. Succeeding the classification process, code segments are optimized accordingly. Our results show that GPU offloading optimization lead to an average speedup of 8x over default non-optimized CPU parallel execution (pragma omp parallel for) and average Vector optimization speedup is 6x compared to LLVM Clang 4.0 auto-vectorization. Furthermore Scheduling mechanism selection process results in overall accuracy of 90%. © 2019 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

PDMF: parallel dictionary motif finder on multicore and GPU 21

PDMF: Parallel dictionary motif finder on multicore and GPU

引用

作者： Salomon, Michelle Soundarajan, Sanjay Park, Jin H. Computer Science Department California State University Fresno FresnoCA United States

ISBN: (纸本)9781728120584

Unknown motif finding in a set of DNA sequences is an important step of understanding the functionality of a group of genes and it requires accuracy and efficiency. We propose and present high-performance computation models for accelerating an efficient combinatorial approach of finding motif, which uses tree based bypassing and hash based heuristics to skip unnecessary computations with keeping the maximum accuracy. the computation models are designed for multicore processors and GPU and implemented with OpenMP and CUDA, respectively. To achieve the maximum efficiency, we also developed efficient heterogeneous computation models, in which multicore processor(s) and GPU collaborate. the collection of the resulting products is named PDMF and tested on a couple of HPC systems for performance. Our experimental results showed that the multicore version (PDMFm) achieved average 4.63x and 8.87x speedups over the serial version on a couple of systems with 4 cores and 16 cores, respectively. the GPU version (PDMFg) achieved average 41.48x and 9.95x speedups over the serial version and PDMFm on a system with a 4-core host CPU and a GPU. the best heterogeneous version showed ~1.4x speedup over the baseline GPU version. © 2019 IEEE.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

PRINS: processing-in-Storage Acceleration of Machine Learning

引用

IEEE TRANSACTIONS ON NANOTECHNOLOGY 2018年第5期17卷 889-896页

作者： Kaplan, Roman Yavits, Leonid Ginosar, Ran Technion Israel Inst Technol Dept Elect Engn IL-3547902 Haifa Israel

Machine learning algorithms have become a major tool in various applications. the high-performance requirements on large-scale datasets pose a challenge for traditional von Neumann architectures. We present two machine learning implementations and evaluations on PRINS, a novel processing-in-storage system based on resistive content addressable memory (ReCAM). PRINS functions simultaneously as a storage and a massively parallel associative processor. PRINS processing-in-storage resolves the bandwidth wall faced by near-data von Neumann architectures, such as three-dimensional DRAM and CPU stack or SSD with embedded CPU, by keeping the computing inside the storage arrays, thus implementing in-data, rather than near-data, processing. We show that PRINS-based processing-in-storage architecture may outperform existing in-storage designs and acceleratorbased designs. Multiple performance comparisons for the ReCAM processing-in-storage implementations ofK-means and K-nearest neighbors are performed. Compared platforms include CPU, GPU, FPGA, and Automata Processor. We show that PRINSmay achieve an order-of-magnitude speedup and improved power efficiency relative to all compared platforms.

关键词： Near-data processing associative processing processing-in-storage processing-in-memory RRAM CAM memristors

来源：评论

学校读者我要写书评

暂无评论

EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor 19

EMBA: Efficient Memory Bandwidth Allocation to Improve Perfo...

引用

48th international conference on parallel processing (ICPP)

作者： Xiang, Yaocheng Ye, Chencheng Wang, Xiaolin Luo, Yingwei Wang, Zhenlin Peking Univ Shenzhen Peoples R China Huazhong Univ Sci & Technol Wuhan Hubei Peoples R China Michigan Technol Univ Houghton MI 49931 USA Peking Univ SECE Shenzhen Key Lab Cloud Comp Technol & Applicat Shenzhen Peoples R China Pengcheng Lab Shenzhen Peoples R China Huazhong Univ Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Serv Comp Technol & Syst Lab Cluster & Grid Comp LabSch Comp Sci & Technol Wuhan Hubei Peoples R China

ISBN: (纸本)9781450362955

On multi-core processors, contention on shared resources such as the last level cache (LLC) and memory bandwidth may cause serious performance degradation, which makes efficient resource allocation a critical issue in data centers. Intel recently introduces Memory Bandwidth Allocation (MBA) technology on its Xeon scalable processors, which makes it possible to allocate memory bandwidth in a real system. However, how to make the most of MBA to improve system performance remains an open question. In this work, (1) we formulate a quantitative relationship between a program's performance and its LLC occupancy and memory request rate on commodity processors. (2) Guided by the performance formula, we propose a heuristic bound-aware throttling algorithm to improve system performance and (3) we further develop a hierarchical clustering method to improve the algorithm's efficiency. (4) We implement these algorithms in EMBA, a low-overhead dynamic memory bandwidth scheduling system to improve performance on Intel commodity processors. the results show that, when multiple programs run simultaneously on a multi-core processor whose memory bandwidth is saturated, the programs with high memory bandwidth demand usually use bandwidth inefficiently compared with programs with medium memory bandwidth demand from the perspective of CPU performance. By slightly throttling the former's bandwidth, we can significantly improve the performance of the latter. On average, we improve system performance by 36.9% at the expense of 8.6% bandwidth utilization rate.

关键词： Memory bandwidth allocation Multi-core architectures

来源：评论

学校读者我要写书评

暂无评论

5th Annual international conference on Network and Information Systems for Computers, ICNISC 2019

5th Annual International Conference on Network and Informati...

引用

5th Annual international conference on Network and Information Systems for Computers, ICNISC 2019

the proceedings contain 88 papers. the topics discussed include: analysis of industrial Ethernet used in active surface system of QTT;interactive platform of gesture and music based on Myo armband and processing;a quantitative study on the color of city landmark landscape architectures;research on risk identification for foreign companies investing in Mongolia infrastructure construction industry based on complex network technology;application of multi-dimensional linear fitting method in the establishment of the semi-autogenous grinding mill model;research on the construction and application of a cloud computing experiment platform for computer science general education courses;revisiting the current state-of-the-art multipath routing in ad hoc networks;surface blemishes of aluminum material image recognition based on transfer learning;and the establishment and analysis of gas yield prediction model.

关键词：

来源：评论

学校读者我要写书评

暂无评论

parallel k-dominant skyline queries over uncertain data streams with capability index 21

Parallel k-dominant skyline queries over uncertain data stre...

引用

作者： Li, Xiaoyong Liu, Jun Ren, Kaijun Li, Xiaoling Ren, Xiaoli Deng, Kefeng College of Meteorology and Oceanography National University of Defense Technology China College of Computer Science and Technology National University of Defense Technology China PLA 91550 China

ISBN: (纸本)9781728120584

the skyline query over uncertain data streams, as an important aspect of big data analysis, plays a significant role in various domains like financial data analysis, environmental monitoring, and wireless sensor network. However, with the diversity of user query requirements, the traditional skyline query is not practical enough and even cannot meet users' requirements. To address the problem that the number of uncertain skyline queries results is so numerous that cannot offer any practical insights effectively, we propose a dominance-capability based parallel uncertain k-dominant skyline queries method named PKDS in this paper. Firstly, the method defines the k-dominant skyline query problem over uncertain data streams. Secondly, PKDS maps the new arriving items to multiple compute nodes according to the streaming items mapping strategy based on sliding-window partitioning, in order to support the parallel processing for the k-dominant skyline queries over uncertain data streams efficiently. Specifically, an index structure based on the k-dominant capability of streaming items is developed to efficiently manage streaming items, which could greatly improve the k-dominance tests and further the efficiency of parallel k-dominant skyline queries over uncertain data streams. Extensive experimental results demonstrate that, PKDS method not only can reduce the results of skyline queries over high-dimensional streaming items to the scope that could give a better decision support, but also can greatly improve the query efficiency. © 2019 IEEE.

关键词： Wireless sensor networks

来源：评论

学校读者我要写书评

暂无评论

parallelization of Variable Rate Decompression through Metadata

Parallelization of Variable Rate Decompression through Metad...

引用

Euromicro conference on parallel, Distributed and Network-Based processing

作者： Lennart Noordsij Steven van der Vlugt Mohamed A. Bamakhrama Zaid Al-Ars Peter Lindstrom ASML Netherlands Synopsys Corporation Netherlands TU Delft Netherlands Lawrence Livermore National Laboratory United States of America

ISBN: (数字)9781728165820

ISBN: (纸本)9781728165837

Data movement has long been identified as the biggest challenge facing modern computer systems' designers. To tackle this challenge, many novel data compression algorithms have been developed. Often variable rate compression algorithms are favored over fixed rate. However, variable rate decompression is difficult to parallelize. Most existing algorithms adopt a single parallelization strategy suited for a particular HW platform. Such an approach fails to harness the parallelism found in diverse modern HW architectures. We propose a parallelization method for tiled variable rate compression algorithms that consists of multiple strategies that can be applied interchangeably. this allows an algorithm to apply the strategy most suitable for a specific HW platform. Our strategies are based on generating metadata during encoding, which is used to parallelize the decoding process. To demonstrate the effectiveness of our strategies, we implement them in a state-of-the-art compression algorithm called ZFP. We show that the strategies suited for multicore CPUs are different from the ones suited for GPUs. On a CPU, we achieve a near optimal decoding speedup and an overhead size which is consistently less than 0.04% of the compressed data size. On a GPU, we achieve average decoding rates of up to 100 GiB/s. Our strategies allow the user to make a trade-off between decoding throughput and metadata size overhead.

关键词： Decoding Metadata Compression algorithms Encoding Data transfer Acceleration Multicore processing

来源：评论

学校读者我要写书评

暂无评论

Serverless Skeletons for Elastic parallel processing 5

Serverless Skeletons for Elastic Parallel Processing

引用

5th IEEE international conference on Big Data Intelligence and Computing, DataCom 2019

作者： Kehrer, Stefan Scheffold, Jochen Blochinger, Wolfgang Reutlingen University Parallel and Distributed Computing Group Germany

ISBN: (纸本)9781728141176

Serverless computing is an emerging cloud computing paradigm with the goal of freeing developers from resource management issues. As of today, serverless computing platforms are mainly used to process computations triggered by events or user requests that can be executed independently of each other. these workloads benefit from on-demand and elastic compute resources as well as per-function billing. However, it is still an open research question to which extent parallel applications, which comprise most often complex coordination and communication patterns, can benefit from serverless *** this paper, we introduce serverless skeletons for parallel cloud programming to free developers from both parallelism and resource management issues. In particular, we investigate on the well-known and widely used farm skeleton, which supports the implementation of a wide range of applications. To evaluate our concepts, we present a prototypical development and runtime framework and implement two applications based on our frame-work: Numerical integration and hyperparameter optimization - a commonly applied technique in machine learning. We report on performance measurements for both applications and discuss the usefulness of our approach. © 2019 IEEE.

关键词： Cloud platforms

来源：评论

学校读者我要写书评

暂无评论

parallel genetic algorithm on GPU for the robust uncapacitated single allocation p-hub median problem with discrete scenarios

Parallel genetic algorithm on GPU for the robust uncapacitat...

引用

international conference on Logistics and Operations Management (GOL)

作者： Abdelhamid Benaini Achraf Berrajaa Normandie Universite LMAH Le Havre France INSA Euro-Mediterranean UEMF Fès Morocco

ISBN: (数字)9781728164250

ISBN: (纸本)9781728164267

In recent years, the Hub location problems (HLPs) have been expanded to handle uncertain data, giving rise to Robust HLPs. In a Robust HLP with discrete scenarios, the unique set of requests is replaced by a set of discrete scenarios. For example, a scenario can be the collection of requests observed between nodes at given periods of the year. In a robust optimization approach, making appropriate decisions for all scenarios is time-consuming, especially for large size HLP instances. the purpose of this study is to show that such problems can be solved in a reasonable computing time and with high quality solutions, using the computing power of the GPU graphics card. So, we present a GPU-based approach for solving large Robust HLP with discrete senarios. the proposed parallel genetic algorithm returns a robust solution based on the min-max lexicographic criterion that minimizes the worst cost on all scenarios. Due to the performance of our GPU implementation we solve instances up to 4000 nodes in a few seconds on an Nvidia Quadro P6000 (3840 cores).

关键词： Uncertainty Resource management Transportation Optimization Stochastic processes Graphics processing units Genetic algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：