检索结果-内蒙古大学图书馆

Multi-frequency sweeping method for periodic steady-state computations on the graphics processor unit

ELECTRIC POWER SYSTEMS RESEARCH 2015年 121卷 295-301页

作者： Morales-Aguilar, Eric Ramirez, Abner Matar, Mahmoud CINVESTAV Ctr Res & Adv Studies Mexico Zapopan 45019 Jalisco Mexico Ain Shams Univ Elect Power & Mach Dept Cairo Egypt

This paper presents the parallelization of the multi-frequency hybrid backward/forward sweeping (BFS) technique on a graphics processor unit (GPU). Primarily, the intrinsic layer structure of a radial network, typical topology of distribution systems, and its multi-frequency behavior are exploited for parallelization of the hybrid BFS method on the GPU. The less computational demanding tasks, e.g., error computation and simple vectorized operations, are assigned to the CPU. The network solution is performed in the Matlab (R) environment using compute unified device architecture (CUDA). The computational time required by the GPU/CPU BFS implementation is compared with a CPU-only program by solving four networks of different sizes. Validation of the multi-frequency BFS results is made through a CPU implementation of a Newton-type solution scheme. The significant reduction in the computational time of the parallelized GPU implementation of the hybrid NS method combined with its ability to include a wide range of frequencies and to handle nonlinear components makes it suitable for real-time online applications. (C) 2014 Elsevier B.V. All rights reserved.

关键词： Frequency domain analysis Graphics processor unit Large-scale systems parallel programming Distribution network Interharmonics

来源：评论

学校读者我要写书评

暂无评论

Performance and Energy Evaluation of Different Multi-Threading Interfaces in Embedded and General Purpose Systems

引用

JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY 2015年第3期80卷 295-307页

作者： Lorenzon, Arthur Francisco Cera, Marcia Cristina Schneider Beck, Antonio Carlos Univ Fed Rio Grande do Sul Inst Informat Porto Alegre RS Brazil Univ Fed Rio Grande do Sul Inst Informat Appl Informat Dept Porto Alegre RS Brazil Fed Univ Pampa Alegrete RS Brazil

In current systems, while it is necessary to exploit the availability of multiple cores, it is also mandatory to consume less energy. To speed up the development process and make it as transparent as possible to the programmer, parallelism is exploited through the use of Application programming Interfaces (API). However, each one of these API implements different ways to exchange data using shared memory regions, and by consequence, they have different levels of energy consumption. In this paper, considering general purpose and embedded systems, we show how each API influences the performance, energy consumption and Energy-Delay Product. For example, Pthreads consumes 12 % less energy on average than OpenMP and MPI considering all benchmarks. We also demonstrate that the difference in Energy-Delay Product (EDP) among the APIs can be of up to 81 %, while the level of efficiency (e.g.: performance or energy consumption per core) changes as the number of threads increases, depending on whether the system is embedded or general purpose.

关键词： Embedded systems parallel programming Performance Energy efficiency evaluation

来源：评论

学校读者我要写书评

暂无评论

Accelerating Learning to Rank via SVM with OpenCL and OpenMP on Heterogeneous Platforms

Accelerating Learning to Rank via SVM with OpenCL and OpenMP...

引用

International Conference on parallel and Distributed Systems (ICPADS)

作者： Huming Zhu Zheng Luo Yanfei Wu Pei Li Peng Zhang Shuiping Gou L.C. Jiao Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China Xidian University Xi'an China Xidian University Xian Shaanxi CN

ISBN: (纸本)9781509053827

Support vector machine (SVM) is a popular algorithm for learning to rank, but the training speed of SVM is the bottleneck when dealing with large size data problems. Recently, heterogeneous computing platforms, such as graphics processing unit (GPU) and Many Integrated Core (MIC), have exhibited huge superiority in High Performance Computing domain. Open Computing Language (OpenCL) and Open Multi-Processing (OpenMP) are two popular parallel programming interface for different Heterogeneous Platforms. To resolve the speed problem of RSVM, comparison of the performance of different parallel programming models on different heterogeneous platforms is important. We designed OpenMPbased parallel learning to Rank SVM (PLRSVM) for multi-core CPU and MIC, and OpenCL-based PLRSVM for multi-core CPU, GPU and MIC. The experimental result shows the different performance between OpenMP based program and OpenCL based program. The OpenCL based program significantly speeds up training process of SVM and shows good portability on heterogeneous devices. The experiment also suggests that selection of suitable programming models according to the hardware platform and the structure of serial algorithm is an important step to acquire high performance of parallel algorithm.

关键词： Graphics processing units Support vector machines Microwave integrated circuits Training parallel programming Algorithm design and analysis Acceleration

来源：评论

学校读者我要写书评

暂无评论

20 Years of Teaching parallel Processing to Computer Science Seniors

20 Years of Teaching Parallel Processing to Computer Science...

引用

Workshop on Education for High Performance Computing (EduHPC)

作者： Jie Liu Computer Science Division Western Oregon University Monmouth Oregon USA

In this paper, we present our Concurrent Systems class, where parallel programming and parallel and distributed computing (PDC) concepts have been taught for more than 20 years. Despite several rounds of changes in hardware, the class maintains its goals of allowing students to learn parallel computer organizations, studying parallel algorithms, and writing code to be able to run on parallel and distributed platforms. We discuss the benefits of such a class, reveal the key elements in developing this class and receiving funding to replace outdated hardware. We will also share our activities in attracting more students to be interested in PDC and related topics.

关键词： Computers parallel programming parallel algorithms Hardware Education

来源：评论

学校读者我要写书评

暂无评论

Utilization and Expansion of ppOpen-AT for OpenACC

Utilization and Expansion of ppOpen-AT for OpenACC

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Satoshi Ohshima Takahiro Katagiri Masaharu Matsumoto Supercomputing Research Division University of Tokyo Tokyo JAPAN

ISBN: (纸本)9781509036837

For application programmers, reducing efforts for optimizing programs is an important issue. Our solution of this issue is an auto-tuning (AT) technique. We are developing an AT language named ppOpen-AT. We have shown that this language is useful for multi-and many-core parallel programming. Today, OpenACC attracts attention as an easy and useful graphics processing unit (GPU) programming environment. While OpenACC is one possible parallel programming environment, users have to spend time and energy in order to optimize OpenACC programs. In this study, we investigate the usability of ppOpen-AT for OpenACC programs and propose to expand ppOpen-AT for further optimization of OpenACC.

关键词： Graphics processing units Optimization Kernel Hardware parallel programming programming environments

来源：评论

学校读者我要写书评

暂无评论

Performance-based ontology matching

引用

APPLIED INTELLIGENCE 2015年第2期43卷 356-385页

作者： Amin, Muhammad Bilal Khan, Wajahat Ali Lee, Sungyoung Kang, Byeong Ho Kyung Hee Univ Dept Comp Engn Ubiquitous Comp Lab Yongin 446701 Gyeonggi Do South Korea Univ Tasmania Sch Comp & Informat Syst Hobart Tas 7001 Australia

Ontology matching is among the core techniques used for heterogeneity resolution by information and knowledge-based systems. However, due to the excess and ever-evolving nature of data, ontologies are becoming large-scale and complex;consequently, leading to performance bottlenecks during ontology matching. In this paper, we present our performance-based ontology matching system. Today's desktop and cloud platforms are equipped with parallelism-enabled multicore processors. Our system benefits from this opportunity and provides effectiveness-independent data parallel ontology matching resolution over parallelism-enabled platforms. Our system decomposes complex ontologies into smaller, simpler, and scalable subsets depending upon the needs of matching algorithms. Matching process over these subsets is divided from granular to finer-level abstraction of independent matching requests, matching jobs, and matching tasks, running in parallel over parallelism-enabled platforms. Execution of matching algorithms is aligned for the minimization of the matching space during the matching process. We comprehensively evaluated our system over OAEI's dataset of fourteen real world ontologies from diverse domains, having different sizes and complexities. We have executed twenty different matching tasks over parallelism-enabled desktop and Microsoft Azure public cloud platform. In a single-node desktop environment, our system provides an impressive performance speedup of 4.1, 5.0, and 4.9 times for medium, large, and very large-scale ontologies. In a single-node cloud environment, our system provides an impressive performance speedup of 5.9, 7.4, and 7.0 times for medium, large, and very large-scale ontologies. In a multi-node (3 nodes) environment, our system provides an impressive performance speedup of 15.16 and 21.51 times over desktop and cloud platforms respectively.

关键词： Ontology matching Heterogeneity resolution Multithreading parallel processing parallel programming Semantic web

来源：评论

学校读者我要写书评

暂无评论

Nonlinear Electronic/Photonic Component Modeling Using Adjoint State-Space Dynamic Neural Network Technique

引用

IEEE TRANSACTIONS ON COMPONENTS PACKAGING AND MANUFACTURING TECHNOLOGY 2015年第11期5卷 1679-1693页

作者： Sadrossadat, Sayed Alireza Gunupudi, Pavan Zhang, Qi-Jun Carleton Univ Dept Elect Ottawa ON K1S 5B6 Canada

In this paper, an adjoint state-space dynamic neural network method for modeling nonlinear circuits and components is presented. This method is used for modeling the transient behavior of the nonlinear electronic and photonic components. The proposed technique is an extension of the existing state-space dynamic neural network (SSDNN) technique. The new method simultaneously adds the derivative information to the training patterns of nonlinear components, allowing the training to be done with less data without sacrificing model accuracy, and, consequently, makes training faster and more efficient. In addition, this method has been formulated such that it can be suitable for the parallel computation. The use of derivative information and parallelization makes training using the proposed technique much faster than the SSDNN. In addition, the models created using the proposed method are much faster to evaluate compared with the conventional models present in traditional circuit simulation tools. The validity of the proposed technique is demonstrated through the transient modeling of the physics-based CMOS driver, commercial NXP's 74LVC04A inverting buffer, and nonlinear photonic components.

关键词： Microelectronic circuit modeling neural networks nonlinear behavioral modeling parallel programming photonic device modeling sensitivity analysis transient analysis

来源：评论

学校读者我要写书评

暂无评论

Partitioned Global Address Space Languages

引用

ACM COMPUTING SURVEYS 2015年第4期47卷 1–27页

作者： De Wael, Mattias Marr, Stefan De Fraine, Bruno Van Cutsem, Tom De Meuter, Wolfgang Vrije Univ Brussel SOFT DINF B-1050 Brussels Belgium Johannes Kepler Univ Linz A-4040 Linz Austria Synopsys B-3001 Heverlee Belgium

The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a globally shared address space improves productivity, but that a distinction between local and remote data accesses is required to allow performance optimizations and to support scalability on large-scale parallel architectures. To this end, PGAS preserves the global address space while embracing awareness of nonuniform communication costs. Today, about a dozen languages exist that adhere to the PGAS model. This survey proposes a definition and a taxonomy along four axes: how parallelism is introduced, how the address space is partitioned, how data is distributed among the partitions, and finally, how data is accessed across partitions. Our taxonomy reveals that today's PGAS languages focus on distributing regular data and distinguish only between local and remote data access cost, whereas the distribution of irregular data and the adoption of richer data access cost models remain open challenges.

关键词： Design Languages parallel programming HPC PGAS message passing one-sided communication data distribution data access survey

来源：评论

学校读者我要写书评

暂无评论

Towards efficient and concurrent FFTs implementation on Intel Xeon/MIC clusters for LTE and HPC

Towards efficient and concurrent FFTs implementation on Inte...

引用

International Symposium on Circuits and Systems

作者： Mounir Khelifi Daniel Massicotte Yvon Savaria Université du Québec à Trois-Rivières Electrical and Computer Engineering Department Trois-Rivières Québec Canada Laboratoire des Signaux et Systèmes Intégrés Groupe de recherche en électronique industrielle école Polytechnique de Montréal Electrical Engineering Department Montréal Quebec Canada

ISBN: (纸本)9781479953424

Fast Fourier Transform (FFT) is an important part of many applications, such as in wireless communication based on OFDM (Orthogonal Frequency Division Multiplexing). With Cloud Radio Access Networks, implementing FFTs on multiprocessor clusters is a challenging task. For instance, supporting the Long Term Evolution (LTE) protocol requires processing 100 independent FFTs (with sizes ranging from 128 to 2048 points) in 66.7 μs. In this work, seven native FFT candidate implementations are compared. The considered implementation environments are: OpenMP (Open Multi-Processing) on 1 core, MPI (Message Passing Interface) on 1 core, 2 cores, and 3 cores, Hybrid OpenMP+MPI on 1 core and 3 cores, and MPI on an heterogeneous platform composed of Xeon-Phi and 3 cores. The reported experimental results show that the latter method meets the latency requirements of LTE. It is shown that the OpenMP and MPI paradigms running only on MICs (Many Integrated Cores) cannot benefit fully from the computing capability of many-core architectures. The heterogeneous combination of Xeon+MICs provides a better performance.

关键词： Multi-core LTE OFDM FFT parallel programming Many Integrated Core (MIC) MKL OpenMP MPI Hybrid Many-core

来源：评论

学校读者我要写书评

暂无评论

Translating OpenACC to LLVM IR with SPIR kernels

Translating OpenACC to LLVM IR with SPIR kernels

引用

International Conference on Computer and Information Science (ACIS)

作者： Hao-Wei Peng Jean Jyh-Jiun Shann Dept. of Computer Science National Chiao Tung University Hsinchu Taiwan

ISBN: (纸本)9781509008070

In general, highly parallelized programs executed on heterogeneous multiprocessor platforms may get better performance than homogeneous ones. OpenCL is one of the standards for parallel programming of heterogeneous multiprocessor platforms and SPIR (Standard Portable Intermediate Representation) is a portable binary format for representing OpenCL kernel code. However, the programming of these programs is usually complex and error-prone for most programmers. Therefore, some standards have been proposed to simplify the programming on heterogeneous multiprocessor platforms, for example, OpenACC (a directive-based parallel programming model). In this paper, we implement our framework on Clang, the C front-end of LLVM, to automatically translate OpenACC to LLVM IR with SPIR kernels. After that, it is optional to optimize the IR code by LLVM optimizer and execute the host LLVM IR by LLVM JIT-compiler. According to the experiment results, our translated programs have significant performance enhancement for some programs while comparing with their corresponding sequential version of programs and have comparable performance while comparing with their manual OpenCL version. Therefore, our design may reduce the difficulty of writing the programs in heterogeneous multiprocessor platform and the translated OpenCL programs are portable and have good performance as that of the manual OpenCL programs written by experienced programmers.

关键词： Kernel Semantics Runtime Standards parallel programming Data preprocessing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：