检索结果-内蒙古大学图书馆

18th Annual international conference on Embedded Computer Systems: architectures, MOdeling and Simulation, SAMOS 2018

ISBN: (纸本)9781450364942

the proceedings contain 33 papers. the topics discussed include: a performance evaluation of multi-FPGA architectures for computations of information transfer;massively parallel computation of linear recurrence equations with graphics processing units;a first-order approximation of microarchitecture energy-efficiency;delays and states in dataflow models of computation;communication-aware scheduling algorithms for synchronous dataflow graphs on multicore systems;towards power management verification of time-triggered systems using virtual platforms;architectural considerations for FPGA acceleration of machine learning applications in MapReduce;and fast parallel simulation of a manycore architecture with a flit-level on-chip network model.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Multi-BSP vs. BSP: A Case of Study for Dell AMD Multicores

Multi-BSP vs. BSP: A Case of Study for Dell AMD Multicores

引用

Euromicro international conference on parallel, Distributed and Network-Based processing

作者： Guillermo Trabes Veronica Gil-Costa Marcela Printista Mauricio Marin LIDIC-Departamento de Informática Universidad Nacional de San Luis CCT San Luis Universidad Nacional de San Luis and CONICET CeBiB Universidad de Santiago de Chile

Computer models have been used as a bridge between parallel algorithms and hardware architectures. the Bulk-Synchronous parallel (BSP) is a well-known computing model originally devised for distributed algorithms running on clusters of single-core processors. the Multi-BSP model, that extends the classic BSP model, was recently proposed for multi-core processors. However, this model -implemented through the MulticoreBSP-for-C library-presents some restrictions such as the explicit synchronizations between the cores, introducing some challenges on which the hardware characteristics should be taken into account to properly model the parallel algorithms. therefore, we explore the suitability of these models for the Dell multi-core architecture. the objectives of this contribution are twofold. First, we model two different multi-core Dell architectures. Second, we show that a simple model with few parameters can be easily adapted to each Dell platform rather than complex models which tends to use tricky hardware parameters.

关键词： parallel computing models Bulk-synchronous parallel - BSP Multi-BSP Dell multi-cores

来源：评论

学校读者我要写书评

暂无评论

A Software/Hardware parallel Uniform Random Number Generation Framework

A Software/Hardware Parallel Uniform Random Number Generatio...

引用

international conference on Communication Software and Networks, ICCSN

作者： Yuan Li Minxuan Zhang College of Computer National University of Defense Technology Changsha China

In this paper, a software/hardware framework is proposed for generating uniform random numbers in parallel. Using the Fast Jump Ahead technique, the software can produce initial states for each generator to guarantee independence of different sub-streams. With support from the software, the hardware structure can be easily constructed by simply replicating the single generator. We apply the framework to parallelize MT19937 algorithm. Experimental results shows that our framework is capable of generating arbitrary number of independent parallel random sequences while obtaining speedup roughly proportional to the number of parallel cores. Meanwhile, our framework is superior to those existing architectures reported in the literatures in both throughput rate and scalability. Furthermore, we implement 149 parallel instances of MT19937 generators on a Xilinx Virtex-5 FPGA device. It achieves the throughput of 42.61M samples/s. Compared to CPU and GPU implementations, the throughput is 10.0 and 2.5 times faster, while the throughput-power efficiency achieves 167.3 and 18.1 times speedup, respectively.

关键词： Generators throughput Hardware IP networks Field programmable gate arrays parallel processing Software

来源：评论

学校读者我要写书评

暂无评论

High-Speed Frequency Domain Bit Synchronization Algorithm for parallel Implementation

High-Speed Frequency Domain Bit Synchronization Algorithm fo...

引用

international conference on Communication Software and Networks, ICCSN

作者： Meisong Luan Hongquan Zhu Ming Li School of Information and Electronics Beijing Institute of Technology Beijing China Beijing Institute of Tracking and Telecommunication Technology Beijing China

A high-speed frequency-domain bit synchronization algorithm for parallel implementation is proposed in this paper, which is aimed at integrating high-speed laser communication with distance measurement. this algorithm which combines bit synchronization with the frequency-domain parallel matched filtering can make the output of filter only contain the data of optimum sampling points, so that the complexity of implementation will be obviously lower than parallel time-domain bit synchronization algorithm. Moreover, this algorithm adopts tracking timing adjustment to produce a better accuracy. A description of this algorithm is made in this paper, including formula derivation, implementation structure and simulation verification. Meanwhile, comparison between time-domain algorithm and frequency-domain algorithm is performed in this paper. According to the simulation results, this algorithm has a better performance.

关键词： Frequency-domain analysis Synchronization Time-domain analysis Signal processing algorithms Matched filters

来源：评论

学校读者我要写书评

暂无评论

Greed is Good: parallel algorithms for Bipartite-Graph Partial Coloring on Multicore architectures 46

Greed is Good: Parallel Algorithms for Bipartite-Graph Parti...

引用

46th international conference on parallel processing Workshops (ICPPW)

作者： Tas, Mustafa Kemal Kaya, Kamer Saule, Erik Sabanci Univ Comp Sci & Engn Istanbul Turkey Ohio State Univ Dept Biomed Informat Columbus OH 43210 USA Univ N Carolina Comp Sci Charlotte NC 28223 USA

ISBN: (纸本)9781538610428

In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be significant. In particular, for the bipartite-graph partial coloring (BGPC) and distance-2 graph coloring (D2GC) problems, which have various use-cases within the scientific computing and numerical optimization domains, the coloring overhead can be in the order of minutes with a single thread for many real-life graphs. In this work, we propose parallel algorithms for bipartite-graph partial coloring on shared-memory architectures. Compared to the existing shared-memory BGPC algorithms, the proposed ones employ greedier and more optimistic techniques that yield a better parallel coloring performance. In particular, on 16 cores, the proposed algorithms are more than 4x faster than their counterparts in the ColPack library which is, to the best of our knowledge, the only publicly-available coloring library for multicore architectures. In addition to BGPC, the proposed techniques are employed to devise parallel distance-2 graph coloring algorithms and similar performance improvements have been observed. Finally, we propose two costless balancing heuristics for BGPC that can reduce the skewness and imbalance on the cardinality of color sets (almost) for free. the heuristics can also be used for the D2GC problem and in general, they will probably yield a better color-based parallelization performance especially on many-core architectures.

关键词： Greedy graph coloring bipartite-graph coloring distance-2 coloring shared-memory parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Accelerating processing of Scale-Free Graphs on Massively-parallel architectures 17th

Accelerating Processing of Scale-Free Graphs on Massively-Pa...

引用

17th international conference on algorithms and architectures for parallel processing (ICA3PP)

作者： Chernoskutov, Mikhail Ural Fed Univ Krasovskii Inst Math & Mech Ekaterinburg Russia

ISBN: (纸本)9783319654829;9783319654812

processing of big scale-free graphs on parallel architectures with high parallelization opportunities connected with a lot of overheads. Due to skewed degree distribution each thread receives different amount of computational workload. In this paper we present a method devoted to address this challenge by modificating CSR data structure and redistributing work across threads. the method was implemented in breadth-first search and single source shortest path algorithms for GPU architecture.

关键词： parallel processing Graph algorithms Workload balancing

来源：评论

学校读者我要写书评

暂无评论

parallel Wavelength Fault Tolerant Clos-network for Space Optical Network 10

Parallel Wavelength Fault Tolerant Clos-network for Space Op...

引用

10th international conference on Wireless Communications and Signal processing (WCSP)

作者： Liu, Kai Lu, Zhou Yan, Jian China Acad Elect & Informat Technol Beijing Peoples R China Tsinghua Univ Tsinghua Space Ctr Beijing Peoples R China

ISBN: (纸本)9781538661192

In order to satisfy high mobility and reliability of space optical networks, a parallel Wavelength Fault Tolerant Clos-network, PW-FTC, is proposed to resist faults induced by space effects. At the network level, the minimal number of wavelengths required to resist link failures, W-min, is obtained by solving the linear programming model. At the node level, the PW-FTC network consists of W-min Fault Tolerant Clos-network (FTC) planes, in each of which a wavelength switching is accomplished. Both theoretical analysis and numerical results demonstrate that, the PW-FTC outperforms parallel Wavelength Fault Tolerant Clos-network, PW-Clos, and parallel Spare Wavelength Clos-network, PSW-Clos, at reliability at the cost of adding spare switching elements.

关键词： onboard optical switching linear programming parallel wavelength switching fault-tolerant Clos-network

来源：评论

学校读者我要写书评

暂无评论

Fast and parallel Computation of the Discrete Periodic Radon Transform on GPUs, multi-core CPUs and FPGAs

Fast and Parallel Computation of the Discrete Periodic Radon...

引用

IEEE international conference on Image processing

作者： Cesar Carranza Marios Pattichis Daniel Llamocca Sección Electricidad y Electrónica Pontificia Universidad Católica del Perú Perú Department of Electrical and Computer Engineering The University of New Mexico Albuquerque NM USA Department of Electrical and Computer Engineering Oakland University Rochester MI USA

the Discrete Periodic Radon Transform (DPRT) has many important applications in reconstructing images from their projections and has recently been used in fast and scalable architectures for computing 2D convolutions. Unfortunately, the direct computation of the DPRT involves O(N~3) additions and memory accesses that can be very costly in single-core architectures. the current paper presents new and efficient algorithms for computing the DPRT and its inverse on multi-core CPUs and GPUs. the results are compared against specialized hardware implementations (FPGAs/ASICs). the results provide significant evidence of the success of the new algorithms. On an 8-core CPU (Intel Xeon), with support for two threads per core, FastDirDPRT and FastDirInvDPRT achieve a speedup of approximately 10× (up to 12.83×) over the single-core CPU implementation. On a 2048-core GPU (GTX 980), FastRayDPRT and FastRayInvDPRT achieve speedups in the range of 526 (for 127 × 127) to 873 (for 1021 × 1021), which approximate ideal speedups of what can be achieved. the DPRT can be computed exactly and in real-time (30 frames per second) for 1471 × 1471 images using FastRayDPRT on the GPU. Furthermore, the GPU algorithms approximate the performance of an efficient FPGA implementation using 2N parallel cores at 100MHz.

关键词： Graphics processing units Instruction sets Multicore processing Radon Transforms

来源：评论

学校读者我要写书评

暂无评论

DaCO: A High-Performance Token Dataflow Coprocessor Overlay for FPGAs 17

DaCO: A High-Performance Token Dataflow Coprocessor Overlay ...

引用

17th international conference on Field-Programmable Technology (FPT)

作者： Siddhartha Kapre, Nachiket Nanyang Technol Univ Singapore Singapore Univ Waterloo Waterloo ON Canada

ISBN: (纸本)9781728102139

Dataflow computing architectures exploit dynamic parallelism at the fine granularity of individual operations and provide a pathway to overcome the performance and energy limits of conventional von Neumann models. In this vein, we present DaCO (Dataflow Coprocessor FPGA Overlay), a high-performance compute organization for FPGAs to deliver up to 2.5x speedup over existing dataflow alternatives. Historically, dataflow-style execution has been viewed as an attractive parallel computing paradigm due to the self-timed, decentralized nature of implementation of dataflow dependencies and an absence of sequential program counters. However, realising high-performance dataflow computers has remained elusive largely due to the complexity of scheduling this parallelism and data communication bottlenecks. DaCO achieves this by (1) supporting large-scale (1000s of nodes) out-of-order scheduling using hierarchical lookup, (2) priority-aware routing of dataflow dependencies using the efficient Hoplite-Q NoC, and (3) clustering techniques to exploit data locality in the communication network organization. Each DaCO processing element is a programmable soft processor and it communicates with others using a packet-switching network-on-chip (PSNoC). We target the Arria 10 AX115S FPGA to take advantage of the hard floating-point DSP blocks, and maximize performance by multipumping the M20K Block RAMs. Overall, we can scale DaCO to 450 processors operating at an f(max) of 250 MHz on the target platform. Each soft processor consumes 779 ALMs, 4 M20K BRAMs, and 3 hard floating-point DSP blocks for optimum balance, while the on-chip communication framework consumes < 15% of the on-chip resources.

关键词： FPGA overlay architectures token dataflow

来源：评论

学校读者我要写书评

暂无评论

Improving matrix-based dynamic programming on massively parallel accelerators

引用

INFORMATION SYSTEMS 2017年第Mar.期64卷 175-193页

作者： Bednarek, David Brabec, Michal Krulis, Martin Charles Univ Prague Fac Math & Phys Parallel Architectures Algorithms Applicat Res Gr Malostranske Nam 25 Prague Czech Republic

Dynamic programming techniques are well-established and employed by-various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. these algorithms usually operate in an iteration-based manner where new values are computed from values of the previous iteration. the data dependencies enforce synchronization which limits possibilities for internal parallel processing. In this paper, we investigate parallel approaches to processing matrix-based dynamic programming algorithms on modern multicore CPUs, Intel Xeon Phi accelerators, and general purpose GPUs. We address both the problem of computing a single distance on large inputs and the problem of computing a number of distances of smaller inputs simultaneously (e.g., when a similarity query is being resolved). Our proposed solutions yielded significant improvements in performance and achieved speedup of two orders of magnitude when compared to the serial baseline. (C) 2016 Elsevier Ltd. All rights reserved.

关键词： parallel Multicore GPU Intel Xeon Phi Dynamic programming Edit distance Dynamic time warping

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：