检索结果-内蒙古大学图书馆

6th International Conference on Supercomputing in Mexico (ISUM)

作者： Perez, Hugo Hernandez, Benjamin Rudomin, Isaac Ayguade, Eduard Univ Politecn Cataluna BarcelonaTECH Barcelona Spain Barcelona Supercomp Ctr Barcelona Spain Oak Ridge Natl Lab Oak Ridge TN USA

ISBN: (纸本)9783319322438;9783319322421

Programmers need to combine different programming models and fully optimize their codes to take advantage of various levels of parallelism available in heterogeneous clusters. To reduce the complexity of this process, we propose a task-based approach for crowd simulation using OmpSs, CUDA and MPI, which allows taking the full advantage of computational resources available in heterogeneous clusters. We also present the performance analysis of the algorithm under different workloads executed on a GPU Cluster.

关键词： Crowds Simulation Visualization parallel programming models Accelerators Heterogeneous architecture GPU cluster HPC

来源：评论

学校读者我要写书评

暂无评论

The Case for Binary Rewriting at Runtime for Efficient Implementation of High-Level programming models in HPC 30

The Case for Binary Rewriting at Runtime for Efficient Imple...

引用

30th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Weidendorfer, Josef Breitbart, Jens Tech Univ Munich Dept Informat Munich Germany

ISBN: (纸本)9781509036820

Implementations of parallel programming models are provided either as language extensions, completely new languages or as a library. The first two options often provides high productivity, but requires the porting of codes. In contrast, calls to new libraries can be added more easily, however the use of abstractions in such programming model implementations can have high runtime overhead. In both cases, the mentioned drawbacks often hinder the adaptation of novel programming models for large existing codes. To combine the advantages of compiler analysis with the composability of pure libraries towards more efficient programming model implementations, in this paper, we propose a low level API for programmer controlled binary rewriting at runtime. This can be used by programming models provided as libraries to efficiently integrate their abstractions with application code. It enables incremental adoption for existing codes as well as favoring input-dependent optimization strategies yet providing similar performance as language extension approaches. We show first promising experiences.

关键词： High Performance Computing parallel programming models Dynamic Optimizations

来源：评论

学校读者我要写书评

暂无评论

TProf: An energy profiler for task-parallel programs

引用

SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS 2015年 5卷 1-13页

作者： Manousakis, Ioannis Zakkak, Foivos S. Pratikakis, Polyvios Nikolopoulos, Dimitrios S. Rutgers State Univ Dept Comp Sci Piscataway NJ 08855 USA Fdn Res & Technol Inst Comp Sci Hellas Greece Queens Univ Belfast Sch Elect Elect Engn & Comp Sci Belfast BT7 1NN Antrim North Ireland

We present TPR0F, an energy profiling tool for OpenMP-like task-parallel programs. To compute the energy consumed by each task in a parallel application, TPRoF dynamically traces the parallel execution and uses a novel technique to estimate the per-task energy consumption. To achieve this estimation, TPRoF apportions the total processor energy among cores and overcomes the limitation of current works which would otherwise make parallel accounting impossible to achieve. We demonstrate the value of TPRoF by characterizing a set of task parallel programs, where we find that data locality, memory access patterns and task working sets are responsible for significant variance in energy consumption between seemingly homogeneous tasks. In addition, we identify opportunities for fine-grain energy optimization by applying per-task Dynamic Voltage and Frequency Scaling (DVFS). (C) 2014 Published by Elsevier Inc.

关键词： Energy profiling parallel programming models parallel runtime systems Task parallelism

来源：评论

学校读者我要写书评

暂无评论

P-SOCRATES: A parallel software framework for time-critical many-core systems

引用

MICROPROCESSORS AND MICROSYSTEMS 2015年第8期39卷 1190-1203页

作者： Pinho, Luis Miguel Nelis, Vincent Yomsi, Patrick Meumeu Quinones, Eduardo Bertogna, Marko Burgio, Paolo Marongiu, Andrea Scordino, Claudio Gai, Paolo Ramponi, Michele Mardiak, Michal ISEP Oporto Portugal Barcelona Supercomp Ctr Dept Comp Sci Barcelona Spain Univ Modena I-41100 Modena Italy ETH Zurich Switzerland Evidence Srl Florence Italy Act Technol Srl Ferrara Italy

Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. parallel programming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution. The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives. (c) 2015 Elsevier B.V. All rights reserved.

关键词： Many-core systems Real-time systems Embedded systems WCET analysis Real-time scheduling parallel programming models

来源：评论

学校读者我要写书评

暂无评论

Enabling Scalable and Fine-Grained Nested parallelism on Embedded Many-Cores 9

Enabling Scalable and Fine-Grained Nested Parallelism on Emb...

引用

9th IEEE International Symposium on Embedded Multicore/Manycore Systems-on-Chip (MCSoC)

作者： Capotondi, Alessandro Marongiu, Andrea Benini, Luca Univ Bologna DEI I-40126 Bologna Italy Univ Bologna IIS I-40126 Bologna Italy Swiss Fed Inst Technol Zurich ETHZ Zurich Switzerland

ISBN: (纸本)9781479986705

Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hierarchical interconnect and distributed memory with non-uniform access (NUMA). Nested parallelism is a convenient programming abstraction for large-scale cc-NUMA systems, which allows to hierarchically (and dynamically) create multiple levels of fine-grained parallelism whenever it is available. Available implementations for cc-NUMA systems introduce large overheads for nested parallelism management, which cannot be tolerated due to the extremely fine-grained nature of embedded parallel workloads. In particular, creating a team of parallel threads has a cost that increases linearly with the number of threads, which is inherently non scalable. This work presents a software cache mechanism for frequently-used parallel team configurations to speed up parallel thread creation overheads in PMCA systems. When a configuration is found in the cache the cost for parallel team creation has a constant time, providing a scalable mechanism. We evaluated our support on the STMicroelectronics STHORM many-core. Compared to the state-of-the art, our solution shows that: i) the cost for parallel team creation is reduced by up to 67%;ii) the tangible effect on real ultra-fine-grained parallel kernels is a speedup of up to 80%.

关键词： cache storage embedded systems multiprocessing systems parallel programming system-on-chip PMCA STMicroelectronics STHORM many-core SoC distributed memory with nonuniform access embedded many-cores fine-grained parallelism heterogeneous systems-on-chip nested parallelism parallel team configuration parallel thread creation overhead programmable manycore accelerator programming abstraction software cache mechanism Fabrics Instruction sets Message systems parallel processing programming Recruitment Embedded Many-Core Architectures OpenMP parallel programming models Embedded systems Computer based information systems parallel programming instruction sets system-on-chip Multiprocessing systems cache storage parallel PROCESSING (COMPUTERS) programming Textile fabrics parallelism fabric Synclitism THREADS parallel Lines

来源：评论

学校读者我要写书评

暂无评论

The AXIOM project (Agile, eXtensible, fast I/O Module) 15

The AXIOM project (Agile, eXtensible, fast I/O Module)

引用

International Conference on Embedded Computer Systems Architectures Modeling and Simulation

作者： Theodoropoulos, Dimitris Pnevmatikatos, Dionisis Alvarez, Carlos Ayguade, Eduard Bueno, Javier Filgueras, Antonio Jimenez-Gonzalez, Daniel Martorell, Xavier Navarro, Nacho Segura, Carlos Fernandez, Carles Oro, David Saeta, Javier Rodriguez Gai, Paolo Rizzo, Antonio Giorgi, Roberto Fdn Res & Technol Hellas FORTH Inst Comp Sci GR-70013 Iraklion Greece Barcelona Supercomp Ctr Dept Comp Sci Barcelona Spain Univ Politecn Cataluna Comp Architecture Dept Barcelona Spain Herta Secur SL Barcelona Spain Evidence SRL Pisa Italy Univ Siena Siena Italy

ISBN: (纸本)9781467373111

The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for the future Cyber-Physical Systems (CPSs). These systems are expected to react in real-time, provide enough computational power for the assigned tasks, consume the least possible energy for such task (energy efficiency), scale up through modularity, allow for an easy programmability across performance scaling, and exploit at best existing standards at minimal costs. Current solutions for providing enough computational power are mainly based on multi-or many-core architectures. For example, some current research projects (such as ADEPT or P-SOCRATES) are already investigating how to join efforts from the High-Performance Computing (HPC) and the Embedded Computing domains, which are both focused on high power efficiency, while GPUs and new Dataflow platforms such as Maxeler, or in general FPGAs, are claimed as the most energy efficient. We present the project's initial approach, ideas and key concepts, and describe the AXIOM preliminary architecture. Our starting point uses power efficient multi-core nodes, such as ARM cores and FPGA accelerators on the same die, as in the Xilinx Zynq. We will work to provide an integrated environment that supports programmability of the parallel, interconnected nodes that form a CPS system, and evaluate our ideas using demanding test application scenarios.

关键词： Cyber-Physical Systems parallel programming models Smart Video Surveillance Smart Living/Home

来源：评论

学校读者我要写书评

暂无评论

UCX: An Open Source Framework for HPC Network APIs and Beyond 23

UCX: An Open Source Framework for HPC Network APIs and Beyon...

引用

IEEE 23rd Annual Symposium on High-Performance Interconnects

作者： Shamis, Pavel Venkata, Manjunath Gorentla Lopez, M. Graham Baker, Matthew B. Hernandez, Oscar Itigin, Yossi Dubman, Mike Shainer, Gilad Graham, Richard L. Liss, Liran Shahar, Yiftah Potluri, Sreeram Rossetti, Davide Becker, Donald Poole, Duncan Lamb, Christopher Kumar, Sameer Stunkel, Craig Bosilca, George Bouteiller, Aurelien Oak Ridge Natl Lab Oak Ridge TN 37831 USA Mellanox Technol Yokneam Illit Israel NVIDIA Corp Santa Clara CA USA IBM Corp Armonk NY USA Univ Tennessee Knoxville TN USA

ISBN: (纸本)9781467391603

This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.

关键词： application program interfaces input-output programs message passing parallel programming public domain software HPC network APIs I/O bound applications MPI OpenSHMEM PGAS languages UCX Unified Communication X high throughput computing highly-scalable network stack message passing interface open source framework parallel programming models partitioned global address space languages system libraries task-based paradigms Bandwidth Electronics packaging Hardware Libraries Memory management programming Protocols HPC Infiniband Middleware PGAS RDMA message passing mannose phosphate isomerase remote procedure calls parallel programming input-output programs open source software Application programming interfaces Infiniband Electronics packaging RDMA Bandwidth High Performance Computing Computer hardware Store management Middleware

来源：评论

学校读者我要写书评

暂无评论

The JStar language philosophy

引用

parallel COMPUTING 2014年第2期40卷 35-50页

作者： Utting, Mark Weng, Min-Hsien Cleary, John G. Univ Waikato Dept Comp Sci FCMS Hamilton New Zealand

This paper introduces the JStar parallel programming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallel programming, and giving the compiler and runtime maximum freedom to try alternative parallelisation strategies. We describe the execution semantics and runtime support of the language, several optimisations and parallelism strategies, with some benchmark results. (C) 2013 Elsevier B.V. All rights reserved.

关键词： parallel programming models Architecture independence JStar Java Datalog Linda-like languages

来源：评论

学校读者我要写书评

暂无评论

PGAS-FMM: Implementing a distributed fast multipole method using the X10 programming language

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2014年第3期26卷 712-727页

作者： Milthorpe, Josh Rendell, Alistair P. Huber, Thomas Australian Natl Univ Res Sch Comp Sci Canberra ACT 0200 Australia Australian Natl Univ Res Sch Chem Canberra ACT 0200 Australia

The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 by using a scalable pattern of activities. This paper demonstrates the use of X10 to implement FMM for simulation of electrostatic interactions between ions in a cyclotron resonance mass spectrometer. X10's task-parallel model is used to express parallelism by using a pattern of activities mapping directly onto the tree. X10's work stealing runtime handles load balancing fine-grained parallel activities, avoiding the need for explicit work sharing. The use of global references and active messages to create and synchronize parallel activities over a distributed tree structure is also demonstrated. In contrast to previous simulations of ion trajectories in cyclotron resonance mass spectrometers, our code enables both simulation of realistic particle numbers and guaranteed error bounds. Single-node performance is comparable with the fastest published FMM implementations, and critical expansion operators are faster for high accuracy calculations. A comparison of parallel and sequential codes shows the overhead of activity management and work stealing in this application is low. Scalability is evaluated for 8k cores on a Blue Gene/Q system and 512 cores on a Nehalem/InfiniBand cluster. Copyright (c) 2013 John Wiley & Sons, Ltd.

关键词： X10 partitioned global address space (PGAS) active messages parallel programming models scientific computing fast multipole method

来源：评论

学校读者我要写书评

暂无评论

Hybrid address spaces: A methodology for implementing scalable high-level programming models on non-coherent many-core architectures

引用

JOURNAL OF SYSTEMS AND SOFTWARE 2014年第0期97卷 47-64页

作者： Papagiannis, Anastasios Nikolopoulos, Dimitrios S. Fdn Res & Technol Hellas Inst Comp Sci FORTH ICS Iraklion Greece Queens Univ Belfast Sch Elect Elect Engn & Comp Sci Belfast BT7 1NN Antrim North Ireland

This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71 x over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix(++), a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demonstrates improvements of a factor of 3.1 x and 3.2x respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing. (C) 2014 Elsevier Inc. All rights reserved.

关键词： parallel programming models Runtime systems Many-core processors

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：