In many cases, applications are not optimized for the hardware on which they run. Several reasons contribute to this unsatisfying situation, including legacy code, commercial code distributed in binary form, or deploy...
详细信息
In many cases, applications are not optimized for the hardware on which they run. Several reasons contribute to this unsatisfying situation, including legacy code, commercial code distributed in binary form, or deployment on compute farms. In fact, backward compatibility of ISA guarantees only the functionality, not the best exploitation of the hardware. In this work, we focus on maximizing the CPU efficiency for the SIMD extensions and propose to convert automatically, and at runtime, loops vectorized for an older version of the SIMD extension to a newer one. We propose a lightweight mechanism, that does not include a vectorizer, but instead leverages what a static vectorizer previously did. We show that many loops compiled for x86 SSE can be dynamically converted to the more recent and more powerful AVX; as well as, how correctness is maintained with regards to challenges such as data dependences and reductions. We obtain speedups in line with those of a native compiler targeting AVX. The re-vectorizer is implemented inside a dynamic optimization platform; it is completely transparent to the user, does not require rewriting binaries, and operates during program execution.
Summary form only given. Multicore and manycore processors are now ubiquitous, but parallel programming remains as difficult as it was 30-40 years ago. In this talk, I will argue that these problems arise largely from...
详细信息
Summary form only given. Multicore and manycore processors are now ubiquitous, but parallel programming remains as difficult as it was 30-40 years ago. In this talk, I will argue that these problems arise largely from the computation-centric abstractions that we currently use to think about parallelism. In their place, I will propose a novel data-centric foundation for parallel programming called the operator formulation in which algorithms are described in terms of unitary actions on data structures. This data-centric view of parallel algorithms shows that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous even in complex, irregular graph applications such as mesh generation and partitioning algorithms, graph analytics, and machine learning applications. Binding time considerations provide a unification of parallelization techniques ranging from static parallelization to speculative parallelization. We have built a system called Galois, based on these ideas, for exploiting amorphous data-parallelism on multicores and GPUs. I will present experimental results from our group as well as from other groups that are using the Galois system.
Image processing algorithms which only work on a local neighbourhood are nearly used in every image processing application. Very often several iterations are performed on a fixed neighbourhood which leads to the descr...
详细信息
Image processing algorithms which only work on a local neighbourhood are nearly used in every image processing application. Very often several iterations are performed on a fixed neighbourhood which leads to the description of stencil codes. A promising approach in embeddedsystems is to use the massively parallel computation power of an FPGA for this kind of algorithms. This not only speeds up processing time, if the FPGA is directly placed inside the image acquisition unit forming a smart camera, but also reduces or even eliminates the PC based hardware which saves space and power. However, most designers begin from scratch when they have to implement stencil computations into smart cameras. This leads to a not fully utilized FPGA because the most efficient usage of the given resources is only secondary alongside functional correctness. Therefore, we are presenting in this paper a framework for stencil code applications which immediately delivers the best architecture regarding prominent resource criteria. An analytical model is used to find an optimized parameter set (degree of parallelism, usage of buffers, etc.) for a highly flexible FPGA implementation. A graphical tool allows to further evaluate the effects of certain parameters. Our results show, that we are able to create an optimized hardware architecture for this application domain.
Summary form only given. The memory system is a fundamental performance and energy bottleneck in almost all computing systems. Recent system design, application, and technology trends that require more capacity, bandw...
详细信息
Summary form only given. The memory system is a fundamental performance and energy bottleneck in almost all computing systems. Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck. At the same time, DRAM and flash technologies are experiencing difficult technology scaling challenges that make the maintenance and enhancement of their capacity, energy-efficiency, and reliability significantly more costly with conventional techniques. In this talk, we examine some promising research and design directions to overcome challenges posed by memory scaling. Specifically, we discuss three key solution directions: 1) enabling new memory architectures, functions, interfaces, and better integration of the memory and the rest of the system, 2) designing a memory system that intelligently employs multiple memory technologies and coordinates memory and storage management using non-volatile memory technologies, 3) providing predictable performance and QoS to applications sharing the memory/storage system. If time permits, we may also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory.
The communication among molecular networks may be specifically realized by nanomechanical, acoustic, and electromagnetic fields and molecular transport. Here, experimental and theoretical studies of peptide and protei...
详细信息
ISBN:
(纸本)9783319231266;9783319231259
The communication among molecular networks may be specifically realized by nanomechanical, acoustic, and electromagnetic fields and molecular transport. Here, experimental and theoretical studies of peptide and protein films and single molecules in static and radiofrequency electromagnetic fields are reported. Impedance (dielectric) electrochemical spectroscopy revealed nonlinear properties of glycine, alanine and albumen films in the external electromagnetic field in frequency range 0.5-100 MHz. computer "all atom" simulation allows one to calculate the nanoelectromagnetic field of molecular systems and to evaluate the self-assembled supramolecular architectures. Theoretical studies revealed the dipole moment dynamics of polyalanine peptides. Further, we combine both approaches, thus providing a prediction model of nanoelectromagnetic field generation, and molecular transportation/communication.
Obtaining the set of trade-off architectures from a SysML model is an important objective for the system designer. To achieve this goal, we propose a methodology combining SysML with the variability concept and multi-...
详细信息
ISBN:
(纸本)9783319278698;9783319278681
Obtaining the set of trade-off architectures from a SysML model is an important objective for the system designer. To achieve this goal, we propose a methodology combining SysML with the variability concept and multi-objectives optimization techniques. An initial SysML model is completed with variability information to show up the different alternatives for component redundancy and selection from a library. The constraints and objective functions are also added to the initial SysML model, with an optimization context. Then a representation of a constraint satisfaction problem (CSP) is generated with an algorithm from the optimization context and solved with an existing solver. The paper illustrates our methodology by designing an embedded Cognitive Safety System (ECSS). From a component repository and redundancy alternatives, the best design alternatives are generated in order to minimize the total cost and maximize the estimated system reliability.
As modern processors are becoming increasingly complex, fast and accurate performance prediction is crucial during the early phases of hardware and software co-development. To accurately and efficiently predict the pe...
详细信息
As modern processors are becoming increasingly complex, fast and accurate performance prediction is crucial during the early phases of hardware and software co-development. To accurately and efficiently predict the performance of a given software workload is, however, a challenging problem. Traditional cycle-accurate simulation is often too slow, while analytical models are not sufficiently accurate or still require target-specific execution statistics that may be slow or difficult to obtain. In this paper, we propose a novel learning-based approach for synthesizing analytical models that can accurately predict the performance of a workload on a target platform from various performance statistics obtained directly on a host platform using built-in hardware counters. Our learning approach relies on a one-time training phase using a cycle-accurate reference of the chosen target processor. We train our models on over 15,000 program instances from the ACM-ICPC programming contest database, and demonstrate the prediction accuracy on standard benchmark suites. Result show that our approach achieves on average more than 90% accuracy at 160× the speed compared to a cycle-accurate reference simulation.
The FlexTiles Platform has been developed within a Seventh Framework Programme project which is co-funded by the European Union with ten participants of five countries. It aims to create a self-adaptive heterogeneous ...
详细信息
The FlexTiles Platform has been developed within a Seventh Framework Programme project which is co-funded by the European Union with ten participants of five countries. It aims to create a self-adaptive heterogeneous many-core architecture which is able to dynamically manage load balancing, power consumption and faulty modules. Its focus is to make the architecture efficient and to keep programming effort low. Therefore, the concept contains a dedicated automated tool-flow for creating both the hardware and the software, a simulation platform that can execute the same binaries as the FPGA prototype and a virtualization layer to manage the final heterogeneous many-core architecture for run-time adaptability. With this approach software development productivity can be increased and thus, the time-to-market and development costs can be decreased. In this paper we present the FlexTiles Development Platform with a many-core architecture demonstration. The steps to implement, validate and integrate two use-cases are discussed.
AEGLE project 1 targets to build an innovative ICT solution addressing the whole data value chain for health based on: cloud computing enabling dynamic resource allocation, HPC infrastructures for computational accel...
详细信息
AEGLE project 1 targets to build an innovative ICT solution addressing the whole data value chain for health based on: cloud computing enabling dynamic resource allocation, HPC infrastructures for computational acceleration and advanced visualization techniques. In this paper, we provide an analysis of the addressed Big Data health scenarios and we describe the key enabling technologies, as well as data privacy and regulatory issues to be integrated into AEGLE's ecosystem, enabling advanced health-care analytic services, while also promoting related research activities.
The proceedings contain 50 papers. The special focus in this conference is on Architecture, modeling, Tools, Applications, Network-on-a-Chip, Cryptography Applications and Extended Abstracts. The topics include: Reduc...
ISBN:
(纸本)9783319162133
The proceedings contain 50 papers. The special focus in this conference is on Architecture, modeling, Tools, Applications, Network-on-a-Chip, Cryptography Applications and Extended Abstracts. The topics include: Reducing storage costs of reconfiguration contexts by sharing instruction memory cache blocks;a vector caching scheme for streaming FPGA SpMV accelerators;hardware synthesis from functional embedded domain-specific languages;operand-value-based modeling of dynamic energy consumption of soft processors in FPGA;a fully parallel particle filter architecture for FPGAs;Teach advanced reconfigurable architectures and tools;dynamic memory management in vivado-HLS for scalable many-accelerator architectures;place and route tools for the mitigation of single event transients on flash-based FPGAs;advanced systemC tracing and analysis framework for extra-functional properties;run-time partial reconfiguration simulation framework based on dynamically loadable components;architecture virtualization for run-time hardware multithreading on field programmable gate arrays;survey on real-time network-on-chip architectures;hardware benchmarking of cryptographic algorithms using high-level synthesis tools;an efficient and flexible FPGA implementation of a face detection system;a dynamically reconfigurable mixed analog-digital filter bank;a timing driven cycle-accurate simulation for coarse-grained reconfigurable architectures;a novel concept for adaptive signal processing on reconfigurable hardware;modular acquisition and stimulation system for timestamp-driven neuroscience experiments;DRAM row activation energy optimization for stride memory access on FPGA-based systems;acceleration of data streaming classification using reconfigurable technology;partial reconfiguration for dynamic mapping of task graphs onto 2D mesh platform and a challenge of portable and high-speed FPGA accelerator.
暂无评论