In this work, we study one of the major problems in exploring the power of GPUs to accelerate video processing applications: countless frames have to be transferred back and forth between the CPU and GPU. We evaluate ...
详细信息
the proceedings contain 68 papers. the special focus in this conference is on Support tools environments, Performance prediction and evaluation, Scheduling and load balancing, High performance architectures and compil...
ISBN:
(纸本)9783319098722
the proceedings contain 68 papers. the special focus in this conference is on Support tools environments, Performance prediction and evaluation, Scheduling and load balancing, High performance architectures and compilers, parallel and distributed data management, Grid, Cluster and cloud computing, Green high performance computing, Distributed systems and algorithms, parallel and distributed programming, parallel numerical algorithms, Multicore and manycore programming, theory and algorithms for parallel computation and High performance networks and communication. the topics include: MPI trace compression using event flow graphs;customized scalable tracing with in-situ data analysis;performance measurement and analysis of transactional memory and speculative execution on IBM blue Gene/Q;an open-source management framework for cloud applications;modeling and simulation of a dynamic task-based runtime system for heterogeneous multi-core architectures;modeling the impact of reduced memory bandwidth on HPC applications;finding the important basic blocks in multithreaded programs;optimization and trade-off analysis for time, energy and resource usage;performance prediction and evaluation of parallel applications in KVM, Xen, and VMware;per-task DRAM energy metering in multicore systems;characterizing the performance-energy tradeoff of small ARM cores in HPC computation;finding efficient queue setup using high-resolution simulations;a progressively pessimistic scheduler for software transactional memory;a queueing theory approach to pareto optimal bags-of-tasks scheduling on clouds;scheduling/placement approach for task-graphs on heterogeneous architecture;energy-aware multi-organization scheduling problem;energy efficient scheduling of mapreduce jobs and switchable scheduling for runtime adaptation of optimization.
Several computing systems that use decimal number calculations suffer from the accumulation and propagation of errors. Decimal numbers are represented using specific length floating point formats and hence there will ...
详细信息
ISBN:
(纸本)9781479965946
Several computing systems that use decimal number calculations suffer from the accumulation and propagation of errors. Decimal numbers are represented using specific length floating point formats and hence there will always be a truncation of extra fraction bits causing errors. Several solutions had been proposed for such a problem. Among those accurate calculation systems was the usage of vectors of floating point numbers to represent decimal values with very large accuracy, known as Multi-Number System (MN). Unfortunately, MN calculations are time consuming and are not suitable for real time applications. Several attempts for special architectures had been proposed to speed up those calculations. In this work, the Single Instruction Multiple Data (SIMD) paradigm found in modern CPUs is exploited to accelerate the MN calculations. the basic arithmetic operation algorithms had been modified to utilize the SIMD architecture and a new Square representation of operands had been proposed, this representation was introduced because the MN operations are sequential and iterative, and thus we can't apply the SIMD parallel instructions directly. the proposed architecture has an execution time that is 35 % of the original MN execution time for the division, which is the most time consuming operation while preserving the same accuracy.
the current computational infrastructure at LHCb is designed for sequential execution. It is possible to make use of modern multi-core machines by using multi-threaded algorithms and running multiple instances in para...
详细信息
the current computational infrastructure at LHCb is designed for sequential execution. It is possible to make use of modern multi-core machines by using multi-threaded algorithms and running multiple instances in parallel, but there is no way to make efficient use of specialized massively parallel hardware, such as graphical processing units and Intel Xeon/Phi. We extend the current infrastructure with an out-of-process computational server able to gather data from multiple instances and process them in large batches.
Information processing is a very broad area in which many problems are computationally intensive and thus, they require parallelization and acceleration based on new technologies. the Xilinx Zynq-7000 all programmable...
详细信息
ISBN:
(纸本)9781479941209
Information processing is a very broad area in which many problems are computationally intensive and thus, they require parallelization and acceleration based on new technologies. the Xilinx Zynq-7000 all programmable system-on-chip can be seen as a very adequate platform permitting application-specific software and problem-targeted hardware to be coupled on a single configurable microchip. the tutorial is dedicated to multi-level software/hardware co-design techniques and system architecturesthat combine general-purpose computers, multi-core application-specific processing, and accelerators in reconfigurable hardware with emphasis on broad parallelism. Four projects from the scope of data processing, application informatics, parallelalgorithms (mapped to hardware), and combinatorial search are briefly characterized and will be demonstrated in fully implemented and ready to test projects that include software and reconfigurable hardware linked with on-chip high-performance interfaces. Particular design examples, potential practical applications, experiments and comparisons will be demonstrated.
XML technology is being extensively used for data exchange between applications on web and hence mining these documents becomes an important area of research. Since XML is extensively used in web, efficient methods ar...
详细信息
ISBN:
(纸本)9781479942367
XML technology is being extensively used for data exchange between applications on web and hence mining these documents becomes an important area of research. Since XML is extensively used in web, efficient methods are required for knowledge discovery from the enormous collections of XML documents. Also some advanced tools and technologies are required to effectively handle this scalable data. A methodology is proposed to deal with handling such scalable XML data withthe help of high performance and low cost computing, the GPU. this paper aims to parallelize the pre-processing stage of deserialization and sorting to make the dataset favorable for mining.
the Barnes-Hut algorithm is a widely used approximation method for the N-Body simulation problem. the irregular nature of this tree walking code presents interesting challenges for its computation on parallel systems....
详细信息
High Energy Physics code has been known for making poor use of high performance computing architectures. Efforts in optimising HEP code on vector and RISC architectures have yield limited results and recent studies ha...
详细信息
High Energy Physics code has been known for making poor use of high performance computing architectures. Efforts in optimising HEP code on vector and RISC architectures have yield limited results and recent studies have shown that, on modern architectures, it achieves a performance between 10% and 50% of the peak one. Although several successful attempts have been made to port selected codes on GPUs, no major HEP code suite has a "High Performance" implementation. With LHC undergoing a major upgrade and a number of challenging experiments on the drawing board, HEP cannot any longer neglect the less-than-optimal performance of its code and it has to try making the best usage of the hardware. this activity is one of the foci of the SFT group at CERN, which hosts, among others, the Root and Geant4 project. the activity of the experiments is shared and coordinated via a Concurrency Forum, where the experience in optimising HEP code is presented and discussed. Another activity is the Geant-V project, centred on the development of a high-performance prototype for particle transport. Achieving a good concurrency level on the emerging parallelarchitectures without a complete redesign of the framework can only be done by parallelizing at event level, or with a much larger effort at track level. Apart the shareable data structures, this typically implies a multiplication factor in terms of memory consumption compared to the single threaded version, together with sub-optimal handling of event processing tails. Besides this, the low level instruction pipelining of modern processors cannot be used efficiently to speedup the program. We have implemented a framework that allows scheduling vectors of particles to an arbitrary number of computing resources in a fine grain parallel approach. the talk will review the current optimisation activities within the SFT group with a particular emphasis on the development perspectives towards a simulation framework able to profit best from t
Image Registration is the key step of Image processing as it is the process to locate most accurate relative orientation among two or more images, captured at the same or different times by distinguishable or indistin...
详细信息
Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requir...
详细信息
ISBN:
(数字)9783319098739
ISBN:
(纸本)9783319098739;9783319098722
Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. the most promising and successful approaches so far rely on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. this approach allows to obtain performance predictions accurate within a few percents on classical dense linear algebra kernels in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not.
暂无评论