Computer scientists and programmers face the difficultly of improving the scalability of their applications while using conventional programming techniques only. As a base-line hypothesis of this paper we assume that ...
详细信息
ISBN:
(纸本)9781509028269
Computer scientists and programmers face the difficultly of improving the scalability of their applications while using conventional programming techniques only. As a base-line hypothesis of this paper we assume that an advanced runtime system can be used to take full advantage of the available parallel resources of a machine in order to achieve the highest parallelism possible. In this paper we present the capabilities of HPX - a distributed runtime system for parallel applications of any scale - to achieve the best possible scalability through asynchronous task execution [1]. OP2 is an active library which provides a framework for the parallel execution for unstructured grid applications on different multi-core/many-core hardware architectures [2]. OP2 generates code which uses OpenMP for loop parallelization within an application code for both single-threaded and multi-threaded machines. In this work we modify the OP2 code generator to target HPX instead of OpenMP, i.e. port the parallel simulation backend of OP2 to utilize HPX. We compare the performance results of the different parallelization methods using HPX and OpenMP for loop parallelization within the Airfoil application. the results of strong scaling and weak scaling tests for the Airfoil application on one node with up to 32 threads are presented. Using HPX for parallelization of OP2 gives an improvement in performance by 5%-21%. By modifying the OP2 code generator to use HPX's parallelalgorithms, we observe scaling improvements by about 5% as compared to OpenMP. To fully exploit the potential of HPX, we adapted the OP2 API to expose a future and dataflow based programming model and applied this technique for parallelizing the same Airfoil application. We show that the dataflow oriented programming model, which automatically creates an execution tree representing the algorithmic data dependencies of our application, improves the overall scaling results by about 21% compared to OpenMP. Our results show
the emergence of clusters of multi-core multiprocessors has created a challenge for software developers who use concurrency to gain performance. the challenge lies in the application's dependence on boththe hardw...
详细信息
ISBN:
(纸本)9781509025688
the emergence of clusters of multi-core multiprocessors has created a challenge for software developers who use concurrency to gain performance. the challenge lies in the application's dependence on boththe hardware and the deeply integrated communication infrastructure for performance improvements. this integration of the communication and parallelism in the user's application reduces flexibility by adding complexity when switching to different communication and parallel infrastructures. In this paper, we present a retargetable compiler framework for a subset of X10that abstracts the hardware details, parallelism, and communication away from the application, allowing for portability and easier retargeting of the communication and parallelism. the retargetable compiler framework uses asynchronous computation and communication, as well as the concept of places to abstract away hardware details and to provide scalability. the framework offers performance, functionality, and flexibility because of our separation of tasks into layers and because of source code level serialization. To illustrate the ease of retargeting the communication and the patterns of parallelism, our framework is implemented with two different communication APIs (DUP and MPI-2) and two different patterns of parallelism (thread pooling and thread spawning). Retargeting the communication infrastructure using our framework required fewer code changes than changing the pattern of parallelism. the minimal code change needed to retarget these components offers developers a reasonable way to retarget without recompiling their application or sacrificing performance.
Knowledge of parallel programming is an essential requirement in multicore era. To meet this requirement, teaching parallel programming is important at university level. Further, students should have an exposure to di...
详细信息
ISBN:
(纸本)9781479918768
Knowledge of parallel programming is an essential requirement in multicore era. To meet this requirement, teaching parallel programming is important at university level. Further, students should have an exposure to different parallelarchitectures and programming models as well. In order to achieve this objective, it is appropriate to use an integrated system having different parallelarchitectures and supporting programming languages. though it is difficult to find a system as stated above, Multi Core Students Experimental Processor (MCSEP) designed on the base of Students Experimental Processor provides an opportunity to develop such system. the MCSEP can be configured to one of the five architectures: SISD, SIMD, MIMD, Multiple-SIMD, and Multiple-MIMD. Each architecture can further be configured to one of six Instruction Set architectures: Memory-Memory, Accumulator, Extended Accumulator, Stack, Register Memory, and Load Store. As there are no programming tools for the MCSEP, a compiler and a simplified programming language, SEPCom has been developed for using all the features of the multicore processor MCSEP. the SEPCom is a Java like programming language withparallel programming features. the test results show that SEPCom performs well in all architectures available in the MCSEP. therefore SEPCom can be used for writing parallel programs for different parallelarchitectures. Consequently, students can develop appropriate programs to do their experiments, and moreover to analyze and measure performances in different parallelarchitectures. Further, students can also use it as a case study for learning compiler design.
In this paper, a new algorithm is designed to solve Multiple Traveling Salesman Problem (MTSP) that avoiding the path intersection among the traveling salesmen. there are three objectives in this problem including the...
详细信息
Conflictless task scheduling is dedicated for environment of parallel task processing with high contention of limited amount of resources. For tasks that each one requires group of resources presented solution can pre...
详细信息
High-order finite-differencemethods are commonly used in wave propagator for industrial subsurface imaging algorithms. Computational aspects of the reduced linear elastic vertical transversely isotropic propagator are...
详细信息
ISBN:
(纸本)9783319198002;9783319197999
High-order finite-differencemethods are commonly used in wave propagator for industrial subsurface imaging algorithms. Computational aspects of the reduced linear elastic vertical transversely isotropic propagator are considered. thread parallelalgorithms suitable for implementing this propagator on multi-core and many-core processing devices are introduced. Portability is addressed through the use of the OCCA runtime programming interface. Finally, performance results are shown for various architectures on a representative synthetic test case.
Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers shar...
详细信息
ISBN:
(纸本)9781467375894
Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures.
Improving data transfer throughput over high-speed long-distance networks has become increasingly difficult. Numerous factors such as nondeterministic congestion, dynamics of the transfer protocol, and multiuser and m...
详细信息
ISBN:
(纸本)9781509028245
Improving data transfer throughput over high-speed long-distance networks has become increasingly difficult. Numerous factors such as nondeterministic congestion, dynamics of the transfer protocol, and multiuser and multitask source and destination endpoints, as well as interactions among these factors, contribute to this difficulty. A promising approach to improving throughput consists in using parallel streams at the application layer. We formulate and solve the problem of choosing the number of such streams from a mathematical optimization perspective. We propose the use of direct search methods, a class of easy-to-implement and light-weight mathematical optimization algorithms, to improve the performance of data transfers by dynamically adapting the number of parallel streams in a manner that does not require domain expertise, instrumentation, analytical models, or historic data. We apply our method to transfers performed withthe GridFTP protocol, and illustrate the effectiveness of the proposed algorithm when used within Globus, a state-of-the-art data transfer tool, on production WAN links and servers. We show that when compared to user default settings our direct search methods can achieve up to 10x performance improvement under certain conditions. We also show that our method can overcome performance degradation due to external compute and network load on source end points, a common scenario at high performance computing facilities.
3-D raw data collections introduce noise and artifacts that need to be recovered from degradation by an automated filtering system before further machine analysis. Serving this goal, five performance-efficient FPGA-pr...
详细信息
ISBN:
(纸本)9781509032488
3-D raw data collections introduce noise and artifacts that need to be recovered from degradation by an automated filtering system before further machine analysis. Serving this goal, five performance-efficient FPGA-prototyped processors are devised to realize parallel 3-D "filtering algorithm". these parallel processors tackle the major bottlenecks and limitations of existing multiprocessor systems in input volumetric data, processing word-length, output boundary conditions and inter-processor communications. then, greyscale 256×256×20 MRI case study are efficiently filtered and improved by a class of common convolution operators and their developed ones respectively. Analytically, the performance of the five implemented processors are evaluated in term of area, speed, dynamic power, and throughput. All five processors efficiently perform in high real-time throughput up to (114 VPS), lowest power consumption of down to (64 mW) at maximum operating frequency. the devised processors can be embedded in mobile MRI or fMRI scanner and as a pre-filtering stage in any portable automated fMRI systems.
暂无评论