In recent years, multi-core digital signal processors (DSPs) have been widely used to improve execution efficiency in a variety of applications. In order to fully explore the parallel processing capacity of DSPs, a we...
详细信息
ISBN:
(纸本)9781479938445
In recent years, multi-core digital signal processors (DSPs) have been widely used to improve execution efficiency in a variety of applications. In order to fully explore the parallel processing capacity of DSPs, a well-designed parallel programming model is essential for programmers. In this paper, a parallel programming model for a self-designed multi-core audio DSP (MAD) is proposed based on both shared-memory and message-passing communication mechanisms. A set of application program interfaces (APIs) of PPMA are provided to realize inter-core data transmission and synchronization controlling with high efficiency. To evaluate performance improvement of audio applications using PPMA, a low bit-rate speech codec application is ported to the MAD. With the help of PPMA, task scheduling of speech codec can be implemented conveniently. Experimental results also show that the overhead of inter-core communication in MAD is negligible compared to the parallel speedup achieved by PPMA.
Coprocessors based on the Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC p...
详细信息
Coprocessors based on the Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability of the MIC processors under different programmingmodels using the Beacon computer cluster. Our findings are as follows. (1) The native MPI programmingmodel on the MIC processors is typically better than the offload programmingmodel, which offloads the workload to MIC cores using OpenMP. (2) On top of the native MPI programmingmodel, multithreading inside each MPI process can further improve the performance for parallel applications on computer clusters with MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI processes to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybrid MPI programmingmodel, in which data processing is distributed to both MIC cores and CPU cores, can outperform the native MPI programmingmodel.
A parallelization of the bi-conjugate gradient solver for the pressure equation of the CUPID (component unstructured program for interfacial dynamics) code, which was developed for analyzing the components of a pressu...
详细信息
A parallelization of the bi-conjugate gradient solver for the pressure equation of the CUPID (component unstructured program for interfacial dynamics) code, which was developed for analyzing the components of a pressurized water-cooled reactor, was studied in a symmetric multi-processing system. The parallel performance was investigated for three typical parallel programming models (MPI, OpenMP, Hybrid) by solving incompressible backward-facing step flow at various grid resolutions. It was confirmed that parallel performance was low when problem size was small or the memory requirement for each thread was considerably higher than the cache memory. Furthermore, it was shown that MPI was better than OpenMP regardless of the problem size, and Hybrid was the best when the number of threads was relatively small.
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relations...
详细信息
Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. parallelprogramming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.
Coprocessors based on Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC proce...
详细信息
ISBN:
(纸本)9781479941162
Coprocessors based on Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability of the MIC processors under different programmingmodels using the Beacon computer cluster. Followings are our findings. (1) The native MPI programmingmodel on the MIC processors is typically better than the offload programmingmodel, which offloads the workload to MIC cores using OpenMP, on Beacon computer cluster. (2) On top of the native MPI programmingmodel, multithreading inside each MPI process can further improve the performance for parallel applications on computer clusters with MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI processes to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybrid MPI programmingmodel, in which data processing is distributed to both MIC cores and CPU cores, can outperform the native MPI programmingmodel.
Thread level speculation provides not only a simple parallel programming model, but also an effective mech- anism for thread-level parallelism exploitation. The performance of software speculative parallelmodels is l...
详细信息
Thread level speculation provides not only a simple parallel programming model, but also an effective mech- anism for thread-level parallelism exploitation. The performance of software speculative parallelmodels is limited by high global overheads caused by different types of loops. These loops usually have different characteristics of dependencies and different requirements of optimization strategies. In this paper, we propose three comprehensive optimization techniques to reduce different factors of global overheads, aiming at requirements from different types of loops. Inter-thread fetching can reduce the high mis-speculation rate of the loops with frequent dependencies and out-of-order committing can reduce the control overhead of the loops with infrequent dependencies, while enhanced dynamic task granularity resizing can reduce the control overhead and optimize the global overhead of the loops with changing characteristics of dependencies. All these three optimization techniques have been implemented in HEUSPEC~ a software TLS system. Experimental results indicate that they can satisfy tile demands from different groups of benchmarks. The combination of these techniques can improve the performance of all benchmarks and reach a higher average speedup.
The advent of the Cloud and the popularization of mobile devices have led us to a shift in computing access where users have an interactive display, and heavy computations run remotely, in the Cloud servers. COMPSs-Mo...
详细信息
The advent of the Cloud and the popularization of mobile devices have led us to a shift in computing access where users have an interactive display, and heavy computations run remotely, in the Cloud servers. COMPSs-Mobile is a framework that aims to ease the development of energy-efficient and high-performing applications for this kind of environment. The framework provides an infrastructure-unaware programmingmodel that allows developers to code regular Android applications whose computation is transparently parallelized and partially offloaded to remote resources. This paper gives an overview of the programmingmodel and describes the internal components of the toolkit which supports it focusing on the offloading and checkpointing mechanisms. It also presents the results of some tests conducted to evaluate the behavior of the solution and to measure the potential benefits in Android applications.
OpenACC is a directive-based programmingmodel which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested paral...
详细信息
OpenACC is a directive-based programmingmodel which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programmingmodel similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations intobatchesand execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks.
The genesis of parallel programming models is considered. It is shown that parallelism and hardware support of synchronization inherent in an architecture determine a parallel programming model. Modern VLSI technology...
详细信息
The genesis of parallel programming models is considered. It is shown that parallelism and hardware support of synchronization inherent in an architecture determine a parallel programming model. Modern VLSI technology require shift to programmingmodels based on shared memory.
With the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring porta...
详细信息
With the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring portability, while preserving programming tractability on such hardware prompted the HPC community to design new, higher level paradigms while relying on runtime systems to maintain performance. However, the common weakness of these projects is to deeply tie applications to specific expert-only runtime system APIs. The OpenMP specification, which aims at providing common parallelprogramming means for shared-memory platforms, appears as a good candidate to address this issue thanks to the latest task-based constructs introduced in its revision 4.0. The goal of this paper is to assess the effectiveness and limits of this support for designing a high-performance numerical library, ScalFMM, implementing the fast multipole method (FMM) that we have deeply re-designed with respect to the most advanced features provided by OpenMP 4. We show that OpenMP 4 allows for significant performance improvements over previous OpenMP revisions on recent multicore processors and that extensions to the 4.0 standard allow for strongly improving the performance, bridging the gap with the very high performance that was so far reserved to expert-only runtime system APIs.
暂无评论