Bulk Synchronous parallel (BSP) is a model for parallel computing with predictable scalability. BSP has a cost model: programs can be assigned a cost which describes their resource usage on any parallel machine. Howev...
详细信息
Bulk Synchronous parallel (BSP) is a model for parallel computing with predictable scalability. BSP has a cost model: programs can be assigned a cost which describes their resource usage on any parallel machine. However, the programmer has to manually derive this cost. This paper describes an automatic method for the derivation of BSP program costs, based on classic cost analysis and approximation of polyhedral integer volumes. Our method requires and analyzes programs with textually aligned synchronization and textually aligned, polyhedral communication. We have implemented the analysis and our prototype obtains cost formulas that are parametric in the input parameters of the program and the parameters of the BSP computer and thus bound the cost of running the program with any input on any number of cores. We evaluate the cost formulas and find that they are indeed upper bounds, and tight for data-oblivious programs. Additionally, we evaluate their capacity to predict concrete run times in two parallel settings: a multi-core computer and a cluster. We find that when exact upper bounds can be found, they accurately predict run-times. In networks with full bisection bandwidth, as the BSP model supposes, results are promising with errors <50%.
This paper presents a number of optimisations for improving the performance of unstructured computational fluid dynamics codes on multicore and manycore architectures such as the Intel Sandy Bridge, Broadwell and Skyl...
详细信息
This paper presents a number of optimisations for improving the performance of unstructured computational fluid dynamics codes on multicore and manycore architectures such as the Intel Sandy Bridge, Broadwell and Skylake CPUs and the Intel Xeon Phi Knights Corner and Knights Landing manycore processors. We discuss and demonstrate their implementation in two distinct classes of computational kernels: face-based loops represented by the computation of fluxes and cell-based loops representing updates to state vectors. We present the importance of making efficient use of the underlying vector units in both classes of computational kernels with special emphasis on the changes required for vectorising face-based loops and their intrinsic indirect and irregular access patterns. We demonstrate the advantage of different data layouts for cell-centred as well as face data structures and architectural specific optimisations for improving the performance of gather and scatter operations which are prevalent in unstructured mesh applications. The implementation of a software prefetching strategy based on auto tuning is also shown along with an empirical evaluation on the importance of multithreading for in order architectures such as Knights Corner. We explore the various memory modes available on the Intel Xeon Phi Knights Landing architecture and present an approach whereby both traditional DRAM as well as MCDRAM interfaces are exploited for maximum performance. We obtain significant full application speed-ups between 2.8 and 3X across the multicore CPUs in two-socket node configurations, 8.6X on the Intel Xeon Phi Knights Corner coprocessor and 5.6X on the Intel Xeon Phi Knights Landing processor in an unstructured finite volume CFD code representative in size and complexity to an industrial application. Program summary Program Title: some_opt_for_unstructured_cfd Program Files doi: http://***/10.17632/zyh2zkf3jw.1 Licensing provisions: GNU General Public License 3 (GPL)
Due to the growth of biological databases and biomedical instruments, the high performance active (real-time) signal processing becomes a challenge for medical scientists and engineers. The medical applications requir...
详细信息
Due to the growth of biological databases and biomedical instruments, the high performance active (real-time) signal processing becomes a challenge for medical scientists and engineers. The medical applications require a high-performance signal processor which can process the scientific and engineering biomedical applications and is easy to program. In this article, we have suggested a biomedical sensor interface and heterogeneous multi-core processing architecture based biomedical application processing system (BAPS) and biomedical applications toolkit. The biomedical sensor interface supports multiple regular and complex medical signals and provides digital data to the processing system. The BAPS uses heterogeneous multi-core architecture that processes biomedical applications with the performance up to 10 billion operations per sec and accuracy of 1 mu sec. The biomedical application toolkit provides programmability by giving support of hardware-level, scientific and artificial intelligence programming. The BAPS provides a single embedded platform solution to process a wide range of biomedical signal and image processing applications. To prove the importance of the proposed system, we developed the BAPS hardware architecture and tested it with different biomedical applications. When compared the results of BAPS with the baseline system, the results show that BAPS improves active (real-time) applications performance up to 12.8 times and processes passive (non-real-time) application 7.4 times faster and improves the 4.84-time performance of artificial intelligence application. While comparing the power and energy, the BAPS draws 1.56 times less dynamic power and consumes 21.85 times less energy.
Many libraries in the HPC field use sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurate or eve...
详细信息
Many libraries in the HPC field use sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurate or even plainly wrong. While algorithm and performance engineers have already been advocating the systematic combination of analytical performance models with practical measurements for a very long time, we go one step further and show how this comparison can become part of automated testing procedures. The most important applications of our method include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. Advancing the concept of performance assertions, we verify asymptotic scaling trends rather than precise analytical expressions, relieving the developer from the burden of having to specify and maintain very fine-grained and potentially non-portable expectations. In this way, scalability validation can be continuously applied throughout the whole development cycle with very little effort. Using MPI and parallel sorting algorithms as examples, we show how our method can help uncover non-obvious limitations of both libraries and underlying platforms.
Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. By maintaining two types of cores (fast and slow) AMCs are able to provide high performance under the f...
详细信息
Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. By maintaining two types of cores (fast and slow) AMCs are able to provide high performance under the facility power budget. This paper performs the first extensive evaluation of how portable are the current HPC applications for such supercomputing systems. Specifically we evaluate several execution models on an ARM *** AMC using the PARSEC benchmark suite that includes representative highly parallel applications. We compare schedulers at the user, OS and runtime levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system level as it improves the baseline by 23%, while the heterogeneous-aware OS scheduling solution improves the baseline by 10%. (C) 2019 Published by Elsevier Inc.
Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need to analyze very large and real-time data available from data repositories, social media, sensor network...
详细信息
Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need to analyze very large and real-time data available from data repositories, social media, sensor networks, smartphones, and the Web. Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of high performance computing (HPC) systems and clouds, whereas in the near future Exascale systems will be used to implement extreme-scale data analysis. Here is discussed how clouds currently support the development of scalable data mining solutions and are outlined and examined the main challenges to be addressed and solved for implementing innovative data analysis applications on Exascale systems.
In this work, novel circuits based on memristors for implementing electronic synapse and artificial neuron are designed. First, two simple synaptic circuits for implementing weighting calculations of voltage and curre...
详细信息
In this work, novel circuits based on memristors for implementing electronic synapse and artificial neuron are designed. First, two simple synaptic circuits for implementing weighting calculations of voltage and current modes using twin memristors are proposed. A synaptic weighting operation is defined as a difference function between the twin memristors, which can be adjusted in reverse by applying programmed signals and conducting positive, zero, and negative synaptic weights. Second, two neuron circuits using the proposed memristor synapses, in which parallel computing and programming can be achieved, are designed. Finally, performances of the proposed memristor synapses and neuron circuits, such as weight programming, neuron computing, and parallel operation, are analyzed through PSpice simulations. (C) 2018 Elsevier B.V. All rightsreserved.
The numerical nonreproducibility in parallel molecular dynamics (MD) simulations, which relates to the non-associate accumulation of float point data, leads to great challenges for development, debugging and validatio...
详细信息
The numerical nonreproducibility in parallel molecular dynamics (MD) simulations, which relates to the non-associate accumulation of float point data, leads to great challenges for development, debugging and validation. The most common solutions to this problem are using a high-precision data type or operation sorting, but these solutions are accompanied by significant computational overhead. This paper analyzes the sources of nonreproducibility in parallel MD simulations in detail. Two general solutions, namely, sorting by force component value and using an 80-bit long double data type, are implemented and evaluated in LAMMPS. To optimize the computational cost, a full-list based method with operation order sorted by particle distance is proposed, which is inspired by the spatial characteristics of MD simulations. An experiment on a system with constant energy dynamics shows that the new method can ensure reproducibility at any parallelism with an extra 50% computational overhead. (C) 2019 Published by Elsevier B.V.
Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems r...
详细信息
We present an OpenACC-based parallelization implementation of stochastic algorithms for simulating biochemical reaction networks on modern GPUs (graphics processing units). To investigate the effectiveness of using Op...
详细信息
We present an OpenACC-based parallelization implementation of stochastic algorithms for simulating biochemical reaction networks on modern GPUs (graphics processing units). To investigate the effectiveness of using OpenACC for leveraging the massive hardware parallelism of the GPU architecture, we carefully apply OpenACC's language constructs and mechanisms to implementing a parallel version of stochastic simulation algorithms on the GPU. Using our OpenACC implementation in comparison to both the NVidia CUDA and the CPU-based implementations, we report our initial experiences on OpenACC's performance and programming productivity in the context of GPU-accelerated scientific computing.
暂无评论