Many libraries in the HPC field encapsulate sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurat...
详细信息
ISBN:
(纸本)9781450335591
Many libraries in the HPC field encapsulate sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurate or even plainly wrong. While algorithm engineers have already been advocating the systematic combination of analytical performance models with practical measurements for a very long time, we go one step further and show how this comparison can become part of automated testing procedures. The most important applications of our method include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. Advancing the concept of performance assertions, we verify asymptotic scaling trends rather than precise analytical expressions, relieving the developer from the burden of having to specify and maintain very fine-grained and potentially non-portable expectations. In this way, scalability validation can be continuously applied throughout the whole development cycle with very little effort. Using MPI as an example, we show how our method can help uncover non-obvious limitations of both libraries and underlying platforms.
Latent Semantic Indexing (LSI) is one of the well-known searching techniques which match queries to documents in information retrieval applications. LSI has been proven to improve the retrieval performance, however, a...
详细信息
ISBN:
(纸本)9789812879363;9789812879356
Latent Semantic Indexing (LSI) is one of the well-known searching techniques which match queries to documents in information retrieval applications. LSI has been proven to improve the retrieval performance, however, as the size of documents gets larger, current implementations are not fast enough to compute the result on a standard personal computer. In this paper, we proposed a new parallel LSI algorithm on standard personal computers with multicore processors to improve the performance of retrieving relevant documents. The proposed parallel LSI was designed to automatically run the matrix computation on LSI algorithms as parallel threads using multi-core processors. The Fork-Join technique is applied to execute the parallel programs. We used the Malay Translated Hadith of Shahih Bukhari from Jilid 1 until Jilid 4 as the test collections. The total number of documents used is 2028 of text files. The processing time during the pre-processing phase of the documents for the proposed parallel LSI is measured and compared to the sequential LSI algorithm. Our results show that processing time for pre-processing tasks using our proposed parallel LSI system is faster than sequential system. Thus, our proposed parallel LSI algorithm has improved the searching time as compared to sequential LSI algorithm.
In this paper, we present an efficient parallel algorithm for calculating cumulative integration based on Simpson's rule. The proposed parallel algorithm exploits two Blelloch's prefix sums. The first scan is ...
详细信息
ISBN:
(纸本)9781467397971
In this paper, we present an efficient parallel algorithm for calculating cumulative integration based on Simpson's rule. The proposed parallel algorithm exploits two Blelloch's prefix sums. The first scan is used to calculate even-index, while the second scan is used to calculate odd-index cumulative integration. We implement the parallel algorithm on NVIDIA CUDA based GPUs. Performance of the proposed parallel algorithm is measured by calculating speedup. We also present accuracy performance of the proposed algorithm. Based on the performance measurements, we can conclude that the parallel proposed algorithm is faster than optimized CPU codes with 3 times speedup.
In this paper, we describe practical results of an algorithmic trading prototype and performance optimization related experiments for end-user code generation from customized UML models. Our prototype includes high-pe...
详细信息
ISBN:
(纸本)9781479969180
In this paper, we describe practical results of an algorithmic trading prototype and performance optimization related experiments for end-user code generation from customized UML models. Our prototype includes high-performance computing solutions for algorithmic trading systems. The performance prediction feature can help the traders to understand how powerful the machine they need when they have a very diverse portfolio or help hem to define the max size of their portfolio for a given machine. The traders can use our Watch Monitor for supervising the PNL (Profit and Loss) of the portfolio and other information so far. A portfolio management module could be added later for aggregating all strategies information together in order to maintain the risk level of the portfolio automatically. The prototype can be modified by end-users on the UML model level and then used with automatic Java code generation and execution within the Eclipse IDE. An advanced coding environment was developed for providing a visual and declarative approach to trading algorithms development. We learned exact and quantitative conditions under which the system can adapt to varying data and hardware parameters.
Biological sequence comparison is a very common task in Bioinformatics applications. Many parallel solutions have been proposed for this problem, using different IIPC platforms, progranuned usually with platform -spec...
详细信息
ISBN:
(纸本)9781467384889
Biological sequence comparison is a very common task in Bioinformatics applications. Many parallel solutions have been proposed for this problem, using different IIPC platforms, progranuned usually with platform -specific languages and frameworks. With this approach, it is difficult to port solutions among different platforms such as CPUs and GPUs, for instance. To tackle this problem, this paper proposes and evaluates an OpenCL parallel solution for Biological Sequence Comparison, which was integrated to the CUDAlign Megabase Sequence Comparison tool. The evaluation of our solution shows we were able to obtain a program for CPUs and GPUs (NVidia and AMD) with basically the same OpenCL code. In addition, in the comparison with SW# and CUDAlign optimized CUDA codes, we show that the performance of our OpenCL version has comparable and, many times, superior performance.
The computation core of many big data applications can be expressed as general matrix computations, including linear algebra operations and irregular matrix operations. However, existing parallel programming systems s...
详细信息
ISBN:
(纸本)9780769557854
The computation core of many big data applications can be expressed as general matrix computations, including linear algebra operations and irregular matrix operations. However, existing parallel programming systems such as Spark do not have programming abstraction and efficient implementation for general matrix computations. In this paper, we present MatrixMap, a unified and efficient data-parallel system for general matrix computations. MatrixMap provides powerful yet simple abstraction, consisting of a distributed data structure called bulk key matrix and a computation interface defined by matrix patterns. Users can easily load data into bulk key matrices and program algorithms into parallel matrix patterns. MatrixMap outperforms current state-of-the-art systems by employing three key techniques: matrix patterns with lambda functions for irregular and linear algebra matrix operations, asynchronous computation pipeline with optimized data shuffling strategies for specific matrix patterns and in-memory data structure reusing data in iterations. Moreover, it can automatically handle the parallelization and distribute execution of programs on a large cluster. The experiment results show that MatrixMap is 12 times faster than Spark.
Software Transactional Memory (STM) is a synchronization method proposed as an alternative to lock-based synchronization. It provides a higher-level of abstraction that is easier to program, and that enables software ...
详细信息
ISBN:
(纸本)9781467386210
Software Transactional Memory (STM) is a synchronization method proposed as an alternative to lock-based synchronization. It provides a higher-level of abstraction that is easier to program, and that enables software composition. Transactions are defined by programmers, but the runtime system is responsible for detecting conflicts and avoiding race conditions. One of the design axis in STMs is how version management is implemented in order to secure atomicity. There are two type of version management: Eager Versioning and Lazy Versioning. In this work, we evaluate the version management options implemented in TinySTM through an orthogonal analysis and performance evaluation.
The concept of task already exists in many parallel programming models. Programmers express parallelism by defining tasks in their applications, and runtime libraries schedule tasks on threads. However, in many task-b...
详细信息
ISBN:
(纸本)9781479984909
The concept of task already exists in many parallel programming models. Programmers express parallelism by defining tasks in their applications, and runtime libraries schedule tasks on threads. However, in many task-based parallel programming models, choosing the right number of threads is still key to performance. Hence, the onus is on the programmer to decide not only about the number of tasks, but also about the optimal number of threads in order to get good performance. In this paper, we aim to show that desirable performance can be achieved by only focusing on tasks. For this purpose, we compare a purely task-centric parallel programming model called GPRM with three popular approaches (OpenMP, Intel Cilk Plus, and TBB) on two modern manycore systems, the Tilera TILEPro64 and Intel Xeon Phi, which have respectively 64 and 60 physical cores integrated into a single chip. We have chosen three benchmarks with different characteristics to show that a task-centric approach such as GPRM can facilitate parallel programming while it outperforms other models in most cases. It does so by controlling only the number of tasks, rather than having to tune the number of threads.
Fast Fourier Transform (FFT) is a key element for wireless applications based on the OFDM (Orthogonal Frequency Division Multiplexing) and challenging for implementing on processor multicores/many-cores. As an example...
详细信息
ISBN:
(纸本)9781467365765
Fast Fourier Transform (FFT) is a key element for wireless applications based on the OFDM (Orthogonal Frequency Division Multiplexing) and challenging for implementing on processor multicores/many-cores. As an example, the Long Term Evolution (LTE) protocol establishes a requirement for processing, whereby many independent FFTs must be calculated within a limited time slot. By using Intel Math Kernel Library (MKL), in our approach to Xeon phi, we managed to reduce the maximum execution time of many independent FFTs. We proposed an implementation on processors multi-cores/many-cores using OpenMP (Open Multi-processing) reducing the mean time latency to 124 mu s on native mode after 1300 mu s with the offload. This is a challenge for shared memory projects. This paper describes how this level of performance can be obtained with multi-core Intel i7, Xeon processors and a many-core Xeon Phi. The best results were obtained with the Xeon Phi, which outperformed the Xeon Sandy-Bridge.
In this paper we discuss several capstone student projects conducted by the students at University of British Columbia, Okanagan campus (UBCO) and at Okanagan College in different years. The aim of the projects was to...
详细信息
ISBN:
(纸本)9781479985470
In this paper we discuss several capstone student projects conducted by the students at University of British Columbia, Okanagan campus (UBCO) and at Okanagan College in different years. The aim of the projects was to demonstrate how end-users could update code for an industrial application (an algorithmic trading system) without any programming skills and programming experience. Another goal was to improve performance for the applications collection of stock information from online public sources by introducing parallel code execution on multi-core personal computers. Real algorithmic trading system requirements were used as a case study. An Eclipse Modelling Framework was used to generate Java code from a UML business model, which can be modified by unexperienced business users. Moreover, code execution can be scaled to a specific computer architecture and hardware for better performance and better computer resources utilization, especially if a business user wants to collect and analyze a long list of stocks. The last section of the paper focuses on performance optimization and analysis.
暂无评论