Language stability is an important upcoming feature of the Chapel programming language. Chapel users have both requested big changes to the language and also requested that the language become stable. This talk will d...
详细信息
ISBN:
(纸本)9781728174457
Language stability is an important upcoming feature of the Chapel programming language. Chapel users have both requested big changes to the language and also requested that the language become stable. This talk will discuss recent efforts to complete the big changes to the Chapel language so that the language can stabilize.
The Inverse Discrete Cosine Transform (IDCT) is commonly used for image and video decoding. Due to the ubiquitous nature of this application area, very efficient implementations of the IDCT transform are of great impo...
详细信息
ISBN:
(纸本)9781728199245
The Inverse Discrete Cosine Transform (IDCT) is commonly used for image and video decoding. Due to the ubiquitous nature of this application area, very efficient implementations of the IDCT transform are of great importance and have lead to the development of highly optimized libraries. The popular libjpeg-turbo library contains 1000s of lines of handwritten assembly code utilizing SIMD instruction sets for a variety of architectures. We present an alternative approach, implementing the 8x8 2D IDCT written in the image processing language Halide - a high-level, functional language that allows for concise, portable, parallel and very efficient code. We show how less than 100 lines of Halide can replace over 1000 lines of code for each architecture in the libjpeg-turbo library to perform JPEG decoding. The Halide implementation is compared for ARMv8 and x86-64 SIMD extensions and shows a 5-25 percent performance improvement over the SIMD code in libjpeg-turbo while also being much easier to maintain and port to new architectures.
During last twenty years, the Differential evolution algorithm (DE) has proved to be one of the powerful methods to solve minimization problems for multidimensional functions. Being a member of the family of evolution...
详细信息
During last twenty years, the Differential evolution algorithm (DE) has proved to be one of the powerful methods to solve minimization problems for multidimensional functions. Being a member of the family of evolutionary optimization algorithms, its main principle is based upon the concepts of natural selection and mutation. In this study, we test the potential of DE to find a proper set of parameters for the multimode Brownian oscillator model, which was then used to simulate absorption lineshapes of carotenoid molecules in solution: spheroidene and spheroidenone. This theory assumes that the correlation function of a particular electronic state of the carotenoid is calculated using the semiclassical spectral density function. Considering our previous studies on photosynthetic pigments, we employed several DE strategies to do fitting of the carotenoid experimental spectra. We found that simulated absorption spectra are very sensitive to several parameters that characterize carotenoid vibronic modes, namely, Huang-Rhys factors. Fine tuning of DE crossover parameter (Cr) and the scaling factor (F) provided acceptable convergence of the algorithm. It appears that to get good convergence of DE, a certain spectral range of carotenoid absorption from 400 to 600 nm must be chosen. This fact can be explained by the limitations of the applied theory, which simply does not predict properly the carotenoid absorption at higher frequencies.
In recent years, the research community has made great strides in alias annotations that support parallel programming [1]. Using these techniques, programmers no longer have to guess where aliased mutable state may ca...
详细信息
In this paper, we go around two completely different levels of program design of a biomechanical program. First, the broadest level is the data level, where we show that we can use the whole world's data. This is ...
详细信息
ISBN:
(纸本)9781728180502
In this paper, we go around two completely different levels of program design of a biomechanical program. First, the broadest level is the data level, where we show that we can use the whole world's data. This is covered by the System of Systems engineering. The second and most particular level is the algorithm level. Our goal is to achieve the fastest program run we can. For this, we overview the possibilities and show an example of how a parallel paradigm accelerates our program.
The efficient mapping of stream processing applications to parallel hardware architectures is a difficult problem. While parallelization is often highly desirable as it reduces the overall execution time, its advantag...
详细信息
ISBN:
(纸本)9781728199245
The efficient mapping of stream processing applications to parallel hardware architectures is a difficult problem. While parallelization is often highly desirable as it reduces the overall execution time, its advantages must be carefully weighed against the parallelization overhead of complexity and communication costs. This paper presents a novel profile-guided optimization for parallel stream processing based on the multi-paradigm system programming language Rust. Our approach's key idea is to systematically balance the performance gain that can be achieved from parallelization with the communication overhead. To achieve this, we 1) use profiling to gain tight estimates of task execution times, 2) evaluate the cost of the fundamental concurrency constructs in Rust with synthetic benchmarks, and exploit this information to estimate the communication overhead introduced by various degrees of parallelism, and 3) present a novel optimization algorithm that exploits both estimates to finetune the degree of parallelism and train processing in a given application. Overall, our approach enables us to map parallel stream processing applications to parallel hardware efficiently. The safety concepts anchored in Rust ensure the reliability of the resulting implementation. We demonstrate our approach's practical applicability with two case studies: the word count problem and aircraft telemetry decoding.
Chapel's high level data-parallel constructs make parallel programming productive for general programmers. This talk introduces the 'Chapel on Accelerators' project, which proposes compiler enhancements to...
详细信息
Peachy parallel Assignments are high-quality assignments for teaching parallel and distributed computing. They are selected competitively for presentation at the Edu* workshops. All of the assignments have been succes...
详细信息
ISBN:
(纸本)9780738143057
Peachy parallel Assignments are high-quality assignments for teaching parallel and distributed computing. They are selected competitively for presentation at the Edu* workshops. All of the assignments have been successfully used in class and they are selected based on the their ease of adoption by other instructors and for being cool and inspirational to students. This paper presents a paper-and-pencil assignment asking students to analyze the performance of different system configurations and an assignment in which students parallelize a simulation of the evolution of simple living organisms.
Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators...
详细信息
Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators provide requires the use of specialized native programming interfaces, such as CUDA or OpenCL, or higher-level programming models like OpenMP or OpenACC. However, on managed programming languages, offloading execution into GPUs is much harder and error-prone, mainly due to the need to call through a native API (Application programming Interface), and because of mismatches between value and reference semantics. The Fancier framework provides a unified interface to Java, C/C++, and OpenCL C compute kernels, together with facilities to smooth the transitions between these programming languages. This combination of features makes GPU acceleration on Java much more approachable. In addition, Fancier Java code can be directly translated into equivalent C/C++ or OpenCL C code easily, which simplifies the implementation of higher-level abstractions targeting GPU or parallel execution on Java. Furthermore, it reduces the programming effort without adding significant overhead on top of the necessary OpenCL and Java Native Interface (JNI) API calls. We validate our approach on several image processing workloads running on different Android devices.
Python has been gaining some traction for years in the world of scientific applications. However, the high-level abstraction it provides may not allow the developer to use the machines to their peak performance. To ad...
详细信息
ISBN:
(纸本)9780738110868
Python has been gaining some traction for years in the world of scientific applications. However, the high-level abstraction it provides may not allow the developer to use the machines to their peak performance. To address this, multiple strategies, sometimes complementary, have been developed to enrich the software ecosystem either by relying on additional libraries dedicated to efficient computation (e.g., NumPy) or by providing a framework to better use HPC scale infrastructures (e.g., PyCOMPSs). In this paper, we present a Python extension based on SharedArray that enables the support of system-provided shared memory and its integration into the PyCOMPSs programming model as an example of integration to a complex Python environment. We also evaluate the impact such a tool may have on performance in two types of distributed execution-flows, one for linear algebra with a blocked matrix multiplication application and the other in the context of data-clustering with a k-means application. We show that with very little modification of the original decorator (3 lines of code to be modified) of the task-based application the gain in performance can rise above 40% for tasks relying heavily on data reuse on a distributed environment, especially when loading the data is prominent in the execution time.
暂无评论