parallel programming is essential to utilize multi-core processors but remains challenging because it requires extensive knowledge of both software and hardware. Various automatic parallelization tools based on static...
详细信息
Currently, processing large volumes of expanding data efficiently and consistently is a significant challenge. Traditional distributed-memory high-performance computers (HPC) based on message-passing model struggle wi...
详细信息
Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel comput...
详细信息
Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel computing resources. In C/C++, the current state-of-the-art for distributed architectures and High-Performance Computing is Message Passing Interface (MPI). However, exploiting stream parallelism using MPI is complex and error-prone because it exposes many low-level details to the programmer. In this work, we introduce a new parallel programming abstraction for implementing distributed stream parallelism named DSParLib. Our abstraction of MPI simplifies parallel programming by providing a pattern-based and building block-oriented development to inter-connect, model, and parallelize data streams found in modern applications. Experiments conducted with five different stream processing applications and the representative PARSEC Ferret benchmark revealed that DSParLib is efficient and flexible. Also, DSParLib achieved similar or better performance, required less coding, and provided simpler abstractions to express parallelism with respect to handwritten MPI programs.
SYCL is a parallel programming language and enables heterogeneous computing on various devices. SYCL CPU device [1] uses CPU as a device to run SYCL kernel. While most SYCL concepts, such as devices memory model, sub-...
详细信息
ISBN:
(纸本)9798400717901
SYCL is a parallel programming language and enables heterogeneous computing on various devices. SYCL CPU device [1] uses CPU as a device to run SYCL kernel. While most SYCL concepts, such as devices memory model, sub-group and work-group construction, can be mapped on GPU hardware, the CPU device lacks native support for them. Therefore, these concepts need to be emulated on the CPU device to ensure full hardware utilization to achieve the performance portability of SYCL programs. To facilitate task parallelism at the work-group level, the SYCL CPU device distributes the execution of SYCL work-groups to CPU threads, each of which has a restricted stack size. The SYCL device's memory model consists of three distinct memory regions. Local memory is accessible by all the work-items in a single work-group. Private memory is accessible to a work-item. The CPU device doesn't have dedicated hardware to support local and private memory. Therefore, they are emulated by allocating a block of memory for each of them on the stack. A stack overflow could occur when a kernel uses a large private or local memory, as a thread's stack size can't be changed after its creation. The probability of error is much higher on Windows since the default thread stack size of a master thread is only 1MB. To address this issue, SYCL CPU device previously adopted an approach of context swapping to expand the stack size using low-level API provided by operating system. Application master thread stack size is 8MB on Linux and 1MB on Windows. The stack size for other worker threads is set to 8MB on a 64-bit system and 4MB on a 32-bit system. When a work-group requires a stack size larger than that of its executing thread's stack size, the SYCL CPU device runtime swaps the thread's context before execution. However, this method results in large-scale performance degradation on Windows due to the swapping involving frequent and inefficient memory movement. Some SYCL workloads on Windows even hang with
The performance bottleneck of math library functions has long been a common issue among major manufacturers. To overcome these bottlenecks, this paper proposes the Rlibm-OMP method, which combines the RLibm fast polyn...
详细信息
We aim to classify acoustic events recorded by a fiber optic distributed acoustic sensor (DAS). We derived the information from probing the fiber with light pulses and analyzing the Rayleigh backscatter. Then, we proc...
详细信息
We aim to classify acoustic events recorded by a fiber optic distributed acoustic sensor (DAS). We derived the information from probing the fiber with light pulses and analyzing the Rayleigh backscatter. Then, we processed this data by a pipeline of processing algorithms to form the input for our machine learning classification model. We put random matrix theory to the test to distinguish the acoustic event of interest from the noise. We conditioned the raw trace using moving average and wavelet-based filtering algorithms to improve the signal-to-noise ratio. For raw, low pass, and wavelet-based filtered signals that we inject into a convolutional neural network, we rely on the magnitude of their complex coefficients to categorize the nature of the event. We also investigate Mel-Frequency Cepstral coefficients specific to the event as an input for the classifier and compare their performance to other signal representations. We run the experiments on the CNN for two-class and three-class classification using datasets from a DAS that is deployed for perimeter security and pipeline monitoring. We obtained the best results when using the MFCCs paired with wavelet denoising, achieving accuracies of 96.4% for the "event" class and 99.7% for the "no event" class when it comes to the two-class process. The three-class process yielded optimal accuracies of 83.3%, 81.3%, and 96.7% for the "digging," "walking," and "excavation" classes, respectively. Finally, the training execution time is exceptionally long because the extensive dataset and the model's architecture are complex. As a result, we make efficient use of the CPU and GPU to maximize our machine's power using the Keras API's sequence data generator. Compared with the serial implementation, we report an improvement of up to 4.87 times. (c) 2022 Society of Photo-Optical Instrumentation Engineers (SPIE)
Performance-portable programming frameworks provide abstractions for parallel execution to allow easily porting an application to multiple backend programming models, such as CUDA, HIP, and OpenMP. However, programs m...
详细信息
This article examines popular classifiers such as Bagging Classifier, Nearest Neighbors Classifier, Boosting Classifier, Support Vector Classifier for the highest performance accuracy. Classifiers will be tested for a...
详细信息
In this paper we evaluate multiple C++ shared memory programming models with respect to both ease of expression, and resulting performance. We do this by implementing the mathematical algorithm known as the 'power...
详细信息
Tuning parallel applications on multi-core architectures is an arduous task. Several studies have utilized auto-tuning for OpenMP applications via standardized user-facing features, namely number of threads, thread pl...
详细信息
暂无评论