parallel computers are everywhere. Over the last few years, a change of paradigm occurred in the computer industry. Mainly due to power dissipation constraints and memory access time limitations, rather than increasin...
详细信息
With the widespread use of the multicore systems having smaller transistor sizes, soft errors become an important issue for parallel program execution. Fault injection is a prevalent method to quantify the soft error ...
详细信息
With the widespread use of the multicore systems having smaller transistor sizes, soft errors become an important issue for parallel program execution. Fault injection is a prevalent method to quantify the soft error rates of the applications. However, it is very time consuming to perform detailed fault injection experiments. Therefore, prediction-based techniques have been proposed to evaluate the soft error vulnerability in a faster way. In this work, we present a soft error vulnerability prediction approach for parallel applications using machine learning algorithms. We define a set of features including thread communication, data sharing, parallel programming, and performance characteristics;and train our models based on three ML algorithms. This study uses the parallel programming features, as well as the combination of all features for the first time in vulnerability prediction of parallel programs. We propose two models for the soft error vulnerability prediction: (1) A regression model with rigorous feature selection analysis that estimates correct execution rates, (2) A novel classification model that predicts the vulnerability level of the target programs. We get maximum prediction accuracy rate of 73.2% for the regression-based model, and achieve 89% F-score for our classification model.
The consultative committee for space data system (CCSDS)-123 is a standard for lossless compression of multispectral and hyperspectral images with applications in on-board power-constrained systems, such as satellites...
详细信息
The consultative committee for space data system (CCSDS)-123 is a standard for lossless compression of multispectral and hyperspectral images with applications in on-board power-constrained systems, such as satellites and military drones. This letter explores the low-power heterogeneous architecture of the Nvidia Jetson TX2 by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort compared with the production of dedicated circuits, while maintaining low energy consumption. This solution parallelizes the predictor on a low-power graphics processing unit (GPU) while the encoders exploit the heterogeneous multiple cores of the CPUs and GPU concurrently. We report more than 16.6 Gb/s for the predictor and 1.4-Gb/s for the whole system, requiring less than 6.3 W and providing an efficiency of 245.6 Mb/s/W.
The RSA algorithm is an asymmetric encryption algorithm used to ensure the confidentiality and integrity of data as it travels across networks. Security has grown in importance over time, resulting into more data requ...
详细信息
In this paper was described the GR1 algorithm that provides feasible execution times for the subgraph isomorphism problem. It is a parallel algorithm that uses a variant of the producer–consumer pattern. It was desig...
详细信息
Since the first release in 2015, OpenTimer v1 has been used in many industrial and academic projects for analyzing the timing of custom designs. After four-year research and developments, we have announced OpenTimer v...
详细信息
Since the first release in 2015, OpenTimer v1 has been used in many industrial and academic projects for analyzing the timing of custom designs. After four-year research and developments, we have announced OpenTimer v2-a major release that efficiently supports: 1) a new task-based parallel incremental timing analysis engine to break through the performance bottleneck of existing loop-based methods;2) a new application programming interface (API) concept to exploit high degrees of parallelisms;and 3) an enhanced support for industry-standard design formats to improve user experience. Compared with OpenTimer v1, we rearchitect v2 with a modern C++ programming language and advanced parallel computing techniques to largely improve the tool performance and usability. For a particular example, OpenTimer v2 achieved up to 5.33x speedup over v1 in incremental timing, and scaled higher with increasing cores. Our contributions include both technical innovations and engineering knowledge that are open and accessible to promote timing research in the community.
Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators...
详细信息
Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators provide requires the use of specialized native programming interfaces, such as CUDA or OpenCL, or higher-level programming models like OpenMP or OpenACC. However, on managed programming languages, offloading execution into GPUs is much harder and error-prone, mainly due to the need to call through a native API (Application programming Interface), and because of mismatches between value and reference semantics. The Fancier framework provides a unified interface to Java, C/C++, and OpenCL C compute kernels, together with facilities to smooth the transitions between these programming languages. This combination of features makes GPU acceleration on Java much more approachable. In addition, Fancier Java code can be directly translated into equivalent C/C++ or OpenCL C code easily, which simplifies the implementation of higher-level abstractions targeting GPU or parallel execution on Java. Furthermore, it reduces the programming effort without adding significant overhead on top of the necessary OpenCL and Java Native Interface (JNI) API calls. We validate our approach on several image processing workloads running on different Android devices.
In high-performance computing, picking the right number of threads to gain a good speedup is important, as many OS-level parameters are influenced by even slight adjustments in thread count. These parameters are requi...
详细信息
Sparse matrix computations are at the heart of many scientific applications and data analytics codes. The performance and memory usage of these codes depend heavily on their use of specialized sparse matrix data struc...
详细信息
Sparse matrix computations are at the heart of many scientific applications and data analytics codes. The performance and memory usage of these codes depend heavily on their use of specialized sparse matrix data structures that only store the nonzero entries. However, such compaction is done using index arrays that result in indirect array accesses such as A[B[i]] where A and B are both arrays. Numerical libraries can provide high-performance code for an individual sparse kernel however they must be manually tuned and optimized for different inputs and architectures. Alternatively, compilers are used to optimize codes that provide architecture portability. Due to these indirect array accesses, memory access information is unknown at compile-time, and thus it is challenging to vectorize a sparse matrix method or run it in parallel cores. To automate the generation of code for efficient execution of sparse code, several compile-time and runtime techniques are required. Existing techniques are either not efficient or need manual effort to extend to different sparse matrix computations. Consequently, in this dissertation, I address the problem of automating the optimization of sparse matrix code on parallel processors with a specific focus on sparse linear solvers and numerical optimizations. This dissertation presents a set of code transformations and algorithms, all implemented in a novel code generator called Sympiler, that automates the optimization of sparse matrix codes on parallel processors. Sympiler takes a sparse method, arising from a sparse linear system or sparse numerical optimization, and decouples information related to the computation pattern of the method, i.e., symbolic information, and uses this information to transform the code to vectorizable and parallel code. Sympiler also enables the reuse of symbolic information when the computation pattern remains static for a period of time in the simulations or for when it changes modestly. Evaluation result
One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is diff...
详细信息
暂无评论