Abstract: The issues of the automatic computations distributing in the translation of the NORMA language programs are considered in this paper. The load balancing between the nodes in multi node computer system is out...
详细信息
parallel nonlinear models using radial kernels on local mesh support have been designed and implemented for application to real-world problems. Although this recently developed approach reduces the memory requirements...
详细信息
parallel nonlinear models using radial kernels on local mesh support have been designed and implemented for application to real-world problems. Although this recently developed approach reduces the memory requirements compared with other methodologies suggested over the last few years, its computational cost makes parallelisation necessary, especially for big datasets with many instances or attributes. In this work, several strategies for the parallelisation of this methodology are proposed and compared. The MPI commu-nication protocol and the OpenMP application programming interface are used to implement the algorithm. The performance of this methodology is compared with various machine learning methods, with particular consideration of techniques using radial basis functions (RBF). Different methods are applied to model the daily maximum air temperature from real meteorological data collected from the Agroclimatic Station Network of the Phytosanitary Alert and Information Network of Andalusia, an autonomous community of southern Spain. The obtained goodness-of-fit measures illustrate the effectiveness of this nonlinear methodology, and its training process is shown to be simpler than those of other powerful machine learning methods.
The main objective of this work is to bring supercomputing and parallel processing closer to non-specialized audiences by building a Raspberry Pi cluster, called Clupiter, which emulates the operation of a supercomput...
详细信息
ISBN:
(数字)9798350381993
ISBN:
(纸本)9798350382006
The main objective of this work is to bring supercomputing and parallel processing closer to non-specialized audiences by building a Raspberry Pi cluster, called Clupiter, which emulates the operation of a supercomputer. It consists of eight Raspberry Pi devices interconnected to each other so that they can run jobs in parallel. To make it easier to show how it works, a web application has been developed. It allows launching parallel applications and accessing a monitoring system to see the resource usage when these applications are running. The NAS parallel Benchmarks (NPB) are used as demonstration applications. From this web application a couple of educational videos can also be accessed. They deal, in a very informative way, with the concepts of supercomputing and parallel programming.
A fracture is the solution of continuity of bone tissue in any bone of the body occurs as a result of excessive stress that exceeds bone resistance, ie is the consequence of a single or multiple overload and occurs in...
详细信息
ISBN:
(数字)9781510647206
ISBN:
(纸本)9781510647206;9781510647190
A fracture is the solution of continuity of bone tissue in any bone of the body occurs as a result of excessive stress that exceeds bone resistance, ie is the consequence of a single or multiple overload and occurs in milliseconds. The development of magnetic resonance imaging and computerized tomography have made it possible to know and evaluate the different pathologies of the human being more accurately. Edge detection is a fundamental tool in image medical processing, particularly in the areas of feature detection, which aim at identifying points in a digital image at which the image has discontinuities. In order to improve the computing speed, was used parallel computing which support NVIDIA GPU. This work presents an improved methodology for processing bone fracture images before and after surgery using segmentation and graphic accelerator cards to help the medical specialist in the analysis and evaluation of the images.
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtim...
详细信息
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs and so forth, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing data-centers. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run-time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel and so forth, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this article, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU-directed compilation flow, along with the integration of whole function vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach-the CPU-directed compilation flow, with an OpenCL backend for existing SYCL-based applications, with no code modification for BabelStream benchmark, Matmul from the ComputeCpp SDK, N-body simulation benchmarks and SYCL-BLAS (Aliaga et al. Proceedings of the 5th International Workshop on OpenCL;2017.), on CPUs from different vendors and architectures. We report a performance improvement of
A major parallel programming challenge in scientific computing is to hide parallel computing details of data distribution and communication. Component-based approaches are often used in practice to encapsulate these c...
详细信息
Attack graphs (AGs) are graphical tools for security analysis of computer networks. They are especially useful in detecting the threats of multi-stage attacks on target networks. An AG is composed of nodes and directe...
详细信息
Attack graphs (AGs) are graphical tools for security analysis of computer networks. They are especially useful in detecting the threats of multi-stage attacks on target networks. An AG is composed of nodes and directed edges. Nodes represent different network states, and directed edges represent the causal connections between these states. By reading AGs, we can acquire useful information such as if multiple attack paths exist between any two nodes, the shortest or the most likely paths, and the most valuable target in a network. Such information helps system administrators assess the relative importance of various elements in a network, allowing them to effectively allocate time and budget to patch vulnerabilities, and proactively defend against possible attacks. This research addresses two primary concerns with AGs: efficient AG generation and effective AG analysis. First, AG application faces the challenge of state space explosion. As modern networks continue to grow and more vulnerabilities are discovered, the data to be processed during AG generation for target networks increase exponentially, which requires efficient AG generators. We design AG generators for the RAGE AG model based on parallel programming and high performance computing (HPC). We optimize the performance of the parallel AG generators with respect to data structures, memory access patterns, and workload balance. We conduct comprehensive performance evaluation on different HPC hardware. The testing dataset includes synthetic AGs and AGs converted from directed acyclic graphs. The results verify that the parallel strategy realized on HPC hardware can effectively handle the scalability issue of AG generation. Next, for effective AG analysis, we explore AG structures and apply probability theory to extract the underlying information from AGs. In structure analysis, we implement three centrality concepts from network science to study the importance of nodes and edges in AGs. Based on the centrality
In this letter, we address the issue of the automatic labeling of remote sensing datasets using a novel deep learning clustering algorithm. The proposed algorithm addresses the inherent susceptibility of the deep embe...
详细信息
In this letter, we address the issue of the automatic labeling of remote sensing datasets using a novel deep learning clustering algorithm. The proposed algorithm addresses the inherent susceptibility of the deep embedded clustering (DEC) algorithm to data imbalance using additional search and extraction steps. Furthermore, the proposed algorithm is highly parallelizable. A graphics processing unit (GPU) implementation is shown to achieve 40X to 2600X of performance speedup and improved clustering accuracy with respect to DEC and other clustering approaches.
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP ...
详细信息
ISBN:
(纸本)9781450394451
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP’s offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.
Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processe...
详细信息
Stream processing applications have seen an increasing demand with the raised availability of sensors, IoT devices, and user data. Modern systems can generate millions of data items per day that require to be processed timely. To deal with this demand, application programmers must consider parallelism to exploit the maximum performance of the underlying hardware resources. In this work, we introduce improvements to stream processing applications by exploiting fine-grained data parallelism (via Map and MapReduce) inside coarse-grained stream parallelism stages. The improvements are including techniques for identifying data parallelism in sequential codes, a new language, semantic analysis, and a set of definition and transformation rules to perform source-to-source parallel code generation. Moreover, we investigate the feasibility of employing higher-level programming abstractions to support the proposed optimizations. For that, we elect SPar programming model as a use case, and extend it by adding two new attributes to its language and implementing our optimizations as a new algorithm in the SPar compiler. We conduct a set of experiments in representative stream processing and data-parallel applications. The results showed that our new compiler algorithm is efficient and that performance improved by up to 108.4x in data-parallel applications. Furthermore, experiments evaluating stream processing applications towards the composition of stream and data parallelism revealed new insights. The results showed that such composition may improve latencies by up to an order of magnitude. Also, it enables programmers to exploit different degrees of stream and data parallelism to accomplish a balance between throughput and latency according to their necessity.
暂无评论