parallel programming platforms are heterogeneous and incompatible;a common approach is needed to free programmers from platforms' technical intricacies, allowing flexible execution in which sequential and parallel...
详细信息
ISBN:
(纸本)9781450348430
parallel programming platforms are heterogeneous and incompatible;a common approach is needed to free programmers from platforms' technical intricacies, allowing flexible execution in which sequential and parallel executions produce identical results. The execution and programming model of an embedded flexible language (EFL), which implement this common approach, are presented. EFL allows embedding of deterministic parallel code blocks into a sequential program, written in any host language. EFL programming model constructs are presented. An EFL implementation of the Reduce parallel Design Pattern is presented. With EFL we aim to implement safe and efficient parallel execution, in software, hardware, or both. Consequences of Rice's theorem regarding parallel computation are discussed. These consequences severely restrict what can be checked at compile time. An approach is proposed for circumventing these restrictions.
Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Paral...
详细信息
ISBN:
(纸本)9781538634721
Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIA's GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intel's CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8x faster than ReFacTo when using cuSPARSE.
parallel and distributed computing have enabled development of much more scalable software. However, developing concurrent software requires the programmer to be aware of non-determinism, data races, and deadlocks. MP...
详细信息
ISBN:
(纸本)9781538609415
parallel and distributed computing have enabled development of much more scalable software. However, developing concurrent software requires the programmer to be aware of non-determinism, data races, and deadlocks. MPI (message passing interface) is a popular standard for writing message-oriented distributed applications. Some messages in MPI systems can be processed by one of the many machines and in many possible orders. This non-determinism can affect the result of an MPI application. The alternate results may or may not be correct. To verify MPI applications, we need to check all these possible orderings and use an application specific oracle to decide if these orderings give correct output. MPJ Express is an open source Java implementation of the MPI standard. Model checking of MPI Java programs is a challenging task due to their parallel nature. We developed a Java based model of MPJ Express, where processes are modeled as threads, and which can run unmodified MPI Java programs on a single system. This model enabled us to adapt the Java PathFinder explicit state software model checker (JPF) using a custom listener to verify our model running real MPI Java programs. The evaluation of our approach shows that model checking reveals incorrect system behavior that results in very intricate message orderings.
A fragmented approach to parallel programming and its implementation in the Aspect programming language are considered. Approach to define order of execution of computation fragments in Aspect language is described an...
详细信息
ISBN:
(纸本)9783319629322;9783319629315
A fragmented approach to parallel programming and its implementation in the Aspect programming language are considered. Approach to define order of execution of computation fragments in Aspect language is described and illustrated by the example of matrix LU decomposition task.
parallel programming has been an active area of research in computer science and software engineering for many years. parallel programming should ideally provide a linear speedup to computational problems. In reality,...
详细信息
ISBN:
(纸本)9781509055388
parallel programming has been an active area of research in computer science and software engineering for many years. parallel programming should ideally provide a linear speedup to computational problems. In reality, this is rarely the case. While there are some algorithms that cannot be parallelized, many that can, still fail to provide the ideal linear speedup. For algorithms that can benefit from parallelization, it is often much more difficult to develop the parallel code than it is to write a sequential, single-threaded program. The existence of this gap between ideal parallel computing and parallel computing on real hardware and software has caused many developers to create new solutions in an attempt to move real parallel computing closer to its idealized model. While many of these solutions provide a great performance benefit on large-scale systems, they often lag behind when deployed on small-scale systems. In this paper, we introduce the design and implementation of DCM (Distributed Computing Middleware) - a Python-based middleware for writing parallel processing applications for execution on clusters of small-scale devices. Evaluation results show the feasibility of DCM. Our middleware and its test cases are publicly available on GitHub.
Optimizing parallel programs is a complex task because the interference among many different parameters. Work-stealing runtimes, used to dynamically balance load among different processor cores, are no exception. This...
详细信息
ISBN:
(纸本)9783319558493;9783319558486
Optimizing parallel programs is a complex task because the interference among many different parameters. Work-stealing runtimes, used to dynamically balance load among different processor cores, are no exception. This work explores the automatic configuration of the following runtime parameters: dynamic granularity control algorithms, granularity control cache, work-stealing algorithm, lazy binary splitting parameter, the maximum queue size and the unparking interval. The performance of the program is highly sensible to the granularity control algorithm, which can be a combination of other granularity algorithms. In this work, we address two search-based problems: finding a globally efficient work-stealing configuration, and finding the best configuration just for an individual program. For both problems, we propose the use of a Genetic Algorithm (GA). The genotype of the GA is able to represent combinations of up to three cut-off algorithms, as well as other work-stealing parameters. The proposed GA has been evaluated in its ability to obtain a more efficient solution across a set of programs, in its ability to generalize the solution to a larger set of programs, and its ability to evolve single programs individually. The GA was able to improve the performance of the set of programs in the training set, but the obtained configurations were not generalized to a larger benchmark set. However, it was able to successfully improve the performance of each program individually.
The art of computer programming has evolved symbiotically in the world of parallel processors. Thus, a computer program cannot be considered merely as a sequence of step by step instructions as it was earlier. Almost ...
详细信息
ISBN:
(纸本)9781509034048
The art of computer programming has evolved symbiotically in the world of parallel processors. Thus, a computer program cannot be considered merely as a sequence of step by step instructions as it was earlier. Almost all the serial processors have now been replaced by their parallel counterparts. The present paper discussesapplication of CUDA, which is the most used platform for developing GPU softwares, for atypical GIS based system. To exemplify this, a terrestrial image was processed upon by a CUDA based system and the methodology has been presented with a minute consideration of majority of the nuances involved. The paper also attempts a juxtaposing comparison of GPU and CPU architecturesin accordance with the GIS.
We present a work in progress tool (VisPar) for visualising computations in the Par monad in Haskell. Our contribution is not a revolutionary new idea but rather a modest addition to the set of tools available for mak...
详细信息
ISBN:
(纸本)9781450351812
We present a work in progress tool (VisPar) for visualising computations in the Par monad in Haskell. Our contribution is not a revolutionary new idea but rather a modest addition to the set of tools available for making sense of parallel programs. We hope to show that VisPar can be useful as a teaching tool by providing visualisations of a few examples from a course on parallel functional programming.
Electronic voting systems are adopted in several countries to provide accuracy and efficiency for the electoral processes. However, e-voting systems employ complex cryptographic and verification techniques to satisfy ...
详细信息
ISBN:
(纸本)9781538620854
Electronic voting systems are adopted in several countries to provide accuracy and efficiency for the electoral processes. However, e-voting systems employ complex cryptographic and verification techniques to satisfy security and privacy requirements. The verification and tallying processes take unacceptable execution times in large elections. This paper overcomes this problem and reduces this time using parallel processing. We have implemented the voting, verification, and tallying processes described in the Secure National Electronic Voting System (S-Vote). We have developed and evaluated three alternative parallel schemes for these processes: Task, Master/Salve, and Data. The Data scheme shows best speedup and efficiency and scales well as the numbers of voters and processing cores are increased. In this scheme, a number of threads dynamically request and process ballot packages. This scheme processes 64,000 ballot using 32 cores in 0.71 hours with 27.5 speedup and 86% efficiency. Therefore, a large national election of 2 million ballots can be processed in an acceptable time of 5.5 hours using 128 cores. Larger elections can be verified and tallied in acceptable time using more processing cores.
FORKJOIN framework is a widely used parallel programming framework upon which both core concurrency libraries and real-world applications are built. Beneath its simple and user-friendly APIs, FORKJOIN is a sophisticat...
详细信息
ISBN:
(纸本)9781538626849
FORKJOIN framework is a widely used parallel programming framework upon which both core concurrency libraries and real-world applications are built. Beneath its simple and user-friendly APIs, FORKJOIN is a sophisticated managed parallel runtime unfamiliar to many application programmers: the framework core is a work-stealing scheduler, handles fine-grained tasks, and sustains the pressure from automatic memory management. FORKJOIN poses a unique gap in the compute stack between high-level software engineering and low-level system optimization. Understanding and bridging this gap is crucial for the future of parallelism support in JVM-supported applications. This paper describes a comprehensive study on parallelism bottlenecks in FORKJOIN applications, with a unique focus on how they interact with underlying system-level features, such as work stealing and memory management. We identify 6 bottlenecks, and found that refactoring them can significantly improve performance and energy efficiency. Our field study includes an in-depth analysis of AKKA - a real-world actor framework - and 30 additional open-source FORKJOIN projects. We sent our patches to the developers of 15 projects, and 7 out of the 9 projects that replied to our patches have accepted them.
暂无评论