The saturation strategy for symbolic state-space generation is very effective for globally-asynchronous locally-synchronous discrete-state systems. Its inherently sequential nature, however, makes it difficult to para...
详细信息
HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on th...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on these architectures. On the MPI critical path, the message-matching process is one of the most time-consuming operations. In this process, searching for a specific request in a message queue represents a significant part of the communication latency. So far, no miracle algorithm performs well in all cases. This paper explores potential matching specializations thanks to hints introduced in the latest MPI 4.0 standard. We propose a hash-table-based algorithm that performs constant time message-matching for no wildcard requests. This approach is suitable for intensive point-to-point communication phases in many applications (more than 50% of CORAL benchmarks). We demonstrate that our approach can improve the overall execution time of real HPC applications by up to 25%. Also, we analyze the limitations of our method and propose a strategy for identifying the most suitable algorithm for a given application. Indeed, we apply machine learning techniques for classifying applications depending on their message pattern characteristics.
Vector processing has become commonplace in today's CPU microarchitectures. Vector instructions improve performance and energy which is crucial for resource-constraint mobile devices. The research community curren...
详细信息
ISBN:
(纸本)9798350303179
Vector processing has become commonplace in today's CPU microarchitectures. Vector instructions improve performance and energy which is crucial for resource-constraint mobile devices. The research community currently lacks a comprehensive benchmark suite to study the benefits of vector processing for mobile devices. This paper presents Swan-an extensive vector processing benchmark suite for mobile applications. Swan consists of a diverse set of data-parallel workloads from four commonly used mobile applications: operating system, web browser, audio/video messaging application, and PDF rendering engine. Using Swan benchmark suite, we conduct a detailed analysis of the performance, power, and energy consumption of vectorized workloads, and show that: (a) Vectorized kernels increase the pressure on cache hierarchy due to the higher rate of memory requests. (b) Vector processing is more beneficial for workloads with lower precision operations and higher cache hit rates. (c) Limited Instruction-Level parallelism and strided memory accesses to multi-dimensional data structures prevent vector processing benefits from scaling with more SIMD functional units and wider registers. (d) Despite lower computation throughput than domain-specific accelerators, such as GPU, vector processing outperforms these accelerators for kernels with lower operation counts. Finally, we show five common computation patterns in mobile data-parallel workloads that dominate the execution time.
The increasing availability of multi-core and multi-processor architectures provides new opportunities for improving the performance of many computer simulations. Markov Chain Monte Carlo (MCMC) simulations are widely...
详细信息
This paper discusses a VLSI based multiprocessor architecture for real-time processing of video coding applications. The architecture consists of multiple identical processing elements and is characterized as MIMD (Mu...
详细信息
Workflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources tha...
详细信息
ISBN:
(纸本)9781479956180
Workflows play an important role in expressing and executing scientific applications. In recent years, a variety of computational sites and resources have emerged, and users often have access to multiple resources that are geographically distributed. These computational sites are heterogeneous in nature and performance of different tasks in a workflow varies from one site to another. Additionally, users typically have a limited resource allocation at each site. In such cases, judicious scheduling strategy is required in order to map tasks in the workflow to resources so that the workload is balanced among sites and the overhead is minimized in data transfer. Most existing systems either run the entire workflow in a single site or use naive approaches to distribute the tasks across sites or leave it to the user to optimize the allocation of tasks to distributed resources. This results in a significant loss in productivity for a scientist. In this paper, we propose a multi-site workflow scheduling technique that uses performance models to predict the execution time on different resources and dynamic probes to identify the achievable network throughput between sites. We evaluate our approach using real world applications in a distributed environment using the Swift distributed execution framework and show that our approach improves the execution time by up to 60% compared to the default schedule.
Since the advent of distributed computer systems an active field of research has been the investigation of scheduling strategies for parallelapplications. The common approach is to employ scheduling heuristics that a...
详细信息
ISBN:
(纸本)0769519199
Since the advent of distributed computer systems an active field of research has been the investigation of scheduling strategies for parallelapplications. The common approach is to employ scheduling heuristics that approximate an optimal schedule. Unfortunately, it is often impossible to obtain analytical results to compare the efficacy of these heuristics. One possibility is to conducts large numbers of back-to-back experiments on real platforms. While this is possible on tightly-coupled platforms, it is infeasible on modern distributed platforms (i.e. Grids) as it is labor-intensive and does not enable repeatable results. The solution is to resort to simulations. Simulations not only enables repeatable results but also make it possible to explore wide ranges of platform and application scenarios. In this paper we present the SimGrid framework which enables the simulation of distributedapplications in distributed computing environments for the specific purpose of developing and evaluating scheduling algorithms. This paper focuses on SimGrid v2, which greatly improves on the first version of the software with more realistic network models and topologies. SimGrid v2 also enables the simulation of distributed scheduling agents, which has become critical for current scheduling research in large-scale platforms. After describing and validating these features, we present a case study by which we demonstrate the usefulness of SimGrid for conducting scheduling research.
This paper presents a Multi-DSP system for real-time SAR-processing using the HiPAR-DSP 16. We developed this full programmable processor at the Laboratorium fur Informationstechnologie. With 16 parallel data paths an...
详细信息
ISBN:
(纸本)0780379292
This paper presents a Multi-DSP system for real-time SAR-processing using the HiPAR-DSP 16. We developed this full programmable processor at the Laboratorium fur Informationstechnologie. With 16 parallel data paths and a two-dimensional memory it is optimized for image processing algorithms like FFT-transforms. SAR image synthesis methods, like the investigated wk-algorithm, use the computational intensive FFT-transform. To overcome the large processing power of future realtime SAR image synthesis applications, several DSPs have to work in parallel. The presented compact SAR system can be easily adapted to match the demands of different SAR algorithms by scaling the number of processing nodes. Equiped with 6 HiPAR-DSP 16 a 233 x 175 x 15 mm(3) board provides a realtime capability of processing SAR applications with a PRF of 1200 Hz. A rangeline length of 4096 8 bit complex samples and 4096 rangelines is assumed. The small volume and it's power consumption of less than 35 W enables it for on-board usage in compact air- or spaceborne systems.
This paper examines the effects of relaxed synchronization on both the numerical and parallel efficiency of parallel genetic algorithms (GAs). We describe a coarse-grain geographically structured parallel genetic algo...
详细信息
This paper examines the effects of relaxed synchronization on both the numerical and parallel efficiency of parallel genetic algorithms (GAs). We describe a coarse-grain geographically structured parallel genetic algorithm. Our experiments provide preliminary evidence that asynchronous versions of these algorithms have a lower run time than synchronous GAs. Our analysis shows that this improvement is due to (1) decreased synchronization costs and (2) high numerical efficiency (e.g. fewer function evaluations) for the asynchronous GAs. This analysis includes a critique of the utility of traditional parallel performance measures for parallel GAs.
The aim of P2P computing is to build virtual computing systems dedicated to large-scale computational problems. JXTA1proposes an underlying infrastructure on which JNGI2, one of the first P2P decentralized computing f...
详细信息
暂无评论