It is often a challenge to keep input/output tasks/results in order for parallel computations over data streams, particularly when stateless task operators are replicated to increase parallelism when there are irregul...
详细信息
It is often a challenge to keep input/output tasks/results in order for parallel computations over data streams, particularly when stateless task operators are replicated to increase parallelism when there are irregular tasks. Maintaining input/output order requires additional coding effort and may significantly impact the application's actual throughput. Thus, we propose a new implementation technique designed to be easily integrated with any of the existing C++ parallel programming frameworks that support stream parallelism. In this paper, it is first implemented and studied using SPar, our high-level domain-specific language for stream parallelism. We discuss the results of a set of experiments with real-world applications revealing how significant performance improvements may be achieved when our proposed solution is integrated within SPar, especially for data compression applications. Also, we show the results of experiments performed after integrating our solution within FastFlow and TBB, revealing no significant overheads.
Real-time data processing is one of the central processes of particle physics experiments which require large computing resources. The LHCb (Large Hadron Collider beauty) experiment will be upgraded to cope with a par...
详细信息
Real-time data processing is one of the central processes of particle physics experiments which require large computing resources. The LHCb (Large Hadron Collider beauty) experiment will be upgraded to cope with a particle bunch collision rate of 30 million times per second, producing 10(9) particles/s. 40 Tbits/s need to be processed in real-time to make filtering decisions to store data. This poses a computing challenge that requires exploration of modern hardware and software solutions. We present Compass, a particle tracking algorithm and a parallel raw input decoding optimized for GPUs. It is designed for highly parallel architectures, data-oriented, and optimized for fast and localized data access. Our algorithm is configurable, and we explore the trade-off in computing and physics performance of various configurations. A CPU implementation that delivers the same physics performance as our GPU implementation is presented. We discuss the achieved physics performance and validate it with Monte Carlo simulated data. We show a computing performance analysis comparing consumer and server-grade GPUs, and a CPU. We show the feasibility of using a full GPU decoding and particle tracking algorithm for high-throughput particle trajectories reconstruction, where our algorithm improves the throughput up to 7.4 x compared to the LHCb baseline.
programming correct parallel software in a cost-effective way is a challenging task requiring a high degree of expertise. As an attempt to overcoming the pitfalls undermining parallel programming, this paper proposes ...
详细信息
programming correct parallel software in a cost-effective way is a challenging task requiring a high degree of expertise. As an attempt to overcoming the pitfalls undermining parallel programming, this paper proposes a pattern-based, formally grounded tool that eases writing parallel code by automatically generating platform-dependent programs from high-level, platform-independent specifications. The tool builds on three pillars: (1) a platform-agnostic parallel programming pattern, called PCR, (2) a formal translation of PCRs into a parallel execution model, namely Concurrent Collections (CnC), and (3) a program rewriting engine that generates code for a concrete runtime implementing CnC. The experimental evaluation carried out gives evidence that code produced from PCRs can deliver performance metrics which are comparable with handwritten code but with assured correctness. The technical contribution of this paper is threefold. First, it discusses a parallel programming pattern, called PCR, consisting of producers, consumers, and reducers which operate concurrently on data sets. To favor correctness, the semantics of PCRs is mathematically defined in terms of the formalism FXML. PCRs are shown to be composable and to seamlessly subsume other well-known parallel programming patterns, thus providing a framework for heterogeneous designs. Second, it formally shows how the PCR pattern can be correctly implemented in terms of a more concrete parallel execution model. Third, it proposes a platform-agnostic C++ template library to express PCRs. It presents a prototype source-to-source compilation tool, based on C++ template rewriting, which automatically generates parallel implementations relying on the Intel CnC C++ library.
As the data becomes bigger and more complex, people tend to process it in a distributed system implemented on clusters. Due to the power consumption, cost, and differentiated price-performance, the clusters are evolvi...
详细信息
As the data becomes bigger and more complex, people tend to process it in a distributed system implemented on clusters. Due to the power consumption, cost, and differentiated price-performance, the clusters are evolving into the system with heterogeneous hardware leading to the performance difference among the nodes. Even in a homogeneous cluster, the performance of the nodes is different due to the resource competition and the communication cost. Some nodes with poor performance will drag down the efficiency of the whole system. Existing parallel computing strategies such as bulk synchronous parallel strategy and stale synchronous parallel strategy are not well suited to this problem. To address it, we proposed a free stale synchronous parallel (FSSP) strategy to free the system from the negative impact of those nodes. FSSP is improved from stale synchronous parallel (SSP) strategy, which can effectively and accurately figure out the slow nodes and eliminate the negative effects of those nodes. We validated the performance of the FSSP strategy by using some classical machine learning algorithms and datasets. Our experimental results demonstrated that FSSP was 1.5-12x faster than the bulk synchronous parallel strategy and stale synchronous parallel strategy, and it used 4x fewer iterations than the asynchronous parallel strategy to converge.
This paper presents a study of the adaptation of a Non-Linear Iterative Partial Least Squares (NIPALS) algorithm applied to Hyperspectral Imaging to a Massively parallel Processor Array manycore architecture, which as...
详细信息
This paper presents a study of the adaptation of a Non-Linear Iterative Partial Least Squares (NIPALS) algorithm applied to Hyperspectral Imaging to a Massively parallel Processor Array manycore architecture, which assembles 256 cores distributed over 16 clusters. This work aims at optimizing the internal communications of the platform to achieve real-time processing of large data volumes with limited computational resources and memory bandwidth. As hyperspectral images are composed of extensive volumes of spectral information, real-time requirements, which are upper-bounded by the image capture rate of the hyperspectral sensor, are a challenging objective. To address this issue, the image size is usually reduced prior to the processing phase, which is itself a computationally intensive task. Consequently, this paper proposes an analysis of the intrinsic parallelism and the data dependency within the NIPALS algorithm and its subsequent implementation on a manycore architecture. Furthermore, this implementation has been validated against three hyperspectral images extracted from both remote sensing and medical datasets. As a result, an average speedup of 17x has been achieved when compared to the sequential version. Finally, this approach has been compared with other state-of-the-art implementations, outperforming them in terms of performance.
This paper is devoted to the performance evaluation of a hybrid computer cluster built on IBM POWER8 CPUs and NVIDIA Tesla P100 GPUs. The architecture of the computing system and software used are described. Results o...
详细信息
This paper is devoted to the performance evaluation of a hybrid computer cluster built on IBM POWER8 CPUs and NVIDIA Tesla P100 GPUs. The architecture of the computing system and software used are described. Results of experiments carried out using the STREAM, NPB, Crossroads/NERSC-9 DGEMM, and HPL packages are discussed. The efficiency of the simultaneous multithreading (SMT) technology supported by POWER8 processors, as well as the performance of some compilers, parallel programming and mathematical libraries, on this architecture is analyzed.
In this paper, an enhanced visual place recognition system is proposed aiming to improve the localization performance of a mobile platform. Our technique takes full advantage of the continuous input image stream in or...
详细信息
In this paper, an enhanced visual place recognition system is proposed aiming to improve the localization performance of a mobile platform. Our technique takes full advantage of the continuous input image stream in order to provide additional knowledge to the matching functionality. The well-established Bag-of-Visual-Words model is adapted into a hierarchical design that derives the visual information from the full entity of a natural scene into the description, while it additionally preserves the geometric structure of the explored world. Our approach is evaluated as part of a state-of-the-art Simultaneous-Localization and-Mapping algorithm, and parallelization techniques are exploited utilizing every available hardware module in a low-power device. The implemented algorithm has been tested on several publicly available datasets offering consistently accurate localization results and preventing the majority of redundant computations that the additional geometrical verifications can induce. (C) 2019 Elsevier B.V. All rights reserved.
Process-network synthesis is the determination of the optimal network structure of a process system together with optimal configurations and capacities of the operating units incorporated into the system. The aim of d...
详细信息
Process-network synthesis is the determination of the optimal network structure of a process system together with optimal configurations and capacities of the operating units incorporated into the system. The aim of developing more and more sophisticated solver algorithms is to find the optimum as fast as possible and increase the circle of practically solvable process synthesis problems. The P-graph framework can effectively reduce the number of structures to be examined and accelerate the computation searching for the optimum due to the exploitation of combinatorial characteristics of candidate solution structures. A cooperative parallel implementation of P-graph algorithms have been published recently to exploit the capabilities of multi-core and multiprocessor systems (Bartos and Bertok in De Gruyter Ser Logic Appl 1:303-313, 2015). The parallel implementation has increased performance significantly but this can be further improved by fine tuning the parameters of the parallel algorithm. Outcomes of experiments on parameter optimization are to be presented herein.
Minimal functional dependency is an important relationship in the relational database. It can describe some special relationships between complex and irregular attributes in the relational database. Extracting minimal...
详细信息
Minimal functional dependency is an important relationship in the relational database. It can describe some special relationships between complex and irregular attributes in the relational database. Extracting minimal functional dependencies (MFDs) from relational databases is an important database analysis technique. However, as the data grows larger and larger in size, even the most efficient stand-alone algorithms are exponential in the number of attributes of the relations. Discovering MFDs on a single computer is hard and slow, and it can only be applied to small centralized datasets. It is challenging to discover MFDs from big data, especially large-scale distributed data. Apache Spark is a unified analytics engine for big data processing;we present a new algorithm FastMFDs based on Spark for discovering all MFDs from large-scale distributed data in parallel. FastMFDs uses both the RDD framework and the DataFrame framework to store and process distributed data. FastMFDs deletes equivalent attributes. FastMFDs also provides two-way search algorithm for searching and pruning. We experimented our algorithm on real-life datasets, and our algorithm is more efficient and faster than the existing discovering methods.
Statistical fisheries models are frequently used by researchers and agencies to understand the behavior of marine ecosystems or to estimate the maximum acceptable catch of different species of commercial interest. The...
详细信息
Statistical fisheries models are frequently used by researchers and agencies to understand the behavior of marine ecosystems or to estimate the maximum acceptable catch of different species of commercial interest. The parameters of these models are usually adjusted through the use of optimization algorithms. Unfortunately, the choice of the best optimization method is far from trivial. This work proposes the use of population-based algorithms to improve the optimization process of the Globally applicable Area Disaggregated General Ecosystem Toolbox (Gadget), a flexible framework that allows the development of complex statistical marine ecosystem models. Specifically, parallel versions of the Differential Evolution (DE) and the Particle Swarm Optimization (PSO) methods are proposed. The proposals include an automatic selection of the internal parameters to reduce the complexity of their usage, and a restart mechanism to avoid local minima. The resulting optimization algorithms were called PMA (parallel Multirestart Adaptive) DE and PMA PSO respectively. Experimental results prove that the new algorithms are faster and produce more accurate solutions than the other parallel optimization methods already included in Gadget. Although the new proposals have been evaluated on fisheries models, there is nothing specific to the tested models in them, and thus they can be also applied to other optimization problems. Moreover, the PMA scheme proposed can be seen as a template that can be easily applied to other population-based heuristics. (C) 2019 Elsevier B.V. All rights reserved.
暂无评论