The Data Acquisition (DAQ) system of LHCb is a complex real-time system. It will be upgraded to provide LHCb with an all-software, trigger-free readout starting from 2020. Consequently, more CPU power in the form of s...
详细信息
The Data Acquisition (DAQ) system of LHCb is a complex real-time system. It will be upgraded to provide LHCb with an all-software, trigger-free readout starting from 2020. Consequently, more CPU power in the form of servers will be needed and the DAQ network will grow to a capacity of 40 Tbps. A PC-based readout system would receive data incoming from the detector, which would then be scattered across builder nodes, and further distributed to a computing farm for data filtering. The design bandwidth of such a DAQ system requires rates as high as 400 Gbps single-duplex per node. These builder nodes will be connected with cost-effective, high-bandwidth data-centre switches in order to minimize the system cost. The behaviour of such an Event Building network can of course be studied in simulation but experience tells us that it is crucial to test, in particular to find out limitations in the switches themselves and to which extent various Event Building protocols can mitigate these limitations. We present a protocol, topology and transport independent emulation software named DAQ Protocol-Independent Performance Evaluator (DAQPIPE). It allows us to test different communication architectures, such as push or pull, with regards to the initiator of the communication. Different topologies and transport protocols can also be tested. We present throughput and stress tests on an InfiniBand FDR multi-rail based LAN network setup, with a focus on the network performance. Large tests on the current system LHCb DAQ are shown to demonstrate the scalability of DAQPIPE itself and its capability to be deployed on any kind of large, tightly interconnected network to test its suitability for Event Building applications.
A new parallel programming framework for DNA sequence alignment in homogeneous multi-core processor architectures is proposed. Contrasting with traditional coarse-grained parallel approaches, that divide the considere...
详细信息
ISBN:
(纸本)9780769539676
A new parallel programming framework for DNA sequence alignment in homogeneous multi-core processor architectures is proposed. Contrasting with traditional coarse-grained parallel approaches, that divide the considered database in several smaller subsets of complete sequences to be aligned with the query sequence, the presented methodology is based on a slicing procedure of both the query and the database sequence under consideration in several tiles/chunks that are concurrently processed by the several cores available in the multi-core processor. The obtained experimental results have proven that significant accelerations of traditional biological sequence alignment algorithms can be obtained, reaching a speedup that is linear with the number of available processing cores and very close to the theoretical maximum.
MapReduce was originally proposed as a suitable and efficient approach for analyzing and processing large amounts of data. Since then, many researches contributed with MapReduce implementations for distributed and sha...
详细信息
MapReduce was originally proposed as a suitable and efficient approach for analyzing and processing large amounts of data. Since then, many researches contributed with MapReduce implementations for distributed and shared memory architectures. Nevertheless, diffrerent architectural levels require diffrerent optimization strategies in order to achieve high-performance computing. Such strategies in turn have caused very diffrerent MapReduce programming interfaces among these researches. This paper presents some research notes on coding productivity when developing MapReduce applications for distributed and shared memory architectures. As a case study, we introduce our current research on a unified MapReduce domain-specific language with code generation for Hadoop and Phoenix++, which has achieved a coding productivity increase from 41.84% and up to 94.71% without significant performance losses (below 3%) compared to those frameworks.
The field of parallel computing has experienced an increase in the number of computing nodes and parallel computing has widened its application to include computations that have irregular features. Some parallel progr...
详细信息
ISBN:
(纸本)9781509008070
The field of parallel computing has experienced an increase in the number of computing nodes and parallel computing has widened its application to include computations that have irregular features. Some parallel programming languages handle object data structures and offer marshaling/unmarshaling mechanisms to transport them. To manage data elements spread over computing nodes, some research on distributed collections has been conducted. This study proposes a distributed collection library that can handle multiple collections of object elements and change their distributions while maintaining the associativity between their elements. This library is implemented on an object-oriented parallel programming language, X10. We suppose pairs of associative collections such as vehicles and streets in a traffic simulation. When many vehicles are concentrated on streets assigned to certain computing nodes, some of those streets should be moved to other nodes. Our library supports the programmer in easily distributing the associative collections over the computing nodes and re-allocating their elements while maintaining the data sharing relationship among associative elements. The programmer can describe the associativity between objects using both declarative and procedural methods.
Over the last two decades, researchers developed many software, hardware, and hybrid Transactional Memories (TMs) with various APIs and semantics. However, reduced performance when exposed to high contention loads is ...
详细信息
Over the last two decades, researchers developed many software, hardware, and hybrid Transactional Memories (TMs) with various APIs and semantics. However, reduced performance when exposed to high contention loads is still the major downside of all the TMs. Although many strategies and methods have been proposed, contention management and transaction scheduling still remains an open area of research. An important piece of unsolved contention management puzzle is plausible transaction execution time estimation. In this paper we proposed two methods for estimating transaction execution times, namely the method based on log-normal distribution and the method based on gamma distribution. Experimental results presented in this paper indicate that the method based on log-normal distribution has better estimation accuracy than the method based on gamma distribution. Even more importantly, the method based on log-normal distribution uses 10 times shorter sliding windows and its complexity is much lower than for the method based on gamma distribution, thus it is faster and requires less electrical power.
OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear;multiple vendors can provide support for application descriptions written according to th...
详细信息
OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear;multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close
Patterns provide a mechanism to express parallelism at a high level of abstraction and to make easier the transformation of existing legacy applications to target parallel frameworks. That also opens a path for writin...
详细信息
Patterns provide a mechanism to express parallelism at a high level of abstraction and to make easier the transformation of existing legacy applications to target parallel frameworks. That also opens a path for writing new parallel applications. In this paper we introduce the REPARA approach for expressing parallel patterns and transforming the source code to parallelism frameworks. We take advantage of C++11 attributes as a mechanism to introduce annotations and enrich semantic information on valid source code. We also present a methodology for performing transformation of source code that allows to target multiple parallel programming models. Another contribution is a rule based mechanism to transform annotated code to those specific programming models. The REPARA approach requires programmer intervention only to perform initial code annotation while providing speedups that are comparable to those obtained by manual parallelization.
We present SciPAL (scientific parallel algorithms library), a C++-based, hardware-independent open-source library. Its core is a domain-specific embedded language for numerical linear algebra. The main fields of appli...
详细信息
parallel programming is an important tool used in flash memories to achieve high write speed. In parallel programming, a common programm voltage is applied to many cells for simultaneous charge injection. This propert...
详细信息
ISBN:
(纸本)9781424482641
parallel programming is an important tool used in flash memories to achieve high write speed. In parallel programming, a common programm voltage is applied to many cells for simultaneous charge injection. This property significantly simplifies the complexity of the memory hardware, and is a constraint that limits the storage capacity of flash memories. Another important property is that cells have different hardness for charge injection. It makes the charge injected into cells differ even when the same program voltage is applied to them. In this paper, we study the parallel programming of flash memory cells, focusing on the above two properties. We present algorithms for parallel programming when there is information on the cells' hardness for charge injection, but there is no feedback information on cell levels during programming. We then proceed to the programming model with feedback information on cell levels, and study how well the information on the cells' hardness for charge injection can be obtained. The results can be useful for understanding the storage capacity of flash memories with parallel programming.
This contribution presents a computational framework for simulation and gradient-based structural optimization of geometrically nonlinear and large-scale structural finite element models. CAGD-free optimization method...
详细信息
ISBN:
(纸本)9781905088416
This contribution presents a computational framework for simulation and gradient-based structural optimization of geometrically nonlinear and large-scale structural finite element models. CAGD-free optimization methods have been developed to integrate shape optimization in an early stage of design and to reduce the related modelling effort. To overcome the problem of an increasing numerical cost due to the large design space, the design sensitivities for objectives and constraints are evaluated via adjoint formulations. A new parallel computation strategy for sensitivity evaluation is presented which takes advantage of a completely parallelized simulation and optimization environment. Two application examples illustrate the method and demonstrate the high parallel efficiency.
暂无评论