Multiple threads running on a multi-core processor can improve the performance of a parallel application significantly. However, effective scaling of threads and cores plays a key role to achieve optimal performance b...
详细信息
ISBN:
(纸本)9781479938445
Multiple threads running on a multi-core processor can improve the performance of a parallel application significantly. However, effective scaling of threads and cores plays a key role to achieve optimal performance because performance does not necessarily improve with increasing number of cores. Multi-threaded applications suffer due to thread synchronization, negative interference in shared memory including last level cache and main memory. Memory bandwidth also often limits the performance of a multi-threaded workload. In this paper we propose a method to achieve optimal scalability on multi-core platform and predict the bandwidth requirement of parallel workloads for a given number of threads. We employ the proposed method to improve the performance of bandwidth limited parallel applications. We find that DRAM access has various phases and use the highest bandwidth among all phases to predict the performance of a given workload on multi-threaded environment. We evaluate our proposed method using Gem5 multi-core simulator and the experimental results show that the phase based bandwidth utilization method can estimate the optimal number of threads for a given parallel workload and has low prediction error.
processing complex queries on unbounded event streams in real-time, is a challenge for many data processing systems. these systems are expected to process data with reduced latency to generate real-time events, and at...
详细信息
ISBN:
(纸本)9781450332866
processing complex queries on unbounded event streams in real-time, is a challenge for many data processing systems. these systems are expected to process data with reduced latency to generate real-time events, and at high throughput to minimize the required hardware. In this regard, Grand Challenge 2015 [6] focuses on evaluating two queries (frequent routes and profitable cells) in real-time with low latency and high throughput. these queries involve processing windows of thousands of records. Firstly, such processing demands efficient data structures and algorithms to minimize the processing overhead. Secondly, the system should partition data to evaluate them in parallel to make it *** this paper, we present a set of data structures that we designed to evaluate the aforementioned queries with O(log n) time complexity and a data partitioning technique to evaluate them in parallel. We then evaluate our solution on a single machine as well as in a distributed setting in a commodity cluster of machines over a 1Gbps LAN. We were able to process the frequent routes query withthe 173 million trips dataset within 5 minutes with less than 4 millisecond latency and the profitable cells query with same dataset within 11 minutes with less than 5 millisecond latency.
this paper explains modeling and control of temperature dynamics on induction furnace. Induction furnace is used for melting metal to process raw material scrap into steel ferrit rate 0.22 wt% carbons. the dynamics re...
详细信息
ISBN:
(纸本)9781467367141
this paper explains modeling and control of temperature dynamics on induction furnace. Induction furnace is used for melting metal to process raw material scrap into steel ferrit rate 0.22 wt% carbons. the dynamics response of the induction furnace temperature affects the resulting product. therefore, the controller of temperature dynamics is required to produce the desired process response. the induction furnace systems consist of electrical and thermal system dynamics. the dynamics of the electrical system represents the induction furnace system in the form of an electric circuit i.e. fed current inverter with a parallel resonant circuit. Meanwhile, the thermal system dynamics represents the thermal energy transfer process, which is developed withthe principle of energy balance, including heat generated energy and heat loss. Induction furnace system dynamics is modelled in an order 2 system, with a time constant coil 1000 times faster than time constant temperature. thus, by ignoring the time constant coil, induction furnace system dynamic model can be transformed into a first order nonlinear system. then, linear system can be obtained by making a replacement variable. Induction furnace temperature control has been implemented by adjusting the PWM to control the input power to the induction furnace. PI controller is designed in three cases, namely linear model with linear PI controller, saturated linear model with linear PI controller, and saturated linear model with anti-windup PI controller. Each case is simulated by Simulink. To get the appropriate specifications, the temperature should be controlled at 912 Celsius degree. the best results can be achieved with maximum overshoot is 7% and the rise time is 2.9 seconds.
this paper presents a parallel remeshing algorithm for distributed-memory architectures. It is an iterative parallel algorithm that divides the areas to be remeshed into multiple pieces which can be distributed to as ...
详细信息
ISBN:
(纸本)9788494284472
this paper presents a parallel remeshing algorithm for distributed-memory architectures. It is an iterative parallel algorithm that divides the areas to be remeshed into multiple pieces which can be distributed to as many processing elements as possible, in order for these pieces to be remeshed concurrently by a third-party sequential remesher. then, remeshed pieces are reintegrated into the distributed mesh, and this process is iterated until all relevant areas of the mesh have been remeshed. Any sequential remesher can be used, provided it allows some of the mesh elements not to be modified, so as to preserve interfaces between pieces. Our method, which has been implemented in the PaMPA library, is validated by a set of experiments involving both isotropic and anisotropic meshes.
Halftoning is an important process to convert a gray scale image into a binary image with black and white pixels. the clipping-free DBS (Direct Binary Search)-based halftoning is one of the halftoning methods that can...
详细信息
ISBN:
(数字)9783319111971
ISBN:
(纸本)9783319111971;9783319111964
Halftoning is an important process to convert a gray scale image into a binary image with black and white pixels. the clipping-free DBS (Direct Binary Search)-based halftoning is one of the halftoning methods that can generate high quality binary images. However, considering the computing time, it is not realistic for most applications such as printing purpose. the main contribution of this paper is to show a new GPU implementation for the clipping-free DBS-based halftoning. We have considered programming issues of the GPU architecture to implement the method on the GPU. the experimental result shows that our GPU implementation on NVIDIA GeForce GTX 780 Ti for a 4096x3072 gray scale image runs in 7.240 seconds, while the CPU implementation runs in 346.6 seconds. thus, our GPU implementation attains a speed-up factor of 47.82.
A system architecture with high-density general purpose graphic processing unit (GPGPU) is emerging as a promising solution that can offer high compute performance and performance-per-watt for building cluster superco...
详细信息
A system architecture with high-density general purpose graphic processing unit (GPGPU) is emerging as a promising solution that can offer high compute performance and performance-per-watt for building cluster supercomputers. the raw compute power of these heterogeneous systems greatly exceeds the current prevailing homogenous systems, motivating their rapid adoption. these heterogeneous systems do however increase the complexity of developing parallel applications and there is a need to investigate the compute performances and associated power consumption of common benchmarks and scientific computing applications. In this paper, we present the performance and power studies through using the Dell C4130 server that integrates up to 4 GPGPU cards and NVIDIA GPGPU K80 is used. the high performance Linpack (HPL) and molecular dynamics (MD) simulators including NAMD, LAMMPS and GROMACS are tested. through comparing 4-K80 and 2-Xeon E5-2690 v3 systems, we show that: (1) for HPL tests, the 4- GPU server delivers up to 7 TFLOPS that is 9 times faster than the 2-CPU system and its power efficiency is 4 GFLOPS per Watt, (2) for MD tests, NAMD on 4-GPU server achieves 7.8 times speedup and it uses 2.3 times power consumption compared to 2-CPU system, and LAMMPS achieves 16 times speedup and it uses 2.6 times power consumption, and GROMACS achieves 3.3 times speed up and it uses 2.6 times power consumption. these preliminary results demonstrated that the novel high-density multi-GPGPU architecture offers high performances for computing intensive applications and molecular simulators with superior power efficiencies in a space efficient design. In future, such heterogeneous architecture could be a powerful alternative solution for next generation supercomputer systems.
the efficient processing of large collections of patterns expressed as Boolean expressions over event streams plays a central role in major data intensive applications ranging from user-centric processing and personal...
详细信息
ISBN:
(纸本)9781479934805
the efficient processing of large collections of patterns expressed as Boolean expressions over event streams plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. On the one hand, emerging user-centric applications, including computational advertising and selective information dissemination, demand determining and presenting to an end-user the relevant content as it is published. On the other hand, applications in real-time data analysis, including push-based multi-query optimization, computational finance and intrusion detection, demand meeting stringent subsecond processing requirements and providing high-frequency event processing. We achieve these event processing requirements by exploiting the shift towards multi-core architectures by proposing novel adaptive parallel compressed event matching algorithm (A-PCM) and online event stream re-ordering technique (OSR) that unleash an unprecedented degree of parallelism amenable for highly parallel event processing. In our comprehensive evaluation, we demonstrate the efficiency of our proposed techniques. We show that the adaptive parallel compressed event matching algorithm can sustain an event rate of up to 233,863 events/second while state-of-the-art sequential event matching algorithms sustains only 36 events/second when processing up to five million Boolean expressions.
Hybrid parallel file systems (PFS), which consist of both HDD and SSD servers, provide a promising solution for data-intensive applications. In this study, we propose a performance-aware data placement (PADP) strategy...
详细信息
ISBN:
(纸本)9783319111971;9783319111964
Hybrid parallel file systems (PFS), which consist of both HDD and SSD servers, provide a promising solution for data-intensive applications. In this study, we propose a performance-aware data placement (PADP) strategy to enable efficient data layout in hybrid PFSs. the basic idea of PADP is to dispatch data on different file servers with adaptive varied-size file stripes based on the server storage performance. By using an effective data access cost model and a linear programming optimization method, the appropriate stripe sizes for each file server are determined effectively. We have implemented PADP within OrangeFS, a widely used parallel file system in HPC domain. Experimental results of representative benchmark show that PADP can significantly improve the I/O performance of hybrid PFSs.
暂无评论