the embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server pr...
详细信息
the embedded and high-performance computing (HPC) sectors, that in the past were completely separated, are now somehow converging under the pressure of two driving forces: the release of less power consuming server processors and the increased performance of the new low power Systems-on-Chip (SoCs) developed to meet the requirements of the demanding mobile market. this convergence allows the porting to low power embedded architectures of applications that were originally confined to traditional HPC systems. In this paper, we present our experience of porting the Filtered Back-projection Algorithm to a low power, low cost system-on-chip, the NVIDIA Tegra K1, which is based on a quad core ARM CPU and on a NVIDIA Kepler GPU. this Filtered Back-projection Algorithm is heavily used in 3D Tomography reconstruction software. the porting has been done exploiting various programming languages (i.e. OpenMP, CUDA) and multiple versions of the application have been developed to exploit boththe SoC CPU and GPU. the performances have been measured in terms of 2D slices (of a 3D volume) reconstructed per time unit and per energy unit. the results obtained with all the developed versions are reported and compared withthose obtained on a typical x86 HPC node accelerated with a recent NVIDIA GPU. the best performances are achieved combining the OpenMP version and the CUDA version of the algorithm. In particular, we discovered that only three Jetson TK1 boards, equipped with Giga Ethernet interconnections, allow to reconstruct as many images per time unit as a traditional server, using one order of magnitude less energy. the results of this work can be applied for instance to the construction of an energy-efficient computing system of a portable tomographic apparatus.
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-ou...
详细信息
ISBN:
(纸本)9781479989379
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-out dependency task graphs, similar to those found in wavefront computational patterns, making the findings broadly applicable. We propose and evaluate three CUDA task progression algorithms, where threadblocks cooperatively process the task graph, and argue about their performance in terms of tasking throughput, atomics and memory IO overheads. Our initial results demonstrate a throughput of 38 million tasks/second on a Kepler K20X architecture.
Regular expression matching is widely used in content-aware applications, such as NIDS and protocol identification. However, wire-speed processing for large scale patterns still remains a great challenge in practice. ...
详细信息
ISBN:
(纸本)9783319271613;9783319271606
Regular expression matching is widely used in content-aware applications, such as NIDS and protocol identification. However, wire-speed processing for large scale patterns still remains a great challenge in practice. Considering low hit rates in NIDS, a compact and efficient pre-filter is firstly proposed to filter most normal traffics and leave few suspicious traffics for further pattern matching. Experiment results show that, the pre-filter achieves a big improvement in both space and time consumption with its compact and efficient structure.
Collaborative and peer-to-peer networked based models generate a large amount of data from students' learning tasks. We have proposed the analysis of these data to tackle information security in e-Learning breache...
详细信息
ISBN:
(纸本)9781467394734
Collaborative and peer-to-peer networked based models generate a large amount of data from students' learning tasks. We have proposed the analysis of these data to tackle information security in e-Learning breaches with trustworthiness models as a functional requirement. In this context, the computational complexity of extracting and structuring students' activity data is a computationally costly process as the amount of data tends to be very large and needs computational power beyond of a single processor. For this reason, in this paper, we propose a complete MapReduce and Hadoop application for processing learning management systems log file data.
this paper presents the experimental evaluation of data mapping techniques in the shared memory of an embedded GPU. the evaluated technique, previously presented in the literature in other contexts, aims at partitioni...
详细信息
ISBN:
(纸本)9781467394734
this paper presents the experimental evaluation of data mapping techniques in the shared memory of an embedded GPU. the evaluated technique, previously presented in the literature in other contexts, aims at partitioning an array across the shared memory physical banks, so as to increase parallel accesses, resulting in appreciable gains in terms of both performance and energy efficiency. the paper presents the experimental setup used for characterizing physically the behaviour of the platform, allowing a validation and a closer understanding of the evaluated memory mapping technique.
作者:
Xu, ZhengCao, BuyangTongji Univ
Sch Software Engn Shanghai 201804 Peoples R China Tongji Univ
China Intelligent Urbanizat Cocreat Ctr High Dens Shanghai 200092 Peoples R China
Clustering analysis plays an important role in a wide range of fields including data mining, pattern recognition, machine learning and many other areas. In this paper, we present a parallel tabu search algorithm for c...
详细信息
ISBN:
(纸本)9783319271613;9783319271606
Clustering analysis plays an important role in a wide range of fields including data mining, pattern recognition, machine learning and many other areas. In this paper, we present a parallel tabu search algorithm for clustering problems. A permanent tabu list is proposed to partition the solution space for parallelization. Moreover, this permanent tabu list can also reduce the neighborhood space and constrain the election of candidates. the proposed approach is evaluated by clustering some specific dataset. And experimental results and speedups obtained show the efficiency of the parallel algorithm.
作者:
Ma, KunYang, BoUniv Jinan
Shandong Prov Key Lab Network Based Intelligent C Jinan 250022 Peoples R China
Recent researches focus on the data replication issue from relational tables to schema-free collections in a batch processing way. However, there are few publications on live data replication in real time. In this pap...
详细信息
ISBN:
(纸本)9781467394734
Recent researches focus on the data replication issue from relational tables to schema-free collections in a batch processing way. However, there are few publications on live data replication in real time. In this paper, we attempt to address this legacy issue with new stream processing framework. the process of replication consists of log-based change data capture and stream-based data replication. Data replication mappings are present, and the proposed architecture of stream processing framework including column grouping, column merging and column versioning, is introduced to avoid data lost in case of failure. Finally, our experimental evaluation of live data replication approach with stream processing framework shows the higher effectiveness and efficiency than current methods.
Viewshed refers to the land area that is visible to an observer placed in a point of a terrain. Due to the advances in remote sensing technologies the volume of data is today beyond the capability of traditional GIS t...
详细信息
ISBN:
(纸本)9781479984909
Viewshed refers to the land area that is visible to an observer placed in a point of a terrain. Due to the advances in remote sensing technologies the volume of data is today beyond the capability of traditional GIS tools and therefore new and fast algorithms become essential. In this paper we present an efficient implementation of the XDRAW algorithm [5] to quickly compute viewsheds on very large digital elevation models. We redesign the algorithm to make it IO-efficient and compatible with modern STAID architectures. Our implementation is able to compute viewsheds on digital elevation models at the rate of 10' points per second on an Intel quad-core CPU with AVX2 technology, which makes the algorithm suitable for real-time applications.
Non-orthogonal multiple access (NOMA) is considered as a promising multiple access scheme for 5G downlink transmission. In this paper, we first review the existing downlink NOMA with successive interference cancellati...
详细信息
ISBN:
(纸本)9781467372183
Non-orthogonal multiple access (NOMA) is considered as a promising multiple access scheme for 5G downlink transmission. In this paper, we first review the existing downlink NOMA with successive interference cancellation (SIC) and then highlight some of the critical performance limiting factors related to SIC which can result in the performance degradation of NOMA. In order to alleviate the problems posed by SIC, we propose an alternate receiver structure for downlink NOMA based on parallel interference cancellation (PIC), along with some design consideration factors. the simulation results show that the proposed receiver outperforms that with SIC and hence is a promising receiver structure for future 5G downlink NOMA.
In many application scenarios, content based pub/sub systems are required to provide stringent service guarantees such as reliable delivery, high performance in terms of throughput and low latency for event notificati...
详细信息
ISBN:
(纸本)9781450340212
In many application scenarios, content based pub/sub systems are required to provide stringent service guarantees such as reliable delivery, high performance in terms of throughput and low latency for event notification to interested subscribers. Matching algorithm play a critical role in content based pub/sub systems. the aim of our work is design and development of parallel, scalable and high performance content based publish subscribe system. We parallelize event processing using thread based and multi GPU approaches. We achieved low latency and high throughput when pub/sub is deployed on Apache Storm, a real time event processing system. throughput gain and reduction in matching time is nearly 48% and 40% respectively in multi GPGPU approach of event processing compared to earlier work mentioned in [1].
暂无评论