Networks are very important in the world. In signal processing, the towers are modeled as nodes (vertices) and if two towers communicate, then they have an arc (edge) between them or precisely, they are adjacent. The ...
详细信息
High Efficiency Video Coding (HEVC) creates the conditions for cost-effective video transmission and storage but its inherent computational complexity calls for efficient parallelization techniques. This paper provide...
详细信息
ISBN:
(纸本)9781728133201
High Efficiency Video Coding (HEVC) creates the conditions for cost-effective video transmission and storage but its inherent computational complexity calls for efficient parallelization techniques. This paper provides HEVC encoders with a holistic parallelization scheme that exploits parallelism at data, thread, and process levels at the same time. The proposed scheme is implemented in the practical Kvazaar open-source HEVC encoder. It makes Kvazaar exploit parallelism at three levels: 1) Single Instruction Multiple Data (SIMD) optimized coding tools at the data level;2) Wavefront parallel Processing (WPP) and Overlapped Wavefront (OWF) parallelization strategies at the thread level;and 3) distributed slice encoding on multi-computer systems at the process level. Our results show that the proposed process-level parallelization scheme increases the coding speed of Kvazaar by 1.86x on two computers and up to 3.92x on five computers with +0.19% and +0.81% coding losses, respectively. Exploiting all these three parallelism levels on a five-computer setup gives almost a 25x speedup over a non-parallelized single-core implementation.
A large number of reads generated by the next generation sequencing platform will contain many repetitive subsequences. Effective localizing and identifying genomic regions containing repetitive subsequences will cont...
详细信息
ISBN:
(纸本)9781665496407
A large number of reads generated by the next generation sequencing platform will contain many repetitive subsequences. Effective localizing and identifying genomic regions containing repetitive subsequences will contribute to the subsequent genomic data analysis. To accelerate the alignment between large-scale short reads and reference genome with many repetitive subsequences, this paper develops a compact de Bruijn graph based short-read alignment algorithm on distributedparallel computing platform. The algorithm uses resilient distributed data sets (RDDS) to perform calculations in memory, and executes the broadcast method to distribute short reads and reference genome to the computing nodes to reduce the data communication time on the cluster system, and the number of RDD partitions is set to optimize the performance of parallel aligning algorithm. Experimental results on real datasets show that compared with the compact de Bruijn graph based sequential short-read alignment algorithm, our implemented distributedparallel alignment algorithm achieves good acceleration on the premise of obtaining the same correct alignment percentage as a whole, and compared with existing distributedparallel alignment algorithms, the implemented parallel algorithm can more quickly complete the alignment between large-scale short reads and reference genome with highly repetitive subsequences.
Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. Software systems must satisfy rapidly increasing demands imposed by emerging...
Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. Software systems must satisfy rapidly increasing demands imposed by emerging applications. For example, new AI applications, such as autonomous driving, require quick responses to an environment that is changing continuously. At the same time, software systems must be fault-tolerant in order to ensure a high degree of availability. As it stands, however, developing these new distributed software systems is extremely challenging even for expert software engineers due to the interplay of concurrency, asynchronicity, and failure of components. The objective of our research is to develop reusable solutions to the above challenges by means of novel programming models and frameworks that can be used to build a wide range of applications. This talk reports on our work on the design, implementation, and foundations of programming models and languages that enable the robust construction of large-scale concurrent and distributed software systems.
We present SIMULATeQCD, HotQCD's software for performing lattice QCD calculations on GPUs. Started in late 2017 and intended as a full replacement of the previous single GPU lattice QCD code used by the HotQCD col...
详细信息
With the rapid development and globalization of the economy, this article investigates the effectiveness of decision tree model for banks’ decision making in granting personal credit loans, unlike the previous tradit...
详细信息
ISBN:
(纸本)9781665417396
With the rapid development and globalization of the economy, this article investigates the effectiveness of decision tree model for banks’ decision making in granting personal credit loans, unlike the previous traditional bank credit analysis model. By collecting user data of lenders and analyzing it through decision tree model, this paper finds that decision tree model can effectively process user information and analyze it, thus increasing prediction accuracy and reducing risk, and finally helping banks to make decisions.
This paper proposes a real-time power system simulation framework that is capable of simulating steady state and electromechanical transients of power systems, with sub-millisecond time resolution. The framework can i...
ISBN:
(纸本)9781665433266
This paper proposes a real-time power system simulation framework that is capable of simulating steady state and electromechanical transients of power systems, with sub-millisecond time resolution. The framework can integrate power system component models packed as reusable Functional Mockup Units (FMUs) to flexibly create power system simulations without the need to recreate new models for different power systems. The integration of individual components is based on a novel model decomposition method, which enables the FMU reuse in different system contexts, as well as a parallel simulation execution onto multi-core machines. Furthermore, the paper proposes methods to optimize the allocation of components to cores and shows that the framework can simulate a medium voltage distributed electrical grid of about 20 components in real-time on a commodity multi-core machine.
Driven by the rapid advances in Artificial Intelligence of Things (AIoT), billions of mobile and IoT devices are connected to the internet, generating huge quantities of data at the network edge. Meanwhile, traditiona...
详细信息
Driven by the rapid advances in Artificial Intelligence of Things (AIoT), billions of mobile and IoT devices are connected to the internet, generating huge quantities of data at the network edge. Meanwhile, traditional analytics approaches such as cloud computing and centralized AI are unable to manage these massively distributed heterogeneous data primarily because 1) moving a tremendous amount of data across the network poses severe challenges to network capacity 2) cloud-based analytics can result in prohibitively high transmission delays 3) transporting data containing private information over the network poses serious concerns for the privacy and may not even be possible due to regulations like GDPR. Accelerated by the success of AI and IoT technologies, there is an urgent need to push AI to the network edge to tap the full potential of big data.
This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad dou...
详细信息
ISBN:
(数字)9781665497473
ISBN:
(纸本)9781665497480
This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision arithmetic. For the blocked Householder Q R and the back substitution, of interest are those dimensions at which teraflop performance is attained. The other interesting question is the cost overhead factor that appears each time the precision is doubled. Experimental results are reported on five different NVIDIA GPUs, with a particular focus on the P100 and the V100, both capable of teraflop performance. Thanks to the high Compute to Global Memory Access (CGMA) ratios of multiple double arithmetic, teraflop performance is already attained running the double double QR on 1,024-by-1,024 matrices, both on the P100 and the V100. For the back substitution, the dimension of the upper triangular system must be as high as 17,920 to reach one teraflops on the V100, in quad double precision, and then taking only the times spent by the kernels into account. The lower performance of the back substitution in small dimensions does not prevent teraflop performance of the solver at dimension 1,024, as the time for the Q R decomposition dominates. In doubling the precision from double double to quad dou-ble and from quad double to octo double, the observed cost overhead factors are lower than the factors predicted by the arithmetical operation counts. This observation correlates with the increased performance for increased precision, which can again be explained by the high CGMA ratios.
暂无评论