We investigate the practical merits of a parallel priority queue through its use in the development of a fast and work-efficient parallel shortest path algorithm, originally designed for an EREW PRAM. Our study reveal...
详细信息
ISBN:
(纸本)9783540680673
We investigate the practical merits of a parallel priority queue through its use in the development of a fast and work-efficient parallel shortest path algorithm, originally designed for an EREW PRAM. Our study reveals that an efficient implementation on a real supercomputer requires considerable effort to reduce the communication performance (which in theory is assumed to take constant time). It turns out that the most crucial part of the implementation is the mapping of the logical processors to the physical processing nodes of the supercomputer. We achieve the requested efficient mapping through a new graph-theoretic result of independent interest: computing a Hamiltonian cycle on a directed hyper-torus. No such algorithm was known before for the case of directed hypertori. Our Hamiltonian cycle algorithm allows us to considerably improve the communication cost and thus the overall performance of our implementation.
this paper examines the memory performance of the vector-parallel and scalar-parallel computing platforms across five applications of three scientific areas;electromagnetic analysis, CFD/heat analysis, and seismology....
详细信息
ISBN:
(纸本)9783540680673
this paper examines the memory performance of the vector-parallel and scalar-parallel computing platforms across five applications of three scientific areas;electromagnetic analysis, CFD/heat analysis, and seismology. Our evaluation results show that the vector platforms can achieve the high computational efficiency and hence significantly outperform the scalar platforms in the areas of these applications. We did exhaustive experiments and quantitatively evaluated representative scalar and vector platforms using real applications from the viewpoint of the system designers and developers. these results demonstrate that the ratio of memory bandwidth to floating-point operation rate needs to reach 4-bytes/flop to preserve the computational performance with hiding the memory access latencies by pipelined vector operations in the vector platforms. We also confirm that the enough number of memory banks to handle stride memory accesses leads to an increase in the execution efficiency. On the scalar platforms, the cache hit rate needs to be almost 100% to achieve the high computational efficiency.
Image processing arises as a promising domain for manifold applications requiring for heavy computing power and memory bandwidth with higher image resolution. Graphics processing unit (GPU) is widely used in image pro...
详细信息
ISBN:
(纸本)9781665435741
Image processing arises as a promising domain for manifold applications requiring for heavy computing power and memory bandwidth with higher image resolution. Graphics processing unit (GPU) is widely used in image processing algorithms but suffers from its powerful programmability that costs high hardware overhead. Moreover, GPU consumes much energy to access data from high-capacity register files, making it hard to implement on wearable devices. Enabling low power and efficient architecture with low hardware overhead remains challenging. In this paper, we propose a programmable image processing architecture (PIPArch) that explores the spatial locality in images to save energy while achieving high performance. We also design the instruction set architecture (ISA) to control the PIPArch. By supporting multiple parallel pipelines, we can keep the hardware utilization of PIPArch high. We evaluate the proposed PIPArch by developing the cycle-accurate simulator with some typical image processing algorithms. Compared to NVIDIA Tesla V100 GPU, PIPArch gains 23.63x speedup.
As the amount of available silicon resources on one chip increases, we have seen the advent of ever increasing parallel resources integrated on-chip. Many architectures use these resources as individually controllable...
详细信息
ISBN:
(纸本)0769524990
As the amount of available silicon resources on one chip increases, we have seen the advent of ever increasing parallel resources integrated on-chip. Many architectures use these resources as individually controllable, parallelprocessing elements. While such architectures excel at parallelapplications, they seldom support legacy single-threaded applications. In this work, we propose using parallel resources to facilitate execution of legacy codes with acceptable performance on parallel architectures containing a drastically different instruction set through the use of an all software parallel dynamic binary translation engine. this engine spatially implements different portions of a superscalar processor across distinct parallel elements thus exploiting the pipeline parallelism inherent in a superscalar this virtual microarchitecture facilitates changing the allocation of silicon resources between different superscalar units in software which is not possible when special purpose physical resources are built. We propose building dynamically reconfigurable architectures that inspect the current virtual machine configuration along withthe dynamic instruction stream and change the configuration to best suit the program's needs at runtime. An x86 to Raw parallel translation engine was built in which tiles dedicated to translation can be traded for tiles dedicated to the memory system as an example of dynamic reconfiguration.
Internet of things is a new emerging technology that promises a new era of Internet through encompassing seamlessly physical and digital worlds in one single intelligent ecosystem. this goal is achieved by interconnec...
详细信息
ISBN:
(纸本)9781538637906
Internet of things is a new emerging technology that promises a new era of Internet through encompassing seamlessly physical and digital worlds in one single intelligent ecosystem. this goal is achieved by interconnecting a large number of smart objects from the physical word such as smartphones, sensors, robots, connected cars, etc., to Internet. Nowadays, withthe advent of Internet of things, we need efficient mechanisms to remotely control IoT smart actuators by users and controllers using smartphones and IoT devices. this arises particularly in industrial Cyber-Physical Systems to supervise industrial processes. However, the complex environment of IoT systems makes this task very difficult to achieve regarding the number of connected objects and their resource limitation. In this paper, we tackle the problem of remote secure control of IoT actuators. We propose a distributed lightweight fine-grained access control based on Attribute Based Encryption mechanism and one way hash chain. We conducted security analysis and formal verification using AVISPA. the results demonstrated that our scheme is secure against various attacks. Moreover, the simulation results demonstrated the scalability and the efficiency of our solution, which saves substantially energy consumption and computation costs.
Counterfactual regret minimization (CFR) is one of the most widely used algorithms in iterative optimization algorithms. It is used to solve complex imperfect-information game problems. this paper introduced the Globa...
详细信息
ISBN:
(纸本)9781665435741
Counterfactual regret minimization (CFR) is one of the most widely used algorithms in iterative optimization algorithms. It is used to solve complex imperfect-information game problems. this paper introduced the Global Counterfactual Regret Minimization Local Update (GCFR+) to solve task planning problems in a crowdsourcing environment. We designed a parallel mechanism to alleviate possible parallel conflicts in actual crowdsourcing scenarios and increase personal rewards. First of all, we chose to test the performance of GCFR+ on data sets with different scales. then we compared the result withthe result of the decision model with a parallel mechanism. It can be seen that the parallel mechanism has significantly improved the efficiency of the decision model. Finally, unlike general CFR, we proved that GCFR+ is applicable to decision tree pruning of imperfect-information games.
Convolutional Neural Networks (CNNs) have become more and more powerful in the computer vision domain, as they achieve the state-of-the-art accuracy. Despite this, it is generally difficult to apply CNNs on mobile pla...
详细信息
ISBN:
(纸本)9781538637906
Convolutional Neural Networks (CNNs) have become more and more powerful in the computer vision domain, as they achieve the state-of-the-art accuracy. Despite this, it is generally difficult to apply CNNs on mobile platforms. Client server paradigm is a straightforward way to deploy CNNs on mobile phones, but studies have shown that it suffers serious problems, such as privacy leaks. Recently, researchers focus on using heterogeneous local processors (e.g., GPUs, CPUs) to accelerate the inference of CNNs. Utilizing all local processors available can achieve the highest performance, but it might incur energy-inefficiency. Different from previous works, this paper concerns more about energy-efficiency of CNN based mobile applications. We present an adaptive strategy, which is able to compute the energy-efficiency of all local processors, and further to obtain the energy-efficient device processor combination to perform CNN inference in parallel. the strategy is implemented on ODROID platform, where the evaluation results show that our proposed approach provides 3.67 x higher energy-efficiency with only 9.7% performance degradation on average compared withthe greedy strategy which tries to use all local processors available.
Alternating direction method of multipliers (ADMM) is an efficient algorithm to solve large- scale machine learning problems in a distributed environment. To make full use of the hierarchical memory model in modern hi...
详细信息
ISBN:
(纸本)9781665435741
Alternating direction method of multipliers (ADMM) is an efficient algorithm to solve large- scale machine learning problems in a distributed environment. To make full use of the hierarchical memory model in modern highperformance computing systems, this paper implements a hybrid MPI/OpenMP parallelization of the asynchronous ADMM algorithm (AH-ADMM). the AH-ADMM algorithm updates local variables in parallel by OpenMP threads and exchanges information between MPI processes, which relieves memory and communication pressure by replacing multiprocessing with multi- threading. Furthermore, for the SVM problem, the AH-ADMMalgorithm speeds up the calculation of sub- problems through an efficient parallel optimization strategy. this paper effectively combines the features of both algorithm design and programming model. Experiments on the Ziqiang4000 high-performance cluster demonstrate that the AH- ADMM algorithm scales better and run faster than the existing distributed ADMM algorithms implemented by pure MPI. the AH-ADMM can reduce the communication overhead by up to 91.8% and increase the convergence rate by up to 36x. For large datasets, the AH-ADMM can scale well on the cluster which over 129 cores.
Functional, memory-managed parallel languages (FMPLs) are a recent innovative approach to shared-memory parallel programming. Despite their rising prevalence in other areas, FMPLs have yet to gain traction in HPC. In ...
详细信息
ISBN:
(纸本)9798350311990
Functional, memory-managed parallel languages (FMPLs) are a recent innovative approach to shared-memory parallel programming. Despite their rising prevalence in other areas, FMPLs have yet to gain traction in HPC. In this work, we explore the utility of FMPLs for HPC by re-implementing the NAS parallel Benchmarks in an FMPL. For this study, we ported the benchmarks into the parallel ML language. We discuss the advantages and disadvantages of using parallel ML for HPC applications based on our development experience. We compare the performance of our parallel ML implementation to the existing C/OpenMP version. the FMPL implementations are 1.02x-5.76x slower compared to OpenMP. Our positive development experience combined with some competitive performance results suggest that FMPLs have the potential to become a viable choice for HPC applications. We conclude by describing our future work to automatically manage distributed memory within an FMPL, creating a compelling new programming model for HPC.
FFT has been a classic computation engine for numerous applications. the bandwidth-intensive nature of FFT capped its performance on off-the-shelf parallel machines that are bandwidth-limited, and forced application r...
详细信息
ISBN:
(纸本)9781509036820
FFT has been a classic computation engine for numerous applications. the bandwidth-intensive nature of FFT capped its performance on off-the-shelf parallel machines that are bandwidth-limited, and forced application researchers into seeking easier-to-speedup alternatives to FFT, even when inferior to FFT. But, what if effective support of FFT is feasible? Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. the results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster.
暂无评论