Moving objects detection is important in traffic video analysis, and many algorithms are being increasingly applied to moving objects detection. Most of these algorithms are time-consuming and cannot satisfy real-time...
详细信息
To provide timely results for ‘Big Data Analytics’, it is crucial to satisfy deadline requirements for MapReduce jobs in production environments. In this paper, we propose a deadline-oriented task scheduling approac...
详细信息
This paper proposes a classification approach for hyperspectral image (HSI) using the local receptive fields based kernel extreme learning machine. Extreme learning machine (ELM) has drawn increasing attention in the ...
详细信息
This paper proposes a classification approach for hyperspectral image (HSI) using the local receptive fields based kernel extreme learning machine. Extreme learning machine (ELM) has drawn increasing attention in the pattern recognition filed due to its simpleness, speediness and good generalization ability. A kernel method is often used to promote ELM's performance, which is known as kernel ELM. The local receptive field concept originates from research in neuroscience. Considering the local correlations of spectral features, it is promising to improve the performance of HSI classification by combining local receptive fields with kernel ELM. Experimental results on the Pavia University dataset confirm the effectiveness of the proposed HSI classification method.
One of the most significant challenges introduced by routing protocol in mobile networks is coping with the unpredictable motion and the unreliable behaviour of mobile nodes. In this paper, we present a hierarchical r...
详细信息
Stragglers can temporize jobs and reduce cluster efficiency seriously. Many researches have been contributed to the solution, such as Blacklist[8], speculative execution[1, 6], Dolly[8]. In this paper, we put forward ...
详细信息
A new parallel full pipeline accelerator implemented on fieldprogrammable gate array (FPGA) for the Schnorr–Euchner sphere decoding (SE– SD) algorithm is presented in this paper. We firstly transform the serial SE–...
详细信息
As the rapid growth of open source software, how to choose software from many alternatives becomes a great challenge. Traditional ranking approaches mainly focus on the characteristics of the software themselves, such...
详细信息
As we are approaching the exascale era in supercomputing, designing a balanced computer system with powerful computing ability and low energy consumption becomes increasingly important. GPU is a widely used accelerato...
详细信息
ISBN:
(纸本)9781509032068
As we are approaching the exascale era in supercomputing, designing a balanced computer system with powerful computing ability and low energy consumption becomes increasingly important. GPU is a widely used accelerator in most recently applied supercomputers. It adopts massive multithreads to hide long latency and has high energy efficiency. In contrast to its strong computing power, GPUs have few on-chip resources with several MB of fast on-chip memory storage per SM (Streaming Multiprocessors). GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design. Since the severe deficiency in on-chip memory, the benefit of high computing capacity of GPUs is pulled down by the poor cache performance dramatically, which limits system performance and energy-efficiency. In this paper, we put forward a locality protected scheme to make full use of the data locality based on the fixed capacity. We present a Locality Protected method based on instruction PC (LPP) to promote GPU performance. Firstly, we use a PC-based collector to collect the reuse information of each cache line. After getting the dynamic reuse information of the cache line, we take an intelligent cache allocation unit (ICAU) which coordinates the reuse information with LRU (Least Recently Used) replacement policy to find out the cache line with the least locality for eviction. The results show that LPP provides an up to 17.8% speedup and an average of 5.5% improvement over the baseline method.
We develop a simulator for 3D tissue of the human cardiac ventricle with a physiologically realistic cell model and deploy it on the supercomputer Tianhe-2. In order to attain the full performance of the heterogeneous...
详细信息
ISBN:
(纸本)9781509053827
We develop a simulator for 3D tissue of the human cardiac ventricle with a physiologically realistic cell model and deploy it on the supercomputer Tianhe-2. In order to attain the full performance of the heterogeneous CPU-Xeon Phi design, we use carefully optimized codes for both devices and combine them to obtain suitable load balancing. Using a large number of nodes, we are able to perform tissue-scale simulations of the electrical activity and calcium handling in millions of cells, at a level of detail that tracks the states of trillions of ryanodine receptors. We can thus simulate arrythmogenic spiral waves and other complex arrhythmogenic patterns which arise from calcium handling deficiencies in human cardiac ventricle tissue. Due to extensive code tuning and parallelization via OpenMP, MPI, and SCIF/COI, large scale simulations of 10 heartbeats can be performed in a matter of hours. Test results indicate excellent scalability, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When ...
详细信息
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
暂无评论