Recently, GPGPU has been adopted well in the High Performance Computing (HPC) field. The limited global memory bandwidth poses a great challenge to many GPGPU programmers trying to exploit parallelism within the CPUGP...
详细信息
The application of memristor in building hardware neural network has accepted widespread interests, and may bring novel opportunities to neural computing. However, due to the limitation of programming precision, the c...
详细信息
Proximity ranking according to end-to-end network distances (e.g., Round-Trip Time, RTT) can reveal detailed proximity information, which is important in network management and performance diagnosis in distributed sys...
详细信息
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When ...
详细信息
OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.
This paper addresses the optimization of parallel simulators for large-scale parallel systems and applications. Such simulators are often based on parallel discrete event simulation with conservative or optimistic pro...
详细信息
The Internet-based virtual computing environment (iVCE) is a novel network computing *** characteristics of growth,autonomy,and diversity of Internet resources present great challenges to resource sharing in *** DHT o...
详细信息
The Internet-based virtual computing environment (iVCE) is a novel network computing *** characteristics of growth,autonomy,and diversity of Internet resources present great challenges to resource sharing in *** DHT overlay (DHT for short) technique has various advantages such as high scalability,low latency,and desirable availability,and is thus an important approach to realizing efficient resource *** construction is a key technique for structured overlays that realizes basic overlay functions including dynamic maintenance and message *** this paper,we first introduce the traditional techniques of DHT topology construction,focusing mainly on dynamic maintenance and message routing of typical DHTs,DHT indexing techniques for complex queries,and DHT grouping techniques for matching domain *** then present recent advances in DHT topology construction techniques in iVCE taking advantage of the characteristics of Internet ***,we discuss the future of DHT topology construction techniques.
distributed social networks have emerged recently. Nevertheless, recommending friends in the distributed social networks has not been exploited fully. We propose FDist, a distributed common-friend estimation scheme th...
详细信息
distributed online social networks (DOSN) have emerged recently. Nevertheless, recommending friends in the distributed social networks has not been exploited fully. We propose BCE (Bloom Filter based Common-Friend Est...
详细信息
Hierarchical text classification is an important task in many real-world applications. To build an accurate hierarchical classification system with many categories, usually a very large number of documents must be lab...
详细信息
Data replication can be used to reduce bandwidth consumption and access latency in the distributed system where users require remote access to large data objects. In this paper, according to the intrinsic characterist...
详细信息
暂无评论