As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many applications, such as high-performance computing, machine learning, etc. In practice, the performance of GEMM is ...
详细信息
As an essential mathematical operation, GEneral Matrix Multiplication (GEMM) plays a vital role in many applications, such as high-performance computing, machine learning, etc. In practice, the performance of GEMM is limited by the dimension of matrix and the diversity of GPU hardware architectures. When dealing with batched, irregular and small matrices, the efficiency of GEMM usually performs poorly. To this end, common approach is to segment the matrix into multiple tiles and utilize parallelism between workgroups in GPU to compute the results. However, previous works only consider tile size and inter-workgroup parallelism and ignore the issues of low computational efficiency and hardware resource utilization caused by the difference in workloads between wavefronts. To address these issues, we propose a load-balanced batch GEMM acceleration method, consisting of a multi-thread kernel design and an efficient tiling algorithm. The multithread kernel design can address the workload unbalance between wavefronts indifferent workgroups, and the efficient tiling algorithm can choose the optimal tiling scheme with the new thread-level parallelism calculation method to achieve load-balanced task allocation. Finally, various comparative experiments were conducted on two GPU platforms: AMD and NVIDIA. Experimental results indicate the proposed method outperforms previous methods.
Due to the complex braiding process and long development cycle of the hexagonal three-dimensional braided stent, a MatLab based computer-aided braiding method for a stent is proposed to speed up the development proces...
详细信息
Due to the complex braiding process and long development cycle of the hexagonal three-dimensional braided stent, a MatLab based computer-aided braiding method for a stent is proposed to speed up the development process. First, an oblique coordinate system for the chassis and a polar coordinate system for the chassis unit are constructed, respectively, precisely to coordinate the carrier's movements on the chassis. Subsequently, an iterative formula delineating the trajectory of the carrier is introduced. The formula effectively translates the entire braiding process into the positional coordinates of the carrier on the chassis and the yarn heights on the mandrel during different stages. Based on the specific characteristics of the braiding process and the stent's structure, the stent is divided into pressing and twisting sections. The interwoven pattern for both the pressing and twisting sections is determined by establishing the basic tiling form, solving the yarn interwoven sequence, judging the interwoven type and computing the number of kinks. Finally, while considering the stent's dimensional parameters and the interwoven pattern of the yarn, the spatial curve equations for the yarns in both the pressing and the twisting sections are formulated. By concatenating these equations for each section, the three-dimensional trajectory equations and a comprehensive solid model of the stent are successfully derived. Through a rigorous comparative analysis of dimensions and the braiding pattern between the three-dimensional solid model and the physical stent preform, the accuracy and fidelity of the model generated through the implementation of computer-aided braiding technology are verified.
This paper describes a new neural network for structures particularly useful for Quantitative Structure-Activity Relationship (QSAR) and Quantitative Structure-Property Relationship (QSPR) applications. Such an algori...
详细信息
This paper describes a new neural network for structures particularly useful for Quantitative Structure-Activity Relationship (QSAR) and Quantitative Structure-Property Relationship (QSPR) applications. Such an algorithm was conceived in order to improve the performance of ANN-based software, when relatively small datasets have to be processed. Encouraging results were achieved in the analysis of a relatively small class of inhibitors of the angiotensin converting enzyme (ACE), taken as probe for our purposes. A huge amount of data is available for ACE inhibition, but only 45 molecules were found to be of interest in view of the design of triple ligands capable if simultaneously inhibiting ACE, neutral endo-peptidase (NEP), and endothelin converting enzyme (ECE), which may be of interest for therapeutic applications. The implementation of this algorithm was proved to supply a valuable solution to one of the major problems encountered in the applications of QSAR/QSPR method to the task of molecular design, in particular drug design. Indeed datasets of known structures and relevant biological properties of interest in drug design do often contain a number of elements below a hundred or a very few hundreds. ANN based approaches are commonly proved to work well, instead, when trained on datasets comprising a number of elements at least an order of magnitude higher. For comparison purposes with the approach described here, other commonly used QSAR models were developed. They were obtained using different algorithms available within the WEKA package, some of which are based on neural networks. The comparison clearly shows a better performance of the model obtained with neural networks for structures in general and with the algorithm proposed here in particular. (C) 2014 Published by Elsevier B.V.
Many practical applications include matrix operations as essential procedures. In addition, recent studies of matrix operations rely on parallel processing to reduce any calculation delays. Because these operations ar...
详细信息
Many practical applications include matrix operations as essential procedures. In addition, recent studies of matrix operations rely on parallel processing to reduce any calculation delays. Because these operations are highly data intensive, many studies have investigated work distribution techniques and data access latency to accelerate algorithms. However, previous studies have not considered hardware architectural features adequately, although they greatly affect the performance of matrix operations. Thus, the present study considers the architectural characteristics that affect the performance of matrix operations on real multicore processors. We use matrix multiplication, LU decomposition, and Cholesky factorization as the test applications, which are well-known data-intensive mathematical algorithms in various fields. We argue that applications only access matrices in a particular direction, and we propose that the canonical data layout is the optimal matrix data layout compared with the block data layout. In addition, the tiling algorithm is utilized to increase the temporal data locality in multilevel caches and to balance the workload as evenly as possible in multicore environments. Our experimental results show that applications using the canonical data layout with tiling have an 8.23% faster execution time and 3.91% of last level cache miss rate compared with applications executed with the block data layout. (C) 2013 Elsevier B.V. All rights reserved.
Network coding helps improve communication rate and save bandwidth by performing a special coding at the sending or intermediate nodes. However, encoding/decoding at the nodes creates computation overhead on large inp...
详细信息
Network coding helps improve communication rate and save bandwidth by performing a special coding at the sending or intermediate nodes. However, encoding/decoding at the nodes creates computation overhead on large input data that causes coding delays. Therefore the progressive method which can hide decoding delay in waiting time is proposed in the previous works. However, the network speed has been greatly accelerated and progressive schemes are no longer the most efficient decoding method. Thus, we present non-progressive decoding algorithm that can be more aggressively parallelized than the progressive network coding, which can diminish the advantages of hidden decoding time of progressive methods by utilizing the multi-core processors. Moreover, the block algorithm implemented by non-progressive decoding helps to reduce cache misses. Through experiments, our scheme which relies on matrix inversion and multiplication shows 46.0% improved execution time and 89.2% last level cache miss reduction compared to the progressive method on multi-core systems. (C) 2012 Elsevier Ltd. All rights reserved.
暂无评论