Sparse triangular solve (SpTRSV) is a vital component in various scientific applications, and numerous GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSV is currently the mainstream algorithm ...
详细信息
Graph processing is increasingly adopted to solve problems that span many application domains, including scientific computing, social networks, and big-data analytics. These applications present particular features (h...
详细信息
ISBN:
(纸本)9798350342543
Graph processing is increasingly adopted to solve problems that span many application domains, including scientific computing, social networks, and big-data analytics. These applications present particular features (huge working sets and irregular scalability) that make the default Linux scheduler, which adopts a time-sharing policy to provide a fair scheduler, perform poorly when co-locating multiple graph applications in the same processor. This work focuses on maximizing processor utilization, which is a major concern of current data centers. To this end, we propose AFAIR, a flexible scheduling policy that allocates multiple graph applications on the same processor and assigns a fraction of the cores exclusively to each application instead of sharing them. Moreover, AFAIR dynamically adds/removes cores to the running applications, adapting the number of threads used for parallel execution to balance memory load. This allows AFAIR to achieve almost perfect fairness, on average 95%.
Modern neural networks require long training to reach decent performance on massive datasets. One common approach to speed up training is model parallelization, where large neural networks are split across multiple de...
详细信息
ISBN:
(纸本)9783031160929;9783031160912
Modern neural networks require long training to reach decent performance on massive datasets. One common approach to speed up training is model parallelization, where large neural networks are split across multiple devices. However, different device placements of the same neural network lead to different training times. Most of the existing device placement solutions treat the problem as sequential decisionmaking by traversing neural network graphs and assigning their neurons to different devices. This work studies the impact of neural network graph traversal orders on device placement. In particular, we empirically study how different graph traversal orders of neural networks lead to different device placements, which in turn affects the training time of the neural network. Our experiment results show that the best graph traversal order depends on the type of neural networks and their computation graphs features. In this work, we also provide recommendations on choosing effective graph traversal orders in device placement for various neural network families to improve the training time in model parallelization.
With the rapid development of Internet services and the Internet of Things (IoT), many studies focus on operator allocation to enhance the DSPAs’ (data stream processing applications) performance and resource utiliza...
详细信息
The rapid development of wireless networks makes it more convenient for people to enjoy high quality multimedia. However, video applications are throughput-demanding, and relatively, radio resource always seems insuff...
详细信息
Nowadays, the rapid growth of data across the Internet has provided sufficient labeled data to train deep structured artificial neural networks. While deeper structured networks bring about significant precision gains...
详细信息
ISBN:
(数字)9783319682105
ISBN:
(纸本)9783319682105;9783319682099
Nowadays, the rapid growth of data across the Internet has provided sufficient labeled data to train deep structured artificial neural networks. While deeper structured networks bring about significant precision gains in many applications, they also pose an urgent demand for higher computation capacity at the expense of power consumption. To this end, various FPGA based deep neural network accelerators are proposed for higher performance and lower energy consumption. However, as a dilemma, the development cycle of FPGA application is much longer than that of CPU and GPU. Although FPGA vendors such as Altera and Xilinx have released OpenCL framework to ease the programming, tuning the OpenCL codes for desirable performance on FPGAs is still challenging. In this paper, we look into the OpenCL implementation of Convolutional Neural Network (CNN) on FPGA. By analysing the execution manners of a CPU/GPU oriented verision on FPGA, we find out the causes of performance difference between FPGA and CPU/GPU and locate the performance bottlenecks. According to our analysis, we put forward a corresponding optimization method focusing on external memory transfers. We implement a prototype system on an Altera Stratix V A7 FPGA, which brings a considerable 4.76x speed up to the original version. To the best of our knowledge, this implementation outperforms most of the previous OpenCL implementations on FPGA by a large margin.
Window functions are a sub-class of analytical operators that allow data to be handled in a derived view of a given relation, while taking into account their neighboring tuples. Currently, systems bypass parallelizati...
详细信息
ISBN:
(数字)9783319395777
ISBN:
(纸本)9783319395777;9783319395760
Window functions are a sub-class of analytical operators that allow data to be handled in a derived view of a given relation, while taking into account their neighboring tuples. Currently, systems bypass parallelization opportunities which become especially relevant when considering Big Data as data is naturally partitioned. We present a shuffling technique to improve the parallel execution of window functions when data is naturally partitioned when the query holds a partitioning clause that does not match the natural partitioning of the relation. We evaluated this technique with a non-cumulative ranking function and we were able to reduce data transfer among parallel workers in 85% when compared to a naive approach.
A key component in large scale distributed analytical processing is shuffling, the distribution of data to multiple nodes such that the computation can be done in parallel. In this paper we describe the design and imp...
详细信息
ISBN:
(数字)9783319395777
ISBN:
(纸本)9783319395777;9783319395760
A key component in large scale distributed analytical processing is shuffling, the distribution of data to multiple nodes such that the computation can be done in parallel. In this paper we describe the design and implementation of a communication middleware to support data shuffling for executing multi-stage analytical processing operations in parallel. The middleware relies on RDMA (Remote Direct Memory Access) to provide basic operations to asynchronously exchange data among multiple machines. Experimental results show that the RDMA-based middleware developed can provide a 75% reduction of the costs of communication operations on parallel analytical processing tasks, when compared with a sockets middleware.
The cloud computing is an Internet-based resource sharing system in which virtualized resources are provided over the Internet. Cloud computing refers to a class of systems and applications that employ distributed res...
详细信息
computing Grid in this paper is a Grid computing environment that supplies applications which run in a local computing site only, without any modification or adaptation for running globally in the Grid computing envir...
详细信息
暂无评论