In ultra-deep submicro technology, two of the paramount reliability concerns are soft errors and device aging. Although intensive studies have been done to face the two challenges, most take them separately so far, th...
详细信息
Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of com...
详细信息
Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computingsystems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI- Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X-21X speedup over Apache Hadoop, and 2X-3X speedup over Apache Spark for PageRank and K-means.
Critical path selection is very important in delay testing. Critical paths found by conventional static timing analysis (STA) tools are inadequate to represent the real timing of the circuit, since neither the testabi...
详细信息
With the increasing demand and the wide application of high performance commodity multi-core processors, both the quantity and scale of data centers grow dramatically and they bring heavy energy consumption. Researche...
详细信息
With the increasing demand and the wide application of high performance commodity multi-core processors, both the quantity and scale of data centers grow dramatically and they bring heavy energy consumption. Researchers and engineers have applied much effort to reducing hardware energy consumption, but software is the true consumer of power and another key in making better use of energy. system software is critical to better energy utilization, because it is not only the manager of hardware but also the bridge and platform between applications and hardware. In this paper, we summarize some trends that can affect the efficiency of data centers. Meanwhile, we investigate the causes of software inefficiency. Based on these studies, major technical challenges and corresponding possible solutions to attain green system software in programmability, scalability, efficiency and software architecture are discussed. Finally, some of our research progress on trusted energy efficient system software is briefly introduced.
In DSM and nanometer technology, there will present more and more new fault types, which are difficult to predict and avoid. Applying fault tolerant algorithms to achieve reliable on-chip communication is one of the m...
详细信息
With the development of semiconductor technology, microprocessors become more and more susceptible to transient faults. Some proposed schemes support redundant execution of a program in a superscalar processor for fau...
详细信息
With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wai...
详细信息
Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and th...
详细信息
ISBN:
(纸本)9781450382946
Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results for two main convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm, which represents the direct and indirect implementations of a convolution respectively. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32× performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided
General sparse matrix-matrix multiplication (SpGEMM) is an essential building block in a number of applications. In our work, we fully utilize GPU registers and shared memory to implement an efficient and load balance...
详细信息
Ray-tracing, can produce high-quality images, however, the use of ray-tracing has been limited due to its high demands on computational power and memory bandwidth, especially in the case of satellite imagery. In this ...
详细信息
暂无评论