In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigat...
详细信息
ISBN:
(纸本)9781450313162
In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. In this paper we investigate advanced software-pipelining optimizations for the double-precision general matrix multiplication (DGEMM) algorithm running on a heterogeneous system that includes ATI GPUs. Our approach decomposes the DGEMM workload to a finer detail and hides the latency of CPU-GPU data transfers to a higher degree than previous approaches in literature. We implement our approach in a five-stage software pipelined DGEMM and analyze its performance on a platform including x86 multi-core CPUs and an ATI Radeon™ HD5970 GPU that has two Cypress GPU chips on board. Our implementation delivers 758 GFLOPS (82% floating-point efficiency) when it uses only the GPU, and 844 GFLOPS (80% efficiency) when it distributes the workload on both CPU and GPU. We analyze the performance of our optimized DGEMM as the number of GPU chips employed grows from one to two, and the results show that resource contention on the PCIe bus and on the host memory are limiting factors. Copyright 2012 ACM.
In recent years, artificial intelligence and automotive industry have developed rapidly, and autonomous driving has gradually become the focus of the industry. In road networks, the problem of proximity detection refe...
详细信息
In recent years, artificial intelligence and automotive industry have developed rapidly, and autonomous driving has gradually become the focus of the industry. In road networks, the problem of proximity detection refers to detecting whether two moving objects are close to each other or not in real time. However, the battery life and computing capability of mobile devices are limited in the actual scene,which results in high latency and energy consumption. Therefore, it is a tough problem to determine the proximity relationship between mobile users with low latency and energy consumption. In this article, we aim at finding a tradeoff between latency and energy consumption. We formalize the computation offloading problem base on mobile edge computing(MEC)into a constrained multiobjective optimization problem(CMOP) and utilize NSGA-II to solve it. The simulation results demonstrate that NSGA-II can find the Pareto set, which reduces the latency and energy consumption effectively. In addition, a large number of solutions provided by the Pareto set give us more choices of the offloading decision according to the actual situation.
Thread Level Speculation (TLS) is a technique aims at boosting the performance of sequential programs running on Chip Multiprocessors (CMPs) by automatically parallelizing them. It exempts programmers from the heavy t...
详细信息
The sub-Doppler absorption laser spectroscopy at 728nm transition from the 5D5/2 state to the 6F state of cesium with linewidth near 10 MHz is first experimentally performed with indirect pumping from the ground state...
详细信息
The sub-Doppler absorption laser spectroscopy at 728nm transition from the 5D5/2 state to the 6F state of cesium with linewidth near 10 MHz is first experimentally performed with indirect pumping from the ground state 6S1/2 to the state 7P3/2 by a 455.5nm diode laser. Using a 455.5nm diode laser as an indirect pump laser, several excited states will be populated due to spontaneous decay from the 7P state. We first implement the sub-Doppler absorption laser spectroscopy at 728nm from the 5D5/2 state to the 6F state when Cs atoms within thermal glass cell decay to the 5D5/2 state. Due to velocity transfer effect, the hyperfine structure of 5D5/2 shows a mixed and complicated pattern but very e/ear structure when the 455.5nm pumping laser is counter-propagating (or co-propagating) with the 728nm probing laser.
As a powerful analysis tool of Petri nets, reachability trees are fundamental for systematically investigating many characteristics such as boundedness, liveness and reversibility. This work proposes a method to gener...
详细信息
Many applications meet certain programming patterns like pipeline, fork-join, do-all etc. While tools such as OS threads and OpenMP allow programmers only to express task or data parallelism, special support for progr...
详细信息
Matter-wave interferometers with spin quantum states are attractive in quantum manipulation and precision measurements. Here, five spatial interference patterns corresponding to the full spin states are observed in ea...
详细信息
Matter-wave interferometers with spin quantum states are attractive in quantum manipulation and precision measurements. Here, five spatial interference patterns corresponding to the full spin states are observed in each run of the experiment, by the combination of the Majorana transition according to the exponential modulation of the magnetic field pulse decline curve and radio frequency coupling among multiple magnetic *** to the realization of two Majorana transitions, the interference fringe for the magnetic field insensitive state also has a higher contrast. After spatially overlapping the full magnetic sub-state interference patterns dozens of times in consecutive experimental measurements, clear fringes are still observed, indicating the great stability of the relative phases of different components. This indicates the potential to achieve an interferometer with multiple spin clocks.
In this paper, we presented a method to improve structural modeling based on conserved domain clusters and structure-anchored alignments. We first constructed a template library of structural clusters for all conserve...
详细信息
ISBN:
(纸本)1595934804;9781595934802
In this paper, we presented a method to improve structural modeling based on conserved domain clusters and structure-anchored alignments. We first constructed a template library of structural clusters for all conserved sequence domains. Then, for each cluster, we built the profile using the structure and sequence information. Finally we use the profile and structural alignments as anchors to increase the alignment accuracy between a query and its templates. Our preliminary results show that this method can be used for the partial prediction for a majority of known protein sequences with better qualities. Copyright 2007 ACM.
With the rapid development of network technology, network-based humanoid robot technology will also be open to the development of gradual and orderly progress. This article is based on the C / S architecture, the serv...
详细信息
ISBN:
(纸本)9783037851579
With the rapid development of network technology, network-based humanoid robot technology will also be open to the development of gradual and orderly progress. This article is based on the C / S architecture, the server responsible for controlling the record of news and information network transit between paragraphs;through remote interaction, real-time client to complete the real humanoid robot control functions. Interoperability between the client, first to sign the server. Server information of all registered users to return to the client process, then the client users will be able to get online users to select the remote robot interaction..When a user operation, the client program as a virtual robot through the virtual robot laboratory will be displayed in realtime robot control results.
Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional ind...
详细信息
Many NoSQL (Not Only SQL) databases were proposed to store and query on a huge amount of data. Some of them like BigTable, PNUTS, and HBase, can be modeled as distributed ordered tables (DOTs). Many additional indexing techniques have been presented to support queries on non-key columns for DOTs. However, there was no comprehensive analysis or comparison of these techniques, which brings troubles to users in selecting or proposing a proper indexing technique for a certain workload. This paper proposes a taxonomy based on six indexing issues to classify indexing techniques on DOTs and provides a comprehensive review of the state-of-the-art techniques. Based on the taxonomy, we propose a performance model named QSModel to estimate the query time and storage cost of these techniques and run experiments on a practical workload from Tencent to evaluate this model. The results show that the maximum error rates of the query time and storage cost are 24.2% and 9.8% respectively. Furthermore, we propose IndexComparator, an open source project that implements representative indexing techniques. Therefore, users can select the best-fit indexing technique based on both theoretical analysis and practical experiments.
暂无评论