The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-...
详细信息
The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-efficiency. However, the emerging many-accelerator processor shows a much more complicated memory accessing pattern than general purpose processors (GPPs) because the abundant on-chip FUs tend to generate highly-concurrent memory streams with distinct locality and bandwidth demand. The disordered memory streams issued by diverse accelerators exhibit a mutual- interference behavior and cannot be efficiently handled by the orthodox main memory interface that provides an inflexible data fetching mode. Unlike the traditional DRAM memory, our proposed Aggregation Memory system (AMS) can function adaptively to the characterized memory streams from different FUs, because it provides the FUs with different data fetching sizes and protects their locality in memory access by intelligently interleaving their data to memory devices through sub-rank binding. Moreover, AMS can batch the requests without sub-rank conflict into a read burst with our optimized memory scheduling policy. Experimental results from trace-based simulation show both conspicuous performance boost and energy saving brought by AMS.
The high-density server is featured as low power, low volume, and high computational density. With the rising use of high-density servers in data-intensive and large-scale web applications, it requires a high-performa...
详细信息
The high-density server is featured as low power, low volume, and high computational density. With the rising use of high-density servers in data-intensive and large-scale web applications, it requires a high-performance and cost-efficient intra-server interconnection network. Most of state-of-the-art high-density servers adopt the fully-connected intra-server network to attain high network performance. Unfortunately, this solution costs too much due to the high degree of nodes. In this paper, we exploit the theoretically optimized Moore graph to interconnect the chips within a server. Accounting for the suitable size of applications, a 50-size Moore graph, called Hoffman-Singleton graph, is adopted. In practice, multiple chips should be integrated onto one processor board, which means that the original graph should be partitioned into homogeneous connected subgraphs. However, the existing partition scheme does not consider above problem and thus generates heterogeneous subgraphs. To address this problem, we propose two equivalent-partition schemes for the Hoffman-Singleton graph. In addition, a logic-based and minimal routing mechanism, which is both time and area efficient, is proposed. Finally, we compare the proposed network architecture with its counterparts, namely the fully-connected, Kautz and Torus networks. The results show that our proposed network can achieve competitive performance as fully-connected network and cost close to Torus.
In particle transport simulations, radiation effects are olden described by the discrete ordinates (Sn) form of Boltzmann equation. In each ordinate direction, the solution is computed by sweeping the radiation flux...
详细信息
In particle transport simulations, radiation effects are olden described by the discrete ordinates (Sn) form of Boltzmann equation. In each ordinate direction, the solution is computed by sweeping the radiation flux across the grid. Parallel Sn sweep on an unstructured grid can be explicitly modeled as topological traversal through an equivalent directed acyclic graph (DAG), which is a data-driven algorithm. Its traditional design using MPI model results in irregular communication of massive short messages which cannot be efficiently handled by MPI runtime. Meanwhile, in high-end HPC cluster systems, multicore has become the standard processor configuration of a single node. The traditional data-driven algorithm of Sn sweeps has not exploited potential advantages of multi-threading of multicore on shared memory. These advantages, however, as we shall demonstrate, could provide an elegant solution resolving problems in the previous MPI-only design. In this paper, we give a new design of data-driven parallel Sn sweeps using hybrid MPI and Pthread programming, namely Sweep-H, to exploit hierarchical parallelism of processes and threads. With special multi-threading techniques and vertex schedule policy, Sweep-H gets more efficient communication and better load balance. We further present an analytical performance model for Sweep-H to reveal why and when it is advantageous over former MPI counterpart. On a 64-node multicore cluster system with 12 cores per node, 768 cores in total, Sweep-H achieves nearly linear scalability for moderate problem sizes, and better absolute performance than the previous times speedup on 64 nodes). MPI algorithm on more than 16 nodes (by up to two
3D point cloud data, which are produced by various 3 D sensors such as LIDAR and stereo cameras, have been widely deployed by industry leaders such as Google, Uber, Tesla, and Mobileye, for mobile robotic applications...
详细信息
3D point cloud data, which are produced by various 3 D sensors such as LIDAR and stereo cameras, have been widely deployed by industry leaders such as Google, Uber, Tesla, and Mobileye, for mobile robotic applications such as autonomous driving and humanoid robots. Point cloud data, which are composed of reliable depth information, can provide accurate location and shape characteristics for scene understanding, such as object recognition and semantic segmentation. However, deep neural networks(DNNs), which directly consume point cloud data, are particularly computation-intensive because they have to not only perform multiplication-and-accumulation(MAC) operations but also search neighbors from the irregular 3 D point cloud data. Such a task goes beyond the capabilities of general-purpose processors in realtime to figure out the solution as the scales of both point cloud data and DNNs increase from application to application. We present the first accelerator architecture that dynamically configures the hardware onthe-fly to match the computation of both neighbor point search and MAC computation for point-based DNNs. To facilitate the process of neighbor point search and reduce the computation costs, a grid-based algorithm is introduced to search neighbor points from a local region of grids. Evaluation results based on the scene recognition and segmentation tasks show that the proposed design harvests 16.4× higher performance and saves 99.95% of energy than an NVIDIA Tesla K40 GPU baseline in point cloud scene understanding applications.
This article investigates an adaptive prescribed performance path-following control algorithm for rotor-assisted vehicles, incorporating reinforcement learning (RL) to execute energy-saving cruising missions. For obta...
详细信息
Dynamic positron emission tomography (PET) parametric imaging typically requires a 60-minute acquisition period, causing patient discomfort and reducing clinical efficiency. This study explores the feasibility of gene...
详细信息
Detecting traffic signs effectively under low-light conditions remains a significant challenge. To address this issue, we propose YOLO-LLTS, an end-to-end real-time traffic sign detection algorithm specifically design...
详细信息
Service-oriented future internet architecture(SOFIA) is a clean-slate network architecture. In SOFIA, a service request is mainly processed through service resolution and network resource allocation. To realize the ...
详细信息
Service-oriented future internet architecture(SOFIA) is a clean-slate network architecture. In SOFIA, a service request is mainly processed through service resolution and network resource allocation. To realize the network resource allocation, we reference the idea of network virtualization and propose resource scheduling virtualization. In resource scheduling virtualization, a service request is abstracted as a virtual network(VN) and the network resources are allocated by mapping the VN onto the physical network. Resource scheduling virtualization provides centralized resource scheduling control within an autonomous system(AS) and achieves better controllability compared with the distributed schemes. Besides, resource scheduling virtualization supports multi-site selection as well. Meanwhile, we propose a collection of resource scheduling algorithms based on maximum resource tree(MRT) adapting to different scenarios. According to the simulation results, the proposed algorithms show good performance on the key metrics, such as acceptance ratio, revenue, cost and utilization. Moreover, the simulation results reveal that our algorithm is more efficient than the traditional ones.
Because of its potential applications in agriculture, environment monitoring and so on, wireless underground sensor network(WUSN) has been researched more and more extensively in recent years. The main and most impo...
详细信息
Because of its potential applications in agriculture, environment monitoring and so on, wireless underground sensor network(WUSN) has been researched more and more extensively in recent years. The main and most important difference of WUSN to terrestrial wireless sensor network(WSN) is the channel characteristics, which determines the design methodology of it. In this paper, the propagation character of electromagnetic(EM) wave in the near surface WUSN is analyzed, as well as the path loss model of it is given. In addition, the influence of human's ankle to the channel characteristics of near surface WUSN is investigated by electromagnetic theory analysis, simulation and experiment. A novel path loss model of near surface WUSN which takes the interference of human's ankle into consideration is proposed. It is verified that the existing of human above the WUSN system may cause additional attenuation to the signal of near surface WUSN which propagates as lateral wave along the ground. Moreover, the relation of the attenuation and operating frequency is deduced, which gives a reference to extend the frequency band applied in WUSN.
Temperature is the key factor in determining the practical performance and energy production prediction of solar photovoltaic (PV) panels. Transparent solar PV panels, in contrast to traditional opaque solar PV panels...
详细信息
暂无评论