OpenMP is a widely used parallel programming model on traditional multi-core processors. Generally, OpenMP is used to develop fine-grained parallelism through a multithread model. Stream programming model is a new kin...
详细信息
OpenMP is a widely used parallel programming model on traditional multi-core processors. Generally, OpenMP is used to develop fine-grained parallelism through a multithread model. Stream programming model is a new kind of parallel programming model for stream architectures. OpenMP bears a resemblance to the stream programming model at some level. The transformation between the two models has attracted much attention from the research community, since it is the foundation of porting programs between the two architectures. Most related researches focus on the efficiency of porting existing parallel programs to the new architectures such as GPUs. Very few of these studies, however, focus on the portative problem systematically, namely, what kind of parallel programs can be or should be transplanted into stream programs and mapped to run on the stream processors. In this paper, we study the mapping relationship of parallel mechanism in OpenMP to the stream programming model, and point out those parallel mechanisms in OpenMP that are infeasible or undesirable for stream programs. By analyzing two typical benchmarks, we draw the conclusion that a majority of scientific applications are suitable to be mapped to the stream programming model. Our conclusion effectively validates the idea of accelerating scientific applications with the stream processors.
In recent years, heterogeneous parallel system have become a focus research area in high performance computing field. Generally, in a heterogeneous parallel system, CPU provides the basic computing environment and spe...
详细信息
In recent years, heterogeneous parallel system have become a focus research area in high performance computing field. Generally, in a heterogeneous parallel system, CPU provides the basic computing environment and special purpose accelerator (GPU in this paper) provides high computing performance. However, the overall performance of the system is prone to be limited by the data communication between the CPU and the GPU. Data communication is typically used to synchronize the array on the CPU and the stream (in AMD's terminology) on the GPU. In many cases, programmers just add data synchronization for each GPU invoking independently. It is easy to program in this manner but much redundant communication may be introduced, which will dramatically degrade the overall performance. To alleviate this problem, based on the stream programming model, we propose a heuristic data communication schedule approach in this paper. By analyzing the state transition of stream/array data pair, relaxing the synchronization strategy conditionally and considering optimization for branch and loop control structure, our approach can significantly reduce the redundant data communication in most cases.
Network emulation environment is great importance to the research of network protocols, applications and security mechanism. Large-scale network topology generation is one of key technologies to construct network emul...
详细信息
The networked application environment has motivated the development of multitasking operating systems for sensor networks and other low-power electronic devices, but their multitasking capability is severely limited b...
详细信息
ISBN:
(纸本)9781424472611;9780769540597
The networked application environment has motivated the development of multitasking operating systems for sensor networks and other low-power electronic devices, but their multitasking capability is severely limited because traditional stack management techniques perform poorly on small-memory systems. In this paper, we show that combining binary translation and a new kernel runtime can lead to efficient OS designs on resource-constrained platforms. We introduce SenSmart, a multitasking OS for sensor networks, and present new OS design techniques for supporting preemptive multi-task scheduling, memory isolation, and versatile stack management. We have implemented SenSmart on MICA2/MICAz motes. Evaluation shows that SenSmart performs efficient binary translation and demonstrates a significantly better capability in managing concurrent tasks than other sensornet operating systems.
Event-driven programming has been a relatively hot topic in distributed systems development. Having worked on these systems for years, we now believe that it is not the best choice. Besides the wellknown "stack r...
详细信息
Event-driven programming has been a relatively hot topic in distributed systems development. Having worked on these systems for years, we now believe that it is not the best choice. Besides the wellknown "stack ripping" problem, we argue that it greatly influences the composability of software modules. Preemptive threads are also short of composability because of data-races and locks. Lacking of composability can result in systems with little vitality. Cooperative threading (or coroutine), on the contrary, is almost free of this problem, so we advocate it as the primary concurrency model for most distributed systems.
Insects build architecturally complex nests and search for remote food by collaboration work despite their limited sensors, minimal individual intelligence and the lack of a central control system. Insets' co...
详细信息
ISBN:
(纸本)9781424472796
Insects build architecturally complex nests and search for remote food by collaboration work despite their limited sensors, minimal individual intelligence and the lack of a central control system. Insets' collaborations emerge as a response of the individual insects to Stigmergy. A sign-based model of Stigmergy to discuss collaboration is proposed in this paper where we picked up "sign" as a key notion to understand it. Therefore, sign is the link of all the components in a Stigmergic complex adaptive system. Based on this understanding, we propose a definition that reveals the nature of signs and exploit the significations and relationships carried by the notion of sign. Then, a sign-based model of Stigmergy is consequently reached, which captures the essentials of Stigmergy. A basic architecture of Stigmergy as well as its constituents are presented and discussed. At last, some applications of the model are discussed.
Successive interference cancellation (SIC) is an effective technique of multipacket reception to combat interference. As not all collision are resolvable, careful transmission coordination is required. We study link s...
详细信息
Successive interference cancellation (SIC) is an effective technique of multipacket reception to combat interference. As not all collision are resolvable, careful transmission coordination is required. We study link scheduling in wireless networks with SIC at the physical layer. A new model, simultaneity graph (SG), is proposed to characterize the link correlation introduced by SIC. Then two new scheduling schemes are presented: 1) a slot-oriented scheme which assigns a maximal feasible link set to a time slot and 2) a link-oriented scheme which assigns each link a sufficient number of slots. The performance is evaluated by simulations and the results demonstrate that the throughput gain is on average 50% and up to 110% over IEEE 802.11. The complexity of SG is only a bit higher than that of the available widely-used models (e.g., conflict graph).
Successive interference cancellation (SIC) is an effective way of multipacket reception to combat interference. We study link scheduling under SINR (Signal to Interference Noise Ratio) model in ad hoc networks with SI...
详细信息
Successive interference cancellation (SIC) is an effective way of multipacket reception to combat interference. We study link scheduling under SINR (Signal to Interference Noise Ratio) model in ad hoc networks with SIC at physical layer. The facts that interference is accumulated and the links decoded sequentially by SIC are correlated pose key technical challenges. We propose conflict set graph (CSG) to characterize the interference and define interference degree to measure the interference of a link. As scheduling over CSG is NP-hard, independent set based greedy scheme is explored to efficiently construct maximal feasible schedule. The performance is evaluated by simulations. As compared to the simple greedy method, the throughput gain is on average 30% and up to 60%.
With fast development of transistor technology, Graphic processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, su...
详细信息
ISBN:
(纸本)9781424456789;9780769539584
With fast development of transistor technology, Graphic processing Unit(GPU) is increasingly used in the non-graphics applications, and major GPU hardware vendors have introduced software stacks for their own GPUs, such as Brook+ for AMD GPU. Compared with the traditional parallel systems, heterogeneous systems integerating stream-based multi-threaded GPUs provide higher parallel computing capabilities with lower cost. However, porting traditional applications to the heterogeneous systems makes new demand of application optimization on GPU. Based on the AMD's Brook+ platform, we explored application optimization features on AMD GPU by optimizing and implementing the benchmark LBM from SPEC2006. To improve the program locality, we optimized the original data layout of LBM. Using the short vector data types mechanism provided by Brook+, we also optimized the GPU's bandwidth utilization and its thread processors' efficiency. Through the branch elimination technique, we reduced the performance lose caused by branch divergences in the kernel, which is due to the GPU's SIMD executing mode. The experiment results show that data layout, memory bandwidth, branch paths and other factors have a close effect on the performance of program execution on the GPU. Through all the optimizations, we finally got a speedup of 22x (single-precision) and 19x (double-precision) over the original serial benchmark code on a Quad-core CPU, and a speedup of 4x (single-precision) and 8.7x (double-precision) over the original OMP benchmark code on a 8-core CPU.
暂无评论