Enforcement of data dependence in parallel algorithms demands certain synchronization primitives. For simple data dependence, synchronization primitives such as Full/Empty bit in a HEP machine can be quite effective....
详细信息
Enforcement of data dependence in parallel algorithms demands certain synchronization primitives. For simple data dependence, synchronization primitives such as Full/Empty bit in a HEP machine can be quite effective. However, if data dependence cannot be determined at compile time, or if it is very complicated, more efficient synchronization schemes and algorithms are required. A synchronization scheme is proposed that is very useful in enforcing data dependence on a large multiprocessor system. A possible hardware implementation of this scheme also is suggested. Using very large-scale integration technology, the proposed scheme can be implemented efficiently by incorporating some simple processing capabilities on the shared memories.
Existing pattern-based compiler technology is unable to effectively exploit the full potential of SIMD architectures. We present a new program synthesis based technique for auto-vectorizing performance critical innerm...
详细信息
Existing pattern-based compiler technology is unable to effectively exploit the full potential of SIMD architectures. We present a new program synthesis based technique for auto-vectorizing performance critical innermost loops. Our synthesis technique is applicable to a wide range of loops, consistently produces performant SIMD code, and generates correctness proofs for the output code. The synthesis technique, which leverages existing work on relational verification methods, is a novel combination of deductive loop restructuring, synthesis condition generation and a new inductive synthesis algorithm for producing loop-free code fragments. The inductive synthesis algorithm wraps an optimized depth-first exploration of code sequences inside a CEGIS loop. Our technique is able to quickly produce SIMD implementations (up to 9 instructions in 0.12 seconds) for a wide range of fundamental looping structures. The resulting SIMD implementations outperform the original loops by 2.0 x -3.7 x.
暂无评论