Distributed shared memory (DSM) is an important technology that provides programmers the underlying execution mechanism for shared memory programs. To improve the performance of DSM, recent studies have been carried o...
详细信息
ISBN:
(纸本)9781457702518
Distributed shared memory (DSM) is an important technology that provides programmers the underlying execution mechanism for shared memory programs. To improve the performance of DSM, recent studies have been carried out with introducing compiler assistance. The compiler generates codes for dependency analysis and communication. This paper proposes high-performance DSM, called Offloaded-DSM, in which the processes of dependency analysis and communication are offloaded to the cluster network. In Offloaded-DSM, the host machine can concentrate on computation of an application itself, while the network maintains coherency in parallel. Through the results of preliminary evaluation, Offloaded-DSM reduces execution time up to 32% in eight nodes and exhibits good scalability.
Scalability of future wide-issue processor designs is severely hampered by the use of centralized resources such as register files, memories and interconnect networks. While the use of centralized resources eases both...
详细信息
Scalability of future wide-issue processor designs is severely hampered by the use of centralized resources such as register files, memories and interconnect networks. While the use of centralized resources eases both hardware design and compiler code generation efforts, they can become performance bottlenecks as access latencies increase with larger designs. The natural solution to this problem is to adapt the architecture to use smaller, decentralized resources. Decentralized architectures use smaller, faster components and exploit distributed instruction-level parallelism across the resources. A multicluster architecture is an example of such a decentralized processor, where subsets of smaller register files, functional units, and memories are grouped together in a tightly coupled unit, forming a cluster. These clusters can then be replicated and connected together to form a scalable, high-performance architecture. The main difficulty with decentralized architectures resides in compiler code generation. In a centralized Very Long Instruction Word (VLIW) processor, the compiler must statically schedule each operation to both a functional unit and a time slot for execution. In contrast, for a decentralized multicluster VLIW, the compiler must consider the additional effects of cluster assignment, recognizing that communication between clusters will result in a delay penalty. In addition, if the multicluster processor also has partitioned data memories, the compiler has the additional task of assigning data objects to their respective memories. Each decision, of cluster, functional unit, memory, and time slot, are highly interrelated and can have dramatic effects on the best choice for every other decision. This dissertation addresses the issues of extracting and exploiting inherent parallelism across decentralized resources through compiler analysis and codegeneration techniques. First, a static analysis technique to partition data objects is presented, which maps
The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This article presents a Metric-Guided Method (MGM) that extends Top-Down analysis with carefully ...
详细信息
The slowdown in technology scaling puts architectural features at the forefront of the innovation in modern processors. This article presents a Metric-Guided Method (MGM) that extends Top-Down analysis with carefully selected, dynamically adapted metrics in a structured approach. Using MGM, we conduct two evaluations at the microarchitecture and the Instruction Set Architecture (ISA) levels. Our results show that simple optimizations, such as improved representation of CISC instructions, broadly improve performance, while changes in the Floating-Point execution units had mixed impact. Overall, we report 10 architectural insights at the microarchitecture, ISA, and compiler fronts while quantifying their impact on the SPEC CPU benchmarks.
暂无评论