检索结果-内蒙古大学图书馆

COMPILE-TIME TECHNIQUES FOR DATA DISTRIBUTION IN DISTRIBUTED MEMORY MACHINES

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1991年第4期2卷 472-482页

作者： RAMANUJAM, J SADAYAPPAN, P OHIO STATE UNIV DEPT COMP & INFORMAT SCICOLUMBUSOH 43210

This paper addresses the problem of partitioning data for distributed memory machines (multicomputers). In current day multicomputers, interprocessor communication is more time-consuming than instruction execution. If insufficient attention is paid to the data allocation problem, then the amount of time spent in interprocessor communication might be so high as to seriously undermine the benefits of parallelism. It is therefore worthwhile for a compiler to analyze patterns of data usage to determine allocation, in order to minimize interprocessor communication. We present a machine-independent analysis of communication-free partitions. We present a matrix notation to describe array accesses in fully parallel loops which lets us derive sufficient conditions for communication-free partitioning (decomposition) of arrays. In the case of a commonly occurring class of accesses, we present a problem formulation to minimize communication costs, when communication-free partitioning of arrays is not possible.

关键词： ACCESS AND DEPENDENCE BASED ARRAY PARTITIONING COMMUNICATION-FREE PARTITIONING DATA DECOMPOSITION DISTRIBUTED MEMORY MACHINES parallelizing compilers

来源：评论

学校读者我要写书评

暂无评论

COMPILE-TIME SCHEDULING AND ASSIGNMENT OF DATA-FLOW PROGRAM GRAPHS WITH DATA-DEPENDENT ITERATION

引用

IEEE TRANSACTIONS ON COMPUTERS 1991年第11期40卷 1225-1238页

作者： HA, SH LEE, EA Department of Electrical Engineering and Computer Sciences University of California Berkeley Berkeley CA USA

Scheduling of data-flow graphs onto parallel processors consists in assigning actors to processors, ordering the execution of actors within each processor, and firing the actors at particular times. Many scheduling strategies do at least one of these operations at compile time to reduce run-time cost. In this paper, we classify four scheduling strategies: 1) fully dynamic, 2) static-assignment, 3) self-timed, and 4) fully static. These are ordered in decreasing run-time cost. Optimal or near-optimal compile-time decisions require deterministic, data-independent program behavior known to the compiler. Thus, moving from strategy 1) toward 4) either sacrifices optimality, decreases generality by excluding certain program constructs, or both. This paper proposes scheduling techniques valid for strategies 2), 3), and 4). In particular, we focus on data-flow graphs representing data-dependent iteration;for such graphs, although it is impossible to deterministically optimize the schedule at compile time, reasonable decisions can be made. For many applications, good compile-time decisions remove the need for dynamic scheduling or load balancing. We assume a known probability mass function for the number of cycles in the data-dependent iteration and show how a compile-time decision about assignment and/or ordering as well as timing can be made. The criterion we use is to minimize the expected total idle time caused by the iteration;in certain cases, this will also minimize the expected makespan of the schedule. We will also show how to determine the number of processors that should be assigned to the data-dependent iteration. The method is illustrated with a practical programming example, yielding preliminary results that are very promising.

关键词： DATA FLOW DATA-DEPENDENT ITERATION PARALLEL PROCESSORS parallelizing compilers QUASI-STATIC SCHEDULING SCHEDULING

来源：评论

学校读者我要写书评

暂无评论

PARTITIONING AND MAPPING NESTED LOOPS ON MULTIPROCESSOR SYSTEMS

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1991年第4期2卷 430-439页

作者： SHEU, JP TAI, TH Department of Electrical Engineering National Central University Chungli Taiwan

Intensive scientific algorithms can usually be formulated as nested loops which are the main source of parallelism. When a nested loop is executed in parallel, the total execution time is composed of two parts-the computation time and the communication time. For a message-passing multiprocessor system, performance declines rapidly as the communication overhead is higher than the corresponding computation. In this paper, a method for parallel executing nested loops with constant loop-carried dependencies on message-passing multiprocessor systems to reduce the communication overhead is presented. First, we partition the nested loop into blocks which result in little communication without concern for the topology of machines. For a given linear time transformation found by the hyperplane method, the iterations of a nested loop are partitioned into blocks such that the communication among the blocks is reduced while the execution ordering defined by the time transformation is not perturbed. Then, the partitioned blocks generated by the partitioning method are mapped onto multiprocessor systems according to the specific properties of various machines. We propose a heuristic mapping algorithm for the hypercube machines.

关键词： HYPERCUBES HYPERPLANE METHOD MESSAGE-PASSING MULTIPROCESSOR SYSTEMS parallelizing compilers SYSTOLIC ARRAYS WAVE-FRONT METHOD

来源：评论

学校读者我要写书评

暂无评论

A DISTRIBUTED IMPLEMENTATION OF SHARED VIRTUAL MEMORY WITH STRONG AND WEAK COHERENCE 2nd

A DISTRIBUTED IMPLEMENTATION OF SHARED VIRTUAL MEMORY WITH S...

引用

2ND EUROPEAN CONF ON DISTRIBUTED MEMORY COMPUTING ( EDMCC2 )

作者： GILOI, WK HASTEDT, C SCHOEN, F SCHROEDERPREIKSCHAT, W GMD Research Center for Innovative Computer Systems and Technology Technical University of Berlin Germany

ISBN: (纸本)3540539514

A virtual shared memory architecture (VSMA) is a distributed memory architecture that looks to the application software as if it were a shared memory system. The major problem with such a system is to maintain the coherence of the distributed data entities. Shared virtual memory means that the shared data entities are pages of local virtual memories with demand paging. Memory coherence may be strong or weak. Strong coherence is a scheme where all the shared data entities look from the outside as if they were stored in one coherent memory. This simplifies programming of a distributed memory system at the cost of a high message traffic in the system, needed to maintain the strong coherence. The efficiency of the system can be increased by adding a weak coherence scheme that allows for multiple writes by different threads of control into the same page. The price of the weak coherence scheme is the need for explicit program synchronizations, needed to reestablish at the end the strong coherence of the result. For the computer architect, the challenging question is how to implement a VSMA most efficiently and, specifically, by what architectural means to support the implementation. In the paper a new solution to this question is presented based upon an innovative distributed memory architecture in which communication is conducted by a dedicated communication processor in each node rather than by the node CPU. This will make the exchange of short, fixed-size messages, e.g., invalidation notices, very efficient. Therefore, it becomes more appropriate to minimize the overall administrative overhead, even at the cost of more message traffic. On that rationale, a novel, capability-based mechanism for both strong and weak coherence of shared virtual memory is presented. The weak coherence scheme is built on top of the strong coherence, utilizing its mechanisms. The proposed implementation is totally distributed and based on a strict need to know philosophy. Consequently, the e

关键词： DISTRIBUTED MEMORY ARCHITECTURE VIRTUAL SHARED MEMORY ARCHITECTURE STRONG AND WEAK DATA COHERENCE COMMUNICATION HARDWARE parallelizing compilers

来源：评论

学校读者我要写书评

暂无评论

HIGH-SPEED SYNCHRONIZATION OF PROCESSORS USING FUZZY BARRIERS

引用

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING 1990年第1期19卷 53-73页

作者： GUPTA, R EPSTEIN, M UNIV PITTSBURGH DEPT COMP SCIPITTSBURGHPA 15260 N AMER PHILIPS LIGHTING CORP PHILIPS LABSBARIACLIFF MANORNY 10510

Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow since it requires execution of several instructions. Secondly, processors that are stalled waiting for other processors to reach the barrier cannot do any useful work. In this paper, the notion of thefuzzy barrier is presented, that avoids these drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of instructions such that a processor is ready to synchronize upon reaching the first instruction and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Hardware fuzzy barriers have been implemented as part of a RISC-based multi-processor system. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechaism.

关键词： barrier synchronization code reordering Multiprocessor systems parallelizing compilers synchronization overhead

来源：评论

学校读者我要写书评

暂无评论

A fine-grained MIMD architecture based upon register channels 23

A fine-grained MIMD architecture based upon register channel...

引用

Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture

作者： Rajiv Gupta Department of Computer Science University of Pittsburgh 220 Alumni Hall Pittsburgh PA

ISBN: (纸本)9780897914130

This paper discusses the use of shared register channels as a data exchange mechanism among processors in a fine-grained MIMD system with a load/store architecture. A register channel is provided with a synchronization bit that is used to ensure that a processor succeeds in reading a channel only after a value has been written to the channel. The instructions supported by this load/store architecture allow both registers and register channels to be used as operand sources and result destinations. Conditional load, store, and move instructions are provided to allow processors to exchange values through channels in presence of aliasing caused by array references. Compiler support required to take proper advantage of channels is briefly discussed. In contrast to a VLIW machine a system with channels does not require strict lockstep operation of its processors. This reduces the delays caused by unpredictable events such as memory bank conflicts.

关键词： parallelizing compilers instruction scheduling multiprocessor system fine-grained parallelism channels aliasing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：