the distributed Consensus problem assumes that all processors in the system have some initial values;the goal is to make all non-faulty processors agree on one of these values. this paper investigates the time needed ...
详细信息
ISBN:
(纸本)0818626720
the distributed Consensus problem assumes that all processors in the system have some initial values;the goal is to make all non-faulty processors agree on one of these values. this paper investigates the time needed to reach consensus in a partially synchronous model with omission failures. In this model, the processors have no direct knowledge about time, but the time between consecutive steps of each processor is always between two known constants c1 and c2;the ratio C = c2/c1 measures the timing uncertainty in the system. Moreover, messages are delivered within time d. this paper provides an improved protocol for the above problem. When the majority of the processors is fault-free, the protocol achieves consensus in time 3(qq+1)d+Cd, where φ is the actual number of faults in a specific execution of the protocol. this will allow an increase in efficiency up to 25% over the existing protocol which requires time 4(qq+1)d+Cd.
A novel analog parallel approach to the implementation of the dynamic programming paradigm is described. the approach has been used to design a test chip which solves the single source, shortest path problem in a full...
详细信息
the design and implementation of an object-oriented topologically partitioned parallel automatic test pattern generation (ATPG) system, ES-TPS, on a multiple-instruction multiple-data (MIMD) distributed-memory multipr...
详细信息
Traditionally, register files have been the primary agent for inter-operation communication in load/store architectures. As processors start issuing multiple instructions per cycle, a centralized register file can eas...
详细信息
ISBN:
(纸本)0818631759
Traditionally, register files have been the primary agent for inter-operation communication in load/store architectures. As processors start issuing multiple instructions per cycle, a centralized register file can easily become a bottleneck. this paper analyzes the register file traffic in a load/store architecture with a view to motivate the development of alternate inter-operation communication mechanisms that reduce the bandwidth demanded of a centralized register file. We first provide metrics to characterize the register traffic. these metrics deal withthe degree and locality of use of the register instances created. We then present the results of a simulation study that uses the MIPS R2000 architecture and the SPEC benchmark programs. We have two major results. First, a large number of the register instances are used only once, and the average degree of use of register instances is about 2. Second, most of the register instances are used up soon after they are created (within about 30-40 instructions). this suggests that alternate inter-operation communication mechanisms that exploit the temporal locality of use of register instances are likely to be effective in reducing the traffic burden on the centralized register file. the second result was pivotal in the design of the distributed register file for the multiscalar processing paradigm.
In this paper a new task migration strategy is presented. the input program is supplied as a task graph and an initial static assignment of tasks to processor is given. the dynamic behaviour of the static allocation i...
详细信息
ISBN:
(纸本)0818626720
In this paper a new task migration strategy is presented. the input program is supplied as a task graph and an initial static assignment of tasks to processor is given. the dynamic behaviour of the static allocation is then profiled, and a set of migration destinations and migration thresholds are identified. During any subsequent execution, the profiling based algorithm makes migration decisions solely on the basis of the profiled data and the local processor load. No neighboring processor load information is required. Results are presented for the algorithm, and compared against a dynamic mapping scheme with global knowledge of the multiprocessor state. the migration scheme compared very well, yielding graph execution times within 5% of the global knowledge algorithm on the average, and 18% in the worst case. Furthermore, only two migration destinations per task were required for the test graphs investigated.
the authors describe the overall system design for ImageNet and present a system prototype developed on an Ethernet network in the Computer Engineering Research Laboratory at the University of Arizona. ImageNet is a g...
详细信息
Termination detection is a fundamental problem in distributed computing. Many algorithms have been proposed, but only the Chandrasekaran and Venkatesan (CV) algorithm is known to be optimal in worst-case message compl...
详细信息
ISBN:
(纸本)0818626720
Termination detection is a fundamental problem in distributed computing. Many algorithms have been proposed, but only the Chandrasekaran and Venkatesan (CV) algorithm is known to be optimal in worst-case message complexity. this optimal algorithm, however, has several undesirable properties. First, it always requires M′+2*|E|+n-1 control messages, whether it is worst case or best case, where M′ is the number of basic messages issued by the underlying computation after the algorithm starts, |E| is the number of channels in the system, and n is the number of processes. Second, its worst-case detection delay is O(M′). In a message-intensive computation, that might not be tolerable. third, the maximum amount of space needed by each process is O(M′), a quantity not known at compile time, making it necessary to use the more expensive dynamic memory allocation. Last, it works only for FIFO channels. the purpose of this paper is to remedy these drawbacks, while keeping its strength. We propose an algorithm that requires M′+2(n-1) control messages in the worst case, but much fewer on the average, and in the best case, it uses only 2(n-1) control messages, no matter how large M′ is. the worst-case detection delay is O(n), as compared to the CV-algorithm's O(M′);and the space complexity is Θ(n) for each process, a compile-time known quantity. the algorithm works equally well for both FIFO and Non-FIFO channels.
In practice, it is interesting to map n-dimensional algorithms, or algorithms with n nested loops, onto (k-1)-dimensional arrays where k<n. For example, many algorithms at bit level are at least 4-dimensional (matr...
详细信息
ISBN:
(纸本)0818626720
In practice, it is interesting to map n-dimensional algorithms, or algorithms with n nested loops, onto (k-1)-dimensional arrays where kthms at bit level are at least 4-dimensional (matrix multiplication, convolution, LU decomposition, etc.) and most existing bit level processor arrays are 2-dimensional. A computational conflict occurs if two or more computations of an algorithm are mapped into the same processor and the same execution time. In this paper, some open problems in the previous work are considered. A procedure is proposed to test if or not a given mapping has computational conflicts and a lower bound on the total execution time is provided. Based on the testing procedure and the lower bound, the complexity and the optimality of the optimization procedure in [17] is improved. the integer programming formulation is also discussed and used to find the optimal time mapping for the 5-dimensional bit level matrix multiplication algorithm into a 2-dimensional bit level processor array.
Recently, we have studied two important classes of algorithms requiring ±2b communications: ±2b-descend, and ±2b-ascend. Let N = 2n be the number of PEs in a SIMD hypercube which restricts all communica...
详细信息
ISBN:
(纸本)0818626720
Recently, we have studied two important classes of algorithms requiring ±2b communications: ±2b-descend, and ±2b-ascend. Let N = 2n be the number of PEs in a SIMD hypercube which restricts all communications to a single fixed dimension at a time. In [5], we developed an efficient O(n) algorithm for the descend class. In [6], we obtained a simple O(n2/log n) algorithm for the ascend class, requiring O(log n) words of local memory per PE. In this paper, we present two new algorithms for the ascend class on a SIMD hypercube. the first algorithm runs in O(n1.5) time and requires O(1) space per PE. the second algorithm, which is discussed only briefly here, runs in O(n√n/log n) time and requires O(log n) space per PE.
the authors present an efficient method for tolerating faults in a two-dimensional mesh architecture. the approach is based on adding spare components (nodes) and extra links (edges) such that the resulting architectu...
详细信息
the authors present an efficient method for tolerating faults in a two-dimensional mesh architecture. the approach is based on adding spare components (nodes) and extra links (edges) such that the resulting architecture can be reconfigured as a mesh in the presence of faults. the cost of the fault-tolerant mesh architecture is optimized by adding about one row of redundant nodes in addition to a set of k spare nodes (while tolerating up to k node faults) and minimizing the number of links per node. the results are surprisingly efficient and seem to be practical for small values of k. the degree of the fault-tolerant architecture is k+5 for odd k, and k+6 for even k. the results can be generalized to d-dimensional meshes such that the number of spare nodes is less than the length of the shortest axis plus k, and the degree of the fault-tolerant mesh is (d-1)k+d+3 when k is odd and (d-1)k+2d+2 when k is even.< >
暂无评论