We have compared various load balancing strategies for the program which was developed by our group to simulate nanoscale measurements of material properties. Among the strategies we have examined, static block distri...
详细信息
ISBN:
(纸本)076952138X
We have compared various load balancing strategies for the program which was developed by our group to simulate nanoscale measurements of material properties. Among the strategies we have examined, static block distributions, static cyclic distributions, self-adapting block distributions, dynamic load balancing, and hybrid load balancing, the hybrid approach has been proven to give the best performance for the program on Hitachi SR8000.
Given an array of positive and negative values, we consider the problem of K maximum sums. When an overlapping property needs to be observed, previous algorithms for the maximum sum are not directly applicable. We des...
详细信息
ISBN:
(纸本)0769521355
Given an array of positive and negative values, we consider the problem of K maximum sums. When an overlapping property needs to be observed, previous algorithms for the maximum sum are not directly applicable. We designed an O(K * n) algorithm for the K maximum subsequences problem. this was then modified to solve the K maximum subarrays problem in O(K * n/sup 3/) time. Finally, we present a VLSI K maximum subarrays algorithm with O(K * n) steps and a circuit size of O(n/sup 2/), which is cost-optimal in parallelisation of the sequential algorithm.
Technology trends present new challenges for processor architectures and their instruction schedulers. Growing transistor density will increase the number of execution units on a single chip, and decreasing wire trans...
详细信息
ISBN:
(纸本)0769522297
Technology trends present new challenges for processor architectures and their instruction schedulers. Growing transistor density will increase the number of execution units on a single chip, and decreasing wire transmission speeds will cause long and variable on-chip latencies. these trends will severely limit the two dominant conventional architectures: dynamic issue superscalars, and static placement and issue VLIWs. We present a new execution model in which the hardware and static scheduler instead work cooperatively, called Static Placement Dynamic Issue (SPDI). this paper focuses on the static instruction scheduler for SPDI. We identify and explore three issues SPDI schedulers must consider - locality, contention, and depth of speculation. We evaluate a range of SPDI scheduling algorithms executing on an Explicit Data Graph Execution (EDGE) architecture. We find that a surprisingly simple one achieves an average of 5.6 instructions-per-cycle (IPC) for SPEC2000 64-wide issue machine, and is within 80% of the performance without on-chip latencies. these results suggest that the compiler is effective at balancing on-chip latency and parallelism, and that the division of responsibilities between the compiler and the architecture is well suited to future systems.
For the parallel tasks represented by the directed acyclic graph (DAG), if it is linearly clustered, the ordering of the execution time of the tasks in each cluster is based on their arrows in the DAG. But for nonline...
详细信息
For the parallel tasks represented by the directed acyclic graph (DAG), if it is linearly clustered, the ordering of the execution time of the tasks in each cluster is based on their arrows in the DAG. But for nonlinearly clustering, the ordering of the independent tasks in each cluster is not easily decided. Improper ordering of these independent tasks will greatly increase the scheduling length of the DAG. We discuss the shortcomings of current scheduling algorithms and the reason behind poor performance, and then propose some new node information to be extracted which is used by a new independent tasks scheduling algorithm based on the maximized parallelism degree (MPD). Experimental results show that the MPD algorithm can yield better performance than the previous algorithms.
Integer mapping is critical for lossless source coding and the techniques have been used for image compression in the new international image compression standard, JPEG 2000. In this paper, from block factorizations f...
详细信息
ISBN:
(纸本)0769521355
Integer mapping is critical for lossless source coding and the techniques have been used for image compression in the new international image compression standard, JPEG 2000. In this paper, from block factorizations for any nonsingular transform matrix, we introduce two types of parallel elementary reversible matrix (PERM) factorizations which are helpful for the parallelization of perfectly reversible integer transforms. With improved degree of parallelism (DOP) and parallel performance, the cost of multiplication and addition can be respectively reduced to O(logN) and O(log2N) for an N-by-N transform matrix. these make PERM factorizations an effective means of developing parallel integer transforms for large matrices. Besides, we also present a scheme to block the matrix and allocate the load of processors for efficient transformation.
Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. this paper presents the multiprocessor architecture of the Simultaneous...
详细信息
Due to advances in fiber-optics and VLSI technology, interconnection networks which allow multiple simultaneous broadcasts are becoming feasible. this paper presents the multiprocessor architecture of the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus), and examines the performance of representative algorithms for matrix operations, merging and sorting. using the message-passing and distributed-shared-memory paradigms. It shows that simple enhancements to the network interface and the cache and directory controllers can result in communication time of 0(l) for the matrix-vector multiplication algorithm using DSM. the SOME-Bus is a low-latency, high-bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over 100 nodes. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidththat scales directly withthe number of nodes in the system. Each of P nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. the entire P receiver array can be integrated on a single chip at a comparatively minor cost resulting in O(P) complexity. the SOME-Bus has much more functionality than a crossbar by supporting multiple simultaneous broadcasts of messages, allowing cache consistency protocols to complete much faster. (C) 2003 Elsevier B.V. All rights reserved.
As CLUMPS become the main stream of clusters and the number of nodes in a cluster increases, it requires enhancing the bandwidth performance and availability of the communication system used in clusters. parallel comm...
详细信息
ISBN:
(纸本)076952138X
As CLUMPS become the main stream of clusters and the number of nodes in a cluster increases, it requires enhancing the bandwidth performance and availability of the communication system used in clusters. parallel communication based on multiple system area networks (SANs) can fulfill the requirements. this paper introduces the parallel communication protocol used in BCL-4, which is a high efficient communication system used in DAWNING-4000A, a large-scale LINUX cluster. It dispatches small messages and sub-messages stripped from large messages into multiple SANs and maintains the communication semantics as before. the parallel communication process is transparent to both users and the control program on network interface card (NIC). It also provides an efficient load balance mechanism. Using the parallel communication protocol, BCL-4 provides many key features, such as multiple throughput, high availability, and backward compatibility. the experimental results show that the peak bandwidth of BCL-4 over two Myrinet is 494.7MB/s, which is almost twice of that over one, and that there is only 0.02us overhead of short message at the same time.
Summary form only given. We consider the problem of querying large scale multidimensional time series data to discover events of interest, test and validate hypotheses, or to associate temporal patterns with specific ...
详细信息
Summary form only given. We consider the problem of querying large scale multidimensional time series data to discover events of interest, test and validate hypotheses, or to associate temporal patterns with specific events. this type of data currently dominates most other types of available data, and will very likely become even more prevalent in the future given the current trends in collecting time series of business, scientific, demographic, and simulation data. the ability to explore such collections interactively, even at a coarse level, will be critical in discovering the information and knowledge embedded in such collections. We develop indexing techniques and search algorithms to efficiently handle temporal range value querying of multidimensional time series data. Our indexing uses linear space data structures that enable the handling of queries in I/O time that is essentially the same as that of handling a single time slice, assuming the availability of a logarithmic number of processors as a function of the temporal window. A data structure with provably almost optimal asymptotic bounds is also presented for the case when the number of multidimensional objects is relatively small. these techniques improve significantly over standard techniques for either serial or parallelprocessing, and are evaluated by extensive experimental results that confirm their superior performance.
In the last decade various soft computing techniques have been developed. they include neural networks, fuzzy systems, evolutionary algorithms, rough sets and others. In many applications it is desirable that soft com...
详细信息
We discuss the problem of finding a dominating sequence for sending the input data items from a low-end client to a server for computational intensive tasks under the realistic assumption of unpredictable communicatio...
详细信息
ISBN:
(纸本)0769521355
We discuss the problem of finding a dominating sequence for sending the input data items from a low-end client to a server for computational intensive tasks under the realistic assumption of unpredictable communication behaviors. Under the assumption, the client sends the input data items using a specified sequence to maximum the number of computations performed at the server at any moment. the sequence-finding problem is NP-hard for the general case. In this paper, we address two fundamental and useful applications: matrix multiplication and fast Fourier transform. We have shown that the sequence-finding problems of the applications can be solved optimally in linear time. However, we have also shown counter examples to rule out any possibility of finding a dominating sequence for the sparse cases. Finally, a simulation is conducted to show the correctness and usefulness of our results.
暂无评论