Current advances in processor technology, and the rapid development of high speed networking technology, such as ATM, have made high performance network computing an attractive computing environment for large-scale hi...
详细信息
ISBN:
(纸本)9780818675829
Current advances in processor technology, and the rapid development of high speed networking technology, such as ATM, have made high performance network computing an attractive computing environment for large-scale high performance distributed computing (HPDC) applications. However, due to the communications overhead at the host-network interface, most of the HPDC applications are not getting the full benefit of high speed communication networks. this overhead can be attributed to the high cost of operating system calls, context switching, the use of inefficient communication protocols, and the coupling of data and control paths. We present an architecture and implementation for a low-latency, high-throughput message passing tool, that we refer to as the NYNET (ATM wide area network testbed in New York state) Communication System (NCS), which can support a variety of HPDC applications with different Quality of Services (QOS) requirements. NCS uses multithreading to provide efficient techniques that overlap computation and communication. NCS uses read/write trap routines to bypass traditional operating system calls. this reduces latency and avoids using inefficient communication protocols. By separating data and control paths, NCS eliminates unnecessary control transfers. this optimizes the data path and improves performance. Benchmarking results show that the performance of NCS is at least a factor of two better than the performance of corresponding p4 and PVM primitives.
the proceedings contains 27 papers. Topics discussed include parallelprocessingapplications, distributed shared memory, recursive refinement algorithms for fast task mapping, scheduling in parallel network processin...
详细信息
the proceedings contains 27 papers. Topics discussed include parallelprocessingapplications, distributed shared memory, recursive refinement algorithms for fast task mapping, scheduling in parallel network processing, software tools, adaptive shaping for bandwidth allocation, block decomposition in cluster computing, telecommunication networks, input output systems, multimedia and recovery protocols, computer programming, and asynchronous transfer mode.
Several variants of parallel multipole-based algorithms have been implemented to further research in fields such as computational chemistry and astrophysics. We present a distributedparallel implementation of a multi...
详细信息
Several variants of parallel multipole-based algorithms have been implemented to further research in fields such as computational chemistry and astrophysics. We present a distributedparallel implementation of a multipole-based algorithm that is portable to a wide variety of applications and parallel platforms. Performance data are presented for loosely coupled networks of workstations as well as for more tightly coupled distributed multiprocessors, demonstrating the portability and scalability of the application to large number of processors.
PVM, a message-passing software system for parallelprocessing, is used on a wide variety of processor platforms, but this portability restricts execution speed. the work here will address this problem mainly in the c...
详细信息
PVM, a message-passing software system for parallelprocessing, is used on a wide variety of processor platforms, but this portability restricts execution speed. the work here will address this problem mainly in the context of Ethernet-based systems, proposing two PVM enhancements for such systems. the first enhancement exploits the fact that an Ethernet has broadcast capability. Since unenhanced PVM must, to keep portability, avoid using broadcast, execution speed is sacrificed. In addition, the larger the system, the larger the sacrifice in speed. A solution to this problem is presented. the second enhancement is intended for use in applications in which many concurrent tasks finish at the same time, and thus simultaneously try to transmit to a master process. On an Ethernet, this produces excessively long random backoffs, reducing program speed. An enhancement, termed 'programmed backoff,' is proposed.
We explore processor-cache affinity scheduling of parallel network protocol processing, in a setting in which protocol processing executes on a shared-memory multiprocessor concurrently with a general workload of non-...
详细信息
We explore processor-cache affinity scheduling of parallel network protocol processing, in a setting in which protocol processing executes on a shared-memory multiprocessor concurrently with a general workload of non-protocol activity. We find affinity-based scheduling can significantly reduce the communication delay associated with protocol processing, enabling the host to support a greater number of concurrent streams and to provide higher maximum throughput to individual streams. In addition, we compare the performance of two parallelization alternatives, Locking and Independent Protocol Stacks (IPS), with very different caching behaviors. We find that IPS (which maximizes cache affinity) delivers much lower message latency and significantly higher message throughput capacity, yet exhibits less robust response to intra-stream burstiness and limited intra-stream scalability.
this paper presents the results of performance analysis of a seismic analysis kernel code on the KSR multiprocessors. the purpose of such analysis is to understand the performance behaviors of a class of applications ...
详细信息
this paper presents the results of performance analysis of a seismic analysis kernel code on the KSR multiprocessors. the purpose of such analysis is to understand the performance behaviors of a class of applications on shared memory parallel machines. the g5 kernel code, commonly used in seismic analysis applications, is parallelized, and its computational and I/O performance is analyzed on a 32-node KSR-1 and a 64-node KSR-2.
distributed systems that consist of workstations connected by high performance interconnects offer computational power comparable to moderate size parallel machines. It is desirable that such workstation clusters can ...
详细信息
distributed systems that consist of workstations connected by high performance interconnects offer computational power comparable to moderate size parallel machines. It is desirable that such workstation clusters can also be programmed the same way as shared memory machines. We develop a portable, user-level library, called Indigo, that can be used to program a variety of state sharing techniques. In particular, Indigo can be used to program DSM protocols as well as distributed shared abstractions where objects can be fragmented/replicated and consistency actions are customized according to application needs. We present an evaluation of Indigo by using its calls to implement a distributed shared memory system as well as shared abstractions for a number of applications.
the Beowulf parallel workstation combines 16 PC-compatible processing subsystems and disk drives using dual Ethernet networks to provide a single-user environment with 1 Gops peak performance, half a Gbyte of disk sto...
详细信息
the Beowulf parallel workstation combines 16 PC-compatible processing subsystems and disk drives using dual Ethernet networks to provide a single-user environment with 1 Gops peak performance, half a Gbyte of disk storage, and up to 8 times the disk I/O bandwidth of conventional workstations. the Beowulf architecture establishes a new operating point in price-performance for single-user environments requiring high disk capacity and bandwidth. the Beowulf research project is investigating the feasibility of exploiting mass market commodity computing elements in support of Earth and space science requirements for large data-set browsing and visualization, simulation of natural physical processes, and assimilation of remote sensing data. this paper reports the findings from a series of experiments for characterizing the Beowulf dual channel communication overhead. It is shown that dual networks can sustain 70% greater throughput than a single network alone but that bandwidth achieved is more highly sensitive to message size than to the number of messages at peak demand. While overhead is shown to be high for global synchronization, its overall impact on scalability of real world applications for computational fluid dynamics and N-body gravitational simulation is shown to be modest.
In recent years a lot of research has been invested in parallelprocessing of numerical applications. However, parallelprocessing of Symbolic and AI applications has received less attention. this paper presents a sys...
详细信息
In recent years a lot of research has been invested in parallelprocessing of numerical applications. However, parallelprocessing of Symbolic and AI applications has received less attention. this paper presents a system for parallel symbolic computing, named ACE, based on the logic programming paradigm. ACE is a computational model for the full Prolog language, capable of exploiting Or-parallelism and Independent And-parallelism. In this paper we focus on the implementation of the and-parallel part of the ACE system (called &ACE) on a shared memory multiprocessor, describing its organization, some optimizations, and presenting some performance figures, proving the ability of &ACE to efficiently exploit parallelism.
the increasing gap between the speed of microprocessors and memory subsystems makes it imperative to exploit locality of reference in sequential irregular applications. the parallelization of such applications require...
详细信息
the increasing gap between the speed of microprocessors and memory subsystems makes it imperative to exploit locality of reference in sequential irregular applications. the parallelization of such applications requires special considerations. Current RTS (Run-Time Support) for irregular computations fail to exploit the fine grain regularity present in these applications, producing unnecessary time and memory overheads. PILAR (parallel Irregular Library with Application of Regularity) is a new RTS for irregular computations that provides a variety of internal representations of communication patterns based on their regularity;allowing for the efficient support of a wide spectrum of regularity under a common framework. Experimental results on the IBM SP-1 and Intel Paragon demonstrate the validity of our approach.
暂无评论