Technology trends make it attractive to use workstations connected by a local area network as a multicomputing platform for parallelapplications. Achieving acceptable application performance in such a workstation clu...
详细信息
Technology trends make it attractive to use workstations connected by a local area network as a multicomputing platform for parallelapplications. Achieving acceptable application performance in such a workstation cluster using commodity components requires good support from system software and network hardware. this paper describes our experience withparallel programming in a workstation cluster and the implications for operating systems and network adaptor design. Our cluster consists entirely of commercially available hardware: 24 Alpha workstations connected by a 155 Mbits/s AN2 ATM network. We present results from running several parallelapplications on this cluster. Our cluster demonstrates both excellent low-level communication performance and good overall application performance that compares favorably with dedicated multicomputers like the IBM SP2.
parallel input/output (I/O) workload characterization studies are necessary to better understand the factors that dominate performance. When translated into system design principles this knowledge can lead to higher p...
详细信息
parallel input/output (I/O) workload characterization studies are necessary to better understand the factors that dominate performance. When translated into system design principles this knowledge can lead to higher performance/cost systems. In this paper we present the experimental results of an I/O workload characterization study of NASA Earth and Space Sciences (ESS) applications. Measurements were collected using device driver instrumentation. Baseline measurements, with no workload, and measurements during regular application runs, were collected and then analyzed and correlated. It will be shown how the observed disk I/O can be identified as block transfers, page requests, and cache activity, and how the ESS applications are characterized by a high degree of spatial and temporal locality.
this paper addresses the issue of dynamic load imbalance in a class of synchronous iterative applications, and develops a model to represent their workload dynamics. Such models of application load dynamics help in mo...
详细信息
this paper addresses the issue of dynamic load imbalance in a class of synchronous iterative applications, and develops a model to represent their workload dynamics. Such models of application load dynamics help in more accurate performance prediction and in the design of efficient load balancing algorithms. Our model captures the workload dynamics across iterations, and predicts the workload distribution at any given iteration as the cumulative effect of workload dynamics during the preceding iterations. the model parameters are derived using empirical data from initial runs of the application. the model development is illustrated using data from a parallel N-body simulation application.
Existing techniques for sharing the processing resources in multiprogrammed shared-memory multiprocessors, such as time-sharing, space-sharing, and gang-scheduling, typically sacrifice the performance of individual pa...
详细信息
Existing techniques for sharing the processing resources in multiprogrammed shared-memory multiprocessors, such as time-sharing, space-sharing, and gang-scheduling, typically sacrifice the performance of individual parallelapplications to improve overall system utilization. We present a new processor allocation technique that dynamically adjusts the number of processors an application is allowed to use for the execution of each parallel section of code based on the current system load. this approach exploits the maximum parallelism possible for each application without overloading the system. We implement our scheme on a Silicon Graphics Challenge multiprocessor system and evaluate its performance using applications from the Perfect Club benchmark suite and synthetic benchmarks. Our approach shows significant improvements over traditional time-sharing and gang scheduling. It has performance comparable to, or slightly better than, static space-sharing, but our strategy is more robust since, unlike static space-sharing, it does not require a priori knowledge of the applications' parallelism characteristics.
Scalability has never been more a part of System/390 than withparallel Sysplex. the parallel Sysplex environment permits a mainframe or parallel Enterprise Server to grow from a single system to a configuration of 32...
详细信息
Scalability has never been more a part of System/390 than withparallel Sysplex. the parallel Sysplex environment permits a mainframe or parallel Enterprise Server to grow from a single system to a configuration of 32 systems (initially), and appear as a single image to the end user and applications. the IBM S/390 parallel Sysplex provides capacity for today's largest commercial workloads by enabling a workload to be spread transparently across a collection of S/390 systems with shared access to data. By way of its parallel architecture and MVS operating system support, the S/390 parallel Sysplex offers near-linear scalability and continuous availability for customers' mission-critical applications. S/390 parallel Sysplex optimizes responsiveness and reliability by distributing workloads across all of the processors in the Sysplex. Should one or more processors fail, the workload is redistributed across the remaining processors. Because all of the processors have access to all of the data, the parallel Sysplex provides a computing environment with near-continuous availability.
We address the problem of maximizing application speedup through runtime, self-selection of an appropriate number of processors on which to run. Automatic, runtime selection of processor allocations is important becau...
详细信息
We address the problem of maximizing application speedup through runtime, self-selection of an appropriate number of processors on which to run. Automatic, runtime selection of processor allocations is important because many parallelapplications exhibit peak speedups at allocations that are data or time dependent. We propose the use of a runtime system that: (a) dynamically measures job efficiencies at different allocations, (b) uses these measurements to calculate speedups, and (c) automatically adjusts a job's processor allocation to maximize its speedup. Using a set of 10 applicationsthat includes both hand-coded parallel programs and compiler-parallelized sequential programs, we show that our runtime system can reliably determine dynamic allocations that match the best possible static allocation, and that it has the potential to find dynamic allocations that outperform any static allocation.
Based on the reconfigurable array of processors with wider bus networks [8], we propose a series of algorithms for image processing. Conventionally, only one bus is connected between two processors but in this machine...
详细信息
Based on the reconfigurable array of processors with wider bus networks [8], we propose a series of algorithms for image processing. Conventionally, only one bus is connected between two processors but in this machine it has a set of buses. Such a characteristic increases the computation power of this machine greatly. Based on the base-m number system, we first introduce some basic operation algorithms. then three related applications are derived in constant time;one is the histogram of an image, another is the image segmentation and the other is the image labeling.
Programming distributed-memory machines requires careful placement of data to balance the computational load among the nodes and minimize excess data movement between the nodes. Most current approaches to data placeme...
详细信息
Programming distributed-memory machines requires careful placement of data to balance the computational load among the nodes and minimize excess data movement between the nodes. Most current approaches to data placement require the programmer or compiler to place data initially and then possibly to move it explicitly during a computation. this paper describes a new, adaptive approach. It is implemented in the Adapt system, which takes an initial data placement, efficiently monitors how well it performs, and changes the placement whenever the monitoring indicates that a different placement would perform better. Adapt frees the programmer from having to specify data placements, and it can use run-time information to find better placements than compilers. Moreover, Adapt automatically supports a 'variable block' placement, which is especially useful for applications with nearest-neighbor communication but an imbalanced workload. For applications in which the best data placement varies dynamically, using Adapt can lead to better performance than using any statically determined data placement.
this paper presents experimental results that characterize the performance of the integrated synchronization and consistency protocol used in the implementation of Jade, an implicitly parallel language for coarse-grai...
详细信息
this paper presents experimental results that characterize the performance of the integrated synchronization and consistency protocol used in the implementation of Jade, an implicitly parallel language for coarse-grain parallel computation. the consistency protocol tags each replica of shared data with a version number. the synchronization algorithm computes the correct version numbers of the replicas of shared data that the computation will access. Because the protocol piggybacks the version number information on the synchronization messages, it generates fewer messages than standard update and invalidate protocols. this paper characterizes the performance impact of the consistency protocol by presenting experimental results for several Jade applications running on the iPSC/860 under several different Jade implementations.
We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag consistency in software for the Cilk multithreaded runti...
详细信息
We introduce dag consistency, a relaxed consistency model for distributed shared memory which is suitable for multithreaded programming. We have implemented dag consistency in software for the Cilk multithreaded runtime system running on a Connection Machine CM5. Our implementation includes a dag-consistent distributed cactus stack for storage allocation. We provide empirical evidence of the flexibility and efficiency of dag consistency for applicationsthat include blocked matrix multiplication, Strassen's matrix multiplication algorithm, and a Barnes-Hut code. Although Cilk schedules the executions of these programs dynamically, their performances are competitive with statically scheduled implementations in the literature. We also prove that the number FP of page faults incurred by a user program running on P processors can be related to the number F1 of page faults running serially by the formula FP &le F1 + 2Cs, where C is the cache size and s is the number of thread migrations executed by Cilk's scheduler.
暂无评论