Performance debugging and prediction for parallelsystems is a difficult problem. The difficulties in identifying performance bottlenecks stem from the need for an intimate understanding of the underlying architecture...
详细信息
Performance debugging and prediction for parallelsystems is a difficult problem. The difficulties in identifying performance bottlenecks stem from the need for an intimate understanding of the underlying architecture. It has been recognized that portability is an important requirement for parallel program development. However, this makes the task of performance debugging even more difficult. In this paper, we present a simulation based approach for performance prediction of portable parallel programs. We demonstrate it using Charm: a message driven programming environment, which provides program portability across a variety of shared and distributed memory MIMD parallelsystems. The proposed approach makes it possible to use a single debugging environment for the development of portable parallelsoftware. This environment can provide correctness and performance debugging support that provides the developer with valuable feedback for improving program performance.
This paper considers the current state of softwareengineering for parallelsystems. A review of existing approaches and techniques identifies inadequacies. Recent work on design, verification and automated support is...
详细信息
This paper considers the current state of softwareengineering for parallelsystems. A review of existing approaches and techniques identifies inadequacies. Recent work on design, verification and automated support is outlined. The next generation of embedded and distributed technologies will compound the problems through increased demand and diversity. This paper discusses the implications for the progression of current techniques into new methods for future softwareengineering of parallelsystems.< >
A large number of variations of distributed simulation protocols have been proposed in the literature. Their performances, however, could not be compared directly, due to different implementation strategies, different...
Performance prediction should be included in the compilation process, by second-generation supercompilers, for creating an information feedback to improve parallel programs. However, application programs might consist...
详细信息
Performance prediction should be included in the compilation process, by second-generation supercompilers, for creating an information feedback to improve parallel programs. However, application programs might consist of a number of procedures calling one another and creating a complex call structure. Every procedure consists of several parallel processes with more or less intensive interprocess communications. This parallel system represents a very time consuming workload for the performance analysis tool. We have designed a hierarchical strategy for analysis of the program call structure, which significantly reduces the total time necessary for the performance prediction of the whole parallel program.
Recent developments have greatly reduced network latencies in multiprocessor networks. Thus, software overhead is becoming the primary cost of multiprocessor communication. This paper proposes data streaming-a techniq...
详细信息
Recent developments have greatly reduced network latencies in multiprocessor networks. Thus, software overhead is becoming the primary cost of multiprocessor communication. This paper proposes data streaming-a technique which places explicit send and receive instructions in the user code-as a means to cut software overhead to a minimum. Data streaming has the added benefit that it can tighten the coupling between processors by reducing the message size to that of a single data item. This paper presents experimental results that indicate data streaming can cut software overhead to less than one instruction per byte of data transmitted.
Performance debugging and prediction for parallelsystems is a difficult problem. The difficulties in identifying performance bottlenecks stem from the need for an intimate understanding of the underlying architecture...
详细信息
Performance debugging and prediction for parallelsystems is a difficult problem. The difficulties in identifying performance bottlenecks stem from the need for an intimate understanding of the underlying architecture. It has been recognized that portability is an important requirement for parallel program development. However, this makes the task of performance debugging even more difficult. In this paper, we present a simulation based approach for performance prediction of portable-parallel programs. We demonstrate it using Charm: a message driven programming environment, which provides program portability across a variety of shared and distributed memory MIMD parallelsystems. The proposed approach makes it possible to use a single debugging environment for the development of portable parallelsoftware. This environment can provide correctness and performance debugging support that provides the developer with valuable feedback for improving program performance.< >
This paper considers the coupled design problems of processor specification and task allocation for embedded multicomputer systems. A packing-based representation is proposed that allows the problems to be solved conc...
详细信息
This paper considers the coupled design problems of processor specification and task allocation for embedded multicomputer systems. A packing-based representation is proposed that allows the problems to be solved concurrently. An algorithm based on this representation is described that utilizes a new heuristic packing technique coupled with an incremental design advisor. This algorithm, named IDAT, was benchmarked against three baseline algorithms on a combination of real and synthetic test cases with respect to two figures of merit: hardware cost and run-time. The real test cases are based on commercially developed automotive electronic applications and the baseline algorithms represent a mixture of search, heuristic and simulated annealing approaches. For all test cases, the IDAT algorithm was found to generate near-optimal solutions with up to three orders of magnitude improvement in run-time compared to the baseline algorithms.
In the single system UNIX, successful completion of a write system call implies a guarantee of adequate disk space for any new pages created by the system call. To support such a guarantee in a distributed file system...
详细信息
In the single system UNIX, successful completion of a write system call implies a guarantee of adequate disk space for any new pages created by the system call. To support such a guarantee in a distributed file system designers need to solve the problems of accurately estimating the space needed, communication overhead, and fault tolerance. In the Calypso file system, which is a cluster-optimized, distributed UNIX file system, we solve these problems using an advance-reservation scheme. Measurements show that the overhead of this scheme for typical UNIX usage patterns is 1% to 3%.
Most studies of resource management in multiprogrammed parallelsystems have ignored the I/O performed by applications. Recent studies have demonstrated that significant I/O operations are performed by a number of dif...
详细信息
Most studies of resource management in multiprogrammed parallelsystems have ignored the I/O performed by applications. Recent studies have demonstrated that significant I/O operations are performed by a number of different classes of parallel applications. This paper focuses on some basic issues that underlie I/O management and system performance in multiprogrammed parallel environments that run applications with I/O. Characterization of the I/O behavior of parallel applications is discussed first followed by an investigation of three different I/O management strategies. Based on simulation models this research demonstrates a strong relationship among I/O characteristics of applications, I/O management strategies, and system performance. For example, using CPU-I/O overlap in applications and I/O management strategies that incorporate data replication are found to be beneficial for a variety of different multi-programmed parallel environments.
The software crises is defined as the inability to meet the demands for new softwaresystems, due to the slow rate at which systems can be developed. To address the crisis, object-based design and implementation techn...
详细信息
The software crises is defined as the inability to meet the demands for new softwaresystems, due to the slow rate at which systems can be developed. To address the crisis, object-based design and implementation techniques and domain models have been developed. However, object-based techniques do not address an additional problem that plagues systems engineers-the effective utilization of distributed and parallel hardware platforms. This problem is partly addressed by program partitioning languages that allow engineers to specify how software components should be partitioned and assigned to the nodes of concurrent computers. However, very little has been done to automate the tasks of partitioning and assignment at the task and object level of granularity. Thus, this paper describes automated techniques for distributed/parallel configuration of object-based applications, and demonstrates the technique on Ada programs. The granularity of partitioning is at the level of the Ada program unit (a program unit is an object, a class, a task, a package (possibly a generic template) or a subprogram). The partitioning is performed by constructing a call-rendezvous graph (CRG) for an application program. The nodes of the graph represent the program units, and the edges denote call and task interaction/rendezvous relationships. The CRG is augmented with edge weights depicting inter-program-unit communication relationships and concurrency relationships, resulting in a weighted CRG (WCRC). The partitioning algorithm repeatedly "cuts" edges of the WCRG with the goal of producing a set of partitions among which (1) there is a small amount of communication and (2) there is a large degree of potential for concurrent execution. Following the partitioning of the WCRG into tightly coupled clusters, a random neural network is employed to assign clusters to physical processors.
暂无评论