In this work we investigate the feasibility of using a cluster of PC's built with mass market networks to deal withthe necessities of the CFD community, in special for unstructured implicit CFD solvers that requi...
详细信息
ISBN:
(纸本)0769517722
In this work we investigate the feasibility of using a cluster of PC's built with mass market networks to deal withthe necessities of the CFD community, in special for unstructured implicit CFD solvers that require very irregular pattern of communications. this paper reports the initial findings from a series of experiments with some well known benchmarks to determine CFD applications sensitivity to machine communication parameters. this is done by running these benchmarks on a cluster in which the communication network has been modified to allow an increase of the bandwidth by adding multiple channels and a reduction on the latency by using a lightweight protocol like the MV, a.
the list of applications requiring highperformancecomputing resources is constantly growing. the cost of inter-processor communication is critical in determining the performance of massively parallel computing syste...
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce ...
详细信息
ISBN:
(纸本)0769517722
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce HACS (Hardware Accelerated Cache Simulator), and describe the validation methods we used to demonstrate functionality. We also present some initial cache simulation results from SPECint 2000. We then propose future directions for research with HACS.
With increasing uniprocessor and SMP computation power, workstation clusters are becoming viable alternatives to highperformancecomputing systems. Communication overhead affects the performance of parallel computers...
详细信息
the limited amount of instruction-level parallelism inherent in applications is a limiting factor for improving the performance of most conventional microprocessors. A promising solution to overcome this problem is to...
详细信息
ISBN:
(纸本)0769517722
the limited amount of instruction-level parallelism inherent in applications is a limiting factor for improving the performance of most conventional microprocessors. A promising solution to overcome this problem is to exploit coarser granularities of parallelism. In this paper we propose exploiting loop-level parallelism in a multithreaded fashion. We use the Shift architecture [9] as a baseline architecture, with improved compiler support and registerfile. the compiler converts iterations of a loop into threads, to be executed by multiple processing elements. the hardware provides a selective register shifting mechanism in order to allow the execution of loops containing loop-carried data dependences, which are very difficult to execute by using conventional architectures. In this paper we simulate and discuss the parameters of major importance for the implementation of this architectural approach. Our initial results show that, on two simple numerical benchmarks, a considerable amount of iteration overlapping can be potentially achieved by an implementation of the Shift architecture, in comparison with a multiprocessor machine.
Withthe advent of Grid computing, scheduling strategies for distributed heterogeneous systems have either become irrelevant or have to be extended significantly to support Grid dynamics. In this paper, we describe a ...
详细信息
ISBN:
(纸本)0769516866
Withthe advent of Grid computing, scheduling strategies for distributed heterogeneous systems have either become irrelevant or have to be extended significantly to support Grid dynamics. In this paper, we describe a metascheduling architecture for a Grid system that takes into account boththe application and system level considerations. Results are presented to demonstrate the usefulness of the metascheduler.
this paper presents T&D-Bench, an integrated suite Of tools for modeling and simulating state-of-the-art processors, which is composed of two main parts. SimPL is an object-oriented methodology for modeling the be...
详细信息
ISBN:
(纸本)0769517722
this paper presents T&D-Bench, an integrated suite Of tools for modeling and simulating state-of-the-art processors, which is composed of two main parts. SimPL is an object-oriented methodology for modeling the behavior of an instruction set, with precise information on the timing of basic instruction steps. the methodology is general and allows easy modeling of various architecture hypes. the second part of the suite is CSPSim, an open set of visualization tools that communicate with any number of SimPL models based on a client-server architecture. T&D-Bench gathers the main advantages of teaching environments, with a rich user interface, and design environments, with resources for modeling any complex processor architectures.
this paper describes a new parallel architectural system which we called Hybrid System. As the name implies, Hybrid System is a combination of both SIMD and MIMD systems working concurrently. this new parallel archite...
详细信息
the gap between memory and processor speeds is responsible for the substantial amount of idle time of current processors. To reduce the impact provoked by the so-called " memory gap problem," many software t...
详细信息
ISBN:
(纸本)0769517722
the gap between memory and processor speeds is responsible for the substantial amount of idle time of current processors. To reduce the impact provoked by the so-called " memory gap problem," many software techniques (e.g., the code layout reorganization) together with hardware mechanisms (cache memory, translation look-aside buffer, branch prediction, speculative execution, trace cache, instruction reuse, and so on) have been successfully implemented. In this paper we present some experiments that explain why these mechanisms and techniques are so efficient. We found that only a small fraction of the object code is actually executed: our experiments disclosed that more than 50% of the instructions remain untouched during the whole execution, and the percentages of basic blocks which remain unused are slightly greater In addition to the usage of instructions and blocks, the paper provides further insights regarding the behavior of application programs, and gives some suggestions for extra performance gains.
this study presents a technique that can significantly improve the performance of a distributed application by allowing the application to locally adapt to architectural characteristics of distinct resources in a dist...
详细信息
ISBN:
(纸本)0769516866
this study presents a technique that can significantly improve the performance of a distributed application by allowing the application to locally adapt to architectural characteristics of distinct resources in a distributed system. Application performance is sensitive to system architecture-application parameter pairings. In a distributed or Grid enabled application, a single parameter configuration for the whole application will not always be optimal for every participating resource. In particular, some configurations can significantly degrade performance. Furthermore, the behavior of a system may change during the course of the run. the technique described here provides an automated mechanism for run-time adaptation of application parameters to the local system architecture. Using a scaled-down simulation of a Monte Carlo physics code, we demonstrate that this technique can conservatively achieve speedups up to 65% on individual resources and may even provide order of magnitude speedup in the extreme case.
暂无评论