In this paper, we describe experienc es with our 127- no de/161-processor A lpha cluster testbed, Ed. Ed is unique for two distinct reasons. First, we have replac ed the standard BIOS on the cluster no des withthe Li...
详细信息
Withthe advent of Grid computing, scheduling strategies for distributed heterogeneous systems have either become irrelevant or have to be extended significantly to support Grid dynamics. In this paper, we describe a ...
详细信息
Withthe advent of Grid computing, scheduling strategies for distributed heterogeneous systems have either become irrelevant or have to be extended significantly to support Grid dynamics. In this paper, we describe a metascheduling architecture for a Grid system that takes into account boththe application and system level considerations. Results are presented to demonstrate the usefulness of the metascheduler.
this paper addresses the parallelization of loops with irregular assignment computations on cc-NUMA multiprocessors. this loop pattern is distinguished by the existence of loop-carried output data dependences that can...
详细信息
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce ...
详细信息
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce HACS (Hardware Accelerated Cache Simulator), and describe the validation methods we used to demonstrate functionality. We also present some initial cache simulation results from SPECint 2000. We then propose future directions for research with HACS.
Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDEs) are usually involved in the description of such entitie...
详细信息
Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDEs) are usually involved in the description of such entities or processes. Distributed parallel computation was used in various computer cluster configurations to calculate PDE solutions of electrostatic field. the study of the efficacy of the selected architecture using mesh techniques was intended. the match between the algorithm and the architecture in achieving maximum computational performance was also investigated. the developed architectures, algorithms, and findings are presented in the paper.
Modeling for synthesis and modeling for simulation seem to be two competing goals in the context of C++-based modeling frameworks. One of the reasons is while most hardware systems have some inherent parallelism effic...
详细信息
Modeling for synthesis and modeling for simulation seem to be two competing goals in the context of C++-based modeling frameworks. One of the reasons is while most hardware systems have some inherent parallelism efficiently expressing it depends on whether the target usage is synthesis or simulation. For synthesis, designs are usually described with synthesis tools in mind and are therefore partitioned according to the targeted hardware units. For simulation, runtime efficiency is critical but our previous work has shown that a synthesis-oriented description is not necessarily the most efficient, especially if using multiprocessor simulators. Multiprocessor simulation requires preemptive multithreading but most current C++-based high level system description languages use cooperative multithreading to exploit parallelism to reduce overhead. We have seen that, for synthesis-oriented models, along with adding preemptive threading we need to transform the threading structure for good simulation performance. In this paper we present an algorithm for automatically applying such transformations to C++-based hardware models, ongoing work aimed at proving the equivalence between the original and transformed model, and a 62% to 76% simulation time improvement on a dual processor simulator.
In this work we investigate the feasibility of using a cluster of PCs built with mass market networks to deal withthe necessities of the CFD community, in particular for unstructured implicit CFD solvers that require...
详细信息
In this work we investigate the feasibility of using a cluster of PCs built with mass market networks to deal withthe necessities of the CFD community, in particular for unstructured implicit CFD solvers that require a very irregular pattern of communications. We report the initial findings from a series of experiments with some well known benchmarks to determine CFD application sensitivity to machine communication parameters. this is done by running these benchmarks on a cluster in which the communication network has been modified to allow an increase of the bandwidth by adding multiple channels and a reduction on the latency by using a lightweight protocol like the M-Via.
Contemporary computing systems, especially large-scale systems such as Grids promise ultra-fast ubiquitous utility computing, always available at the flip of a switch. A major unresolved issue is the organization and ...
详细信息
Contemporary computing systems, especially large-scale systems such as Grids promise ultra-fast ubiquitous utility computing, always available at the flip of a switch. A major unresolved issue is the organization and efficient usage of such infrastructure in a commercial context where several entities compete for shared resources. this has long been resolved for conventional utility resources such as gas and electricity through commoditization, a variety of market designs, customization, and decision support for the resulting portfolios of assets and commitments. the paper reviews the state of Grid commercialization and compares it to the commercialization of conventional resources. We draw specific lessons for commercialized Grids and detail them as architecture requirements at each level of the architecture stack. We provide an example to illustrate the benefits of commercialized resources in terms of the financial clarity it brings to decisions for different user groups, namely application users and IT managers.
the limited amount of instruction-level parallelism inherent in applications is a limiting factor for improving the performance of most conventional microprocessors. A promising solution to overcome this problem is to...
详细信息
the limited amount of instruction-level parallelism inherent in applications is a limiting factor for improving the performance of most conventional microprocessors. A promising solution to overcome this problem is to exploit coarser granularities of parallelism. In this paper, we propose exploiting loop-level parallelism in a multithreaded fashion. We use the Shift architecture as a baseline architecture, with improved compiler support and register file. the compiler converts iterations of a loop into threads, to be executed by multiple processing elements. the hardware provides a selective register shifting mechanism in order to allow the execution of loops containing loop-carried data dependences, which are very difficult to execute by using conventional architectures. In this paper, we simulate and discuss the parameters of major importance for the implementation of this architectural approach. Our initial results show that, on two simple numerical benchmarks, a considerable amount of iteration overlapping can be potentially achieved by an implementation of the Shift architecture, in comparison with a multiprocessor machine.
In systems consisting of multiple clusters of processors such as our Distributed ASCI Supercomputer (DAS), jobs may request co-allocation, i.e., the simultaneous allocation of processors in different clusters. We simu...
详细信息
In systems consisting of multiple clusters of processors such as our Distributed ASCI Supercomputer (DAS), jobs may request co-allocation, i.e., the simultaneous allocation of processors in different clusters. We simulate such systems ignoring communication among the tasks of jobs, and determine the response times for different types and sizes of job requests, and for different numbers and sizes of clusters. In many cases we also compute or approximate the maximum utilization. We find that the numbers and sizes of the clusters and of the job components have a strong impact on performance, and that in many cases co-allocation is a viable choice.
暂无评论