We have achieved a sustained performance of 55 TFLOPS for molecular dynamics simulations of the amyloid fibril formation of peptides from the yeast Sup35 in an aqueous solution. For performing the calculations, we use...
详细信息
I/O intensive applications have posed great challenges to computational scientists. A major problem of these applications is that users have to sacrifice performance requirements in order to satisfy storage capacity r...
I/O intensive applications have posed great challenges to computational scientists. A major problem of these applications is that users have to sacrifice performance requirements in order to satisfy storage capacity requirements in a conventional computing environment. Further performance improvement is impeded by the physical nature of these storage media even when state-of-the-art I/O optimizations are employed. In this paper, we present a distributed multi-storage resource architecture, which can satisfy both performance and capacity requirements by employing multiple storage resources. Compared to a traditional single storage resource architecture, our architecture provides a more flexible and reliable computing environment. This architecture can bring new opportunities for high performance computing as well as inherit state-of-the-art I/O optimization approaches that have already been developed. It provides application users with high-performance storage access even when they do not have the availability of a single large local storage archive at their disposal. We also develop an Application Programming Interface (API) that provides transparent management and access to various storage resources in our computing environment. Since I/O usually dominates the performance in I/O intensive applications, we establish an I/O performance prediction mechanism which consists of a performance database and a prediction algorithm to help users better evaluate and schedule their applications. A tool is also developed to help users automatically generate performance data stored in databases. The experiments show that our multi-storage resource architecture is a promising platform for high performance distributedcomputing.
One of the challenges brought by large-scale scientific applications is how to avoid remote storage access by collectively using enough local storage resources to hold huge amount of data generated by the simulation w...
详细信息
One of the challenges brought by large-scale scientific applications is how to avoid remote storage access by collectively using enough local storage resources to hold huge amount of data generated by the simulation while providing high performance I/O. DPFS, a distributedparallel File System, is designed and implemented to address this problem. DPFS collects locally distributed unused storage resources as a supplement to the internal storage of parallelcomputing systems to satisfy the storage capacity requirement of large-scale applications. In addition, like parallel file systems, DPFS provides striping mechanisms that divides a file into small pieces and distributes them across multiple storage devices for parallel data access. The unique feature of DPFS is that it provides three file levels with each file level corresponding to a file striping method. In addition to the traditional linear striping method, DPFS also provides a novel multidimensional striping method that can solve performance problems of linear striping for many popular access patterns. Other issues such as load-balancing and user interface are also addressed in DPFS.
Patterns of faults that are catastrophic for regular architectures, particularly the systolic arrays, have been studied. For a given link configuration, there are many fault patterns which are catastrophic. Among thos...
详细信息
The performance of applications on large shared-memory multiprocessors with coherent caches depends on the interaction between the granularity of data sharing, the size of the coherence unit and the spatial locality e...
详细信息
The performance of applications on large shared-memory multiprocessors with coherent caches depends on the interaction between the granularity of data sharing, the size of the coherence unit and the spatial locality exhibited by the applications, in addition to the amount of parallelism in the applications. Large coherence units are helpful in exploiting spatial locality, but worsen the effects of false sharing. We present a mathematical framework that allows a clean description of the relationship between spatial locality and false sharing. We first show how to identify a severe form of multiple-writer false sharing and then demonstrate the importance of the interaction between optimization techniques aimed at enhancing locality and the techniques oriented toward reducing false sharing. Given the conflicting requirements, a compiler based approach to this problem holds promise. We investigate the use of data transformations in addressing spatial locality and false sharing, and derive an approach that balances the impact of the two. Experimental results demonstrate that such a balanced approach outperforms those approaches that consider only one of these two issues. On an eight-processor SGI Origin 2000 system, our approach brings an additional 9% improvement over a powerful locality optimization technique that uses both loop and data transformations. Also, our approach obtains an additional 19% improvement over an optimization technique that is oriented specifically toward reducing false sharing.
Computer science is a practical discipline. It is always a great challenge to evaluate students' computer practice using computer-aided means for large scale students. We always need to address problems such as su...
详细信息
As Grid computing is becoming a reality, there is a need for managing and monitoring the available resources worldwide, as well as the need for conveying these resources to the everyday user. This paper describes a re...
详细信息
In High Performance Fortran (HPF), array redistribution can be described explicitly using directives (REDISTRIBUTE or REALIGN) which specify where new distributions become active or implicitly by calling functions whi...
详细信息
It is very common that modern large-scale scientific applications employ multiple compute and storage resources in a heterogeneously distributed environment. Working effectively and efficiently in such an environment ...
详细信息
ISBN:
(纸本)0769511406
It is very common that modern large-scale scientific applications employ multiple compute and storage resources in a heterogeneously distributed environment. Working effectively and efficiently in such an environment is one of the major concerns for designing meta-data management systems. The authors present an integrated graphical user interface (GUI) that makes the entire environment virtually an easy-to-use control platform for managing complex programs and their large datasets. To hide the I/O latency when the the user carries out interactive visualization, aggressive prefetching and caching techniques are employed in our GUI. The performance numbers show that the design of our Java GUI has achieved the goals of both high performance and ease-of-use.
Several techniques currently exist for estimating the power dissipation of combinational and sequential circuits using exhaustive simulation, Monte Carlo sampling, and probabilistic estimation. Exhaustive simulation a...
详细信息
Several techniques currently exist for estimating the power dissipation of combinational and sequential circuits using exhaustive simulation, Monte Carlo sampling, and probabilistic estimation. Exhaustive simulation and Monte Carlo sampling techniques can be highly reliable but often require long runtimes. This paper presents a comprehensive study of pattern-partitioning and circuit-partitioning parallelization schemes for those two methodologies in the context of distributed-memory multiprocessing systems. Issues in pipelined event-driven simulation and dynamic load balancing are addressed. Experimental results are presented for an IBM SP-2 system and a network of HP-9000 workstations. For instance, runtimes have been reduced from over 3 hours to under 20 minutes in one case.
暂无评论