"MegaProto" is a proof-of-concept prototype for our project "mega-scale computing based on low-power technology and workload modeling", implementing our key idea that a million-scale parallel syste...
详细信息
"MegaProto" is a proof-of-concept prototype for our project "mega-scale computing based on low-power technology and workload modeling", implementing our key idea that a million-scale parallel system should be built with densely mounted low-power commodity processors. It also serves as a platform to implement and evaluate our new technologies such as power conscious compilation, highly reliable and high performance networking, highly dependable cluster management, and multi-level scalable parallel programming. The building block of the MegaProto is a 1U-high 19 inch-rack mountable motherboard unit on which 16 low-power, one-dollar note-sized, commodity PC-architecture daughterboards are mounted with a high bandwidth, 2 Gbps per processor network based on gigabit Ethernet. The peak performance of each unit is 14.4 GFlops for the first version and will improve to 38.4 GFlops in the second version through a processor/daughterboard upgrade. The intra- and inter-unit network bandwidths are 32 Gbps and 16 Gbps respectively. As for power consumption, the entire unit idles at less than 150 W and consumes 300-330 W maximum under extreme computational stress; this is comparable to or better than conventional 1U servers comprised of dual high-performance, power hungry processors, while benchmarks exhibit up to 279% superior performance for some NPB programs. This demonstrates that higher performance can be achieved with low-power, densely populated architectures with commodity components.
In order to make shared memory programming possible in distributed architectures, we use an abstraction called distributed shared memory. The behavior of distributed shared memory systems is dictated by the memory con...
详细信息
In order to make shared memory programming possible in distributed architectures, we use an abstraction called distributed shared memory. The behavior of distributed shared memory systems is dictated by the memory consistency model. In order to provide a better understanding on the semantics of the memory consistency models, many researchers have proposed formalisms to define them. Even with formal definitions, it is still difficult to say what kind of execution histories can be produced on a particular memory model. In this paper, we propose a tool that shows what operations orderings could lead to user-defined execution histories on different memory models. We also present a prototype of our tool that analyses execution histories for four different memory consistency models: sequential consistency, PipelinedRAM consistency, release consistency and scope consistency.
The distributed shared memory (DSM) model is designed to leverage the ease of programming of the shared memory paradigm, while enabling the high-performance by expressing locality as in the message-passing model. Expe...
详细信息
The distributed shared memory (DSM) model is designed to leverage the ease of programming of the shared memory paradigm, while enabling the high-performance by expressing locality as in the message-passing model. Experience, however, has shown that DSM programming languages, such as UPC, may be unable to deliver the expected high level of performance. Initial investigations have shown that among the major reasons is the overhead of translating from the UPC memory model to the target architecture virtual addresses space, which can be very costly. Experimental measurements have shown this overhead increasing execution time by up to three orders of magnitude. Previous work has also shown that some of this overhead can be avoided by hand-tuning, which on the other hand can significantly decrease the UPC ease of use. In addition, such tuning can only improve the performance of local shared accesses but not remote shared accesses. Therefore, a new technique that resembles the translation look aside buffers (TLBs) is proposed here. This technique, which is called the memory model translation buffer (MMTB) has been implemented in the GCC-UPC compiler using two alternative strategies, full-table (FT) and reduced-table (RT). It would be shown that the MMTB strategies can lead to a performance boost of up to 700%, enabling ease-of-programming while performing at a similar performance to hand-tuned UPC and MPI codes.
With the advent of grid technologies, much interest has arisen in the application of these computational techniques to multiple fields. However, grid computing technologies have a steep learning curve that tends to di...
详细信息
With the advent of grid technologies, much interest has arisen in the application of these computational techniques to multiple fields. However, grid computing technologies have a steep learning curve that tends to discourage scientists from the usage of grid facilities applied to their research, preventing a widespread adoption of grid computing. In this paper, we describe a Java-based middleware, built on top of the Java commodity grid, that offers an object oriented, user-friendly view of the grid, which hides much of the underlying complexity when using the grid computing services provided by the Globus Toolkit. The middleware developed is focused on achieving remote execution of tasks, providing automatic file staging services, parallel execution in multiprocessor machines and fault-tolerant scheduling capabilities, from a simple and intuitive application programming interface.
A new high performance computation technique involving multiple processors on a single silicon die is quickly gaining popularity. This new design approach provides very high performance, excellent power efficiency and...
详细信息
A new high performance computation technique involving multiple processors on a single silicon die is quickly gaining popularity. This new design approach provides very high performance, excellent power efficiency and a high level of programmability as compared to other existing solutions. This approach also serves to move the design effort away from hardware design and toward software. This results in a faster time to market as well as a lower up-front design cost. This paper discusses the configurable multiprocessor design environment from Cmpware, Inc. This toolkit is used to design ASIC, FPGA and SoC multiprocessor solutions.
In this paper, we take the idea of application-level processing on disks to one level further, and focus on an architecture, called cluster of active disks (CAD), where the storage system contains a network of paralle...
详细信息
In this paper, we take the idea of application-level processing on disks to one level further, and focus on an architecture, called cluster of active disks (CAD), where the storage system contains a network of parallel "active disks". Each individual active disk (which includes an embedded processor, disk(s), caches, memory, and interconnect) can perform some application level processing; but, more importantly, the active disks can collectively perform parallel input/output (I/O) and processing, thereby reducing not just the communication latency but I/O latency and computation time as well. The CAD architecture poses many challenges for the next generation software systems at all levels including programming models, operating and runtime systems, application mapping, compilation, parallelization and performance modeling, and evaluation. In this paper, we focus exclusively on code scheduling support required for clusters of active disks. More specifically, we address the problem of code scheduling with the goal of minimizing the power consumption on the disk system. Our experiments indicate that the proposed scheduling approach is very successful in reducing power and generates better results than three other alternate scheduling schemes tested.
This paper emphasizes on load balancing issues associated with hybrid programming models for the parallelization of fully permutable nested loops onto SMP clusters. Hybrid parallel programming models usually suffer fr...
详细信息
This paper emphasizes on load balancing issues associated with hybrid programming models for the parallelization of fully permutable nested loops onto SMP clusters. Hybrid parallel programming models usually suffer from intrinsic load imbalance between threads, mainly because most existing message passing libraries generally provide limited multi-threading support, allowing only the master thread to perform internode message passing communication. In order to mitigate this effect, the authors proposed a generic method for the application of static load balancing on the coarse-grain hybrid model for the appropriate distribution of the computational load to the working threads. The efficiency of the proposed scheme was experimentally evaluated against a micro-kernel benchmark, and demonstrated the potential of such load balancing schemes for the extraction of maximum performance out of hybrid parallel programs.
Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architec...
详细信息
Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, the authors focus on some fundamental features of aCe C.
Typically, only technical arguments like performance, cost or scalability are discussed if programming models and languages on high performance computing facilities are under consideration. In this paper, we investiga...
详细信息
Typically, only technical arguments like performance, cost or scalability are discussed if programming models and languages on high performance computing facilities are under consideration. In this paper, we investigate the impact of human factors such as personal preferences and perceptions, and personal experience on making technical decisions. We have queried a large HPC community of the Sharcnet project in Ontario in regards to general preferences and in regards to detailed usage of language features and programming style. The main result of our study is that - as often claimed in the past but never proven - shared-memory programming models and architectures appear to be the ideal for the majority of users, even if the main architecture of the project is a distributed-memory cluster. However, experience appears to be able to quickly overcome initial difficulties in using message passing.
暂无评论