We present the design of the experimental single-chip cloud computer (SCC) by Intel Labs. The SCC is a research microprocessor containing the most Intel architecture cores ever integrated on a single silicon chip: 48 ...
ISBN:
(纸本)9783642243219
We present the design of the experimental single-chip cloud computer (SCC) by Intel Labs. The SCC is a research microprocessor containing the most Intel architecture cores ever integrated on a single silicon chip: 48 cores. We envision SCC as a concept vehicle for research in the areas of parallel computing including system software, compilers and applications. It incorporates technologies intended to scale multi-core processors to 100 cores and beyond including an on-chip network, advanced power management technologies, new data-sharing options using software-managed memory coherency or hardwareaccelerated message passing, and intelligent resource management. SCC is implemented in a 45-nm process integrating 1.3-B transistors. It is based on a tiled architecture with each tile containing two Pentium class cores, private L1 and L2 caches, and one mesh router. All 24 tiles have access to four DDR3 memory channels. These channels can provide up to 64-GB of main memory to the system. The on-die communication is organized in a regular 6×4 mesh of tiles using 16-B-wide data links. The SCC contains one frequency domain for each tile and eight voltage domains: two for on and off chip I/O and six for the cores. Each tile contains sensors to monitor the thermal state. SCC has a NUMA architecture including local caches and on-die distributed memory for low latency, hardware-assisted message passing or scratchpad use as well as an abundant external DRAM bandwidth and capacity. Thus, the processor can be used as a proxy for future manycore platforms by running several independent applications and operating systems concurrently on dedicated resources while applying fine-grain voltage and frequency scaling for best energy efficiency. In this talk we review the chip’s architecture and highlight different system configurations that enable the exploration of compute, memory or communication limited workloads. We show the emulation-based design flow that enabled us to build the SCC w
Esterel is a synchronous language suited for describing reactive embedded systems. It combines fine-grained parallelism with precise timing control for the execution of threads. Due to this, Esterel programs have typi...
详细信息
Esterel is a synchronous language suited for describing reactive embedded systems. It combines fine-grained parallelism with precise timing control for the execution of threads. Due to this, Esterel programs have typically been compiled into sequential code in software implementations, as tight synchronization between a large number of threads cannot be efficiently managed with an operating system (OS). This has enabled concurrent Esterel programs to be executed directly on single-core processors. Recently, however, multi-core processors have been increasingly used to achieve better performance in embedded applications. The conventional approach of generating sequential code from Esterel programs is unable to take advantage of multi-core processors. We overcome this limitation by compiling Esterel into a limited number of thread partitions (up to the number of available cores) to avoid the large overheads of implementing each Esterel thread separately within a conventional multithreading scheme. These partitions are then distributed onto separate cores using a static load balancing heuristic. The Esterel threads within a partition may then be dynamically scheduled with or without an OS. To evaluate the viability of this approach, we present experimental results comparing the execution of a set of benchmarks using one to four cores on the Intel Core 2 Quad with Linux, and one to two cores on the Xilinx Micro blaze without any OS. We have performed extensive benchmarking over large Esterel programs to illustrate that achieving throughput with parallel execution of Esterel is benchmark dependent.
Domain decomposition method is a popular algorithm, which is adopted to the parallel finite element method(FEM). The formulation for solving sparse linear systems of equations is presented. The TAU performance analysi...
详细信息
ISBN:
(纸本)9780769541105
Domain decomposition method is a popular algorithm, which is adopted to the parallel finite element method(FEM). The formulation for solving sparse linear systems of equations is presented. The TAU performance analysis software is used to analyze and understand the execution behavior of the parallel algorithm such as: communication patterns, processor load balance, and computation versus communication ratios, timing characteristics, and processor idle time. This is all done by displays of post-mortem trace-files. Performance bottlenecks can easily be identified at the appropriate level of detail. A large-scale mechanical calculation of a dam by the parallel FEM program was brought out using the Dawning 5000A parallel computer at the Henan technical University Supercomputer Center. The TAU performance analysis software are used to analyze and understand the execution behavior of the parallel algorithm such as: communication patterns, processor load balance, computation versus communication ratios, timing characteristics, and processor idle time. This is all done by displays of post-mortem trace-files. Statistics show that the formulation is efficient in parallel computing environments and that the formulation is significantly faster and consumes less memory.
This paper describes a novel approach to parallel simulation of complex multi-agent systems which is based on actors and the Java middleware Terracotta. The approach aims to an exploitation of the computing power of m...
详细信息
ISBN:
(纸本)9780769542515
This paper describes a novel approach to parallel simulation of complex multi-agent systems which is based on actors and the Java middleware Terracotta. The approach aims to an exploitation of the computing power of modern multi-core machines. Terracotta was chosen because it transparently allows to cluster the JVM. The paper discusses design and implementation aspects of the approach, and demonstrates the achievable execution performance through the parallel simulation of a scalable multi-agent system based on the predator/prey model.
Two kinds of parallel genetic algorithm (PGA) are implemented in this paper based on the MATLAB (R) parallel Computing Toolbox (TM) and distributed Computing Server T software. parallel for-loops, SPMD (Single Program...
详细信息
ISBN:
(纸本)9780769541105
Two kinds of parallel genetic algorithm (PGA) are implemented in this paper based on the MATLAB (R) parallel Computing Toolbox (TM) and distributed Computing Server T software. parallel for-loops, SPMD (Single Program Multiple Data) block and co-distributed arrays, three basic parallel programming modes in MATLAB are employed to accomplish the global and coarse-grained PGAs. To validate and compare our implementation, both PGAs are applied to run the problem of range image registration. A set of experiments have illustrated that it is convenient and effective to use MATLAB to parallelize the existing algorithms. At the same time, a higher speed-up and performance enhancement can be obtained obviously.
Rejuvenation is a technique expected to mitigate failures in HPC systems by replacing, repairing, or resetting system components. Because of the small overhead required by software rejuvenation, we primarily focus on ...
详细信息
Nowadays, more and more computer malwares or viruses have evolved to a new special form that depends on the Internet, which is called downloader. In this article, we will show something about the downloader's dest...
详细信息
ISBN:
(纸本)9780769541105
Nowadays, more and more computer malwares or viruses have evolved to a new special form that depends on the Internet, which is called downloader. In this article, we will show something about the downloader's destructive power and several available methods to bypass the heuristic scanning of Kaspersky and Eset's newest antivirus software for their heuristic scanning technology are the most advanced in the windows OS platforms. Even though the Heuristic Scanning Technology is the key of protection software, more and more new methods are built to bypass it. And then, I will give my guess about how to detect and Intercept the downloader-like programs. Note that I never hope do harm to Kaspersky and Eset's products but only to learn.
Along with the rapid development of parallel computing technology and the popularity of Beowulf cluster system, the scalability of parallel algorithm-machine combinations, which measures the capacity of a parallel alg...
详细信息
ISBN:
(纸本)9780769541105
Along with the rapid development of parallel computing technology and the popularity of Beowulf cluster system, the scalability of parallel algorithm-machine combinations, which measures the capacity of a parallel algorithm to effectively utilize an increasing number of processors, becomes more and more important. This ratio of parallel overhead to computation is reviewed in this paper, the merit and deficiencies of this metric are pointed out. Then in order to apply the distributedparallel computation environment based on Beowulf cluster it is improved, obtain the new extensible function which reflects the scalability of distributedparallelsystems more directly and precisely when the size of machines and the scale of problems are extending in the environment of Beowulf cluster. Finally, the new metric is used to analyze and prove the scalability of parallel algorithms and Beowulf cluster.
Due to the deficiencies of prior modeling methods of systems of boilers in parallel operation through a common manifold, in this article we use simplified approximate distributed modeling based on the compartmentaliza...
详细信息
Due to the deficiencies of prior modeling methods of systems of boilers in parallel operation through a common manifold, in this article we use simplified approximate distributed modeling based on the compartmentalization and combination of the manifold. First, a principle is established to plot out the manifold into some basic pipe sections between each pair of source/sink points. Second, a simplified approximate distributed decoupled transfer function matrix model without steady errors is built for each pipe section. Finally, a smooth joint arithmetic is adopted to combine all the conterminous subsections of the pipe section with boilers/turbines at its ends into a whole system model. This method can decrease the size of the equipment and, therefore, makes it easier to model larger systems. The model presents more distributed characteristics and fits well in control system simulating. Some experiments have been done with favorable results to prove the validity of the method.
software transactional memory (STM) is one of the promising models in parallel programming for multi-core processor system and has being studied by many researchers. Memory management is an important aspect of STM sys...
详细信息
ISBN:
(纸本)9780769541105
software transactional memory (STM) is one of the promising models in parallel programming for multi-core processor system and has being studied by many researchers. Memory management is an important aspect of STM system design which affects the performance of the whole STM system directly. This paper presents the design and implementation of an effective memory manager. It uses private heap to manage transactions' memory space of each thread and a global heap to manage the whole memory space in STM system. Algorithms ensure that the memory access is no-blocking. Tests show that performance of the memory manager is satisfying.
暂无评论