Recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. Such attributes have once a...
详细信息
Recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. Such attributes have once again sparked an interest in creating learning algorithms that aspire to reverse-engineer many of the abilities of the brain. In this paper we describe a GPGPU-accelerated extension to an intelligent learning model inspired by the structural and functional properties of the mammalian neocortex. Our cortical network, like the brain, exhibits massive amounts of processing parallelism, making today's GPGPUs a highly attractive and readily-available hardware accelerator for such a model. Furthermore, we consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose optimizations such as a software work-queue structure and pipelining the hierarchical layers of the cortical network to mitigate such problems. Our analysis provides important insight into the GPU architecture details including the number of cores, the memory system, and the global thread scheduler. Additionally, we create a runtime profiling tool for our parallel learning algorithm which proportionally distributes the cortical network across the host CPU as well as multiple GPUs, whether homogeneous or heterogeneous, that may be available to the system. Using the profiling tool with these optimizations on Nvidia's CUDA framework, we achieve up to 60× speedup over a single-threaded CPU implementation of the model.
This study considers a heterogeneous computing system and corresponding workload being investigated by the Extreme Scale systems Center (ESSC) at Oak Ridge National Laboratory (ORNL). The ESSC is part of a collaborati...
详细信息
This study considers a heterogeneous computing system and corresponding workload being investigated by the Extreme Scale systems Center (ESSC) at Oak Ridge National Laboratory (ORNL). The ESSC is part of a collaborative effort between the Department of Energy (DOE) and the Department of Defense (DoD) to deliver research, tools, software, and technologies that can be integrated, deployed, and used in both DOE and DoD environments. The heterogeneous system and workload described here are representative of a prototypical computing environment being studied as part of this collaboration. Each task can exhibit a time-varying importance or utility to the overall enterprise. In this system, an arriving task has an associated priority and precedence. The priority is used to describe the importance of a task, and precedence is used to describe how soon the task must be executed. These two metrics are combined to create a utility function curve that indicates how valuable it is for the system to complete a task at any given moment. This research focuses on using time-utility functions to generate a metric that can be used to compare the performance of different resource schedulers in a heterogeneous computing system. The contributions of this paper are: (a) a mathematical model of a heterogeneous computing system where tasks arrive dynamically and need to be assigned based on their priority, precedence, utility characteristic class, and task execution type, (b) the use of priority and precedence to generate time-utility functions that describe the value a task has at any given time, (c) the derivation of a metric based on the total utility gained from completing tasks to measure the performance of the computing environment, and (d) a comparison of the performance of resource allocation heuristics in this environment.
Domain decomposition method is a popular algorithm, which is adopted to the parallel finite element method(FEM). The formulation for solving sparse linear systems of equations is presented. The TAU performance analysi...
详细信息
ISBN:
(纸本)9780769541105
Domain decomposition method is a popular algorithm, which is adopted to the parallel finite element method(FEM). The formulation for solving sparse linear systems of equations is presented. The TAU performance analysis software is used to analyze and understand the execution behavior of the parallel algorithm such as: communication patterns, processor load balance, and computation versus communication ratios, timing characteristics, and processor idle time. This is all done by displays of post-mortem trace-files. Performance bottlenecks can easily be identified at the appropriate level of detail. A large-scale mechanical calculation of a dam by the parallel FEM program was brought out using the Dawning 5000A parallel computer at the Henan technical University Supercomputer Center. The TAU performance analysis software are used to analyze and understand the execution behavior of the parallel algorithm such as: communication patterns, processor load balance, computation versus communication ratios, timing characteristics, and processor idle time. This is all done by displays of post-mortem trace-files. Statistics show that the formulation is efficient in parallel computing environments and that the formulation is significantly faster and consumes less memory.
Rejuvenation is a technique expected to mitigate failures in HPC systems by replacing, repairing, or resetting system components. Because of the small overhead required by software rejuvenation, we primarily focus on ...
详细信息
Performance prediction is valuable in different areas of computing. The popularity of lease-based access to high performance computing resources particularly benefits from accurate performance prediction. Most contemp...
详细信息
We present a novel architecture of a communication engine for non-coherent distributed shared memory systems. The shared memory is composed by a set of nodes exporting their memory. Remote memory access is possible by...
详细信息
In this paper, we propose a software testing environment, called D-Cloud, using cloud computing technology and virtual machines with fault injection facility. Nevertheless, the importance of high dependability in a so...
详细信息
This paper presents a new network management system with ontology-supported multi-agent techniques. This technique derived from the ontology combining with related free software - Ethereal and Cacti which stored the o...
详细信息
Solving dense linear systems of equations is a fundamental problem in scientific computing. Numerical simulations involving complex systems represented in terms of unknown variables and relations between them often le...
详细信息
Ensuring the correctness of complex implementations of software transactional memory (STM) is a daunting task. Attempts have been made to formally verify STMs, but these are limited in the scale of systems they can ha...
详细信息
ISBN:
(纸本)9781424464432
Ensuring the correctness of complex implementations of software transactional memory (STM) is a daunting task. Attempts have been made to formally verify STMs, but these are limited in the scale of systems they can handle [1], [2], [3] and generally verify only a model of the system, and not the actual system. In this paper we present an alternate attack on checking the correctness of an STM implementation by verifying the execution runs of an STM using a checker that runs in parallel with the transaction memory system. With future many-core systems predicted to have hundreds and even thousands of cores [4], it is reasonable to utilize some of these cores for ensuring the correctness of the rest of the system. This will be needed anyway given the increasing likelihood of dynamic errors due to particle hits (soft errors) and increasing fragility of nanoscale devices. These errors can only be detected at runtime. An important correctness criterion that is the subject of verification is the serializability of transactions. While checking transaction serializability is NP-complete, practically useful subclasses such as interchange- serializability (DSR) are efficiently computable [5]. Checking DSR reduces to checking for cycles in a transaction ordering graph which captures the access order of objects shared between transaction instances. Doing this concurrent to the main transaction execution requires minimizing the overhead of capturing object accesses, and managing the size of the graph, which can be as large as the total number of dynamic transactions and object accesses. We discuss techniques for minimizing the overhead of access logging which includes time-stamping, and present techniques for on-the-fly graph compaction that drastically reduce the size of the graph that needs to be maintained, to be no larger than the number of threads. We have implemented concurrent serializability checking in the Rochester software Transactional Memory (RSTM) system [6]. We pres
暂无评论