The authors present an overview of the functionality and the architecture of KIWIS, a federated knowledge-base environment based on deduction in an object-oriented framework. A KIWIS environment consists of a number o...
详细信息
The authors present an overview of the functionality and the architecture of KIWIS, a federated knowledge-base environment based on deduction in an object-oriented framework. A KIWIS environment consists of a number of participating KIWIS systems, each of which may be connected to one or more external databases. The architecture of each component system is composed of a number of layers that incrementally add power to this system. The kernel of a system consists of layers 1 through 4 and creates the 'personal knowledge machine' environment. The other layers (5 and 6) act as a 'window on the world,' enriching the local system knowledge with external knowledge.
The Third Workshop on Using Emerging parallelarchitectures (WEPA), held in conjunction with ICCS 2011, provides a forum for exploring the capabilities of emerging parallelarchitectures such as GPUs, FPGAs, Cell B. E...
详细信息
The Third Workshop on Using Emerging parallelarchitectures (WEPA), held in conjunction with ICCS 2011, provides a forum for exploring the capabilities of emerging parallelarchitectures such as GPUs, FPGAs, Cell B. E., and multi-cores to accelerate computational science applications.
Web 2.0 applications written in JavaScript are increasingly popular as they are easy to use, easy to update and maintain, and portable across a wide variety of computing platforms. Web applications receive frequent in...
详细信息
ISBN:
(纸本)9781450328098
Web 2.0 applications written in JavaScript are increasingly popular as they are easy to use, easy to update and maintain, and portable across a wide variety of computing platforms. Web applications receive frequent input from a rich array of sensors, network, and user input modalities. To handle the resulting asynchrony due to these inputs, web applications are developed using an event-driven programming model. These event-driven web applications have dramatically different characteristics, which provides an opportunity to create a customized processor core to improve the responsiveness of web applications. In this paper, we take one step towards creating a core customized to event-driven applications. We observe that instruction cache misses of web applications are substantially higher than conventional server and desktop workloads due to large working sets caused by distant re-use. To mitigate this bottleneck, we propose an instruction prefetcher (EFetch) that is tuned to exploit the characteristics of web applications. We find that an event signature, which captures the current event and function calling context, is a good predictor of the control flow inside a function of an event-driven program. It allows us to accurately predict a function's callees and their function bodies and prefetch them in a timely manner. For a set of real-world web applications, we show that the proposed prefetcher outperforms commonly implemented next-2-line prefetcher by 17%. Also, it consumes 5.2 times less area than a recently proposed prefetcher, while outperforming it.
Dataflow computing is proved to be promising in high-performance computing. However, traditional dataflow architectures are general-purpose and not efficient enough when dealing with typical scientific applications du...
详细信息
ISBN:
(纸本)9781450341219
Dataflow computing is proved to be promising in high-performance computing. However, traditional dataflow architectures are general-purpose and not efficient enough when dealing with typical scientific applications due to low utilization of function units. In this paper, we propose an optimization of dataflow architectures for scientific applications. The optimization introduces a request for operands mechanism and a topology based instruction mapping algorithm to improve the efficiency of dataflow architectures. Experimental results show that the request for operands optimization achieves a 4.6% average performance improvement over the traditional dataflow architectures and the TBIM algorithm achieves a 2.28x and a 1.98x average performance improvement over SPDI and SPS algorithm respectively.
Image processing operations are defined which permit image processing algorithms to be written as algebraic expressions whose variables are whole images. Highly parallel computing architectures for evaluating these ex...
详细信息
With an increasing number of cores and memory controllers in multiprocessor platforms, co-location of parallelapplications is gaining on importance. Key to achieve good performance is allocating the proper number of ...
详细信息
ISBN:
(纸本)9781450359863
With an increasing number of cores and memory controllers in multiprocessor platforms, co-location of parallelapplications is gaining on importance. Key to achieve good performance is allocating the proper number of threads to co-located applications. This paper presents NuPoCo, a framework for automatically managing parallelism of co-located parallelapplications on NUMA multi-socket multi-core systems. NuPoCo maximizes the utilization of CPU cores and memory controllers by dynamically adjusting the number of threads for co-located parallelapplications. Evaluated with various scenarios of co-located OpenMP applications on a 64-core AMD and a 72-core Intel machine, NuPoCo achieves a reduction of the total turnaround time by 10-20% compared to the default Linux scheduler and an existing parallelism management policy focusing on CPU utilization only.
State-of-the-art chip multiprocessor (CMP) proposals emphasize optimization to deliver computing power across many types of applications. Potentially significant performance improvements that leverage application spec...
详细信息
ISBN:
(纸本)9781450311823
State-of-the-art chip multiprocessor (CMP) proposals emphasize optimization to deliver computing power across many types of applications. Potentially significant performance improvements that leverage application specific characteristics such as data access behavior are missed by this approach. In this paper, we demonstrate that using fairly simple and inexpensive static analysis, data can be classified into private and shared. In addition, we develop a novel compiler-based approach to speculatively detect a third classification: practically private. We demonstrate that practically private data is ubiquitous in parallelapplications and leveraging this classification provides opportunities to benefit performance. While this proposed data classification scheme can be applied to many micro-architectural constructs including the TLB, coherence directory and interconnect, we demonstrate its potential through an efficient cache coherence design. Specifically, we show that the compiler-assisted mechanism reduces an average of 46% coherence traffic and achieves up to 13%, 9%, and 5% performance improvement over shared, private, and state-of-the-art NUCA-based caching, respectively depending on scenarios.
For reasons of both performance and energy efficiency, high performance computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL framework supports portable programming across a wide range of comput...
详细信息
ISBN:
(纸本)9781450365239
For reasons of both performance and energy efficiency, high performance computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL framework supports portable programming across a wide range of computing devices and is gaining influence in programming next-generation accelerators. To characterize the performance of these devices across a range of applications requires a diverse, portable and configurable benchmark suite, and OpenCL is an attractive programming model for this purpose. We present an extended and enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus placed on the robustness of applications, curation of additional benchmarks with an increased emphasis on correctness of results and choice of problem size. Preliminary results and analysis are reported for eight benchmark codes on a diverse set of architectures three Intel CPUs, five Nvidia GPUs, six AMD GPUs and a Xeon Phi.
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code struct...
详细信息
ISBN:
(纸本)9781450301787
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we have enabled automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 560% performance increases over the language-defined layout, and a 7% performance gain in the worst case, in which the language-defined layout and access pattern is already well-vectorizable by the underlying hardware.
The Second Workshop on Using Emerging parallelarchitectures (WEPA), held in conjunction with ICCS 2010, provides a forum for exploring the capabilities of emerging parallelarchitectures such as GPUs, FPGAs, Cell B.E...
详细信息
The Second Workshop on Using Emerging parallelarchitectures (WEPA), held in conjunction with ICCS 2010, provides a forum for exploring the capabilities of emerging parallelarchitectures such as GPUs, FPGAs, Cell B.E., and multi-cores to accelerate computational science applications. (C) 2010 Published by Elsevier Ltd.
暂无评论