A 1 Gb multilevel flash memory is fabricated in a 0.13 /spl mu/m CMOS process. The chip area of 95 mm/sup 2/ is achieved using AG-AND-type cells with a multilevel program cell technique and compact write-buffer. By us...
详细信息
ISBN:
(纸本)0780377079
A 1 Gb multilevel flash memory is fabricated in a 0.13 /spl mu/m CMOS process. The chip area of 95 mm/sup 2/ is achieved using AG-AND-type cells with a multilevel program cell technique and compact write-buffer. By use of constant-charge-injection programming and multi-bank operation, high-speed programming throughput of 10 MB/s achieved.
Adaptive mesh refinement (AMR) is a technique used in numerical simulations to automatically refine (or de-refine) certain regions of the physical domain in a finite difference calculation. AMR data consists of nested...
详细信息
Adaptive mesh refinement (AMR) is a technique used in numerical simulations to automatically refine (or de-refine) certain regions of the physical domain in a finite difference calculation. AMR data consists of nested hierarchies of data grids. As AMR visualization is still a relatively unexplored topic, our work is motivated by the need to perform efficient visualization of large AMR data sets. We present a software algorithm for parallel direct volume rendering of AMR data using a cell-projection technique on several different parallel platforms. Our algorithm can use one of several different distribution methods, and we present performance results for each of these alternative approaches. By partitioning an AMR data set into blocks of constant resolution and estimating rendering costs of individual blocks using an application specific benchmark, it is possible to achieve even load balancing.
In this paper, we propose an implementation of OpenMP compiler for distributed memory environment While OpenMP provides a notion of shared address space, distributed memory environment does not have a physical shared ...
详细信息
In this paper, we propose an implementation of OpenMP compiler for distributed memory environment While OpenMP provides a notion of shared address space, distributed memory environment does not have a physical shared memory. One of the approaches to implement OpenMP on distributed memory environment is communication code generation, in which a producer sends appropriate data to the consumer. Our compiler finds accesses to shared data and represents them by using quad, which is our proposed array section descriptor. To identify data to be sent, intersection operation is performed between quads representing written and read data. Since a quad can concisely represent stride accesses to an array section, our compiler can generate efficient code in the case which OpenMP directive divides a for-loop in block-cyclic manner. As a preliminary evaluation, we parallelized a matrix-multiply program by inserting an OpenMP directive and executed it on a PC cluster. In result, we achieved a speedup of 7.82 with 8 processors.
Exploiting thread-level parallelism is a promising way to improve the performance of multimedia applications running on multithreading general-purpose processors. This paper describes our work in developing the first ...
详细信息
ISBN:
(纸本)0780381858
Exploiting thread-level parallelism is a promising way to improve the performance of multimedia applications running on multithreading general-purpose processors. This paper describes our work in developing the first multithreading implementation of the H.264 encoder. We parallelize the encoder using the OpenMP programming model, which allows us to leverage the advanced compiler technology in the Intel/spl reg/ C++ compiler for Intel hyper-threading architectures. We present our design considerations in the parallelization process. We describe an efficient multi-level data partitioning scheme that increases performance of a multithreaded H.264 encoder. Our experiments show parallel speedups ranging from 4.31x to 4.69x on a 4-CPU Intel Xeon/spl trade/ system with hyper-threading technology.
Programmable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper a...
详细信息
ISBN:
(纸本)9781581135886
Programmable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper aims to offset those performance disadvantages by exploiting task-level concurrency in the workload to parallelize the network interface firmware for a programmable controller with two processors. By carefully partitioning the handler procedures that process various events related to the progress of a packet, the system can minimize sharing, achieve load balance, and efficiently utilize on-chip storage. Compared to the uniprocessor firmware released by the manufacturer, the parallelized network interface firmware increases throughput by 65% for bidirectional UDP traffic of maximum-sized packets, 157% for bidirectional UDP traffic of minimum-sized packets, and 32--107% for real network services. This parallelization results in performance within 10--20% of a modern ASIC-based network interface for real network services.
The current technologies have made it possible to execute parallel applications across heterogeneous platforms. However, the performance models available do not provide adequate methods to calculate, compare and predi...
详细信息
The current technologies have made it possible to execute parallel applications across heterogeneous platforms. However, the performance models available do not provide adequate methods to calculate, compare and predict the applications performance on these platforms. In this paper, we discuss an enhanced performance evaluation model for parallel applications on heterogeneous systems. In our analysis, we include machines of different architectures, specifications and operating environments. We also discuss the enabling technologies that facilitate such heterogeneous applications. The model is then validated through experimental measurements using an agent-based parallel Java system, which facilitates simultaneous utilization of heterogeneous systems for parallel applications. The model provides good evaluation metrics that allow developers to assess and compare the parallel heterogeneous applications performances.
Needs for performance on embedded applications leads to the use of dynamic execution on embedded processors in the next few years. However, complete out-of-order superscalar cores are still expensive in terms of silic...
详细信息
Needs for performance on embedded applications leads to the use of dynamic execution on embedded processors in the next few years. However, complete out-of-order superscalar cores are still expensive in terms of silicon area and power dissipation. In this paper, we study the adequacy of a more limited form of dynamic execution, namely decoupled architecture, to embedded applications. Decoupled architecture is known to work very efficiently whenever the execution does not suffer from inter-processor dependencies causing some loss of decoupling, called LOD events. In this study, we address regularity of codes in terms of the LOD events that may occur. We address three aspects of regularity: control regularity, control/memory dependency, and patterns of referencing memory data. Most of the kernels in MiBench will be amenable to efficient performance on a decoupled architecture.
Data race detection is highly essential for debugging multithreaded programs and assuring their correctness. Nevertheless, there is no single universal technique capable of handling the task efficiently, since the dat...
详细信息
Data race detection is highly essential for debugging multithreaded programs and assuring their correctness. Nevertheless, there is no single universal technique capable of handling the task efficiently, since the data race detection problem is computationally hard in the general case. Thus, all currently available tools, when applied to some general case program, usually result in excessive false alarms or in a large number of undetected races. Another major drawback of currently available tools is that they are restricted, for performance reasons, to detection units of fixed size. Thus, they all suffer from the same problem - choosing a small unit might result in missing some of the data races, while choosing a large one might lead to false detection. We present a novel testing tool, called MultiRace, which combines improved versions of Djit and Lockset - two very powerful on-the-fly algorithms for dynamic detection of apparent data races. Both extended algorithms detect races in multithreaded programs that may execute on weak consistency systems, and may use two-way as well as global synchronization primitives. By employing novel technologies, MultiRace adjusts its detection to the native granularity of objects and variables in the program under examination. In order to monitor all accesses to each of the shared locations, MultiRace instruments the C++ source code of the program. It lets the user fine-tune the detection process, but otherwise is completely automatic and transparent. This paper describes the algorithms employed in MultiRace, gives highlights of its implementation issues, and suggests some optimizations. It shows that the overheads imposed by MultiRace are often much smaller (orders of magnitude) than those obtained by other existing tools.
作者:
L.FW. GoesC.A.P.S. MartinsElectrical Engineering
Computational and Digita Systems Laboratory Pontifical Catholic University of Minas Gerais Belo Horizonte Minas Gerais Brazil Computer Science Department
Elechical Engineering Computational and Digital System Laboratory Pontifical Catholic University of Minas Gerais Belo Horizonte Minas Gerais Brazil
parallelism is essential in any programming environment used to produce simulations and interactive games. These are just the sorts of program most children would like to produce. Unfortunately parallel programming is...
详细信息
parallelism is essential in any programming environment used to produce simulations and interactive games. These are just the sorts of program most children would like to produce. Unfortunately parallel programming is hard. Icicle, a programming by demonstration environment, supports the production of the most common forms of parallelism in a straightforward way and allows complete control over parallel behaviour using more advanced features.
暂无评论