the Pajé approach to help performance debugging of parallel and distributed applications is to provide behavioral visualizations of their program executions to programmers. this article describes how Pajé wa...
详细信息
the P-GRADE job execution mode will be demonstrated on a small Grid containing 3 clusters from Budapest and London. the first demonstration illustrates the Grid execution of a parallel meteorology application. the par...
详细信息
We study the problem of mapping tree-structured data to an ensemble of parallel memory *** are given a "conflict tolerance" c, and we seek the smallest ensemble that will allow us to store any nvertex rooted...
详细信息
the Streaming SIMD extension (SSE) is a special feature embedded in the Intel Pentium III and IV classes of microprocessors. It enables the execution of SIMD type operations to exploit data parallelism. this article p...
详细信息
Today9;s parallel computers with SMP nodes provide both multithreading and message passing as their modes of parallel execution. As a consequence, performance analysis and optimization becomes more difficult and cr...
详细信息
ISBN:
(纸本)354040788X
Today's parallel computers with SMP nodes provide both multithreading and message passing as their modes of parallel execution. As a consequence, performance analysis and optimization becomes more difficult and creates a need for advanced performance tools that are custom made for this class of computing environments. Current state-of-the-art tools provide valuable assistance in analyzing the performance of MPI and OpenMP programs by visualizing the run-time behavior and calculating statistics over the performance data. However, the developer of parallel programs is still required to filter out relevant parts from a huge amount of low-level information shown in numerous displays and map that information onto program abstractions without tool support. the KOJAK project (Kit for Objective Judgement and Knowledge-based Detection of Performance Bottlenecks) is aiming at the development of a generic automatic performance analysis environment for parallel programs. Performance problems are specified in terms of execution patterns that represent situations of inefficient behavior. these patterns are input for an analysis process that recognizes and quantifies the inefficient behavior in event traces. Mechanisms that hide the complex relationships within event pattern specifications allow a simple description of complex inefficient behavior on a high level of abstraction. the analysis process transforms the event traces into a three-dimensional representation of performance behavior. the first dimension is the kind of behavior. the second dimension describes the behavior's source-code location and the execution phase during which it occurs. Finally, the third dimension gives information on the distribution of performance losses across different processes or threads. the hierarchical organization of each dimension enables the investigation of performance behavior on varying levels of granularity. Each point of the representation is uniformly mapped onto the corresponding fra
Performance data usually must be archived for various performance analysis and optimization tasks such as multi-experiment analysis, performance comparison, automated performance diagnosis. However, little effort has ...
详细信息
Future processors having sliced memory pipelines will rely on bank prediction to schedule memory instructions to a first-level cache split into banks. In a deeply pipelined processor, even a small bank misprediction r...
详细信息
In designing high-speed communications, the smallest functional unit like arithmetic, B(x)/sup -1/mod F(x), should be carefully designed and optimized well to improve the overall performance. To do this, we study two ...
详细信息
ISBN:
(纸本)0780381149
In designing high-speed communications, the smallest functional unit like arithmetic, B(x)/sup -1/mod F(x), should be carefully designed and optimized well to improve the overall performance. To do this, we study two variations that is, square-first and multiply-first type operations - for the repetition-operation of the numerical formula, AB/sup 2/. From these two variations, we propose m-bit parallel semi-systolic architectures for GF(2/sup m/) inversion. When we compared performance of them withthose of different inversion architectures based on a normal power-sum operation, based on small grain of special power-sum operation, and based on a Euclidean algorithm, performance of the proposed one, which is based on small grain of special power-sum operation, is the best for the purpose of high-speed applications. When we implement a simplified 8-bit parallel semi-systolic architecture for square-first inversion circuit over GF(2/sup m/) by using 0.25 /spl mu/W CMOS library, it has 2495 equivalent logic-gates, 1848 1-bit latches, and the latency is 56 and the clock-rate is up to 580 MHz at 100% throughput.
In this paper, an FPGA implementation of a novel and highly scalable hardware architecture for fast inversion of triangular matrices is presented. An integral part of modem signal processing and communications applica...
详细信息
In this paper, an FPGA implementation of a novel and highly scalable hardware architecture for fast inversion of triangular matrices is presented. An integral part of modem signal processing and communications applications involves manipulation of large matrices. therefore, scalable and flexible hardware architectures are increasingly sought for. In this paper, the traditional triangular shaped array architecture with n(n+l)/2 communicating processors, with n being the number of inputs, is mapped to a linear structure with only n processors. the linear and the triangular shaped architectures are compared in aspect of area consumption, latencies, and maximum clocking speed. this paper also show that the linear array structure avoids drawbacks such as non-scalability, large area, and large power consumption. the implementation is based on a numerically stable recurrence algorithm, which has excellent properties for hardware implementation.
this paper presents a software implementation of a very fast parallel Reed-Solomon decoder on the second generation of MorphoSys reconfigurable computation platform, which is targeting on streamed applications such as...
详细信息
ISBN:
(纸本)9781581137422
this paper presents a software implementation of a very fast parallel Reed-Solomon decoder on the second generation of MorphoSys reconfigurable computation platform, which is targeting on streamed applications such as multimedia and DSP. Numerous modifications of the first-generation of the architecture have made a scalable computation and communication intensive architecture capable of extracting parallelisms of fine grain in instruction level. Many algorithms and the whole digital video broadcasting base-band receiver as well, have been mapped onto the second architecture with impressive performance. the mapping of a Reed-Solomon decoder proposed in the paper highly parallelizes all of its sub-algorithms, including Syndrome Computation, Berlekamp Algorithm, Chein Search, and Error Value Computation, in a SIMD fashion. the mapping is tested on a cycle-accurate simulator, "Mulate", and the performance is encouragingly better than other architectures. the decoding speed of the RS (255,239,16) decoder using two different methods of GF multiplication can be 1.319 Gbps and 2.534 Gbps, respectively. Furthermore, since there is no functionality specifically tailored to Reed-Solomon decoder, the result has demonstrated the capability of MorphoSys architecture to extracting instruction level parallelism from streamed applications.
暂无评论