PLX is a concise instruction set architecture ( ISA) that combines the most useful features from previous generations of multimedia instruction sets with newer ISA features for high- performance, low- cost multimedia ...
详细信息
PLX is a concise instruction set architecture ( ISA) that combines the most useful features from previous generations of multimedia instruction sets with newer ISA features for high- performance, low- cost multimedia information processing. Unlike previous multimedia instruction sets, PLX is not added onto a base processor ISA, but designed from the beginning as a standalone processor architecture optimized for media processing. Its design goals are high performance multimedia processing, general- purpose programmability to support an ever- growing range of applications, simplicity for constrained environments where low power and low cost are paramount, and scalability for higher performance in less constrained multimedia systems. Another design goal of PLX is to facilitate exploration and evaluation of novel techniques in instruction set architecture, microarchitecture, arithmetic, VLSI implementations, compiler optimizations, and parallel algorithm design for new computing paradigms. Key characteristics of PLX are a fully subword- parallel architecture with novel features like wordsize scalability from 32- bit to 128- bit words, a new definition of predication, and an innovative set of subword permutation instructions. We demonstrate the use and high performance of PLX on some frequently- used code kernels selected from image, video, and graphics processing applications: discrete cosine transform, pixel padding, clip test, and median filter. Our results show that a 64- bit PLX processor achieves significant speedups over a basic 64- bit RISC processor and over IA- 32 processors with MMX and SSE multimedia extensions. Using PLX's wordsize scalability feature, PLX- 128 often provides an additional 2 x speedup over PLX- 64 in a cost- effective way. Superscalar or VLIW ( Very Long Instruction Word) PLX implementations can also add additional performance through inter-instruction, rather than intra- instruction parallelism. We also describe the PLX testbed and its soft
In this paper a novel medical image processing system is discussed. the core of the system is developed using a 16-bit fixed-point parallel architecture B-spline signal processing system. the statistical measure of fi...
详细信息
ISBN:
(纸本)0769522645
In this paper a novel medical image processing system is discussed. the core of the system is developed using a 16-bit fixed-point parallel architecture B-spline signal processing system. the statistical measure of finite word length effect is analytically developed. A modified algorithm for the reduced hardware reprogrammable interpolator has been designed. Finally some suitable modification in the hardware is made to reduce the power consumption.
Withthe advent of hardware technologies, high-performance parallel computers and commodity clusters are becoming affordable. However, complexity of parallel application development remains one of the major obstacles ...
详细信息
the Data Grid enables the sharing, selection, and connection of a wide variety of geographically distributed computational and storage resources for solving large-scale data intensive scientific applications. Such tec...
详细信息
A common computing-core representation of the discrete cosine transform and discrete sine transform is derived, and a reduced-complexity algorithm is developed for computation of the proposed common computing-core. A ...
详细信息
A common computing-core representation of the discrete cosine transform and discrete sine transform is derived, and a reduced-complexity algorithm is developed for computation of the proposed common computing-core. A parallel architecture based on the principle of distributed arithmetic is designed further for computation of these transforms using the common-core algorithm. the proposed scheme not only leads to a systolic-like, fully-pipelined regular and modular hardware for computing the these transforms, but also offers significant saving of hardware over the existing structures having nearly the same computational throughput. the proposed structure is devoid of complicated input/output mapping and does not involve any complex control structure. Moreover, it does not have restriction on the transform-length, and can be utilized as a reusable core for cost-effective, high-throughput implementation of either of these transforms
Reconfigurable architectures have becoming very relevant in recent years. In this paper we propose a methodology dedicated to analyze interactive applications in order to execute them in a SIMD reconfigurable architec...
详细信息
Reconfigurable architectures have becoming very relevant in recent years. In this paper we propose a methodology dedicated to analyze interactive applications in order to execute them in a SIMD reconfigurable architecture taking into account power/performance trade-offs. this methodology starts from a kernel description of the interactive application. Kernels are conditionally executed depending on dynamic conditions like user's input data manipulation. the volume of data involved in this kind of applications combined with user's actions occurring at unexpected times strongly impact on performance. We define an execution model to deal with conditional branches accompanied by a data prefetch scheme in order to avoid reconfigurable processing unit stalls due to operands unavailability. Experimental results satisfy time constraints of interactive applications and show a power effective solution for them.
Program performance optimization often involves choosing right parameters to minimize the program's runtime. Selecting optimization parameters by means of execution-driven search is guaranteed to find excellent re...
详细信息
Hardware implementation aspects of the MD5 hash algorithm are discussed in this paper. A general architecture for MD5 is proposed and several implementations are presented. An extensive study of effects of pipelining ...
详细信息
Hardware implementation aspects of the MD5 hash algorithm are discussed in this paper. A general architecture for MD5 is proposed and several implementations are presented. An extensive study of effects of pipelining on delay, area requirements and throughput is performed, and finally certain architectures are recommended and compared to other published MD5 designs. the designs were implemented on a Xilinx Virtex-II XC2V4000-6 FPGA and a throughput of 586 Mbps was achieved with logic requirements of only 647 slices and 2 BlockRAMs. Methods to increase the throughput to gigabit-level were also studied and an implementation of parallel MD5 blocks achieving a throughput of over 5.8 Gbps was introduced. At least to the authors' knowledge, MD5 designs presented in this paper are the fastest published FPGA-based architectures at the time of writing.
We describe a generic programming model to design collective communications on SMP clusters. the programming model utilizes shared memory for collective communications and overlapping inter-node/intra-node communicati...
详细信息
We describe a generic programming model to design collective communications on SMP clusters. the programming model utilizes shared memory for collective communications and overlapping inter-node/intra-node communications, both of which are normally platform specific approaches. Several collective communications are designed based on this model and tested on three SMP clusters of different configurations. the results show that the developed collective communications can, with proper tuning, provide significant performance improvements over existing generic implementations. For example, when broadcasting an 8 MB message our implementations outperform the vendor's MPl/spl ***/Bcast by 35% on an IBM SP system, 51% on a G4 cluster, and 63% on an Intel cluster, the latter two using MPICH's MPl/spl ***/Bcast. With all-gather operations using 8 MB messages, our implementation outperform the vendor's MPI/spl ***/Allgather by 75% on the IBM SP, 60% on the Intel cluster, and 48% on the G4 cluster.
暂无评论