the UCSC Kestrel parallel processor is part of an evolution from application-specific to specialized to application-unspecific processing. Kestrel combines an ALU, multiplier, and local memory, with systolic shared re...
详细信息
the UCSC Kestrel parallel processor is part of an evolution from application-specific to specialized to application-unspecific processing. Kestrel combines an ALU, multiplier, and local memory, with systolic shared registers for seamless merging of communication and computation, and an innovative condition stack for rapid conditionals. the result has been a readily programmable and efficient co-processor for many applications. Experience with Kestrel indicates that programmable systolic processing, and its natural combination withthe single instruction-multiple data (SIMD) parallel architecture, will be an effective design choice for years to come
For some time now, configurable computing has been hailed as the future for application-specific architectures. the purported advantages are well-known: the increasing NRE cost of chip fab is avoided, the same platfor...
详细信息
For some time now, configurable computing has been hailed as the future for application-specific architectures. the purported advantages are well-known: the increasing NRE cost of chip fab is avoided, the same platform can be used for a variety of applications, and implementations can be fixed or upgraded in the field. But in spite of many attempts to move configurable computing platforms into the mainstream, they have yet to achieve their full promise. this talk will explore the barriers that have kept configurable computing on the sidelines thus far, and suggest steps that we might take to take advantage of its full potential
In the process of mapping compute-intensive algorithms onto arrays of processing elements (PEs) an efficient usage of channels between PEs and registers within PEs is crucial for achieving a significant algorithm acce...
详细信息
In the process of mapping compute-intensive algorithms onto arrays of processing elements (PEs) an efficient usage of channels between PEs and registers within PEs is crucial for achieving a significant algorithm acceleration. In this paper this problem is solved for algorithms represented as systems of uniform recurrence equations. We address an optimization problem in order to realize the algorithmic data dependencies within the processor array (PA) with minimum cost for channels and registers. there, we use a new mapping approach which allows a direct mapping of the algorithm onto the PA by a partitioning method. In contrast to existing approaches, we consider the issue of avoiding redundant usage of channels and registers, which can appear if one instance of a variable has to be transferred from a source PE to several sink PEs. Further, a solution of the optimization problem determines the schedule for the transfer of the variable instances in the channels and their storage in registers as well as the inner schedule for the operations in the PEs. We illustrate our method on the edge detection algorithm.
this paper proposes a novel parallel approach for pipelining of nested multiplexer loops to design high speed decision feedback equalizers (DFEs) based on look-ahead techniques. It is well known that the DFE is an eff...
详细信息
this paper proposes a novel parallel approach for pipelining of nested multiplexer loops to design high speed decision feedback equalizers (DFEs) based on look-ahead techniques. It is well known that the DFE is an efficient scheme to suppress intersymbol interference (ISI) in various communication and magnetic recording systems. However, the feedback loop within a DFE limits an upper bound of the achievable high speed in hardware implementation. A straightforward parallel implementation requires more hardware complexity. the novel proposed technique offers significant reduction of hardware complexity of 56% and 80% over the conventional parallel six-tap DFE architectures for 10 Gbps and 20 Gbps throughput, respectively.
Materialized view (MV) maintenance is an important research area in the database and data warehouse area. Currently, most of the MV maintenance methods are based on the C/S or B/S model, and the MV tasks are recompute...
详细信息
ISBN:
(纸本)142440164X
Materialized view (MV) maintenance is an important research area in the database and data warehouse area. Currently, most of the MV maintenance methods are based on the C/S or B/S model, and the MV tasks are recomputed sequentially, which cause the system to overload and even crash. To address the problem, decomposition of the maintenance task is explored based on task balancing collaborative strategy, which is also applicable to the CSCW systems for managing, searching and processing data
Grids have emerged as paradigms for the next generation parallel and distributed computing. Computational grid can be defined as large-scale high-performance distributed computing environments that provide access to h...
详细信息
Grids have emerged as paradigms for the next generation parallel and distributed computing. Computational grid can be defined as large-scale high-performance distributed computing environments that provide access to high-end computational resources. Grid scheduling is the process of scheduling jobs over grid resources. Improving overall system performance with a lower turn around time is an important objective of grid scheduling. In this paper a priority based scheduling algorithm is proposed. In this algorithm a new parameter named "priority" has been taken into consideration. the algorithm classifies the jobs into high, medium and low categories based on their priority. the priority assignment is done by considering the computational power of job and level of parallelism. the value for level of parallelism is assigned based on the amount of parallelism exhibited by the job and the amount of parallelism exhibited by the available resources. Generally, a job, which needs high computational power and exhibits low parallelism is given a high priority. Prioritizing the jobs in this way can improve the performance of computational grids. the effectiveness of our algorithm is evaluated through simulation results and its superiority over other known algorithms is demonstrated
the proceedings contain 52 papers. the topics discussed include: improving concurrent write scheme in file server group;a practical comparison of cluster operating systems implementing sequential and transactional con...
详细信息
ISBN:
(纸本)3540292357
the proceedings contain 52 papers. the topics discussed include: improving concurrent write scheme in file server group;a practical comparison of cluster operating systems implementing sequential and transactional consistency;a recursive-adjustment co-allocation scheme in data grid environments;reducing the bandwidth requirements of P2P keyword indexing;localization techniques for cluster-based data grid;an efficient dynamic load-balancing algorithm in a large scale cluster;job scheduling policy for high throughput grid computing;data distribution strategies for domain decomposition applications in grid environments;and a practical comparison of cluster operating systems implementing sequential and transactional consistency.
this work proposes a new architecture and execution model called 2D-VLIW. this architecture adopts an execution model based on large pieces of computation running over a matrix of functional units connected by a set o...
详细信息
this work proposes a new architecture and execution model called 2D-VLIW. this architecture adopts an execution model based on large pieces of computation running over a matrix of functional units connected by a set of local register spread across the matrix. Experiments using the Mediabench and SPECint00 programs and the Trimaran compiler show performance gains ranging from 5% to 63%, when comparing the proposal to an EPIC architecture withthe same number of registers and functional units. It also show that the g72-enc program running on a 2D-VLIW3times3 matrix had a speedup of 1.37 over a 2times2 matrix while the same program over the EPIC processor with 9 functional units had a speedup of 1.12 over an EPIC processor with 4 functional units. For some internal procedures from Mediabench and SPECint programs, the average 2D-VLIW OPC (operations per cycle) was up to 10 times greater than for the equivalent EPIC processor
A master equation characterizes the time-evolution of trajectories, the transition of states in protein folding kinetics. Numerical solution of the master equation requires calculating eigenvalues for the correspondin...
详细信息
A master equation characterizes the time-evolution of trajectories, the transition of states in protein folding kinetics. Numerical solution of the master equation requires calculating eigenvalues for the corresponding large scale eigenvalue problem. In this paper, we present a parallel computing technique to compute the eigenvalues of the matrix with an N-dimensional vector of the instantaneous probability of the N conformations. parallelization of the implicitly restarted Arnoldi method is successfully implemented on a PC-based Linux cluster. the parallelization scheme used in this work mainly partitions the operations of the matrix. For the Arnoldi factorization, we replicate the upper Hessenberg matrix H/sub m/ for each processor, and distribute the set of Arnoldi vectors V/sub m/ among processors. Each processor performs its own operations. this algorithm is implemented on a PC-based Linux cluster with message passing interface (MPI) libraries. Our preliminary numerical experiment performing on the 32-nodes PC-based Linux cluster has shown that the maximum difference among CPUs is within 10%. A 23 times speedup and 72% parallel efficiency are also attained for the tested cases. this approach enables us to explore large scale dynamics of protein folding.
High performance protocol processing plays more and more important role in current high speed network security. Recent studies show that current computer architecture advances and CPU performance improvement has limit...
详细信息
High performance protocol processing plays more and more important role in current high speed network security. Recent studies show that current computer architecture advances and CPU performance improvement has limited impact on network protocol processing performance. Some researchers find that on real SMT processor like Intel Xeon processor with Hyper-threadings, the sharing resources (like cache) contention between threads can hurt the performance of network processing applications like servers or IDS. In this paper, we put our focus on the processing performance of TCP automata phases, using execution based simulations to model the relation between each phase cache behavior and cache size, and then measuring the cache contention between threads
暂无评论