To achieve the highperformance/cost ratio, the idea of combining the advantages of DSP and RISC in a single architecture is a solution for the booming and versatile embedded application systems. the existent methods ...
详细信息
To achieve the highperformance/cost ratio, the idea of combining the advantages of DSP and RISC in a single architecture is a solution for the booming and versatile embedded application systems. the existent methods that appear in many literatures all focus on the construction of mixed architecture with RISC basement. Contrary to the above, a new idea of constructing architecture with DSP basement and supplementing it with some RISC features is put up in this paper. It can exert the advantages of DSP architecture by instruction level parallelization and powerful memory access capability, and obtain RISC good high-level language support by the manners of local-homogenous register set and RISC-like pipeline. To accelerate the DSP design and optimize the system, integrated design system (IDS) is presented to complete the hardware/software co-design and co-validation. Application programs can also be rapidly developed based on the integrated development environment (IDF). the behavior level simulation of the DSP system has been completed. It is given as 150 MIPS under the condition of 0.18 /spl mu/m technology library. FPGA validation is also accomplished.
Summary form only given. Network-on-a-chip (NoC) has received attention as a high-performance interconnect, because traditional buses, which can't transfer more than one data-stream simultaneously, are more likely...
详细信息
Summary form only given. Network-on-a-chip (NoC) has received attention as a high-performance interconnect, because traditional buses, which can't transfer more than one data-stream simultaneously, are more likely to become a bottleneck. Since some concepts of NoC have been proposed by simply borrowing the networking structure of parallel computers or system area networks (SANs), it tends to require complicated network interface logic in all the nodes. We propose a novel data-transfer method called Black-Bus as a NoC. In Black-Bus, a local identifier (ID) is attached to each raw data as routing information. Unlike the traditional packet transfer, the local ID is transferred on dedicated wires attached to data lines to remove complicated packet generation procedure in a node. Only a small-sized local ID is required to specify routing tags to the destination, and intermediate routers change it to solve local ID conflicts between paths on a physical channel. the required local ID and routing table sizes for the Black-Bus router are evaluated with access trace data of NAS parallel benchmarks for on-chip multiprocessors, and JPEG codec as stream processing. Evaluation results show that most of the applications require only at most 3 bits for the local ID in a 16-node system. And the Black-Bus data-transfer reduces up to 75% of routing tags compared with global addressing scheme used in the traditional packet networks.
this paper presents implementation of a very fast parallel complex FFT on M2, the second generation of MorphoSys Reconfigurable computation platform, which is targeting on streamed applications such as multimedia and ...
详细信息
ISBN:
(纸本)0769520464
this paper presents implementation of a very fast parallel complex FFT on M2, the second generation of MorphoSys Reconfigurable computation platform, which is targeting on streamed applications such as multimedia and DSP. the proposed mapping comprises fast presorting, cascaded radix-2 stages, and post-reordering. Data and twiddle factors are 16-bit real and 16-bit imaginary in 2's complement format and scaling is performed to avoid overflow. the mapping is tested on our cycle-accurate simulator, "Mulate", and the performance is encouragingly better than other architectures such as Imagine and VIRAM. Moreover, the performance is scalable according to FFT sizes. Since there is no functionality specifically tailored to FFT, the results demonstrate the capability of MorphoSys architecture to extract parallelism from streamed applications. Further rationales are given based on the concepts of scalar operand networks and memory hierarchy.
the X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels. Although it is a general-purpose microprocessor, it has the reliable performance of a reconfigurable architecture. this paper exposes i...
详细信息
ISBN:
(纸本)0769520464
the X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels. Although it is a general-purpose microprocessor, it has the reliable performance of a reconfigurable architecture. this paper exposes its architecture and programming levels, and discusses the powerful interaction between parallel programming and reconfiguration. It shows two performance-optimized implementations of matrix multiplication using both parallel and reconfigurable paradigms and a parallel implementation of miner intelligent agents.
A bipartite graph G = (V, W, E) is convex if there exists an ordering of the vertices of W such that, for each v is an element of V, the neighbors of v are consecutive in W. In this work we describe a BSP/CGM algorith...
详细信息
ISBN:
(纸本)0769520464
A bipartite graph G = (V, W, E) is convex if there exists an ordering of the vertices of W such that, for each v is an element of V, the neighbors of v are consecutive in W. In this work we describe a BSP/CGM algorithm for finding a maximum matching in a convex bipartite graph. For p processors, the algorithm runs in time O((\V\/p) lg(\V\/p) lgp) and it uses O(lgp) communication rounds.
Commodities-built clusters, a low cost alternative for distributed parallel processing, brought high-performancecomputing to a wide range of users. However the existing widespread tools for distributed parallel progr...
详细信息
ISBN:
(纸本)0769520464
Commodities-built clusters, a low cost alternative for distributed parallel processing, brought high-performancecomputing to a wide range of users. However the existing widespread tools for distributed parallel programming, such as messaging passing libraries, does not attend new software engineering requirements that nave emerged due to increase in complexity of applications. Haskell(#) is a parallel programming language intending to reconcile higher abstraction and modularity with scalable performance. In this paper it is demonstrated the use of Haskell(#) in the programming of three SPMD benchmark programs, which have lower-level MPI implementations available.
Modular exponentiation is the cornerstone computation performed in public-key cryptography systems such as the RSA cryptosystem. the operation is time consuming for large operands. this paper describes the characteris...
详细信息
ISBN:
(纸本)0769520464
Modular exponentiation is the cornerstone computation performed in public-key cryptography systems such as the RSA cryptosystem. the operation is time consuming for large operands. this paper describes the characteristics of three architectures designed to implement modular exponentiation using the fast binary method: the first FPGA prototype has a sequential architecture, the second has a parallel architecture and the third has a systolic array-based architecture. the paper compares the three prototypes using the time x area classic factor. All three prototypes implement the modular multiplication using the popular Montgomery algorithm.
One of the main challenges to the wide use of the Internet is the scalability of the servers, that is, their ability to handle the increasing demand. Scalability in stateful servers, which comprise e-Commerce and othe...
详细信息
ISBN:
(纸本)0769520464
One of the main challenges to the wide use of the Internet is the scalability of the servers, that is, their ability to handle the increasing demand. Scalability in stateful servers, which comprise e-Commerce and other transaction-oriented servers, is even more difficult, since it is necessary to keep transaction data across requests from the same user One common strategy for achieving scalability is to employ clustered servers, where the load is distributed among the various servers. However, as a consequence of the workload characteristics and the need of maintaining data coherent among the servers that compose the cluster, load imbalance arise among servers, reducing the efficiency of the server as a whole. In this paper we propose and evaluate a strategy for load balancing in stateful clustered servers. Our strategy is based on control theory and allowed significant gains over configurations that do not employ the load balancing strategy, reducing the response time in up to 50% and increasing the throughput in up to 16%.
A high-radix composite algorithm for the computation of the powering function (X-Y) is presented in this paper the algorithm consists of a sequence of overlapped operations: (i) digit-recurrence logarithm, (ii) left-t...
详细信息
ISBN:
(纸本)076951894X
A high-radix composite algorithm for the computation of the powering function (X-Y) is presented in this paper the algorithm consists of a sequence of overlapped operations: (i) digit-recurrence logarithm, (ii) left-to-right carry-free (LRCF) multiplications, and (iii) on-line exponential. A redundant number system is used, and the selection in (i) and (iii) is done by rounding except from the first iteration, when selection by table look-up is necessary to guarantee the convergence of the recurrences. A sequential implementation of the algorithm is proposed and the execution times and hardware requirements are estimated for single and double-precision floating-point computations, for radix r = 128, showing that powering can be computed with similar performance as high-radix CORDIC algorithms.
暂无评论