this paper proposes the rescheduling of the SHA-1 hash function operations on hardware implementations. the proposal is mapped on the Xilinx Virtex II Pro technology. the proposed rescheduling allows for a manipulatio...
详细信息
ISBN:
(纸本)3540364102
this paper proposes the rescheduling of the SHA-1 hash function operations on hardware implementations. the proposal is mapped on the Xilinx Virtex II Pro technology. the proposed rescheduling allows for a manipulation of the critical path in the SHA-1 function computation, facilitating the implementation of a more parallelized structure without an increase on the required hardware resources. Two cores have been developed, one that uses a constant initialization vector and a second one that allows for different Initialization Vectors (IV), in order to be used in HMAC and in the processing of fragmented messages. A hybrid software/hardware implementation is also proposed. Experimental results indicate a throughput of 1.4 Gbits/s requiring only 533 slices for a constant IV and 596 for an imputable IV. Comparisons to SHA-1 related art suggest improvements of the throughput /slice metric of 29% against the most recent commercial cores and 59% to the current academia proposals.
this paper presents a modeling approach based on deterministic and stochastic Petri nets (DSPN) for analyzing the performance of node architectures for MIMD multiprocessor systems with distributed memory. DSPN are a n...
详细信息
this paper presents a modeling approach based on deterministic and stochastic Petri nets (DSPN) for analyzing the performance of node architectures for MIMD multiprocessor systems with distributed memory. DSPN are a numerically solvable modeling formalism with a graphical representation. the modeling approach supports design decisions for node architectures by providing quantitative results concerning processor and memory utilization for several design alternatives. To illustrate the proposed approach, DSPN of two node architectures are presented and employed for a comparative performance study.
the Stream model is a high level Intermediate Representation that can be mapped to a range of parallel architectures. the Stream model has a limited scope because it is aimed at architecturesthat reduce the control o...
详细信息
ISBN:
(纸本)354026969X
the Stream model is a high level Intermediate Representation that can be mapped to a range of parallel architectures. the Stream model has a limited scope because it is aimed at architecturesthat reduce the control overhead of programmable hardware to improve the overall computing efficiency. While it has its limitations, the performance critical parts of embedded and media applications can often be compiled to this model. the automatic compilation to Stream programs from C code is demonstrated.
the main goal of an overtaking monitor system is the segmentation and tracking of the overtaking vehicle. this application can be addressed through an optic flow driven scheme. We can focus on the rear. mirror visual ...
详细信息
ISBN:
(纸本)3540364102
the main goal of an overtaking monitor system is the segmentation and tracking of the overtaking vehicle. this application can be addressed through an optic flow driven scheme. We can focus on the rear. mirror visual field by placing a camera on the top of it. If we drive a car, the ego-motion optic flow pattern is more or less unidirectional, i.e. all the static objects and landmarks move backwards while the overtaking cars move forward towards our vehicle. this well structured motion scenario facilitates the segmentation of regular motion patterns that correspond to the overtaking vehicle. Our approach is based on two main processing stages: first, the computation of optical flow using a novel superpipelined and fully parallelized architecture capable to extract the motion information with a frame-rate up to 148 frames per second at VGA resolution (640x480 pixels). Second, a tracking stage based on motion pattern analysis provides an estimated position of the overtaking car. We analyze the system performance, resources and show some promising results using a bank of overtaking car sequences.
We present a highly efficient automated clock gating platform for rapidly developing power efficient hardware architectures. Our language, called CoDeL, allows hardware description at the algorithm level, and thus dra...
详细信息
ISBN:
(纸本)3540364102
We present a highly efficient automated clock gating platform for rapidly developing power efficient hardware architectures. Our language, called CoDeL, allows hardware description at the algorithm level, and thus dramatically reduces design time. We have extended CoDeL to automatically insert clock gating at the behavioral level to reduce dynamic power dissipation in the resulting architecture. this is, to our knowledge, the first hardware design environment that allows an algorithmic description of a component and yet produces a power aware design. To estimate the power savings, we have developed an estimation framework, which is shown to be consistent withthe power savings obtained using statistical power analysis using Synopsys tools. To evaluate our platform we use the CoDeL implementation of a counter and various integer transforms used in the realm of DSP (Digital Signal Processing): discrete wavelet transform, discrete cosine transform and an integer transform used in the H.264 (MPEG4 Part 10) video compression standard. these designs are then clock gated using CoDeL and Synopsys. A simulation based power analysis on the designed circuits shows that CoDeL's clock gating performs better than Synopsys' automated clock gating. CoDeL reduces the power dissipation by 83% on average, while Synopsys gives 81% savings.
Modern multimedia applications usually have real-time constraints and they are implemented using application-domain specific embedded processors. Dimensioning a system requires accurate estimations of resources needed...
详细信息
Modern multimedia applications usually have real-time constraints and they are implemented using application-domain specific embedded processors. Dimensioning a system requires accurate estimations of resources needed by the applications. Overestimation leads to over-dimensioning. For a good resource estimation, all the cases in which an application can run must be considered. To avoid an explosion in the number of different cases, those that are similar with respect to required resources are combined into, so called application scenarios. this paper presents a methodology and a tool that can automatically detect the most important variables from an application and use them to select and dynamically predict scenarios, with respect to the necessary time budget, for soft real-time multimedia applications. the tool was tested for three multimedia applications. Using a proactive scenario-based dynamic voltage scheduler based on the scenarios and the runtime predictor generated by our tool, the energy consumption decreases with up to 19%, while guaranteeing a frame deadline miss ratio close to zero.
Many telecommunication applications, especially baseband processing, and digital signal processing (DSP) applications call for high-performance implementations due to the complexity of algorithms and high throughput r...
详细信息
ISBN:
(纸本)3540364102
Many telecommunication applications, especially baseband processing, and digital signal processing (DSP) applications call for high-performance implementations due to the complexity of algorithms and high throughput requirements. In general, the required performance is obtained withthe aid of parallel computational resources. In these application domains, software implementations are often preferred over fixed-function ASICs due to the flexibility and ease of development. Application-specific instruction-set processor (ASIP) architectures can be used to exploit efficiently the inherent parallelism of the algorithms but still maintaining the flexibility. Use of high-level languages to program processor architectures with parallel resources can lead to inefficient resource utilization and, on the other hand, parallel assembly programming is error prone and tedious. In this paper, the inherent problems of parallel programming and software pipelining are mitigated with parallel language syntax and automatic generation of software pipelined code for the iteration kernels. Withthe aid of the developed tool support, the underlying performance of a processor architecture with parallel resources can be exploited and full utilization of the main processing resources is obtained for pipelined loop kernels. the given examples show that efficiency can be obtained without reducing the performance.
In embedded multiprocessors cache partitioning is a known technique to eliminate inter-task cache conflicts, so to increase predictability. On such systems, the partitioning ratio is a parameter that should be tuned t...
详细信息
ISBN:
(纸本)1424401550
In embedded multiprocessors cache partitioning is a known technique to eliminate inter-task cache conflicts, so to increase predictability. On such systems, the partitioning ratio is a parameter that should be tuned to optimize performance. In this paper we propose a Simulated Annealing (SA) based heuristic to determine the cache partitioning ratio that maximizes an application's throughput. In its core, the SA method iterates many times over many partitioning ratios, checking the resulted throughput. Hence the throughput of the system has to be estimated very fast, so we utilize a light simulation strategy. the light simulation derives the throughput from tasks I timings gathered off-line. this is possible because in an environment where tasks don't interfere with each other, their performance figures can be used in any possible combination. An application of industrial relevance (H.264 decoder) running on a parallel homogeneous platform is used to demonstrate the proposed method. For the H.264 application 9% throughput improvement is achieved when compared to the throughput obtained using methods of partitioning for the least number of misses. this is a significant improvement as it represents 45% from the theoretical throughput improvement achievable when assuming an infinite cache.
the cache memory plays a crucial role in the performance of any processor. the cache memory (SRAM), especially the on chip cache, is 3-4 times faster than the main memory (DRAM). It can vastly improve the processor pe...
详细信息
ISBN:
(纸本)1424401550
the cache memory plays a crucial role in the performance of any processor. the cache memory (SRAM), especially the on chip cache, is 3-4 times faster than the main memory (DRAM). It can vastly improve the processor performance and speed. Also the cache consumes much less energy than the main memory. that leads to a huge power saving which is very important for embedded applications. In today's processors, although the cache memory reduces the energy consumption of the processor, however the energy consumption in the on-chip cache account to almost 40% of the total energy consumption of the processor. In this paper, we propose a cache architecture, for the instruction cache, that is a modification of the hotspot architecture. Our proposed architecture consists of a small filter cache in parallel withthe hotspot cache, between the L1 cache and the main memory. the small filter cache is to hold the code that was not captured by the hotspot cache. We also propose a prediction mechanism to steer the memory access to either the hotspot cache, the filter cache, or the L1 cache. Our design has both a faster access time and less energy consumption compared to boththe filter cache and the hotspot cache architectures. We use Mibench and Mediabench benchmarks, together withthe simplescalar simulator in order to evaluate the performance of our proposed architecture and compares it withthe filter cache and the hotspot cache architectures. the simulation results show that our design outperforms boththe filter cache and the hotspot cache in boththe average memory access time and the energy consumption.
Advances in the design, modeling and simulation of parallel processing systems provide significant research opportunities which lead to improvements on the speed, performance, fault tolerance, flexibility and cost-eff...
详细信息
ISBN:
(纸本)9780769528410
Advances in the design, modeling and simulation of parallel processing systems provide significant research opportunities which lead to improvements on the speed, performance, fault tolerance, flexibility and cost-effectiveness of distributed systems. Several parameters determine the suitability of the system architecture for a given application. However Average Routing Distance (ARD) is perhaps one of the most important parameters in performance evaluation of parallel processing systems. To this effect, all mathematical modeling and simulation of ARD and Visit Ratio for a class of parallel processing systems are presented.
暂无评论