For high throughput applications, efficient parallel architectures require to avoid collision accesses, i.e. concurrent read/write accesses to the same memory bank have to be avoided. This consideration applies for ex...
详细信息
ISBN:
(纸本)9781450312448
For high throughput applications, efficient parallel architectures require to avoid collision accesses, i.e. concurrent read/write accesses to the same memory bank have to be avoided. This consideration applies for example to the two main classes of turbo-like codes that are Low Density Parity Check and Turbo- Codes. These error correcting codes, that scramble data by using an interleaving law, are used in most of recent communication standards and storage systems like wireless access, digital video broadcasting or magnetic storage in hard disk drives. In order to optimize the architectural cost and to reduce the control complexity of such integrated circuits, designers usually use standard interconnection networks with low complexity topologies between processing elements and memory banks. However the design constraints, i.e. interleaving law, parallelism and interconnection network, often prevent mapping the data in the memory banks without any conflict. In this paper we propose a methodology which always finds a collision-free memory mapping for a given set of design constraints. The approach uses additional registers each time the design constraints forbid to use memory banks without conflict. Our approach is compared to state of the art methods and its interest is shown through the design of parallel interleavers for industrial applications: Multi Band-Orthogonal Frequency-Division Multiplexing Ultra- WideBand (MB-OFDM UWB) and non-binary LDPC decoders. Copyright 2012 acm.
Sub-round implementations of AES have been explored as an area and energy efficient solution to encrypt data in resource constrained applications such as the Internet of Things. Symmetry in AES operations across bytes...
详细信息
ISBN:
(纸本)9781509060238
Sub-round implementations of AES have been explored as an area and energy efficient solution to encrypt data in resource constrained applications such as the Internet of Things. Symmetry in AES operations across bytes and words allows the datapath to be scaled down to 8 bits resulting in very compact designs. However, such designs incur an area penalty to store intermediate results or energy penalty to shift data through registers without performing useful computation. We propose a smart clocking scheme and rename registers to minimize data movement and clock loading, and also avoid storing a duplicate copy of the system state. In comparison to the most efficient 8-bit implementation from literature, we save 45% energy per encryption and reduce clock energy by 70% at a reasonable area cost.
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the ...
详细信息
ISBN:
(纸本)9781450344937
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI all-to-all operations, in which data blocks are copied into the receive buffers in Morton order to exploit data locality. Experimental results on different many-core architectures show that our cache-oblivious implementations significantly outperform the naive implementations based on shared heap and the highly optimized MPI libraries.
Communication and synchronization are two main latency issues in computing FFT on parallel architectures. Both latencies have to be either hidden or tolerated to achieve high performance. One approach to achieve this ...
详细信息
Communication and synchronization are two main latency issues in computing FFT on parallel architectures. Both latencies have to be either hidden or tolerated to achieve high performance. One approach to achieve this is by multithreading. Another approach to tolerate latency is to map data efficiently onto the processors' local memory and exploiting data locality. Indirect swap networks, an idea proposed in VLSI circuits can be efficiently used to compute the butterfly computations in FFT. Data mapping in the swap network topology reduces the communication overhead by half at each iteration. Cell broadband engine (Cell/B.E.)processor is a heterogeneous multicoreprocessor for stream data applications and high performance computing. Its eight SIMD processing elements, synergistic processor elements (SPEs), provide multi-folded parallelism. In this paper, we investigate the improved Cooley-Tukey FFT algorithm based on indirect swap network, and design the parallel algorithm taking into consideration all the features of the Cell/B.E. architecture. The performance results show that the new algorithm on Cell/B.E. is 3.7 faster than the cluster for 4K input data size and 6.4 faster than the cluster for 16K input data size at the processor level.
Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel har...
详细信息
暂无评论