Achieving highperformance in cryptographic processing is important due to the increasing connectivity among today's computers. Despite steady improvements in microprocessor and system performance, private-key cip...
详细信息
Achieving highperformance in cryptographic processing is important due to the increasing connectivity among today's computers. Despite steady improvements in microprocessor and system performance, private-key cipher implementations continue to be slow. Irrespective of the cipher used, the main reason for the low performance is lack of parallelism, which fundamentally comes from encryption modes such as the Cipher Block Chaining (CBC) mode. In CBC, each plaintext block is XOR'ed withthe previous ciphertext block and then encrypted, essentially inducing a tight recurrence through the ciphertext blocks. To deliver highperformance while maintaining high level of security assurance in real systems, the cryptography community has proposed Interleaved Cipher Block Chaining (ICBC) mode. In four-way interleaved chaining, the first, fifth, and every fourth block thereafter are encrypted in CBC mode; the second, sixth, and every fourth block thereafter are encrypted as another stream, and so on. thus, interleaved chaining loosens the recurrence imposed by CBC, enabling the multiple encryption streams to be overlapped. the number of interleaved chains can be chosen to balance performance and adequate chaining to get good data diffusion. While ICBC was originally proposed to improve hardware encryption rates by employing multiple encryption chips in parallel, this is the first paper to evaluate ICBC via multithreading commonly-used ciphers on a symmetric multiprocessor (SMP). ICBC allows exploiting the full processing power of SMPs, which spend many cycles in cryptographic processing as medium-scale servers today, and will do so as chip-multiprocessor clients in the future. Using the Wisconsin Wind Tunnel II, we show that our multithreaded ciphers achieve encryption rates of 92 Mbytes/s on a 16-processor SMP at 1 GHz, reaching a factor of almost 10 improvement oiler a uniprocessor, which achieves 9 Mbytes/s.
Single-event upsets from particle strikes have become a key challenge in microprocessor design. Techniques to deal withthese transients faults exist, but come at a cost. Designers clearly require accurate estimates o...
详细信息
ISBN:
(纸本)9780769520438
Single-event upsets from particle strikes have become a key challenge in microprocessor design. Techniques to deal withthese transients faults exist, but come at a cost. Designers clearly require accurate estimates of processor error rates to make appropriate cost/reliability tradeoffs. this paper describes a method for generating these estimates. A key aspect of this analysis is that some single-bit faults (such as those occurring in the branch predictor) do not produce an error in a program's output. We define a structure's architectural vulnerability factor (AVF) as the probability that a fault in that particular structure do not result in an error. A structure's error rate is the product of its raw error rate, as determined by process and circuit technology, and the AVF. Unfortunately, computing AVFs of complex structures, such as the instruction queue, can be quite involved. We identify numerous cases, such as prefetches, dynamically dead code, and wrong-path instructions, in which a fault do not affect, correct execution. We instrument a detailed 1A64 processor simulator to map bit-level microarchitectural state to these cases, generating per-structure AVF estimates. this analysis shows AVFs of 28% and 9% for the instruction queue and execution units, respectively, averaged across dynamic sections of the entire CPU2000 benchmark suite.
the proceedings contain 26 papers. the topics discussed include: instruction set extension for long integer modulo arithmetic on RISC-based smart cards;architecture of oscillatory neural network for image segmentation...
ISBN:
(纸本)0769517722
the proceedings contain 26 papers. the topics discussed include: instruction set extension for long integer modulo arithmetic on RISC-based smart cards;architecture of oscillatory neural network for image segmentation;parallel boundary elements using Lapack and ScaLapack;efficient cyclic weighted reference counting;a parallel approximation hitting set algorithm for gene expression analysis;implementing declarative parallel bottom-avoiding choice;minimally-skewed-associative caches;a framework for exploiting adaptation in highly heterogeneous distributed processing;cluster-based static scheduling: theory and practice;the virtual cluster: a dynamic environment for exploitation of idle network resources;and design and evaluation of data access prediction strategies in SDSM systems.
this work examines the facility of using a large distributed memory system for rasterization of computer graphics using the OpenGL and GLUT libraries. Issues examined include the performance increases achieved through...
详细信息
ISBN:
(纸本)0769516262
this work examines the facility of using a large distributed memory system for rasterization of computer graphics using the OpenGL and GLUT libraries. Issues examined include the performance increases achieved through parallel processing and the effects of different methods for dividing the framebuffer over multiple processors.
Summary form only given. this paper focuses on theoretical and practical aspects of the high-performance multikey sorting problem on computer clusters, with particular emphasis on the Alpha Maci Cluster, a world-class...
详细信息
We propose a hybrid parallelism-independent scheduling method, predominantly performed at compile time, which generates a machine code efficiently executable on any number of workstations or PCs in a cluster computing...
详细信息
ISBN:
(纸本)0769516262
We propose a hybrid parallelism-independent scheduling method, predominantly performed at compile time, which generates a machine code efficiently executable on any number of workstations or PCs in a cluster computing environment. Our new scheduling algorithm called Dynamical Level Parallelism-Independent Scheduling algorithm (DLPIS) is applicable for distributed computer systems because additionally to the task scheduling, we perform a message communication scheduling. It provides an explicit task synchronization mechanism guiding the task allocation and data dependency solution at run time at reduced overhead. Furthermore, we provide a mechanism allowing the self-adaptation of the machine code to the degree of parallelism of the system at run-time. therefore our scheduling method supports the variable number of processors in the users' computing systems and the adaptive parallelism, which may occur in distributed computing systems due to computer or link failure.
Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDE's) are usually involved in the description of such en...
详细信息
ISBN:
(纸本)0769516262
Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDE's) are usually involved in the description of such entities or processes. Distributed parallel computation was used in various computer cluster configurations to calculate PDE solutions of electrostatic field. the study of the efficacy of the selected architecture using mesh techniques was intended. the match between the algorithm and the architecture in achieving maximum computational performance was also investigated. the developed architectures, algorithms, and findings are presented in the paper.
the adequate occupation of the computing resources can influence, in a decisive way, the global performance of the system. therefore, in order to achieve a highperformance, it is mandatory to know all the computing r...
详细信息
ISBN:
(纸本)0769516262
the adequate occupation of the computing resources can influence, in a decisive way, the global performance of the system. therefore, in order to achieve a highperformance, it is mandatory to know all the computing resources involved and their respective occupation level in a certain moment. Withthe objective of improving the system performance, this paper presents the OpenTella model to update the information related to the occupation of resources and the respective analysis of this occupation so that the migration of processes among computers of a same cluster can be completed. Withthe objective of increasing the scale level in the system and decreasing the number of messages among the computers, this Peer-to-peer protocol defines sub-nets, which are clusters that make up a more comprehensive cluster. thus, groups are defined to interchange information and update the occupation of resources, in order to minimize the communication and to achieve a calculation to balance the load and meet the system needs, resulting in the migration of processes.
the subject of this paper is to show the very high power of asynchronism for iterative algorithms in the context of global computing, that is to say, with machines scattered all around the world. the question is wheth...
详细信息
ISBN:
(纸本)0769516262
the subject of this paper is to show the very high power of asynchronism for iterative algorithms in the context of global computing, that is to say, with machines scattered all around the world. the question is whether or not asynchronism helps to reduce the communication penalty and the overall computation time of a given parallel algorithm. the asynchronous programming model is applied to a given problem implemented with a multi-threaded environment and tested over two kinds of clusters of workstations;a homogeneous local cluster and a heterogeneous non-local one. the main features of this programming model are exhibited and the high efficiency and interest of such algorithms is pointed out.
暂无评论