SHA-256 plays an important role in widely used applications, such as data security, data integrity, digital signatures, and cryptocurrencies. However, most of the current optimized implementations of SHA-256 are based...
详细信息
SHA-256 plays an important role in widely used applications, such as data security, data integrity, digital signatures, and cryptocurrencies. However, most of the current optimized implementations of SHA-256 are based...
SHA-256 plays an important role in widely used applications, such as data security, data integrity, digital signatures, and cryptocurrencies. However, most of the current optimized implementations of SHA-256 are based on CPUs or dedicated hardware, such as ASICs and FPGAs. Consequently, there is a need to explore whether new heterogeneous parallel framework can improve the computational performance of the hash function. To address this issue, we conducted a study on the MT-3000 platform, which is a special architecture processor for the next-generation exascale prototype supercomputer. We proposed MT-SHA256, a heterogeneous multistage parallel implementation for hashing multiple messages on the MT-3000. Combining the architectural features of this processor, we developed an effective solution that significantly improved the computational performance of SHA-256. As a result, MT-SHA256 achieved a maximum throughput of 1045.68 MB/s on a single acceleration core of MT-3000. This is 9.84x higher than the C code implementation on one CPU core of MT-3000. We also performed a scalability test and found that MT-SHA256 achieved a throughput of 98.04 GB/s on a computing node, and extended to 512 nodes (2048 acceleration clusters) on this system with good scalability.
During the last years a large number of research works has focused on problems related to multi-core processors. Due to the possibilities of many cores, the number of opportunities in High Performance Computing (HPC) ...
During the last years a large number of research works has focused on problems related to multi-core processors. Due to the possibilities of many cores, the number of opportunities in High Performance Computing (HPC) has grown a lot. In fact, new fields related to HPC and processor architecture increase the future possibilities of a Grid-on-Chip (GoC). The goal of this paper is to show a high-throughput MCNoC (Multi-Cluster Network-on-Chip) as an alternative architecture to support clusters of cores and Grid features. In this new scenario data throughput, flexibility, and scalability are very important. The results verify that MCNoC has a similar area occupation and a better data throughput than a traditional Network-on-Chip.
For the next processor generation, many cores and parallelprogramming will provide high-throughput and high-performance processing. As a consequence, research works have studied on-chip interconnection architectures ...
详细信息
For the next processor generation, many cores and parallelprogramming will provide high-throughput and high-performance processing. As a consequence, research works have studied on-chip interconnection architectures to identify alternatives capable of decreasing the communication latencies. The objective of this paper is to present the evaluation of three well-known architectures (bus, crossbar switch and a conventional network-on-chip) in order to propose a multi-cluster network-on-chip architecture for parallelprocessing. The results show that a NoC composed of programmable routers and crossbar switches to interconnect clusters of cores has a better performance than conventional NoCs.
A new procedure for the optimum design of optoelectronic devices is explained in this paper and an automatic search is made simultaneously for the structure satisfying various demands. The feature of this procedure is...
详细信息
A new procedure for the optimum design of optoelectronic devices is explained in this paper and an automatic search is made simultaneously for the structure satisfying various demands. The feature of this procedure is in the introduction of cost by which the quantitative evaluation of the structure becomes possible and the global search for the required structure by simulated annealing can be carried out. First, the definition of cost and details of the optimization procedure are clarified. In optimization, in addition to convergence to the minimum point of the cost function (the optimum configuration from theoretical viewpoint), the convergence also is possible in structures with great tolerance to fabrication errors (neighborhood cost (NC) and finite temperature annealing (FTA) methods). Next, these three proposed methods are used in the design of a pnpn differential optical switch and the effectiveness of the methods is verified. The method of cost expression, the relation between annealing parameters, and convergence are investigated. It is shown that cost expression with large degree of freedom improves the search for high-performance structures and the initial temperature of annealing or the fixed temperature of FTA method is the important parameter which sets up the probability of acceptance. Further, it is shown that the convergence cost is inversely proportional to the time spent in annealing. These results are useful guidelines in the optimum design of arbitrary optoelectronic devices.
Simple serial synchronized (SSS) multistage interconnection network (MIN) is a processor-memory connection network that has a high performance/cost ratio, where the packet is inputted and switch synchronously in the M...
详细信息
Simple serial synchronized (SSS) multistage interconnection network (MIN) is a processor-memory connection network that has a high performance/cost ratio, where the packet is inputted and switch synchronously in the MIN, which has a high pass-through ratio and is composed of simple elements. This paper evaluates the effect of the hot spot contention and the effect of the synchronous bit-serial (SBS) message combining in SSS-MIN, by the theoretical analysis based on probability and simulation. In contrast to conventional MIN, there does not arise a complete tree saturation in SSS-MIN, but an area, to which the access is difficult, is produced according to the relative position to the hot spot contention. From such a viewpoint, an analysis method for the pass-through ratio is presented, which considers the position of the switching element to the hot spot. It is verified as a result of evaluation that the proposed method of analysis gives a result close to that of simulation, so long as the access to the hot spot and the connection network architecture stay within a practical range. It is also seen that the pass-through ratio is deteriorated less in SSS-MIN by the hot spot contention than in the conventional MIN, and the effect can be almost completely eliminated by the SBS message combining. When a multiprocessor system is actually constructed, performance deterioration due to hot spot contention is greater than in the case where only the pass-through ratio is considered. This can also be eliminated almost completely by the SBS message combining.
An efficient emulation/simulation system for evaluating architectures and scheduling strategies for reduction systems is described. Execution traces of example programs are generated by the emulator. The execution met...
详细信息
Architectural simulation of complex systems is usually constrained by available computational resources. Recently, several commercial parallelprocessing systems have appeared with price-performance levels that make v...
详细信息
Architectural simulation of complex systems is usually constrained by available computational resources. Recently, several commercial parallelprocessing systems have appeared with price-performance levels that make very intense simulations affordable. In this paper, we briefly review architectural simulation technology, then describe the approach used to develop a parallel architectural simulator. Performance of the parallel simulator is then experimentally characterized and analyzed. This study is one of the earliest to report measured performance of a widely-used commercial simulator, running non-trivial designs on a popular parallel computing system.
Traditionally, the bulk of computer system functionality is implemented in the software medium, as a sequence of instructions for a general-purpose processor. Historically, this has provided the best balance of flexib...
详细信息
Traditionally, the bulk of computer system functionality is implemented in the software medium, as a sequence of instructions for a general-purpose processor. Historically, this has provided the best balance of flexibility, cost, and performance. The new economics of VLSI and continuing advances in VLSI CAD capability open the possibility of application-specific functionality embedded in silicon as a matter of routine. This paper presents several case studies of silicon solutions used in typical software areas, including regular language recognition, Ada program unit replacement, dictionary machines, and string pattern matching. Either software or hardware designers may benefit from a study of such architectures, and Organick's notion of heterosystems designers proficient in both domains is supported.
暂无评论