Current shared memory multi-core and multiprocessor systems are nondeterministic. When these systems execute a multithreaded application, even if supplied with the same input, they could produce a different output eac...
详细信息
Current shared memory multi-core and multiprocessor systems are nondeterministic. When these systems execute a multithreaded application, even if supplied with the same input, they could produce a different output each time. It frustrates debugging and limits the ability to properly test multithreaded code, and is becoming a major stumbling block to the much-needed widespread adoption of parallel programming. The support for deterministic replay of multithreaded execution is greatly helpful in finding concurrency bugs. A memory race recording scheme, named Rainbow, is proposed. Its core idea is to make inter-thread communications fully deterministic. The unique feature of Rainbow is that it precisely sets up happens-before relationships between conflicting memory operations among different threads. By using effective, bloom-filter based, coherence history queue, Rainbow removes redundant happens-before relations implied in the already generated log and enables a compact log. Rainbow adds the modest hardware to the base multi-core processors, and the coherence protocol is unmodified. The analysis results show that Rainbow reduces the log size by 17% of a state-of-the-art scheme, and the records execution speed is similar to that of release consistency (RC) execution and replays at about 93% of its speed. The determinism can be provided with little performance cost using our architecture proposals on the state-of-the-art hardware, and the software-only approaches can be utilized on existing systems without problem.
With the continual scaling of semiconductor process technology, the circuit timing is increasingly impacted by process variations. It is thus important to categorize high-speed digital circuits into multiple bins of d...
详细信息
With the continual scaling of semiconductor process technology, the circuit timing is increasingly impacted by process variations. It is thus important to categorize high-speed digital circuits into multiple bins of different performances. However, the speed-binning process typically needs very long test application time. In this paper, we proposed a unified architecture, which can accomplish performance grading with a high confidence and short test application time. Moreover, the proposed architecture can be used for on-line circuit failure prediction and detection. Experimental results are presented to validate the proposed architecture.
NAND Flash memories have rapidly emerged as a storage class memory such as SSD (Solid state Disk), CF (Compact Flash) Card, SD (Secure Digital Memory) Card. Due to its distinct operation mechanisms, NAND Flash memory ...
详细信息
NAND Flash memories have rapidly emerged as a storage class memory such as SSD (Solid state Disk), CF (Compact Flash) Card, SD (Secure Digital Memory) Card. Due to its distinct operation mechanisms, NAND Flash memory suffers from erase/program endurance, data retention and program/read disturbance problems. Specifically, erase and program operation keeps in developing bad blocks during the lifetime of memory chips. Bad blocks are blocks that contain faulty bits but the ECC (Error Correction Code) algorithm cannot correct them. Although wear leveling tries to balance the erase/program operations on different blocks so that all blocks can wear out at a similar pace, new bad blocks still inevitably *** propose an in-field testing technique which takes some pages in a block as predictors. Due to wear out faster than the other pages, the predictors will become bad before the other pages in the block become bad. The further questions are (1) how to detect those wearing fast pages so as to use them as predictors, (2) how many predictors are needed to achieve a satisfactory prediction accuracy, (3) misprediction will result in what negative impact on performance and endurance.
Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion *** this paper,we present an attempt to design efficient mult...
详细信息
Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion *** this paper,we present an attempt to design efficient multiple pattern searching algorithms on multi-core *** observe an important feature which indicates that the multiple pattern matching time mainly depends on the number and minimal length of *** multi-core algorithm proposed in this paper leverages this feature to decompose pattern set so that the parallel execution time is *** formulate the problem as an optimal decomposition and scheduling of a pattern set,then propose a heuristic algorithm,which takes advantage of dynamic programming and greedy algorithmic techniques,to solve the optimization *** results suggest that our decomposition approach can increase the searching speed by more than 200% on a 4-core AMD Barcelona system.
In this paper, a novel concept of multilayer synthesis and a general framework for texture synthesis method are presented. Within this framework, we first decompose the texture into the supposed pattern layer and mate...
详细信息
In this paper, a novel concept of multilayer synthesis and a general framework for texture synthesis method are presented. Within this framework, we first decompose the texture into the supposed pattern layer and material layer in the frequency domain by an E-texton extracting algorithm, then manipulate and extend them respectively according to their own personalities, and finally merge the newly synthesized pattern layer and material layer again to generate the final output. Experiment results show that our method not only greatly improves the synthesis quality for those cases that single-layer synthesis cannot handle well but also provides an ability of achieving various special synthesis effects.
Chosen-ciphertext security has been well-accepted as a standard security notion for public key encryption. But in a multi-user surrounding, it may not be sufficient, since the adversary may corrupt some users to get t...
详细信息
Concurrent trace is an emerging challenge when debugging multicore systems. In concurrent trace, trace buffer becomes a bottleneck since all trace sources try to access it simultaneously. In addition, the on-chip inte...
详细信息
ISBN:
(纸本)9783981080186
Concurrent trace is an emerging challenge when debugging multicore systems. In concurrent trace, trace buffer becomes a bottleneck since all trace sources try to access it simultaneously. In addition, the on-chip interconnection fabric is extremely high hardware cost for the distributed trace signals. In this paper, we propose a clustering-based scheme which implements concurrent trace for debugging Network-on-Chip (NoC) based multicore systems. In the proposed scheme, a unified communication framework eliminates the requirement for interconnection fabric which is only used during debugging. With clustering scheme, multiple concurrent trace sources can access distributed trace buffer via NoC under bandwidth constraint. We evaluate the proposed scheme using Booksim and the results show the effectiveness of the proposed scheme.
Because the structure and function of a high-rise building is complex and the density of occupants is high, and the rescue from outside is very difficult, safe and timely evacuation is an important issue under high-ri...
详细信息
Group communication is essential for multi-user applications. However, due to unpredictable node departures and non-deterministic network partitions, providing reliable and scalable group communication services is cha...
详细信息
Group communication is essential for multi-user applications. However, due to unpredictable node departures and non-deterministic network partitions, providing reliable and scalable group communication services is challenging when the applications are utilized by the users with heterogeneous capacities on a large scale. To address this challenge, we propose a novel replication scheme to achieve high reliability and low-cost scalability in group communication with following three features. First, it introduces a new concept of replication based on topological similarity, which empowers each node with an ability of measuring similarity between the nodes in topology. By eliminating the topological similarity between the replicas, it intelligently mitigates service interruptions caused by node failures and network partitions. Second, instead of specifying the number of replicas, it provides a technique for nodes to dynamically adapt the replication placement schemes by exploiting functionality importance of the nodes in the group- communication session. It eliminates the bottleneck problem and improves the network resource utilization. Third, the scheme is self-converging and it can stabilize within a few adaptations even facing a high churn rate. Extensive simulations show that it yields significant improvements in reduction of replication overhead and service interruption when comparing to existing approaches.
Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievemen...
详细信息
Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievements in our research project, which is supported by the National Basic Research 973 Program of China, on parallel architecture, are systematically presented. The innovative approaches and techniques to solve the significant problems in parallel architecture design are smnmarized, including architecture level optimization, compiler and language-supported technologies, reliability, power-performance efficient design, test and verification challenges, and platform building. Two prototype chips, a multi-heavy-core Godson-3 and a many-light-core Godson-T, are described to demonstrate the highly scalable and reconfigurable parallel architecture designs. We also present some of our achievements appearing in ISCA, MICRO, ISSCC, HPCA, PLDI, PACT, IJCAI, Hot Chips, DATE, IEEE Trans. VLSI, IEEE Micro, IEEE Trans. computers, etc.
暂无评论