As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of ...
详细信息
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant parallel Algorithm ( FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach.
In recent years, the research of parallel digital terrain analysis has become a hot spot. Using parallel computing technology to solve data-intensive computing problems has become a new trend in digital terrain analys...
详细信息
In recent years, the research of parallel digital terrain analysis has become a hot spot. Using parallel computing technology to solve data-intensive computing problems has become a new trend in digital terrain analysis (DTA). On the other hand, with the development of hardware technology and new applications, how to ensure the reliability of the results of parallel computing is one of key problems. We can improve the system's ability to provide the right service by properly adopting fault tolerance technology. A parallel error-detecting approach based on parallel recomputing technology is presented and implemented by the combination of redundant process and multi-threads technology. Adopting a parallel comparison between the results of the process and the ones of its copy process, on the same data block, it can improve the efficiency of fault tolerance parallel recomputing. Furthermore, considering the relationship between the error-detection of the results and the recomputing, a modified scheme is proposed to make them to be executed concurrently. According to the error-detecting analysis of slope algorithm from DTA, it proves the effectiveness of the error-detecting approach based on fault-tolerant recomputing and achieves minor overhead.
This paper addresses the issue of fault recovery in transactional memory,and proposes a method of fault recovery based on parallel recomputing in transactional memory *** method utilizes the dataversioning mechanism o...
详细信息
This paper addresses the issue of fault recovery in transactional memory,and proposes a method of fault recovery based on parallel recomputing in transactional memory *** method utilizes the dataversioning mechanism of transactional memory system to avoid the extra cost of state saving,rolls back a single transaction to avoid wasting the computing time of the fault-free transactions,and adopts the parallel recomputing method to reduce the cost of fault *** paper applies this method to Open TM programs,and proposes the implementation method of parallel recomputing in Open *** last,this paper tests the performance of this method through a test *** experimental results show that,compared with the fault recovery method of rolling back a single transaction,the parallel recomputing method in transactional memory system can execute the fault recovery quickly and accurately and the method has a well scalability.
暂无评论