This paper introduces a new method based on parallel failure recovery, for the fault tolerance issue of parallel programs. In case a process fails, other surviving processes will compute the task of the failed one in ...
详细信息
This paper introduces a new method based on parallel failure recovery, for the fault tolerance issue of parallel programs. In case a process fails, other surviving processes will compute the task of the failed one in parallel, so that the overhead for fault tolerance is leveled down. The paper presents the design and implementation of the parallel FFT using the new approach, and works on finding an optimum number of processes that participate in parallel failure recovery. Finally, an experiment is done to show the better performance of the parallel failure recovery over that of checkpointing, and to show the effectiveness of our solution for the best number of processes participating parallel failure recovery.
Peer-to-peer media streaming has been an important service on the internet in recent years. The Data-driven (or mesh-based) structure is adopted by most working systems,in which data scheduling is one of the important...
详细信息
Peer-to-peer media streaming has been an important service on the internet in recent years. The Data-driven (or mesh-based) structure is adopted by most working systems,in which data scheduling is one of the important ***, those frequently used scheduling algorithms are often faced with such a case: A neighbor peer takes up its bandwidth to deliver the packets that other neighbors can also supply, but some packets only held by it are not *** packets can not be delivered in the current scheduling cycle, even though that the other neighbors have surplus bandwidth. This is a kind of waste of bandwidth and decreases the throughput of transmission. In this paper we propose anew scheduling algorithm aiming at the optimal throughput:Bipartite-matching based Block Scheduling algorithm(BBS).We convert the original data scheduling problem to a problem of finding a maximum match on the correspond bipartite graph, then assign data packets to neighbors according to the maximum match. We evaluate the performance of BBS with extensive experiments and the results show that BBS throughput and provides better streaming quality than those frequently used scheduling algorithms.
There are a lot of important and sensitive data in databases, which need to be protected from attacks. To secure the data, Cryptography support is an effective mechanism. However, a tradeoff must be made between the p...
详细信息
This paper presents our experience with exploiting fine-grained pipeline parallelism for wavefront computations on a multicore platform. Wavefront computations have been widely applied in many application areas such a...
详细信息
This paper presents our experience with exploiting fine-grained pipeline parallelism for wavefront computations on a multicore platform. Wavefront computations have been widely applied in many application areas such as scientific computing algorithms and dynamic programming algorithms. To exploit fine-grained parallelism on multicore platforms, the programmers must consider the problems of synchronization, scheduling strategies and data locality. This paper shows the impact of fine-grained synchronization methods, scheduling strategies and data tile sizes on performance. We propose a low cost, lock-free, and lightweight synchronization method that can fully exploit pipeline parallelism. Our evaluation shows that RNAfold, an application for RNA secondary structures prediction, can achieve the best speedup of 3.88 on four cores under our framework.
Previous works have projected that the peak performance of FPGAs can outperform that of the general purpose processors. However, no work actually compares the performance between FPGAs and CPUs using the standard benc...
详细信息
Previous works have projected that the peak performance of FPGAs can outperform that of the general purpose processors. However, no work actually compares the performance between FPGAs and CPUs using the standard benchmarks such as the LINPACK benchmark. We propose and implement an FPGA-based hardware design of the LINPACK benchmark, the key step of which is LU decomposition with pivoting. We introduce a fine-grained pipelined LU decomposition algorithm that enables optimum performance by exploiting fine-grained pipeline parallelism. A scalable linear array of processing elements (PEs), which is the core component of our hardware design, is proposed to implement this algorithm. To the best of our knowledge, this is the first reported FPGA-based pipelined implementation of LU decomposition with pivoting. A total of 19 PEs can be integrated into an Altera Stratix II EP2S130F1020C5 on our self-designed development board. Experimental results show that the speedup up to 6.14 can be achieved relative to a Pentium 4 processor for the LINPACK benchmark.
With the advancement of peer-to-peer technology, media streaming applications become more and more popular in the Internet. However, the traditional development methods for this kind of applications need developers no...
详细信息
With the advancement of peer-to-peer technology, media streaming applications become more and more popular in the Internet. However, the traditional development methods for this kind of applications need developers not only to consider the application logic but also to manage the dynamics of Internet resources, thus increasing the difficulty of development and limiting the deployment of personal video distribution applications. In this paper, we design and implement a peer-to-peer streaming system in a much easier way. In this way we can concentrate on the application itself without distraction from the dynamics of Internet resources. Such simplification owes to the Internet-based Virtual Computing Environment (iVCE), which provides programming abstractions and runtime utilities that can encapsulate the complexity of managing transient resources into the platform, thus facilitating the construction of Internet applications. When we build our streaming application based on the iVCE, we only need to define the interaction protocols among distributed nodes with the Owlet programming language. Also, we implement a JavaBean, which can be used by the Owlet program, to assist the transferring and rendering of the content. Our implementation shows that peer-to-peer applications such as media streaming, can be elegantly built using the iVCE platform, and it can serve as a reference implementation for developing similar applications.
It is of great significance to optimize the design of on-chip memory system for improving the performance of multi-core processor. This paper describes the organization of on-chip memory hierarchy in UltraSPARC T2, an...
详细信息
It is of great significance to optimize the design of on-chip memory system for improving the performance of multi-core processor. This paper describes the organization of on-chip memory hierarchy in UltraSPARC T2, analyzes the procedure of dealing with the memory access request from multi-core processor in detail. To get the basic parameters of timing characteristic and area consumption, the primary element of on-chip memory system is simulated by ModelSim and synthesized by Synopsys Design Complier with 90 nm standard cell library. Aimed at the potential limiting factor of improving the performance of memory hierarchy, corresponding optimization techniques of critical elements in memory system are explored finally.
The problem of inverting matrices is one that occurs in some problems of practical importance. This paper introduces and evaluates the block algorithm for high performance matrix inversion on the Cell Broadband Engine...
详细信息
The problem of inverting matrices is one that occurs in some problems of practical importance. This paper introduces and evaluates the block algorithm for high performance matrix inversion on the Cell Broadband Engine (Cell/B.E.) processor. The Cell/B.E. is a heterogeneous multi-core processor on a single-chip jointly developed by Sony, Toshiba and IBM, which has a very high speed of the single precision floating-point arithmetic. The discussed matrix inversion algorithm is a combination of the block algebraic path problem algorithm and the well-known block matrix inversion algorithm based on the LU decomposition. For relatively big matrices, this combined block algorithm spends the most time in computing matrix-matrix multiplication of blocks and achieves 149.4 Gflop/s on Cell/B.E., when PPE and six SPEs of PlayStation3 are used, or 93.4% of the aggregated double (PPE) and single (SPEs) precision peak performance, which is 160.0 Gflop/s.
Efficient mapping of logical processes to physical processes is one of key technologies to accelerate parallel performance simulation. Aiming at minimizing the communications between SMP nodes and between host physica...
详细信息
Efficient mapping of logical processes to physical processes is one of key technologies to accelerate parallel performance simulation. Aiming at minimizing the communications between SMP nodes and between host physical processes, this paper presents a novel method named TPsmp-LP 3 M. It automatically extracts communication pattern of logical processes from trace and then generates a two-phase mapping from logical processes to SMP clusters. Experimental results show that TPsmp-LP 3 M can outperform the regular mapping method by up to 20.2%.
Proximity ranking according to end-to-end network distances (e.g., Round-Trip Time, RTT) can reveal detailed proximity information, which is important in network management and performance diagnosis in distributed sys...
详细信息
Proximity ranking according to end-to-end network distances (e.g., Round-Trip Time, RTT) can reveal detailed proximity information, which is important in network management and performance diagnosis in distributed systems. However, to the best of our knowledge, there has been no similar work on this subject in the P2P computing field. We present a distributed rating method iRank, that enables proximity rankings by providing discrete ratings in a distributed manner. It formulates the proximity ranking as a rating problem that faithfully captures the proximity based on noisy distance measurements scalably and practically. The primary challenge in inferring proximity rankings is enforcing distributed ratings with complex rating policies. Our solution is based on reconstructing ratings by decomposing a centralized rating method Maximum Margin Matrix Factorization (MMMF) into independent sub-problems, that can be efficiently solved in a decentralized manner. By relaxing the dependence on infrastructure nodes that are a single point of failure and limit scalability, iRank can gracefully handle network churns. Through real network latency data sets, we demonstrate that iRank can predict ratings with low distortion, which are smaller than 20 percentage worse than the centralized method, in the context of synthetic complex rating policies.
暂无评论