Multicast communication is a significant operation in multicomputer systems and can be used to support several other collective communication operations. Hardware implementation is a viable solution to develop a low l...
详细信息
Multicast communication is a significant operation in multicomputer systems and can be used to support several other collective communication operations. Hardware implementation is a viable solution to develop a low latency multicast algorithm. In this paper, we present a new multicast routing algorithm for two-dimensional meshes. the algorithm uses worm-hole routing mechanism and can send messages to any number of destinations within two start-up communication phases;hence the name, two-phase multicast (TPM) algorithm. the algorithm uses the base routing scheme used for unicast communication, thus minimizing the additional hardware support. the algorithm allows some intermediate nodes that are not in the destination set to perform multicast functions. this feature allows flexibility in multicast path selection and therefore improves the performance. A simulation study has been conducted to investigate the performance of the purposed TPM algorithm. the simulation results show that the proposed TPM algorithm performs significantly better for both contention-free and mixed-traffic conditions, when compared withthe previously proposed multicast algorithms.
About ten years after the Java Grande effort, this paper aims at providing a snapshot of the comparison of Fortran MPI, with Java performance for parallel computing, using the ProActive library. We first analyze some ...
详细信息
ISBN:
(纸本)9781424416936
About ten years after the Java Grande effort, this paper aims at providing a snapshot of the comparison of Fortran MPI, with Java performance for parallel computing, using the ProActive library. We first analyze some performance details about ProActive behaviors, and then compare its global performance from the MPI library. this comparative is based on the five kernels of the NAS parallel Benchmarks. From those experiments we identify benchmarks where parallel Java performs as well as Fortran MPI, and lack of performance on others, together with clues for improvement.
In phonotactic spoken language recognition systems, acoustic model adaptation prior to phone lattice decoding has been adopted to deal withthe mismatch between training and test conditions. Moreover, combining divers...
详细信息
ISBN:
(纸本)9781467325073;9781467325066
In phonotactic spoken language recognition systems, acoustic model adaptation prior to phone lattice decoding has been adopted to deal withthe mismatch between training and test conditions. Moreover, combining diversified phonotactic features is commonly used. these motivate us to have an in-depth investigation of combining diversified phonotactic features from diversely adapted acoustic models. Our experiment shows that our approach achieves an equal error rate (EER) of 1.94% in the 30-second closed-set trials of the 2007 NIST Language Recognition Evaluation (LRE). It represents a 14.9% relative improvement in EER over a sophisticated system, in which parallel phone recognizers, speaker adaptive training (SAT) in acoustic models and CMLLR adaptation are used. Moreover, it is shown that our approach provides consistent and substantial improvements in three different phonotactic systems, in each of which a single phone recognizer is used.
this paper describes a technique that allows an MPI code to be encapsulated into a component. Our technique is based on an extension to the Common Object Request Broker Architecture (CORBA) from the OMG (Object Manage...
详细信息
this paper describes a technique that allows an MPI code to be encapsulated into a component. Our technique is based on an extension to the Common Object Request Broker Architecture (CORBA) from the OMG (Object Management Group). the proposed extensions do not modify the CORBA core infrastructure (the Object Request Broker) so that it can fully co-exist with existing CORBA applications. An MPI code is seen as a new kind of CORBA object that hides most of the cumbersome problems when dealing withparallelism. Such technique can be used to connect MPI codes to existing CORBA software infrastructures which are now being developed in the framework of several research and development projects such as JACO3, JULIUS or TENT from DLR. To illustrate the concept of parallel CORBA object, we present a virtual reality application that is made of the coupling of a light simulation application (radiosity) and a visualization tool using VRML and Java.
In this work we propose a fault-tolerant mechanism for parallel programs based on task replication. We use a sequential discrete-event simulator of a distributed system subject to failures to compare a semi-active app...
详细信息
ISBN:
(纸本)076950728X
In this work we propose a fault-tolerant mechanism for parallel programs based on task replication. We use a sequential discrete-event simulator of a distributed system subject to failures to compare a semi-active approach and a passive approach of the protocol. In our model, each time a task of a given parallel program is allocated, a copy of it is stored in a second processor, called the buddy processor. If the original processor fails, the copies of the tasks at the buddy processor will be processed, providing fault tolerance. Some performance measures, such as program execution times and processor utilization factors, are given for the different versions of the mechanism. the performance has been studied as a function of processor degradation, and program and system sizes.
We present a novel declustering scheme (ACATS) for reliability stripes in an orthogonal disk array. Our scheme is deterministic, run-time efficient, and provides frequently the best possible and always an almost best ...
详细信息
We present a novel declustering scheme (ACATS) for reliability stripes in an orthogonal disk array. Our scheme is deterministic, run-time efficient, and provides frequently the best possible and always an almost best possible distribution of failure-induced incremental rebuild workloads. Our scheme provides protection against single disk as well as single string failures within the disk array. Our approach presents a framework in which the Level 5 RAID organization logically appears as a Level 4 RAID;it facilitates the provision of distributed sparing in exactly the same manner. ACATS does not require the existence of a suitably configured block design or of a run-time efficient pseudo-random number generator;it is applicable to arbitrarily configured orthogonal disk arrays. Our scheme is faster than declustering schemes that use pseudo-random permutations and achieves better uniformity of disk loads during a disk rebuild;it is simpler than schemes based on block designs. ACATS provides a rich spectrum of declustering schemes.
Empirical performance modeling is a proven instrument to analyze the scaling behavior of HPC applications. Using a set of smaller-scale experiments, it can provide important insights into application behavior at large...
详细信息
ISBN:
(纸本)9781665440660
Empirical performance modeling is a proven instrument to analyze the scaling behavior of HPC applications. Using a set of smaller-scale experiments, it can provide important insights into application behavior at larger scales. Extra-P is an empirical modeling tool that applies linear regression to automatically generate human-readable performance models. Similar to other regression-based modeling techniques, the accuracy of the models created by Extra-I' decreases as the amount of noise in the underlying data increases. this is why the performance variability observed in many contemporary systems can become a serious challenge. In this paper, we introduce a novel adaptive modeling approach that makes Extra-P more noise resilient, exploiting the ability of deep neural networks to discover the effects of numerical parameters, such as the number of processes or the problem size, on performance when dealing with noisy measurements. Using synthetic analysis and data from three different case studies, we demonstrate that our solution improves the model accuracy at high noise levels by up to 25% while increasing their predictive power by about 15%.
Physically-distributed memory multiprocessors are becoming popular and data distribution and loop parallelization are aspects that a parallelizing compiler has to consider in order to get efficiency from the system. T...
详细信息
Physically-distributed memory multiprocessors are becoming popular and data distribution and loop parallelization are aspects that a parallelizing compiler has to consider in order to get efficiency from the system. the cost of accessing local and remote data can be one or several orders of magnitude different, and this can dramatically affect the performance of the system. It would be desirable to free the programmer from considerations of the low-level details of the target architecture, to program explicit processes or specify interprocess communication. In this paper, we present an approach to automatically derive static or dynamic data distribution strategies for the arrays used in a program. All the information required about data movement and parallelism is contained in a single data structure, called the Communication-parallelism Graph (CPG). the problem is modeled and solved using a general purpose linear 0-1 integer programming solver. this allows us to find the optimal solution for the problem for one-dimensional array distributions. We also show the feasibility of using this approach in terms of compilation time and quality of the solutions generated.
Withthe increasing growth in mobile withthe increasing growth in mobile computing devices and wireless networks, users are able to access information from anywhere and at anytime. In such situations, the issues of l...
详细信息
ISBN:
(纸本)0769511538
Withthe increasing growth in mobile withthe increasing growth in mobile computing devices and wireless networks, users are able to access information from anywhere and at anytime. In such situations, the issues of location management for mobile hosts are becoming increasingly significant. Different location management schemes such as Columbia University's mobile IP scheme and IETF mobile IP have been proposed. In this paper, we propose a new distributed location management scheme and discuss the advantages of the proposed scheme over the others. the paper then considers the issues of multicasting in the proposed architecture.
Two and three dimensional k-tori are among the most used topologies in the designs of new parallel computers. Traditionally (withthe exception of the Tera parallel computer), these networks have been used as fully-po...
详细信息
Two and three dimensional k-tori are among the most used topologies in the designs of new parallel computers. Traditionally (withthe exception of the Tera parallel computer), these networks have been used as fully-populated networks, in the sense that every routing node in the topology is subjected to message injection. However, fully-populated tori and meshes exhibit a theoretical throughput which degrades as the network size increases. In contrast, multistage networks (that are partially populated) scale well withthe network size. Introducing slackness in fully-populated tori, i.e., reducing the number of processors, and studying optimal routing strategies for the resulting interconnections are the central subjects of this paper. the key concept that we study is the placement of the processors in a network together with a routing algorithm between them, where a placement is the subset of the nodes in the interconnection network that are attached to processors. Our main contribution is the construction of optimal placements for d-dimensional k-tori networks, of sizes k and k2 and the corresponding routing algorithms for the cases d = 2 and d = 3, respectively.
暂无评论