With speculative thread-level parallelization, codes that cannot be fully compiler-analyzed are aggressively executed in parallel. If the hardware detects a cross-thread dependence violation, it squashes offending thr...
详细信息
ISBN:
(纸本)0769515258
With speculative thread-level parallelization, codes that cannot be fully compiler-analyzed are aggressively executed in parallel. If the hardware detects a cross-thread dependence violation, it squashes offending threads and resumes execution. Unfortunately, frequent squashing cripples performance. this paper proposes a new framework of hardware mechanisms to eliminate most squashes due to data dependences in multiprocessors. the framework works by learning and predicting violations, and applying delayed disambiguation, value prediction, and stall and release. the framework is suited for directory-based multiprocessors that track memory accesses at the system level withthe coarse granularity of memory lines. Simulations of a 16-processor machine show that the framework is very effective. By adding our framework to a speculative CC-NUMA with 64-byte memory lines, we speed-up applications by an average of 4.3 times. Moreover, the resulting system is even 23% faster than a machine that tracks memory accesses at the fine granularity of words - a sophisticated system that is not compatible with mainstream cache coherence protocols.
Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. the performance of these processors heavily depends ...
详细信息
ISBN:
(纸本)0769515258
Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. the performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated oil a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. the speed-lip over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.
this paper proposes to transform the branch outcome history from the time domain to the frequency domain. With our proposed Fourier Analysis Branch (FAB) predictor, we can represent long periodic branch history patter...
详细信息
ISBN:
(纸本)0769515258
this paper proposes to transform the branch outcome history from the time domain to the frequency domain. With our proposed Fourier Analysis Branch (FAB) predictor, we can represent long periodic branch history patterns - as long as 2 13 bits - with a realistic number of bits (52 bits). We evaluate the potential gains of the FAB predictor by considering a hybrid branch predictor in which each branch is predicted using a static scheme, the 2-bit dynamic scheme, the PAp and GAp schemes, and our FAB predictor. By including our FAB predictor in the hybrid predictor it is possible to cut the misprediction rate of integer applications in the SPEC95 suite by between 5 and 50% with an average of 20%. Besides evaluating its performance, this paper shows some key properties of our FAB predictor and presents some possible implementation approaches.
Several studies of speculative execution based on values have reported promising performance potential. However, virtually all microarchitectures in these studies were described in an ambiguous manner, mainly due to t...
详细信息
ISBN:
(纸本)0769515258
Several studies of speculative execution based on values have reported promising performance potential. However, virtually all microarchitectures in these studies were described in an ambiguous manner, mainly due to the lack of formalization that defines the effects of value-speculation on a microarchitecture. In particular, the manifestations of value-speculation on the latency of microarchitectural operations, such as releasing resources and reissuing, was at best partially addressed. this may be problematic since results obtained in these studies can be difficult to reproduce and/or appreciate their contribution. this paper introduces a model for a methodical description of dynamically-scheduled microarchitectures that use value-speculation. the model isolates the parts of a microarchitecture that may be influenced by value-speculation in terms of various variables and latency events. this provides systematic means for describing, evaluating and comparing the performance of value-speculative microarchitectures. the model parameters are integrated in a simulator to investigate the performance of several value-speculation related events. Among other, the results show value-speculation performance to have non-uniform sensitivity to changes in the latency of these events. For example, fast verification latency is found to be essential, but when mis-speculation is infrequent slow invalidation may be acceptable.
In this paper, we develop a multithreaded algorithm for pricing simple options and implement it on a 8 node SMP machine using MIT's supercomputer programming language Cilk. the algorithm dynamically creates lots o...
详细信息
In this paper, we develop a multithreaded algorithm for pricing simple options and implement it on a 8 node SMP machine using MIT's supercomputer programming language Cilk. the algorithm dynamically creates lots of threads to exploit parallelism and relies on the Cilk runtime system to distribute the computation load. We present both analytical and experimental results and our results explain how Cilk could be used effectively to exploit parallelism in the given problem. the analytical results show that our algorithm has a very high average parallelism and hence Cilk is the target paradigm to implement the algorithm. We conclude from our implementation results that the size of the threads, the number of threads created, the load balancer the cost of spawning a thread are parameters that must be considered while designing the algorithm on the Cilk platform.
the application of parallel and distributed systems to the multi-agent environments has attracted recent attention. Multi-agent systems are a particular type of distributed artificial intelligence system. this paper p...
详细信息
ISBN:
(纸本)0769511538
the application of parallel and distributed systems to the multi-agent environments has attracted recent attention. Multi-agent systems are a particular type of distributed artificial intelligence system. this paper presents an approach to learning in parallel and distributed systems. A variant of the job assignment problem is chosen as on evaluation task. this is an NP-hard problem, which is relevant to many industrial application domains. Experimental results show the effectiveness of the proposed approach.
In this paper we propose an efficient parallel implementation of Edmonds' algorithm for finding optimum branchings on a model of the SIMD type with vertical data processing (the STAR-machine). To this end for a di...
详细信息
ISBN:
(纸本)0769511538
In this paper we propose an efficient parallel implementation of Edmonds' algorithm for finding optimum branchings on a model of the SIMD type with vertical data processing (the STAR-machine). To this end for a directed graph given as a list of triples (edge vertices and the weight), we construct a new associative version of Edmonds' algorithm. this version is represented as the corresponding STAR procedure whose correctness is proved. We obtain that on vertical processing systems Edmonds' algorithm takes O(n log n) time, where n is the number of graph vertices.
Techniques for scheduling parallel I/O for both uniprogrammed systems that run single jobs in isolation and multiprogrammed environments that execute multiple parallel jobs simultaneously ate presented. the performanc...
详细信息
ISBN:
(纸本)0769511538
Techniques for scheduling parallel I/O for both uniprogrammed systems that run single jobs in isolation and multiprogrammed environments that execute multiple parallel jobs simultaneously ate presented. the performance of the scheduling algorithms is evaluated on a network of workstations. A new scheduling algorithm proposed in this paper is observed to perform very well for systems running single jobs in isolation. the algorithms that use knowledge of job characteristics are observed to produce a superior performance in multiprogrammed parallel environments.
In this study, a new algorithm withdistributed systems is proposed in order to optimise the structure of the classifiers, which have a great importance in pattern recognition. the algorithm is applied to a multi laye...
详细信息
ISBN:
(纸本)0769511538
In this study, a new algorithm withdistributed systems is proposed in order to optimise the structure of the classifiers, which have a great importance in pattern recognition. the algorithm is applied to a multi layer neural network classifier which uses the rule of back propagation learning. the long process period is shortened and expected high operation speed is achieved in pattern recognition by minimizing the hardware realization of the classifier.
Withthe increasing growth in mobile withthe increasing growth in mobile computing devices and wireless networks, users are able to access information from anywhere and at anytime. In such situations, the issues of l...
详细信息
ISBN:
(纸本)0769511538
Withthe increasing growth in mobile withthe increasing growth in mobile computing devices and wireless networks, users are able to access information from anywhere and at anytime. In such situations, the issues of location management for mobile hosts are becoming increasingly significant. Different location management schemes such as Columbia University's mobile IP scheme and IETF mobile IP have been proposed. In this paper, we propose a new distributed location management scheme and discuss the advantages of the proposed scheme over the others. the paper then considers the issues of multicasting in the proposed architecture.
暂无评论