Next generation multimedia mobile phones that use the high bandwidth 3G cellular radio network consume more power. Multimedia algorithms such as speech, video transcodecs have very large instruction foot prints and co...
详细信息
ISBN:
(纸本)9781604233926
Next generation multimedia mobile phones that use the high bandwidth 3G cellular radio network consume more power. Multimedia algorithms such as speech, video transcodecs have very large instruction foot prints and consequently stalled due to instruction cache misses. the conflicts in on-chip caches contribute a large fraction of the CPU cycle penalty and hence increase in power consumption. Many commercial tools are developed to minimize such cache misses by adequately placing the frequently called procedures in an application. Followed by profile extraction, these tools use cache line coloring and post compilation techniques for cache hit optimization. the major impediment in the optimal performance of such tools is their static layout profile, which does not consider the dynamic behavior of the application. We propose a methodology called DCP (dynamic code placement) for positioning code at run time for good instruction cache performance and have implemented in high end processors. the dynamic application profile is completely transparent to the developer’s code. this technique optimizes the code footprint in memory layout of a program. It improves i-cache mapping to increase the number of cache hits and eventually reduce the CPU stalls. Our optimization is powered with static as well as detail run time profile information that extracts the relevant, temporal behavior of the applications. Moreover, while mapping code in instruction cache, the effect of inter-procedural code positioning is also considered. Improvement over the Pettis and Hansen approach (PH) is also shown in results. though majority of multimedia applications can be optimized by our framework, application dominated withthe function pointers do not work correctly. the technique incurs low overheads and enhances the cache hits architecture correlation. For a range of applications we show that instruction miss rates can be reduced by 19-68%. Using a simple model this corresponds to execution time re
the proceedings contain 45 papers. the topics discussed include: on the number of spanning trees in a mesh structure;polylogarihmic gap between meshes with dynamically separable buses and meshes with statically partit...
ISBN:
(纸本)9781604236446
the proceedings contain 45 papers. the topics discussed include: on the number of spanning trees in a mesh structure;polylogarihmic gap between meshes with dynamically separable buses and meshes with statically partitioned buses;a bus-based recursive mesh for high-performance on-chip systems;designing parallel algorithms on the star graph using orthogonal decomposition;a virtual-channel free mapping for application-specific on-chip torus networks;using abstract state machines to specify a protocol that implements the GWO consistency model;optimizing timing and code size using maximum direct loop fusion;a constraint-based framework for concurrent and distributed programming;and topology and power management routing for reliability in ad-hoc networks.
the co-operation of parallel simulated annealing processes to solve the vehicle routing problem with time windows (VRPTW) is considered. the objective is to investigate how the number of parallel processes and the fre...
详细信息
ISBN:
(纸本)9780889866386
the co-operation of parallel simulated annealing processes to solve the vehicle routing problem with time windows (VRPTW) is considered. the objective is to investigate how the number of parallel processes and the frequency of processes co-operation influence the accuracy of solutions to the VRPTW. the accuracy of solutions is measured by their proximity to the best known solution.
In this paper, we present a new method of static load balancing for parallel mining of all frequent itemsets on a distributed-memory (DM) parallel machine. the method partitions the space of all frequent itemsets into...
详细信息
ISBN:
(纸本)9780889866386
In this paper, we present a new method of static load balancing for parallel mining of all frequent itemsets on a distributed-memory (DM) parallel machine. the method partitions the space of all frequent itemsets into subspaces of approximately the same size. Hence, it allows to balance the computational load for an arbitrary frequent itemset mining algorithm.
To keep up withthe pace of fast development of Internet, cluster architecture has been proposed for next generation core routers. In a cluster router, parallel computation is expected. computing shortest path tree (S...
详细信息
ISBN:
(纸本)9780889866386
To keep up withthe pace of fast development of Internet, cluster architecture has been proposed for next generation core routers. In a cluster router, parallel computation is expected. computing shortest path tree (SPT) is a fundamental problem implementing OSPF, which is one of the most popular routing protocols. this paper presents a parallel algorithm BPA (Branching parallel Algorithm) for computing SPT, analyzes the performance of BPA, and finally validates the BPA performance by experiments
In this paper, we discuss parallelization of a high-level computer vision application in medical imaging, namely, multi-scale active shape description of MR (magnetic resonance) brain images of epileptic patients usin...
详细信息
ISBN:
(纸本)9780889866386
In this paper, we discuss parallelization of a high-level computer vision application in medical imaging, namely, multi-scale active shape description of MR (magnetic resonance) brain images of epileptic patients using active contour models, on a cluster of workstations. the paper gives a comparative study and analysis of three different approaches of parallel implementation using corresponding parallelcomputing patterns such as Temporal Multiplexing, Pipeline, and Composite Pipeline. the outcome of the cluster-based parallel implementations has shown encouraging results.
distributed testing is often hard to implement. this is due to difficulties in handling heterogeneous environments, complex configurations, synchronization, error probing, result maintenance and automation in distribu...
详细信息
ISBN:
(纸本)9780889866386
distributed testing is often hard to implement. this is due to difficulties in handling heterogeneous environments, complex configurations, synchronization, error probing, result maintenance and automation in distributed testing. this paper describes a practical testing framework that permits automated distributed testing of distributedsystems and applications. the framework extends the capability of JUnit to support test execution over heterogeneous environments and complex configurations.
Memory consistency model is crucial to the performance of shared-memory multiprocessors, and in current architectures several different models are adopted. In this paper, using graph algorithms for illustrative purpos...
详细信息
ISBN:
(纸本)9780889866386
Memory consistency model is crucial to the performance of shared-memory multiprocessors, and in current architectures several different models are adopted. In this paper, using graph algorithms for illustrative purposes, we consider the impact of memory model on the implementation and performance of parallel algorithms on shared-memory multiprocessors. We show that the implementation of PRAM algorithm's is largely "oblivious" of the underlying memory model, and has good performance on relaxed models. More importantly, we show that different memory models can favor drastically different algorithm designs.
this paper studies Strassen's matrix multiplication algorithm by implementing it in a variety of methods: sequential, workflow, and in parallel. All the methods show better performance than the well-known scientif...
详细信息
ISBN:
(纸本)9780889866386
this paper studies Strassen's matrix multiplication algorithm by implementing it in a variety of methods: sequential, workflow, and in parallel. All the methods show better performance than the well-known scientific libraries for medium to large size matrices. the sequential recursive program is implemented and compared with ATLAS's DGEMM subroutine. A workflow program in the NetSolve system and two parallel programs based on MPI and Scal-APACK are also implemented. By analyzing the time complexity and memory requirement of each method, we provide insight into how to utilize Strassen's Algorithm to speedup matrix multiplication based on existing high performance tools or libraries.
We have developed a distributed asynchronous Web based training system. In order to improve the scalability and robustness of this system, all contents and a function that scores user's answers are realized on mob...
详细信息
ISBN:
(纸本)9780889866386
We have developed a distributed asynchronous Web based training system. In order to improve the scalability and robustness of this system, all contents and a function that scores user's answers are realized on mobile agents. these agents are distributed to computers, and they can obtain using a P2P network that modified Content-Addressable Network. In this system, although entire services do not become impossible even if some computers break down, the problem that contents disappear occurs with an agent's disappearance. In this study, as a solution for this problem, backups of agents are distributed to computers. If a failure of a computer is detected, other computer will continue service using backups of the agents belonged to the computer. the developed algorithm is examined by experiments.
暂无评论