this work presents the outline of an algorithm for merging two programs (and hence more) into a single program in source level. the approach is a constrained software equivalent of simultaneous multithreading (SMT). T...
详细信息
ISBN:
(纸本)9780769529448
this work presents the outline of an algorithm for merging two programs (and hence more) into a single program in source level. the approach is a constrained software equivalent of simultaneous multithreading (SMT). this work goes beyond previous works by considering how to merge the remainder of a loop into a recursively merged tail. the contribution of this work is the technique of handling the tail of non-equivalent loops in the process of recursively merging subcomponents. the proposed scheme makes extensive use of the ability to forward "remaining iterations" from the merging of two sub-components to be used in following mergings of other sub-components. Forwarding remaining iterations has several modes and in particular the ability to use repeated execution of inner loops to complete the iterations of larger loops. this is (to the best of our knowledge) the first complete tool for source-level merging in C. the effectiveness of the proposed scheme for embedded systems has been studied via a sequence of experiments showing expected improvement of 10-20%. We tested merging programs from DSP related benchmarks using several compilers on different architectures.
We describe a new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP). Poor performance isolation occurs when an application's performance is determined by the b...
详细信息
ISBN:
(纸本)9780769529448
We describe a new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP). Poor performance isolation occurs when an application's performance is determined by the behaviour of its co-runners, i.e., other applications simultaneously running with it. this performance dependency is caused by unfair, co- runner-dependent cache allocation on CMPs. Poor performance isolation interferes withthe operating system 's control over priority enforcement and hinders QoS provisioning. Previous solutions required modifications to the hardware. We present a new software solution. Our cache-fair algorithm ensures that the application runs as quickly as it would under fair cache allocation, regardless of how the cache is actually allocated. If the thread executes fewer instructions per cycle than it would under fair cache allocation, the scheduler increases that thread's CPU time slice. this way, the thread's overall performance does not suffer because it is allowed to use the CPU longer. We describe our implementation of the algorithm in Solaristrade 10, and show that it significantly improves performance isolation for SPEC CPU, SPEC JBB and TPC-C.
Withthe growing acceptance of multi-core architectures by the industry, devising novel techniques to extract thread-level parallelism from sequential programs has become a fundamental need. the role of compiler along...
详细信息
Withthe growing acceptance of multi-core architectures by the industry, devising novel techniques to extract thread-level parallelism from sequential programs has become a fundamental need. the role of compiler along with programming model and architectural innovation is of utmost importance to fully realize the potential performance benefits of the multi-core architectures. this paper evaluates the capabilities and limitations of parallelizing compilers to extract parallelism automatically from the loops present in sequential programs. the applications from embedded benchmark suites EEMBC 1.1 and MiBench are analyzed using the Intel C++ 9.1 Compiler for Linux. the contributions of the paper are manifold: Firstly, the paper shows that on average 10% of the loops can be parallelized automatically by the Intel Compiler. Secondly, we have shown that the auto- parallelizable loops cover only about 12.5% of the total program execution-time. thirdly, we have explored the reasons behind the inability of the compiler to auto-parallelize the majority of the loops. We have found that on average 37.5% and 8% of the loops can't be auto-parallelized because of statically unknown loop trip count and probable data dependence, respectively. Finally, this study identifies the set of loops which comprises the most of the execution time of the programs and shows that compiler, on average, can automatically parallelize about 22% of such loops.
Massively parallel processor array architectures can be used as hardware accelerators for a plenty of dataflow dominant applications. Bilateral filtering is an example of a state-of-the-art algorithm in medical imagin...
详细信息
ISBN:
(纸本)0769526829
Massively parallel processor array architectures can be used as hardware accelerators for a plenty of dataflow dominant applications. Bilateral filtering is an example of a state-of-the-art algorithm in medical imaging, which falls in the class of 2D adaptive filter algorithms. In this paper we propose a semi-automatic mapping methodology for the generation of hardware accelerators for such a generic class of adaptive filtering applications in image processing. the final architecture deliver similar synthesis results as a hand-tuned design.
Embedded computing architectures can be designed to meet a variety of application specific requirements. However, optimized hardware can require compiler support to realize the potential of the hardware. this is espec...
详细信息
ISBN:
(纸本)0769526373
Embedded computing architectures can be designed to meet a variety of application specific requirements. However, optimized hardware can require compiler support to realize the potential of the hardware. this is especially true for embedded image processing systems where significant architectural variation is possible, and targeted software can change drastically based on architectural variation. this paper presents methods to compile a single high-level source given a fundamental variation in data-parallel target architectures processor granularity ranging from a single processor to a massively parallel processor array. the approach uses single PPE virtualization, which supports pixel-level data-parallel expressions that operate on a virtual one pixel per processing element (PPE) network and applies pixel-locating transformations to retarget the code into a given target PPE. Unlike mainstream parallel computing techniques, this technique can be applied to lightweight SIMD targets that do not provide global communication hardware or shared memory.
this paper describes an architecture dedicated to the real-time processing of census correlation in the context of the realization of passive stereovision sensors. Although DSP circuits have dramatically increased the...
详细信息
ISBN:
(纸本)9781424403127
this paper describes an architecture dedicated to the real-time processing of census correlation in the context of the realization of passive stereovision sensors. Although DSP circuits have dramatically increased their performances in terms of frequency (about 600 MHz today), DSP cores (several Multipliers Accumulators) and pipelines (Super Harvard architectures for example), FPGA circuits remain the best way to design massive parallelarchitectures when ultra fast algorithms computation are needed like it is the case in real time vision systems for collision avoidance.
We present the first parallel algorithm for building a Hausdorff Voronoi diagram (HVD). Our algorithm is targeted towards cluster computing architectures and computes the Hausdorff Voronoi diagram for non-crossing obj...
详细信息
ISBN:
(纸本)0769526365
We present the first parallel algorithm for building a Hausdorff Voronoi diagram (HVD). Our algorithm is targeted towards cluster computing architectures and computes the Hausdorff Voronoi diagram for non-crossing objects in time O(nlog(4)n/p)for input size n and p processors. In addition, our parallel algorithm also implies a new sequential HVD algorithm that constructs HVDs for noncrossing objects in time O(n log(4) n). this improves on previous sequential results and solves an open problem posed by Papadopoulou and Lee [18].
this paper discusses fast parallelalgorithms for evaluating several centrality indices frequently used in complex network analysis. these algorithms have been optimized to exploit properties typically observed in rea...
详细信息
ISBN:
(纸本)0769526365
this paper discusses fast parallelalgorithms for evaluating several centrality indices frequently used in complex network analysis. these algorithms have been optimized to exploit properties typically observed in real-world large scale networks, such as the low average distance, high local density, and heavy-tailed power law degree distributions. We test our implementations on real datasets such as the web graph, protein-interaction networks, movie-actor and citation networks, and report impressive parallel performance for evaluation of the computationally intensive centrality metrics (betweenness and closeness centrality) on high-end shared memory symmetric multiprocessor and multithreaded architectures. To our knowledge, these are the first parallel implementations of these widely-used social network analysis metrics. We demonstrate that it is possible to rigorously analyze networks three orders of magnitude larger than instances that can be handled by existing network analysis (SNA) software packages. For instance, we compute the exact betweenness centrality value for each vertex in a large US patent citation network (3 million patents, 16 million citations) in 42 minutes on 16 processors, utilizing 20GB RAM of the IBM p5 570. Current SNA packages on the other hand cannot handle graphs with more than hundred thousand edges.
A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of...
详细信息
ISBN:
(纸本)9781424403431
A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of XML documents has a reputation for poor performance, and a number of optimizations have been developed to address this performance problem from different perspectives, none of which have been entirely satisfactory. In this paper, we present a seemingly quixotic, but novel approach: parallel XML parsing. parallel XML parsing leverages the growing prevalence of multicore architectures in all sectors of the computer market, and yields significant performance improvements. this paper presents our design and implementation of parallel XML parsing. Our design consists of an initial preparsing phase to determine the structure of the XML document, followed by a full, parallel parse. the results of the preparsing phase are used to help partition the XML document for data parallelprocessing. Our parallel parsing phase is a modification of the libxml2 [1] XML parser, which shows that our approach applies to real-world, production quality parsers. Our empirical study shows our parallel XML parsing algorithm can improved the XML parsing performance significantly and scales well.
暂无评论