Hydra PPS is a collection of annotations, classes, a runtime, and a compiler designed to provide Java programmers with a fairly simple method of producing programs for Symmetric Multiprocessing (SMP) architectures. Th...
ISBN:
(纸本)9781595934529
Hydra PPS is a collection of annotations, classes, a runtime, and a compiler designed to provide Java programmers with a fairly simple method of producing programs for Symmetric Multiprocessing (SMP) architectures. This paper introduces the basics of this new system including the basic constructs for this new programming language and the relationship between the Java VM, the compiler, the runtime, and the parallel program. Hydra will exploit parallelism when the underlying architecture supports it and will run as normal sequential Java program when the architecture does not have support for parallelism. parallelism is expressed through events in Hydra, it is easy to use, and programs run efficiently on parallelarchitectures.
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heteroge...
详细信息
ISBN:
(纸本)9781595934529
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heterogeneity of its capacity distribution? We propose a general framework to quantify the worst-case effect of increasing heterogeneity in models of parallel systems. Given a cost function g(C,W) representing the system's performance as a function of its nodes' capacities C and workload W (such as the completion time of an optimum schedule of jobs W on machines C), we say that g has price of heterogeneity α when for any workload, cost cannot increase by more than a factor α if node capacities become arbitrarily more heterogeneous. We give constant bounds on the price of heterogeneity of several well-known job scheduling and graph degree/diameter problems, indicating that increasing heterogeneity can never be much of a disadvantage. On the other hand, with the introduction of timing constraints such as release times or precedence constraints on the jobs, the dependence on node capacities becomes more complex, so that increasing heterogeneity may be quite detrimental.
While there are strong beliefs within the community about whether one particular parallel programming model is easier to use than another, there has been little research to analyze these claims empirically. Currently,...
详细信息
ISBN:
(纸本)9781595934529
While there are strong beliefs within the community about whether one particular parallel programming model is easier to use than another, there has been little research to analyze these claims empirically. Currently, the most popular paradigm is message-passing, as implemented by the MPI library [1]. However, MPI is considered to be difficult for developing programs, because it forces the programmer to work at a very low level of abstraction. One alternative parallel programming model is the PRAM model, which supports fine-grained parallelism and has a substantial history of algorithmic theory [2]. It is not possible to program current parallel machines using the PRAM model because modern architectures are not designed to support such a model efficiently. However, current trends towards multicore chips suggest that large-scale, fine-grained uniform-memory access parallel machines may soon be feasible. XMT-C is an extension of the C language that supports parallel directives to provide a PRAM-like model to the programmer. A prototype compiler exists that generates code which runs on a simulator for an XMT architecture [3].To better understand how much benefit a PRAM-like model could provide over a message-passing model, we conducted a feasibility study in an academic setting to compare the effort required to solve a particular problem. The questions under study were: can we measure the effort in developing a program using these two programming models and can we differentiate the amount of effort for each model?.The subjects participating in the study were divided up into two groups. One group solved a problem using the MPI library in either C,C++, or Fortran, and the other group solved the problem using XMT-C. The task was to write a function to multiply a sparse matrix with a dense vector. To obtain subjects, we leveraged existing graduate-level parallel programming courses at two different universities: University of California, Santa Barbara (UCSB), and Universit
Since the discovery of Gröbner bases, the algorithmic advances in Commutative Algebra have made possible to tackle many classical problems in Algebraic Geometry that were previously out of reach. However, algorit...
详细信息
ISBN:
(纸本)9781595934529
Since the discovery of Gröbner bases, the algorithmic advances in Commutative Algebra have made possible to tackle many classical problems in Algebraic Geometry that were previously out of reach. However, algorithmic progress is still desirable, for instance when solving symbolically a large system of algebraic non-linear equations. For such a system, in particular if its solution set consists of geometric components of different dimension (points, curves, surfaces, etc) it is necessary to combine Gröbner bases with decomposition techniques, such as triangular decompositions. Ideally, one would like each of the different components to be produced by an independent processor, or set of processors. In practice, the input polynomial system, which is hiding those components, requires some transformations in order to split the computations into sub-systems and, then, lead to the desired components. The efficiency of this approach depends on its ability to detect and exploit geometrical information during the solving *** work addresses two questions: How to discover geometrical information, at an early stage of the solving process, that would be favorable to parallel execution? How to ensure load balancing among the processors? We answer these questions in the context of triangular decompositions [2] which are a popular way of solving polynomial systems symbolically. These methods tend to split the input polynomial system into subsystems and, therefore, are natural candidate for parallel implementation. However, the only such method which has been parallelized so far is the Characteristic Set Method of Wu [5], as reported in [1, 6]. This approach suffers from several limitations. For instance, the solving of the second component cannot start before that of the first one is completed; this is a limitation in view of coarse-grain *** [4] an algorithm, called Triade, for TRIAngular DEcompositions, provides a good management of the intermediate computat
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently sch...
详细信息
ISBN:
(纸本)9781595934529
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional *** of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working *** configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to *** of findings. We studied a variety of benchmark programs to show the following *** several application classes, PDF enable
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing dequ...
详细信息
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing deque;coloring unstructured radio networks;name independent routing for growth bounded networks;windows scheduling of arbitrary length jobs on parallel machines;parallel scheduling of complex dags under uncertainty;on distributed smooth scheduling;scheduling malleable tasks with precedence constraints;an adaptive power conservation scheme for heterogeneous wireless sensor networks with node redeployment;irrigating ad hoc networks ion constant time;and constant density spanners for wireless ad-hoc networks.
We consider the natural extension of the single disk caching problem to parallel disk I/O model. We close the existing gap between lower and upper bounds and achieve optimal competitive ratio of O(√D) when lookahead ...
详细信息
ISBN:
(纸本)9781581139860
We consider the natural extension of the single disk caching problem to parallel disk I/O model. We close the existing gap between lower and upper bounds and achieve optimal competitive ratio of O(√D) when lookahead is more than the memory size M. When lookahead is smaller, we derive various upper bounds and lower bounds on the competitive ratio under various adversarial models.
The proceedings contains 40 papers from the conference on SPAA 2004 - Sixteenth annualacmsymposium on parallelism in algorithms and architectures. The topics discussed include: On delivery times in packet networksun...
详细信息
The proceedings contains 40 papers from the conference on SPAA 2004 - Sixteenth annualacmsymposium on parallelism in algorithms and architectures. The topics discussed include: On delivery times in packet networksunder adversarial traffic;balanced graph partitioning;online hierarchical cooperative caching;scheduling against an adversarial network;effectively sharing a cache among threads;online algorithms for network design and dynamic analysis of the arrow distributed protocol.
The proceedings contains 46 papers from the conference on SPAA 2003 Fifteenth annualacmsymposium on parallelism in algorithms and architectures. The topics discussed include: optimal sharing of bags of tasks in hete...
详细信息
The proceedings contains 46 papers from the conference on SPAA 2003 Fifteenth annualacmsymposium on parallelism in algorithms and architectures. The topics discussed include: optimal sharing of bags of tasks in heterogeneous clusters;minimizing total flow time and total completion time with immediate dispatching;a practical algorithm for constructing oblivious routing schemes;a polynomial-time tree decomposition to minimize congestion and online oblivious routing.
The proceedings contain 7 papers from the symposium on parallelism in algorithms and architectures, SPAA 2003: 15th annualsymposium on parallelism in algorithms and architectures. The topics discussed include: a prac...
详细信息
The proceedings contain 7 papers from the symposium on parallelism in algorithms and architectures, SPAA 2003: 15th annualsymposium on parallelism in algorithms and architectures. The topics discussed include: a practical algorithm for constructing oblivious routing schemes;novel architectures for P2P applications: the continuous-discrete approach;quantifying instruction criticality for shared memory multiprocessors;relaxing the problem-size bound for out-of-core columnsort;the complexity of verifying memory coherence;a near optimal scheduler for switch-memory-switch routers;and on local algorithms for topology control and routing in ad hoc networks.
暂无评论