Since the discovery of Gröbner bases, the algorithmic advances in Commutative Algebra have made possible to tackle many classical problems in Algebraic Geometry that were previously out of reach. However, algorit...
详细信息
ISBN:
(纸本)9781595934529
Since the discovery of Gröbner bases, the algorithmic advances in Commutative Algebra have made possible to tackle many classical problems in Algebraic Geometry that were previously out of reach. However, algorithmic progress is still desirable, for instance when solving symbolically a large system of algebraic non-linear equations. For such a system, in particular if its solution set consists of geometric components of different dimension (points, curves, surfaces, etc) it is necessary to combine Gröbner bases with decomposition techniques, such as triangular decompositions. Ideally, one would like each of the different components to be produced by an independent processor, or set of processors. In practice, the input polynomial system, which is hiding those components, requires some transformations in order to split the computations into sub-systems and, then, lead to the desired components. The efficiency of this approach depends on its ability to detect and exploit geometrical information during the solving *** work addresses two questions: How to discover geometrical information, at an early stage of the solving process, that would be favorable to parallel execution? How to ensure load balancing among the processors? We answer these questions in the context of triangular decompositions [2] which are a popular way of solving polynomial systems symbolically. These methods tend to split the input polynomial system into subsystems and, therefore, are natural candidate for parallel implementation. However, the only such method which has been parallelized so far is the Characteristic Set Method of Wu [5], as reported in [1, 6]. This approach suffers from several limitations. For instance, the solving of the second component cannot start before that of the first one is completed; this is a limitation in view of coarse-grain *** [4] an algorithm, called Triade, for TRIAngular DEcompositions, provides a good management of the intermediate computat
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing dequ...
详细信息
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing deque;coloring unstructured radio networks;name independent routing for growth bounded networks;windows scheduling of arbitrary length jobs on parallel machines;parallel scheduling of complex dags under uncertainty;on distributed smooth scheduling;scheduling malleable tasks with precedence constraints;an adaptive power conservation scheme for heterogeneous wireless sensor networks with node redeployment;irrigating ad hoc networks ion constant time;and constant density spanners for wireless ad-hoc networks.
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently sch...
详细信息
ISBN:
(纸本)9781595934529
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional *** of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working *** configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to *** of findings. We studied a variety of benchmark programs to show the following *** several application classes, PDF enable
In this paper we consider the problem of finding perfect matchings in parallel. We present a RNC algorithm with optimal work in respect to sequential algorithms, i.e., it uses O(nω) processors. Our algorithm is based...
详细信息
ISBN:
(纸本)9781581139860
In this paper we consider the problem of finding perfect matchings in parallel. We present a RNC algorithm with optimal work in respect to sequential algorithms, i.e., it uses O(nω) processors. Our algorithm is based on an RNC algorithm for computing determinant of a degree one polynomial matrix, which is of independent interest. Copyright 2005 acm.
I present a VLSI circuit for segmented parallel prefix with gate delay O(logS) and wire delay O(√S) for segment size S, and total area O(N). Thus, for example, for the problem of adding random numbers, in most cases,...
详细信息
ISBN:
(纸本)9781581139860
I present a VLSI circuit for segmented parallel prefix with gate delay O(logS) and wire delay O(√S) for segment size S, and total area O(N). Thus, for example, for the problem of adding random numbers, in most cases, the addition would complete with only O(log logN) gate delay and O(√logN) wire delay.
We consider the natural extension of the single disk caching problem to parallel disk I/O model. We close the existing gap between lower and upper bounds and achieve optimal competitive ratio of O(√D) when lookahead ...
详细信息
We consider the natural extension of the single disk caching problem to parallel disk I/O model. We close the existing gap between lower and upper bounds and achieve optimal competitive ratio of O(√D) when lookahead is more than the memory size M. When lookahead is smaller, we derive various upper bounds and lower bounds on the competitive ratio under various adversarial models.
Switching cells in parallel is a common approach to build switches with very high external line rate and a large number of ports. A prime example is the parallel packet switch (in short, PPS) in which a demultiplexing...
详细信息
ISBN:
(纸本)9781581139860
Switching cells in parallel is a common approach to build switches with very high external line rate and a large number of ports. A prime example is the parallel packet switch (in short, PPS) in which a demultiplexing algorithm sends cells, arriving at rate R on N input-ports, through one of K intermediate slower switches, operating at rate τ < R. This paper presents lower bounds on the average queuing delay introduced by the PPS relative to an optimal workconserving FCFS switch, for demultiplexing algorithms that does not have full and immediate information about the switch status. The bounds hold even if the algorithm is randomized. These lower bounds are shown to be asymptotically optimal through a new methodology for analyzing the maximal relative queuing delay;this clearly upper bounds their average relative queuing delay. The methodology is used to devise a new algorithm that relies on slightly out-dated global information on the switch status. It is also used to provide, for the first time, a complete proof of the maximum relative queuing delay provided by the fractional traffic dispatch [19, 22] algorithm. Copyright 2005 acm.
High-end shared storage systems serving multiple independent work-loads must assure that concurrently executing clients will receive a fair or agreed-upon share of system I/O resources. In a parallel I/O system an app...
详细信息
ISBN:
(纸本)9781581139860
High-end shared storage systems serving multiple independent work-loads must assure that concurrently executing clients will receive a fair or agreed-upon share of system I/O resources. In a parallel I/O system an application makes requests for specific disks at different steps of its computation depending on the data layout and its computational state. Different applications contend for disk access making the problem of maintaining fair allocation challenging. We propose a model for differentiated disk bandwidth allocation based on lexicographic minimization, and provide new efficient scheduling algorithms to allocate the I/O bandwidth fairly among contending applications. A major contribution of our model is its ability to handle multiple parallel disks and contention for disks among the concurrent applications. Analysis and simulation-based evaluation shows that our algorithms provide performance isolation, weighted allocation of resources, and are work conserving. The solutions are also applicable to other shared resource environments dealing with non-uniform heterogeneous servers. Copyright 2005 acm.
This paper introduces a parallel scheduling problem where a directed acyclic graph modeling t tasks and their dependencies needs to be executed on n unreliable workers. Worker i executes task j correctly with probabil...
详细信息
ISBN:
(纸本)9781581139860
This paper introduces a parallel scheduling problem where a directed acyclic graph modeling t tasks and their dependencies needs to be executed on n unreliable workers. Worker i executes task j correctly with probability p i,j. The goal is to find a regimen Σ, that dictates how workers get assigned to tasks (possibly in parallel and redundantly) throughout execution, so as to minimize expected completion time. This fundamental parallel scheduling problem arises in grid computing and project management fields, and has several practical applications. We show a polynomial time algorithm for the problem restricted to the case when dag width is at most a constant and the number of workers is also at most a constant. These two restrictions may appear to be too severe. However, they are fundamentally required. Specifically, we demonstrate that the problem is NP-hard with constant number of workers when dag width can grow, and is also NP-hard with constant dag width when the number of workers can grow. When both dag width and the number of workers are unconstrained, then the problem is inapproximable within factor less than 5/4, unless P=NP. Copyright 2005 acm.
暂无评论