Adapting to the network is the key to achieving high performance for communication-intensive applications, including scientific computing,data intensive computing, and multicast, especially in Grid environments. This ...
详细信息
ISBN:
(纸本)1595936734
Adapting to the network is the key to achieving high performance for communication-intensive applications, including scientific computing,data intensive computing, and multicast, especially in Grid environments. This paper investigates an approach of representing network as a tree of participating hosts and switches matching or approximating their physical topology, and describes a fast, non-intrusive, and portable algorithm for inferring such a topology. This representation and the proposed inference algorithm serves as a key to building network-aware applications in a portable manner. The algorithm is based solely on RTTs of small packets between end hosts;it does not rely on popular but not universally available protocols such as trace route and SNMP. Another benefit is that it can handle all layers of network uniformly without any a priori knowledge of cluster configurations. The required number of measurements is O(Nd) in certain idealizing assumptions made for the purpose of analysis, where N is the number of participating processes and d the diameter of the network, which is usually small in real networks. In our experimental environment, the inference algorithm built a topology of 64 hosts in a single cluster in 4 seconds and and that of 256 hosts across 4 clusters in 15 seconds. It is able to not only identify clusters within a Grid, but also to partially identify the Layer 2 topology within a cluster. This is important for optimizing bandwidth-limited operations such as broadcast. We built several network-aware applications upon the inference system, including efficient bandwidth measurements and long message broadcasts. The topology is used to schedule as many measurements as possible in parallel without competing on shared links. We were able to build a bandwidth map of 256 hosts across 4 clusters in 27 seconds. Copyright 2007 ACM.
Very large scientific datasets are increasingly becoming available in XML formats. At the same time, multi-core processing is increasingly becoming available on desktop- and laptop-class computing machines. Unfortunat...
详细信息
ISBN:
(纸本)159593717X
Very large scientific datasets are increasingly becoming available in XML formats. At the same time, multi-core processing is increasingly becoming available on desktop- and laptop-class computing machines. Unfortunately, most XML parsers are still using algorithms that are inherently serial, which show little improvement on newer computing hardware. The current XML implementation landscape does not adequately meet the performance requirements of large scale applications. Thus far, applications using Web services (in the grid community, for example) have largely focused on XML protocol standardization and tool building efforts, and not on addressing the performance bottlenecks when dealing with large volumes of XML data. Generic parallel parsing has been studied in depth over the past thirty years. However, as yet, these results have not been applied to the problem of XML parsing. XML documents have some structural properties that make it more amenable to parallelized parsing than general context-free languages. As has been previously shown, XML parsers spend a large percentage of time tokenizing the input in aninherently serial process, typically running a deterministic finite automaton on the input. Our initial approach, described here, separates the process of parsing the XML from the process of reading the input. We take a well-known high performance parser, Piccolo, and apply two different strategies, Runahead and Piped, and examine the timing of the file read time and hence the overall time to parse large scientific XML files. Under the conditions tested here, performance decreases. Copyright 2007 ACM.
The proceedings contains 436 papers. The topics discussed include: exploiting barriers to optimize power consumption of CMPs;PDM sorting algorithms that take a small number of passes;a highly parallel algorithm for th...
详细信息
ISBN:
(纸本)0769523129
The proceedings contains 436 papers. The topics discussed include: exploiting barriers to optimize power consumption of CMPs;PDM sorting algorithms that take a small number of passes;a highly parallel algorithm for the numerical simulation of unsteady diffusion processes;functionality distribution for parallel rendering;effective instruction prefetching via fetch prestaging;enhanced parallelprocessing in wide registers;asynchronous complete distributed garbage collection;scheduling algorithms for effective thread pairing on hybrid multiprocessors;practical divisible load scheduling on grid platforms with APST-DV;parallelizing a defect detection and categorization application;data redistribution and remote method invocation in parallel component architectures;and runtime empirical selection of loop schedulers on hyperthreaded SMPs.
An efficient architecture for a FPGA symmetry FIR filter is proposed that employs M-bit parallel-distributed arithmetic (M-bit PDA). The partial product is pre-calculated and saved into the distributed RAM. This elimi...
详细信息
ISBN:
(纸本)9780780393899
An efficient architecture for a FPGA symmetry FIR filter is proposed that employs M-bit parallel-distributed arithmetic (M-bit PDA). The partial product is pre-calculated and saved into the distributed RAM. This eliminates the large amount of logic needed to compute multiplication results. The proposed architecture consumes less area and offers higher speed operation because the multiplier is omitted. Altera APEX20KE is used as a target device. Thus, the proposed architecture has high processing speed and small area.
High-Performance clusters are rapidly becoming an important computing platform for both scientific and business applications. To fulfill the new demands and challenges, cluster system software is inevitably complex. E...
详细信息
distributed Hash Tables (DHT) algorithms obtain good lookup performance bounds by using deterministic rules to organize peer nodes into an overlay network. To preserve the invariants of the overlay network, DHTs use s...
详细信息
Reputation in P2P networks is an important tool to encourage cooperation among peers. It is based on ranking of peers according to their past behaviour. In large-scale real world networks, a global centralised knowled...
详细信息
A hybrid vision chip is presented for real-time object-based processing for tasks such as positioning and sizing of enclosed objects. This system presents the first artificial silicon retina capable of position and si...
详细信息
ISBN:
(纸本)9780780393899
A hybrid vision chip is presented for real-time object-based processing for tasks such as positioning and sizing of enclosed objects. This system presents the first artificial silicon retina capable of position and size determination of multiple objects in true parallel fashion. Based on a novel distributed algorithm, this approach uses the input image to enclose a feedback. loop to realise a data-driven pulsating action. The fabricated device is shown to achieve a computation-efficiency of at least 725 million instructions per second per milliwatt and capable of processing up to 2000 frames per second.
CDD weather derivatives are widely used to hedge weather risks and their fast and accurate pricing is an important problem in financial engineering. In this paper, we propose an efficient parallelization strategy of a...
详细信息
Considering the wide range of applicability of particle filters, their VLSI implementation is of great importance. Resampling is the sequential part of the fully parallel particle filter. Therefore, parallel VLSI arch...
详细信息
ISBN:
(纸本)9780780393899
Considering the wide range of applicability of particle filters, their VLSI implementation is of great importance. Resampling is the sequential part of the fully parallel particle filter. Therefore, parallel VLSI architectures for resampling is of particular interest. In this paper, we develop a parallel implementation of resampling. The novel feature of the proposed architecture is that the execution time of resampling becomes independent of the distributions of the weights. Despite the alternatives in the literature, our scheme achieves a very small execution time by pipelining the resampling and sampling steps. Moreover, it is scalable for high levels of parallelism, has lower memory usage, fixed routing time, and has close to ideal performance. Furthermore, it eliminates the need for a point-to-point network between processing elements and results in a simple central unit.
暂无评论