Running programs across multiple nodes in a cluster of networked computers, such as in a supercomputer or commodity datacenter system,is increasingly important across multiple domains, including data science, machine ...
详细信息
Running programs across multiple nodes in a cluster of networked computers, such as in a supercomputer or commodity datacenter system,is increasingly important across multiple domains, including data science, machine learning, and scientific computing. This is brought on by a combination of increasing data sizes, which push beyond the memory capacity of a single node, and increasing computational demands from new, more elaborate simulations, models, and applications. However, writing parallel programs for clusters of computers remains a difficult task, particularly for programs that are irregular in terms ofdata distribution or access pattern. Many parallel programs today are still written using communication libraries like MPI or OpenSHMEM, which require users to explicitly manage low-level details. While high-level parallel programming languages and libraries do exist, and these can make implementing certain types of programs much easier, developers often have to expend significant effort building custom infrastructure and datastructures for their applications. This thesis argues that a large part of the reason why parallel programming remains difficult is a lack of high-level distributed data structures analogous to the datastructures that have become ubiquitous in sequential programming environments like C++ and Python. These especially include irregular datastructures like hash tables and queues that may require fine-grained memory accesses along with synchronization. This thesis examines techniques for building high-level, cross-platform distributed data structures using one-sided remote memory operations like remote put, remote get, and remote atomics. These memory access primitives allow for a high degree of asynchrony, enabling better performance by removing synchronization bottlenecks and allowing a high degree of overlap between communication and computation. They can also be efficiently executed directly by the network hardware in modern supercomputer
One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-l...
详细信息
ISBN:
(纸本)9781450362955
One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance datastructures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL datastructures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL datastructures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL datastructures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.
distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct...
详细信息
ISBN:
(纸本)9781728159874
distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributeddata structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing datastructures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures.
Linearizability is a powerful consistency condition but can be expensive to implement. Recently, reserarchers have suggested gaining performance by relaxing the sequential specification of objects' data types. We ...
详细信息
ISBN:
(纸本)9783662451748
Linearizability is a powerful consistency condition but can be expensive to implement. Recently, reserarchers have suggested gaining performance by relaxing the sequential specification of objects' data types. We consider, for the first time, linearizable message-passing implementations of relaxed Queues and prove upper and lower bounds on the elapsed time for Dequeue operations both in the worst case and on average. Our results imply that worst-case time complexity does not indicate benefit from relaxation. In contrast, we present implementations of relaxed Queues for which the average time complexity of Dequeue is significantly smaller than both the worst-case lower bound for unrelaxed Queues and a newly-proved lower bound on the average time for unrelaxed Queues. We also prove lower bounds on the average time complexity of Dequeue for relaxed Queues that show our algorithms are asymptotically optimal and that there is an inherent complexity gap between different levels of relaxation.
Remote Direct Memory Access (RDMA) has become a standard networking technology and is prominently used in high-performance applications. While RDMA can provide both excellent performance and novel capabilities, it can...
详细信息
Remote Direct Memory Access (RDMA) has become a standard networking technology and is prominently used in high-performance applications. While RDMA can provide both excellent performance and novel capabilities, it can be more difficult than traditional kernel-supported networking to use effectively. RDMA adoption can be simplified through the use of abstractions that target broad classes of applications. In this thesis, we propose flow structures as an abstraction targeting distributed applications in which data need to flow directly between producers and consumers. Flow structures are asymmetric distributed producer/consumer datastructures in which some servers only produce and others only consume. We present a design space for RDMA implementations of flow structures and illustrate the space using several concrete implementations of a simple flow structure: a FIFO queue. Finally, we present an empirical study that illustrates the design tradeoffs along the dimensions of the design space, and that shows that flow structures abstractions can capture the performance advantages of RDMA while making it much easier to use.
This paper revisits the study of (minimum) broadcast graphs, i.e., graphs enabling fast information dissemination from every source node to all the other nodes (and having minimum number of edges for this property). T...
详细信息
ISBN:
(数字)9789819723409
ISBN:
(纸本)9789819723393;9789819723409
This paper revisits the study of (minimum) broadcast graphs, i.e., graphs enabling fast information dissemination from every source node to all the other nodes (and having minimum number of edges for this property). This study is performed in the framework of compact distributed data structures, that is, when the broadcast protocols are bounded to be encoded at each node as an ordered list of neighbors specifying, upon reception of a message, in which order this message must be passed to these neighbors. We show that this constraint does not limit the power of broadcast protocols, as far as the design of (minimum) broadcast graphs is concerned. Specifically, we show that, for every n, there are n-node graphs for which it is possible to design protocols encoded by lists yet enabling broadcast in [log(2) n] rounds from every source, which is optimal even for general (i.e., non space-constrained) broadcast protocols. Moreover, we show that, for every n, there exist such graphs with the additional property that they are asymptotically as sparse as the sparsest graphs for which [log(2) n]-round broadcast protocols exist, up to a constant multiplicative factor. Concretely, these graphs have O(n center dot L(n)) edges, where L(n) is the number of leading 1 s in the binary representation of n - 1, and general minimum broadcast graphs are known to have Omega(n center dot L(n)) edges.
Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerabl...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerable algorithmic challenges, particularly when managing sparse data, a pivotal component in progressing data science. The spatial computer model quantifies communication locality by weighting processor communication costs by distance, introducing a term named energy. Moreover, it integrates depth, a widely-utilized metric, to promote high parallelism. We propose and analyze a framework for efficient spatial tree algorithms within the spatial computer model. Our primary method constructs a spatial tree layout that optimizes the locality of the neighbors in the compute grid. This approach thereby enables locality-optimized messaging within the tree. Our layout achieves a polynomial factor improvement in energy compared to utilizing a PRAM approach. Using this layout, we develop energy-efficient treefix sum and lowest common ancestor algorithms, which are both fundamental building blocks for other graph algorithms. With high probability, our algorithms exhibit near-linear energy and poly-logarithmic depth. Our contributions augment a growing body of work demonstrating that computations can have both high spatial locality and low depth. Moreover, our work constitutes an advancement in the spatial layout of irregular and sparse computations.
Ensuring data consistency under remote direct memory access (RDMA) is challenging due to the combined effects of various hard-ware components. This brief announcement introduces remote object memory (ROMe), the first ...
详细信息
ISBN:
(纸本)9798400704161
Ensuring data consistency under remote direct memory access (RDMA) is challenging due to the combined effects of various hard-ware components. This brief announcement introduces remote object memory (ROMe), the first technique to guarantee wait-free consistent reads of arbitrarily sized objects over RDMA without the use of specialized hardware, and while allowing the concurrent execution of conflicting local updates. We integrated ROMe into ROMe-KV, an RDMA-enabled key-value store whose underlying Blink tree nodes are ROMe objects that enable supporting wait-free linearizable range queries.
A multiplicity queue is a concurrently-defined data type which relaxes the conditions of a linearizable FIFO queue by allowing concurrent Dequeue instances to return the same value. It would seem that this should allo...
详细信息
ISBN:
(纸本)9798400701214
A multiplicity queue is a concurrently-defined data type which relaxes the conditions of a linearizable FIFO queue by allowing concurrent Dequeue instances to return the same value. It would seem that this should allow faster message-passing implementations, as processes should not need to wait as long to learn about concurrent operations and previous work has shown that multiplicity queues are computationally less complex than the unrelaxed version. Intriguingly, recentwork has shown that there is, in fact, little possible speedup versus an unrelaxed queue. Seeking to understand this difference between intuition and real behavior, we increase the lower bound for uniform algorithms. Further, we outline a path toward building proofs for even higher lower bounds, hypothesizing that the worst-case time to Dequeue approaches maximum message delay, which is similar to the time required for an unrelaxed Dequeue We also give an upper bound for a special case to show that our bounds are tight at that point. To achieve our lower bounds, we use extended shifting arguments, which have been rarely used but allow larger lower bounds than traditional shifting arguments. We use these in series of inductive indistinguishability proofs which allow us to extend our proofs beyond the usual limitations of shifting arguments. This proof structure is an interesting contribution independently of the main result, as developing new lower bound proof techniques may have many uses in future work.
A multiplicity-relaxed queue or stack data type allows multiple Dequeue or Pop operations to return the same value if they are concurrent. We consider the possible efficiency of message-passing implementations of such...
详细信息
ISBN:
(数字)9783031099939
ISBN:
(纸本)9783031099939;9783031099922
A multiplicity-relaxed queue or stack data type allows multiple Dequeue or Pop operations to return the same value if they are concurrent. We consider the possible efficiency of message-passing implementations of such data types. We show that both the worst case and amortized time cost for Dequeues and Pops are nearly as high as upper bounds for their worst-case time in unrelaxed queues and stacks. Relaxed data types are of interest since they can in some cases trade off some of data types' ordering guarantees for increased performance or easier implementation. The multiplicity relaxation, in particular, is interesting as it has been shown to be less computationally complex than unrelaxed queues and stacks. Our results explore a different aspect of these data types, considering communication time complexity in a message passing system and showing limits on possible improved time performance.
暂无评论