One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-l...
详细信息
ISBN:
(纸本)9781450362955
One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance datastructures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL datastructures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL datastructures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL datastructures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.
distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct...
详细信息
ISBN:
(纸本)9781728159874
distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributeddata structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing datastructures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures.
Linearizability is a powerful consistency condition but can be expensive to implement. Recently, reserarchers have suggested gaining performance by relaxing the sequential specification of objects' data types. We ...
详细信息
ISBN:
(数字)9783662451748
ISBN:
(纸本)9783662451748
Linearizability is a powerful consistency condition but can be expensive to implement. Recently, reserarchers have suggested gaining performance by relaxing the sequential specification of objects' data types. We consider, for the first time, linearizable message-passing implementations of relaxed Queues and prove upper and lower bounds on the elapsed time for Dequeue operations both in the worst case and on average. Our results imply that worst-case time complexity does not indicate benefit from relaxation. In contrast, we present implementations of relaxed Queues for which the average time complexity of Dequeue is significantly smaller than both the worst-case lower bound for unrelaxed Queues and a newly-proved lower bound on the average time for unrelaxed Queues. We also prove lower bounds on the average time complexity of Dequeue for relaxed Queues that show our algorithms are asymptotically optimal and that there is an inherent complexity gap between different levels of relaxation.
Remote Direct Memory Access (RDMA) has become a standard networking technology and is prominently used in high-performance applications. While RDMA can provide both excellent performance and novel capabilities, it can...
详细信息
Remote Direct Memory Access (RDMA) has become a standard networking technology and is prominently used in high-performance applications. While RDMA can provide both excellent performance and novel capabilities, it can be more difficult than traditional kernel-supported networking to use effectively. RDMA adoption can be simplified through the use of abstractions that target broad classes of applications. In this thesis, we propose flow structures as an abstraction targeting distributed applications in which data need to flow directly between producers and consumers. Flow structures are asymmetric distributed producer/consumer datastructures in which some servers only produce and others only consume. We present a design space for RDMA implementations of flow structures and illustrate the space using several concrete implementations of a simple flow structure: a FIFO queue. Finally, we present an empirical study that illustrates the design tradeoffs along the dimensions of the design space, and that shows that flow structures abstractions can capture the performance advantages of RDMA while making it much easier to use.
Ensuring data consistency under remote direct memory access (RDMA) is challenging due to the combined effects of various hard-ware components. This brief announcement introduces remote object memory (ROMe), the first ...
详细信息
ISBN:
(纸本)9798400704161
Ensuring data consistency under remote direct memory access (RDMA) is challenging due to the combined effects of various hard-ware components. This brief announcement introduces remote object memory (ROMe), the first technique to guarantee wait-free consistent reads of arbitrarily sized objects over RDMA without the use of specialized hardware, and while allowing the concurrent execution of conflicting local updates. We integrated ROMe into ROMe-KV, an RDMA-enabled key-value store whose underlying Blink tree nodes are ROMe objects that enable supporting wait-free linearizable range queries.
This paper revisits the study of (minimum) broadcast graphs, i.e., graphs enabling fast information dissemination from every source node to all the other nodes (and having minimum number of edges for this property). T...
详细信息
ISBN:
(数字)9789819723409
ISBN:
(纸本)9789819723393;9789819723409
This paper revisits the study of (minimum) broadcast graphs, i.e., graphs enabling fast information dissemination from every source node to all the other nodes (and having minimum number of edges for this property). This study is performed in the framework of compact distributed data structures, that is, when the broadcast protocols are bounded to be encoded at each node as an ordered list of neighbors specifying, upon reception of a message, in which order this message must be passed to these neighbors. We show that this constraint does not limit the power of broadcast protocols, as far as the design of (minimum) broadcast graphs is concerned. Specifically, we show that, for every n, there are n-node graphs for which it is possible to design protocols encoded by lists yet enabling broadcast in [log(2) n] rounds from every source, which is optimal even for general (i.e., non space-constrained) broadcast protocols. Moreover, we show that, for every n, there exist such graphs with the additional property that they are asymptotically as sparse as the sparsest graphs for which [log(2) n]-round broadcast protocols exist, up to a constant multiplicative factor. Concretely, these graphs have O(n center dot L(n)) edges, where L(n) is the number of leading 1 s in the binary representation of n - 1, and general minimum broadcast graphs are known to have Omega(n center dot L(n)) edges.
Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerabl...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerable algorithmic challenges, particularly when managing sparse data, a pivotal component in progressing data science. The spatial computer model quantifies communication locality by weighting processor communication costs by distance, introducing a term named energy. Moreover, it integrates depth, a widely-utilized metric, to promote high parallelism. We propose and analyze a framework for efficient spatial tree algorithms within the spatial computer model. Our primary method constructs a spatial tree layout that optimizes the locality of the neighbors in the compute grid. This approach thereby enables locality-optimized messaging within the tree. Our layout achieves a polynomial factor improvement in energy compared to utilizing a PRAM approach. Using this layout, we develop energy-efficient treefix sum and lowest common ancestor algorithms, which are both fundamental building blocks for other graph algorithms. With high probability, our algorithms exhibit near-linear energy and poly-logarithmic depth. Our contributions augment a growing body of work demonstrating that computations can have both high spatial locality and low depth. Moreover, our work constitutes an advancement in the spatial layout of irregular and sparse computations.
A multiplicity queue is a concurrently-defined data type which relaxes the conditions of a linearizable FIFO queue by allowing concurrent Dequeue instances to return the same value. It would seem that this should allo...
详细信息
ISBN:
(纸本)9798400701214
A multiplicity queue is a concurrently-defined data type which relaxes the conditions of a linearizable FIFO queue by allowing concurrent Dequeue instances to return the same value. It would seem that this should allow faster message-passing implementations, as processes should not need to wait as long to learn about concurrent operations and previous work has shown that multiplicity queues are computationally less complex than the unrelaxed version. Intriguingly, recentwork has shown that there is, in fact, little possible speedup versus an unrelaxed queue. Seeking to understand this difference between intuition and real behavior, we increase the lower bound for uniform algorithms. Further, we outline a path toward building proofs for even higher lower bounds, hypothesizing that the worst-case time to Dequeue approaches maximum message delay, which is similar to the time required for an unrelaxed Dequeue We also give an upper bound for a special case to show that our bounds are tight at that point. To achieve our lower bounds, we use extended shifting arguments, which have been rarely used but allow larger lower bounds than traditional shifting arguments. We use these in series of inductive indistinguishability proofs which allow us to extend our proofs beyond the usual limitations of shifting arguments. This proof structure is an interesting contribution independently of the main result, as developing new lower bound proof techniques may have many uses in future work.
A multiplicity-relaxed queue or stack data type allows multiple Dequeue or Pop operations to return the same value if they are concurrent. We consider the possible efficiency of message-passing implementations of such...
详细信息
ISBN:
(数字)9783031099939
ISBN:
(纸本)9783031099939;9783031099922
A multiplicity-relaxed queue or stack data type allows multiple Dequeue or Pop operations to return the same value if they are concurrent. We consider the possible efficiency of message-passing implementations of such data types. We show that both the worst case and amortized time cost for Dequeues and Pops are nearly as high as upper bounds for their worst-case time in unrelaxed queues and stacks. Relaxed data types are of interest since they can in some cases trade off some of data types' ordering guarantees for increased performance or easier implementation. The multiplicity relaxation, in particular, is interesting as it has been shown to be less computationally complex than unrelaxed queues and stacks. Our results explore a different aspect of these data types, considering communication time complexity in a message passing system and showing limits on possible improved time performance.
In this work, we propose D-3-Tree, a dynamic distributed deterministic structure for data management in decentralized networks, by engineering and extending an existing decentralized structure. Conducting an extensive...
详细信息
In this work, we propose D-3-Tree, a dynamic distributed deterministic structure for data management in decentralized networks, by engineering and extending an existing decentralized structure. Conducting an extensive experimental study, we verify that the implemented structure outperforms other well-known hierarchical tree-based structures since it provides better complexities regarding load-balancing operations. More specifically, the structure achieves an O(log N) amortized bound (N is the number of nodes present in the network), using an efficient deterministic load-balancing mechanism, which is general enough to be applied to other hierarchical tree-based structures. Moreover, our structure achieves O(log N) worst-case search performance. Last but not least, we investigate the structure's fault tolerance, which hasn't been sufficiently tackled in previous work, both theoretically and through rigorous experimentation. We prove that D-3-Tree is highly fault-tolerant and achieves O (log N) amortized search cost under massive node failures, accompanied by a significant success rate. Afterwards, by incorporating this novel balancing scheme into the ART (Autonomous Range Tree) structure, we go one step further to achieve sub-logarithmic complexity and propose the ART(+) structure. ART(+) achieves an O (log(b)(2) log N) communication cost for query and update operations (b is a double-exponentially power of 2 and N is the total number of nodes). Moreover, ART(+) is a fully dynamic and fault-tolerant structure, which supports the join/leave node operations in O(log log N) expected WHP (with high proability) number of hops and performs load-balancing in O(log log N) amortized cost.
暂无评论