In the pooled data problem the goal is to efficiently reconstruct a binary signal from additive measurements. Given a signal sigma is an element of {0, 1}(n), we can query multiple entries at once and get the total nu...
详细信息
ISBN:
(纸本)9781665481069
In the pooled data problem the goal is to efficiently reconstruct a binary signal from additive measurements. Given a signal sigma is an element of {0, 1}(n), we can query multiple entries at once and get the total number of non-zero entries in the query as a result. We assume that queries are time-consuming and therefore focus on the setting where all queries are executed in parallel. For the regime where the signal is sparse such that ||s||(1) = o(n) our results are twofold: First, we propose and analyze a simple and efficient greedy reconstruction algorithm. Secondly, we derive a sharp information-theoretic threshold for the minimum number of queries required to reconstruct s with high probability. Our first result matches the performance guarantees of much more involved constructions (Karimi et al. 2019). Our second result extends a result of Alaoui et al. (2014) and Scarlett & Cevher (2017) who studied the pooled data problem for dense signals. Finally, our theoretical findings are complemented with empirical simulations. Our data not only confirm the information-theoretic thresholds but also hint at the practical applicability of our pooling scheme and the simple greedy reconstruction algorithm.
Sorting is one of the most fundamental operations for many applications. For efficient sorting, data locality can be exploited by processing subdivided data in parallel. This work presents a high-performance and area-...
详细信息
ISBN:
(纸本)9781665483322
Sorting is one of the most fundamental operations for many applications. For efficient sorting, data locality can be exploited by processing subdivided data in parallel. This work presents a high-performance and area-efficient near-memory radix sort accelerator where end-to-end sorting is performed locally. With a parallel 1-bit radix sorter, it achieves high throughput by processing multiple keys per cycle. Tested with Xilinx Zynq UltraScale+ ZCU104 FPGA, the experimental result shows up to 10x performance speedup over CPU. It is highly area-efficient and can be integrated into each processing node of a distributed computing system with low area cost.
We show that the problem of computing the minimum spanning tree can be formulated as special case of detecting Lattice Linear Predicate (LLP). In general, formulating problems as LLP presents two main advantages: 1) D...
详细信息
ISBN:
(纸本)9781665497473
We show that the problem of computing the minimum spanning tree can be formulated as special case of detecting Lattice Linear Predicate (LLP). In general, formulating problems as LLP presents two main advantages: 1) Different problems are formulated under a single, general framework, which defines the problem in terms of simple local predicates that must hold for the all the elements of a lattice, making the problem (and the solution) compact and easy to understand. 2) improvements on one set of problems can be transferable to other sets of problems;3) since the problems are stated as a set of local predicates, which can be often tested with little or no synchronization it is often the case that new opportunities for parallelism present themselves. In this paper we introduce two parallel algorithms LLP-Prim and LLP-Boruvka that improve on the non-LLP counterparts in several ways. LLP-Prim reduces the number of heap operations required by Prim by allowing edges to be selected without entering the heap thus allowing for parallelism. LLP-Boruvka improves on Boruvka by reducing synchronization and thus once more improving parallelism opportunities. Our experimental evaluation shows that LLP-Prim is faster than Prim's algorithm in both single threaded and multithreaded scenarios and that it provides a good tradeoff between parallelism and efficiency at low core counts. For higher core count scenarios we show how LLP-Boruvka improves on an efficient implementation of a parallel version of Boruvka.
Modeling the performance of real-world applications at scale is essential for designing next-generation platforms and shaping the development of future algorithms. However, accurately capturing the complexity of appli...
详细信息
The reverse-time migration (RTM) imaging algorithm, which is used for complex underground structure analysis, is known as one of the current mainstream method of high-precision seismic imaging. Nowadays, as the fast d...
详细信息
Software development of high-performance graph algorithms is difficult on modern parallel computers. To simplify this task, we have designed and implemented a collection of C++ graph primitives, basic building blocks,...
详细信息
Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Bec...
详细信息
ISBN:
(纸本)9781665481069
Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint. In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme.
The paper proposed an approach to building a scalable service for spatiotemporal data storage to be used in applications that require searching for localized data and possibility of scaling up the storage resources. T...
详细信息
ISBN:
(纸本)9798350367607;9798350367591
The paper proposed an approach to building a scalable service for spatiotemporal data storage to be used in applications that require searching for localized data and possibility of scaling up the storage resources. The motivation for proposing this approach was enabling the localized data being overlayed over the map, while transferring only required data and minimize the latency time. The paper describes the architecture combining the R-star tree index and Log Structured Merge (LSM) tree methods. The implementation framework based on Erlang OTP is proposed to provide a basis for sustainability and resiliency.
HPC is a widely used term, often referred to the applications, architectures and programming models and tools targeting highly parallel machines such as those of the *** lists. Recent advances in computing hardware re...
详细信息
Large shared-memory many-core nodes have become the norm in scientific computing, and therefore the sparse linear solver stack must adapt to the multilevel structure that exists on these nodes. One adaption is the dev...
详细信息
ISBN:
(纸本)9781665481069
Large shared-memory many-core nodes have become the norm in scientific computing, and therefore the sparse linear solver stack must adapt to the multilevel structure that exists on these nodes. One adaption is the development of hybrid-solvers at the node level. We present HTS as a hybrid threaded solver that aims to provide a finer-grain algorithm to keep an increased number of threads actively working on these larger shared-memory environments without the overheads of message passing implementations. Additionally, HTS aims at utilizing the additional shared memory that may be available to improve performance, i.e., reducing iteration counts when used as a preconditioner and speeding up calculations. HTS is built around the Schur complement framework that many other hybrid solver packages already use. However, HTS uses a multilevel structure in dealing with the Schur complement and allows for fill-in in certain off-diagonal submatrices to allow for a faster and more accurate solve phase. These modifications allow for a tasking thread library, namely Cilk, to be used to speed up performance while still reducing peak memory by more than 20% on average compared to an optimized direct factorization method. We show that HTS can outperform the MPI-based hybrid solver ShyLU on a suite of sparse matrices by as much as 2x, and show that HTS can scale well on three-dimensional finite difference problems.
暂无评论