Sparse data structures like hash tables, trees, or compressed tensors are ubiquitous, but operations on these structures are expensive and inefficient on current systems. Prior work has proposed hardware acceleration ...
详细信息
This paper presents Snapgram, an innovative assignment for teaching matrix multiplications in the context of a parallel Computing course. Designed for fourth-year undergraduate at the University of California, Davis, ...
详细信息
This special session will report on the updated NSF/IEEE-TCPP Curriculum on parallel and Distributed Computing released in Nov 2020 by the Center for parallel and Distributed Computing Curriculum Development and Educa...
详细信息
ISBN:
(纸本)9781450394338
This special session will report on the updated NSF/IEEE-TCPP Curriculum on parallel and Distributed Computing released in Nov 2020 by the Center for parallel and Distributed Computing Curriculum Development and Educational Resources (CDER). The purpose of the special session is to obtain SIGCSE community feedback on this curriculum in a highly interactive manner employing the hybrid modality and supported by a full-time CDER booth for the duration of SIGCSE. In this era of big data, cloud, and multi- and many-core systems, it is essential that the computer science (CS) and computer engineering (CE) graduates have basic skills in parallel and distributed computing (PDC). The topics are primarily organized into the areas of architecture, programming, and algorithms topics. A set of pervasive concepts that percolate across area boundaries are also identified. Version 1 of this curriculum was released in December 2012. That curriculum guideline has over 140 early adopter institutions worldwide and has been incorporated into the 2013 acm/IEEE Computer Science curricula. This Version-II represents a major revision. The updates have focused on enhancing coverage related to the topical aspects of Big Data, Energy, and Distributed Computing. The session will also report on related CDER activities including a workshop series on a PDC institute conceptualization, developing a CE-oriented version of the curriculum, and identifying a minimal set of PDC topics aligned with ABET's exposure-level PDC requirements. The interested SIGCSE audience includes educators, authors, publishers, curriculum committee members, department chairs and administrators, professional societies, and the computing industry.
The classical paging problem can be described as follows: given a cache that can hold up to k pages (or blocks) and a sequence of requests to pages, how should we manage the cache so as to maximize performance-or, in ...
详细信息
ISBN:
(纸本)9781450391467
The classical paging problem can be described as follows: given a cache that can hold up to k pages (or blocks) and a sequence of requests to pages, how should we manage the cache so as to maximize performance-or, in other words, complete the sequence as quickly as possible. Whereas this sequential paging problem has been well understood for decades, the parallel version, where the cache is shared among p processors each issuing its own sequence of page requests, has been much more resistant. In this problem we are given p request sequences R-1, R-2, . . . , R-p, each of which accesses a disjoint set of pages, and we ask the question: how should the paging algorithm manage the cache to optimize the completion time of all sequences (i.e., the makespan). As for the classical sequential problem, the goal is to design an online paging algorithm that achieves an optimal competitive ratio, using O(1) resource augmentation. In a recent breakthrough, Agrawal et al. [SODA '21] showed that the optimal (deterministic) competitive ratio C for this problem is in the range Omega(log p) <= C <= O(log(2) p). This paper closes that gap, showing how to achieve a competitive ratio C = O(log p). Our techniques reveal surprising combinatorial differences between the problem of optimizing makespan and that of optimizing the closely related metric of mean completion time;and yet our algorithm manages to be simultaneously asymptotically optimal for both tasks.
Clustering of data in metric spaces is a fundamental problem and has many applications in data mining and it is often used as an unsupervised learning tool inside other machine learning systems. In many scenarios wher...
详细信息
ISBN:
(纸本)9781611977073
Clustering of data in metric spaces is a fundamental problem and has many applications in data mining and it is often used as an unsupervised learning tool inside other machine learning systems. In many scenarios where we are concerned with the privacy implications of clustering users, clusters are required to have minimum-size constraint. A canonical example of min-size clustering is in enforcing anonymization and the protection of the privacy of user data. Our work is motivated by real-world applications (such as the Federated Learning of Cohorts project -FLoC) where a min size clustering algorithm needs to handle very large amount of data and the data may also change over time. Thus efficient parallel or dynamic algorithms are desired. In this paper, we study the r-gather problem, a natural formulation of minimum-size clustering in metric spaces. The goal of r-gather is to partition n points into clusters such that each cluster has size at least r, and the maximum radius of the clusters is minimized. This additional constraint completely changes the algorithmic nature of the problem, and many clustering techniques fail. Also previous dynamic and parallelalgorithms do not achieve desirable complexity. We propose algorithms both in the Massively parallel Computation (MPC) model and in the dynamic setting. Our MPC algorithm handles input points from the Euclidean space Rd. It computes an O(1)-approximate solution of r-gather in O(log(epsilon) n) rounds using total space O(n(1+gamma) . d) for arbitrarily small constants epsilon , gamma > 0. In addition our algorithm is fully scalable, i.e., there is no lower bound on the memory per machine. Our dynamic algorithm maintains an O(1)-approximate r-gather solution under insertions/deletions of points in a metric space with doubling dimension d. The update time is r.2(O(d)) . log(O(1)) and the query time is 2(O(d)) . log(O(1)) Lambda, where Lambda is the ratio between the largest and the smallest distance. To obtain our r
We present a parallel k-clique listing algorithm with improved work bounds (for the same depth) in sparse graphs with low degeneracy or arboricity. We achieve this by introducing and analyzing a new pruning criterion ...
详细信息
作者:
Kepner, JeremyMIT
Lincoln Lab Supercomp Ctr Lexington Lexington MA 02421 USA
Groundbreaking work analyzing early Internet data revealed novel phenomena that became the basis of a new endeavor: Network Science. This exciting new field has revealed fundamental properties about communication, soc...
详细信息
ISBN:
(纸本)9781450391467
Groundbreaking work analyzing early Internet data revealed novel phenomena that became the basis of a new endeavor: Network Science. This exciting new field has revealed fundamental properties about communication, social, and biological networks. Simultaneously, the Internet has expanded enormously and is now a domain of activity as important to civilization as land, sea, air, and space. The initial Internet observations that nurtured network science have ballooned and become the largest dynamic streaming data sets availability;creating fresh opportunities to examine the foundations of network science in previously unimagined detail. The analysis of streaming networks with trillions of events have stimulated the development of novel mathematics (e.g., associative array algebra), algorithms (e.g., hypersparse neural networks), software (e.g., ***), and hardware. All of these capabilities are critically dependent on parallel processing. Application of these developments to the worlds' largest publicly available streaming event datasets have revealed a variety of new phenomena.
Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors. Cars are an example for complex embedded systems in which the use of multicores conti...
详细信息
ISBN:
(纸本)9781450387132
Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors. Cars are an example for complex embedded systems in which the use of multicores continues to grow. The requirements of software components running in modern cars are diverse. On the one hand there are safety-critical tasks like the airbag control, on the other hand tasks which do not have any safety-related requirements at all, for example those controlling the infotainment system. Trends like autonomous driving lead to tasks which are simultaneously safety-critical and computationally complex. To satisfy the requirements of modern embedded applications we developed a dataflow-based runtime environment (RTE) for clustered manycore architectures. The RTE is able to execute dataflow graphs in various redundancy configurations and with different schedulers. We implemented our RTE design on the Kalray Bostan Massively parallel Processor Array and evaluated all possible configurations for three common computation tasks. To classify the performance of our RTE, we compared the non-redundant graph executions with OpenCL versions of the three applications. The results show that our RTE can come close or even surpass Kalray's OpenCL framework, although maximum performance was not the primary goal of our design.
The proceedings contain 84 papers. The topics discussed include: Astrea: accurate quantum error-decoding via practical minimum-weight perfect-matching;TaskFusion: an efficient transfer learning architecture with dual ...
ISBN:
(纸本)9798400700958
The proceedings contain 84 papers. The topics discussed include: Astrea: accurate quantum error-decoding via practical minimum-weight perfect-matching;TaskFusion: an efficient transfer learning architecture with dual delta sparsity for multi-task natural language processing;scaling qubit readout with hardware efficient machine learning architectures;enabling high performance debugging for variational quantum algorithms using compressed sensing;HAAC: a hardware-software co-design to accelerate garbled circuits;inter-layer scheduling space definition and exploration for tiled accelerators;ArchGym: an open-source gymnasium for machine learning assisted architecture design;SHARP: a short-word hierarchical accelerator for robust and practical fully homomorphic encryption;V10: hardware-assisted NPU multi-tenancy for improved resource utilization and fairness;GenDP: a framework of dynamic programming acceleration for genome sequencing analysis;CamJ: enabling system-level energy modeling and architectural exploration for in-sensor visual computing;and an algorithm and architecture co-design for accelerating smart contracts in blockchain.
Nature has inspired a lot of problem solving techniques over the decades. More recently, researchers have increasingly turned to harnessing nature to solve problems directly. Ising machines are a good example and ther...
详细信息
ISBN:
(纸本)9781450386104
Nature has inspired a lot of problem solving techniques over the decades. More recently, researchers have increasingly turned to harnessing nature to solve problems directly. Ising machines are a good example and there are numerous research prototypes as well as many design concepts. They can map a family of NP-complete problems and derive competitive solutions at speeds much greater than conventional algorithms and in some cases, at a fraction of the energy cost of a von Neumann computer. However, physical Ising machines are often fixed in its problem solving capacity. Without any support, a bigger problem cannot be solved at all. With a simple divide-and-conquer strategy, it turns out, the advantage of using an Ising machine quickly diminishes. It is therefore desirable for Ising machines to have a scalable architecture where multiple instances can collaborate to solve a bigger problem. We then discuss scalable architecture design issues which lead to a multiprocessor Ising machine architecture. Experimental analyses show that our proposed architectures allow an Ising machine to scale in capacity and maintain its significant performance advantage (about 2200x speedup over a state-of-the-art computational substrate). In the case of communication bandwidth-limited systems, our proposed optimizations in supporting batch mode operation can cut down communication demand by about 4-5x without a significant impact on solution quality.
暂无评论