检索结果-内蒙古大学图书馆

37th ieee International parallel and distributed processing symposium (IPDPS)

作者： Satake, Moto Takahashi, Keichi Shimomura, Yoichi Takizawa, Hiroyuki Tohoku Univ Grad Sch Informat Sci Sendai Miyagi Japan Tohoku Univ Cybersci Ctr 6-3 Aramaki Aza Aoba Sendai Miyagi 9808578 Japan

ISBN: (纸本)9798350311990

Parameter survey with an MPI application is extensively used, where the optimality of each parameter setting is evaluated by executing the application, called a trial. Bayesian optimization can find a suboptimal parameter setting with less trials. parallel Bayesian optimization (PBO) is to execute multiple trials in parallel to reduce the execution time. However, evaluating too many trials at once is likely to be trapped by a local minimum, and degrade the final solution. In addition, under a computing resource constraint, the more parallel trials are executed, the less computing resource each MPI application run can use. As a result, the execution time might become longer. thus, there is a trade off in the number of parallel trials, and the parameter search generally becomes less explorative and more exploitative by decreasing the number of parallel trials. this paper proposes a method to dynamically adjust the number of parallel trials accordingly to the degree of progress in optimization. the evaluation results demonstrate that the proposed method can adapt to the optimization problems and hence is robust to the problem. As a result, the proposed method can consistently achieve both a faster improvement speed and a better final solution in many cases.

关键词： Auto tuning Bayesian optimization parallel computing

来源：评论

学校读者我要写书评

暂无评论

Scheduling with Many Shared Resources 37

Scheduling with Many Shared Resources

引用

37th ieee International parallel and distributed processing symposium (IPDPS)

作者： Deppert, Max A. Jansen, Klaus Maack, Marten Pukrop, Simon Rau, Malin Univ Kiel Dept Comp Sci Kiel Germany Paderborn Univ Paderborn Heinz Nixdorf Inst Paderborn Germany Paderborn Univ Dept Comp Sci Paderborn Germany

ISBN: (纸本)9798350337662

Consider the many shared resources scheduling problem where jobs have to be scheduled on identical parallel machines with the goal of minimizing the makespan. However, each job needs exactly one additional shared resource in order to be executed and hence prevents the execution of jobs that need the same resource while being processed. Previously, an approximation ratio of asymptotically 2 was the best known result for this problem. Furthermore, a 6/5-approximation for the case with only two machines was known as well as a PTAS for the case with a constant number of machines. We present a simple and fast 5/3-approximation and a much more involved but still reasonable 1.5-approximation. Furthermore, we provide a PTAS for the case with only a constant number of machines, which is arguably simpler and faster than the previously known one, as well as a PTAS with resource augmentation for the general case. the approximation schemes make use of the N-fold integer programming machinery, which has found more and more applications in the field of scheduling recently. It is plausible that the latter results can be improved and extended to more general cases. Lastly, we give an inapproximability result for the natural problem extension where each job may need up to a constant number of different resources, namely 3, ruling out better than 5/4 approximations for that case.

关键词： Scheduling Approximation parallel Identical Machines Resource Constraints Conflicts

来源：评论

学校读者我要写书评

暂无评论

An Efficient 2D Method for Training Super-Large Deep Learning Models 37

An Efficient 2D Method for Training Super-Large Deep Learnin...

引用

37th ieee International parallel and distributed processing symposium (IPDPS)

作者： Xu, Qifan You, Yang Univ Calif Los Angeles Los Angeles CA 90095 USA Natl Univ Singapore Singapore Singapore

ISBN: (纸本)9798350337662

Since the rise of Transformer [22] and BERT [6], large language models [7, 12] have been proposed and shown unprecedented performance in tasks like translation, classification, and text generation. However, due to the memory constraint, model parallelism must be used to split the model across multiple processors. Interlayer partition, intra-layer partition, and sparse activation are the major approaches to achieve model parallelism. Among them, interlayer partition [10, 11] often requires the model to be explicitly expressed as a stack of sub-modules, the number of which equals to the number of processors, and would introduce either gradient staleness or bubble overhead;while the sparse activation [12] is primarily designed for Google TPU cluster and hard to deploy on GPU servers, intra-layer partition [17], especially Megatron-LM [18], can be easily deployed on GPU servers and has been adopted in subsequent works like Turing-NLG and M6. though as pioneers of intra-layer parallelism, they still show memory redundancy and sub-optimal communication efficiency, which reveals the space for further improvements. In this work, we leverage SUMMA [21] and propose Optimus, a highly efficient and scalable paradigm for training super-large language models. In Optimus, activations and gradients are partitioned and distributed along processors all the way through forward and backward propagations, with hardly any memory redundancy. the isoefficiency of communication in pure model parallelism improves from W similar to p(3) for Megatron-LM, to W similar to (root p logp)(3) for our Optimus. this framework is implemented with open-source deep learning framework, PyTorch, and consolidates existing techniques such as mixed precision training [13], activation checkpointing [5], and data parallelism. In experiments on TACC Frontera supercomputers, Optimus shows 1.48x the speed for training, 1.78x speed for inference, and 8x the maximum batch size over Megatron-LM on 64 GPUs in pure mod

关键词： matrix-matrix multiplication distributed training neural networks natural language processing

来源：评论

学校读者我要写书评

暂无评论

NWHy: A Framework for Hypergraph Analytics: Representations, Data structures, and Algorithms 36

NWHy: A Framework for Hypergraph Analytics: Representations,...

引用

36th ieee International parallel and distributed processing symposium (ieee IPDPS)

作者： Liu, Xu T. Firoz, Jesun Gebremedhin, Assefaw H. Lumsdaine, Andrew Univ Washington Seattle WA 98195 USA Pacific Northwest Natl Lab Richland WA 99352 USA Washington State Univ Pullman WA 99164 USA TileDB Inc Cambridge MA USA

ISBN: (纸本)9781665497473

this paper presents NWHypergraph, (NWHy), a parallel high-performance C++ framework for both exact and approximate hypergraph analytics. NWHy provides data structures for various representations of hypergraphs and their associated graph projections (lower order approximations), including a new technique for hypergraph representation called adjoin graphs. We present a set of hypergraph algorithms for exact and approximate hypergraph analytics implemented in NWHy and demonstrate scalability and performance, operating on a variety of hypergraph representations, that is competitive with the state of the art. In addition, we propose two new queue-based algorithms for s-line graph construction, a lower-order approximation of hypergraphs, to demonstrate the effectiveness and versatility of work queue-based algorithm design. Our queue-based algorithms demonstrate similar performance to the non-queue-based algorithms for bipartite graphs.

关键词： Hypergraph analytics hypergraph representation parallel hypergraph algorithms

来源：评论

学校读者我要写书评

暂无评论

Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data 36

Accelerating Deep Learning based Identification of Chromatin...

引用

36th ieee International parallel and distributed processing symposium (ieee IPDPS)

作者： Chaudhary, Narendra Misra, Sanchit Kalamkar, Dhiraj Heinecke, Alexander Georganas, Evangelos Ziv, Barukh Adelman, Menachem Kaul, Bharat Intel Parallel Comp Lab Bangalore Karnataka India Intel Corp Parallel Comp Lab Santa Clara CA USA Intel Corp Haifa Israel

ISBN: (纸本)9781665497473

Identifying accessible chromatin regions is a fundamental problem in epigenomics with ATAC-seq being a commonly used assay. Exponential rise in ATAC-seq experiments has made it critical to accelerate processing of ATAC-seq data that can have a low signal-to-noise ratio for various reasons including low coverage or low cell count. To denoise and identify accessible chromatin regions from noisy ATAC-seq data, use of deep learning on 1D data - using large filter sizes, long tensor widths, and/or dilation - has recently been proposed. Convolutions over 1D data consume a majority of the runtime in these methods. However, existing implementations of the 1D convolution layer for CPUs and GPUs fail to efficiently use the underlying architecture especially in the case of large filter sizes, long tensor widths, and dilation. Here, we present ways to accelerate the end-toend training performance of these deep learning based methods. We evaluate our approach on the recently released AtacWorks toolkit using modern CPUs. Compared to AtacWorks running on an Nvidia DGX-1 box with 8 V100 GPUs, we get up to 2.27 x speedup using just 16 CPU sockets. To achieve this, we build an efficient 1D dilated convolution layer and demonstrate reduced precision (BFloatl6) training and nearly linear scaling from 1 to 16 sockets.

关键词： chromatin accessibility ATAC-seq single cell analysis deep learning 1D convolutions architecture-aware optimizations SIMD BF16

来源：评论

学校读者我要写书评

暂无评论

Collaborative Offloading with Temporal Tolerance in Cybertwin-enabled 6G 35

Collaborative Offloading with Temporal Tolerance in Cybertwi...

引用

35th ieee International symposium on Personal, Indoor and Mobile Radio Communications, PIMRC 2024

作者： Luo, Kaiyue Wang, Yumei Liu, Yu Beijing University of Posts and Telecommunications School of Artificial Intelligence Beijing China

ISBN: (纸本)9798350362244

Cybertwin-enabled 6G, by mapping physical entities to Cybertwins endowed with distributed data-centered autonomy, is envisioned to cater to intelligent and flexible services. It plays a critical role in resolving potential conflicts in service integration arising from the proliferation of Internet of Everything devices (IoEDs). However, Cybertwins may encounter temporal dependencies when decoupling data and functionality during the offloading process. To address this issue, we initially decouple the offloading into three segments, adaptively distributed among IoEDs, Cybertwins, and the edge cloud. Subsequently, the Potential Minimum Point with Temporal Lag Tolerance (PMP-TLT) algorithm is introduced. the proposed PMP-TLT utilizes the Temporal Lag Tolerance (TLT) between Cybertwins and IoEDs, breaking the Cybertwins into collaborative active and dormant states. this separation facilitates parallel processing, boosting computational speed. Finally, we highlight the merits of our algorithm in terms of efficiency and complexity, especially for asymmetric solving scales. Additionally, our analysis underscores the performance benefits of the Cybertwin-enabled 6G design, particularly regarding power savings and reduced latency. © 2024 ieee.

关键词： Spatio-temporal data

来源：评论

学校读者我要写书评

暂无评论

High-order Line Graphs of Non-uniform Hypergraphs: Algorithms, Applications, and Experimental Analysis 36

High-order Line Graphs of Non-uniform Hypergraphs: Algorithm...

引用

36th ieee International parallel and distributed processing symposium (ieee IPDPS)

作者： Liu, Xu T. Firoz, Jesun Aksoy, Sinan Amburg, Ilya Lumsdaine, Andrew Joslyn, Cliff Praggastis, Brenda Gebremedhin, Assefaw H. Univ Washington Seattle WA 98195 USA Washington State Univ Pullman WA 99164 USA Pacific Northwest Natl Lab Richland WA 99352 USA

ISBN: (纸本)9781665481069

Hypergraphs offer flexible and robust data representations for many applications, but methods that work directly on hypergraphs are not readily available and tend to be prohibitively expensive. Much of the current analysis of hypergraphs relies on first performing a graph expansion - either based on the nodes (clique expansion), or on the hyperedges (line graph) - and then running standard graph analytics on the resulting representative graph. However, this approach suffers from massive space complexity and high computational cost with increasing hypergraph size. Here, we present efficient, parallel algorithms to accelerate and reduce the memory footprint of higher-order graph expansions of hypergraphs. Our results focus on the hyperedge-based s-line graph expansion, but the methods we develop work for higher-order clique expansions as well. To the best of our knowledge, ours is the first framework to enable hypergraph spectral analysis of a large dataset on a single sharedmemory machine. Our methods enable the analysis of datasets from many domains that previous graph-expansion-based models are unable to provide. the proposed s-line graph computation algorithms are orders of magnitude faster than state-of-the-art sparse general matrix-matrix multiplication methods, and obtain approximately 2 - 31x speedup over a prior state-of-the-art heuristic-based algorithm for s-line graph computation.

关键词： Hypergraphs parallel hypergraph algorithms line graphs intersection graphs clique expansion

来源：评论

学校读者我要写书评

暂无评论

As easy as ABC: Optimal (A)ccountable (B)yzantine (C)onsensus is easy! 36

As easy as ABC: Optimal (A)ccountable (B)yzantine (C)onsensu...

引用

36th ieee International parallel and distributed processing symposium (ieee IPDPS)

作者： Civit, Pierre Gilbert, Seth Gramoli, Vincent Guerraoui, Rachid Komatovic, Jovan Sorbonne Univ LIP6 CNRS Paris France NUS Singapore Singapore Singapore Univ Sydney EPFL Sydney NSW Australia Ecole Polytech Fed Lausanne EPFL Lausanne Switzerland

ISBN: (纸本)9781665481069

It is known that the agreement property of the Byzantine consensus problem among n processes can be violated in a non-synchronous system if the number of faulty processes exceeds t(0) = left perpendicularn/3right perpendicular - 1 [10], [19]. In this paper, we investigate the accountable Byzantine consensus problem in non-synchronous systems: the problem of solving Byzantine consensus whenever possible (e.g., when the number of faulty processes does not exceed t(0)) and allowing correct processes to obtain proof of culpability of (at least) t(0) + 1 faulty processes whenever correct processes disagree. We present four complementary contributions: 1) We introduce ABC: a simple yet efficient transformation of any Byzantine consensus protocol to an accountable one. ABC introduces an overhead of only two all-toall communication rounds and O(n(2)) additional bits in executions with up to t(0) faults (i.e., in the common case). 2) We define the accountability complexity, a complexity metric representing the number of accountabilityspecific messages that correct processes must send. Furthermore, we prove a tight lower bound. In particular, we show that any accountable Byzantine consensus protocol incurs cubic accountability complexity. Moreover, we illustrate that the bound is tight by applying the ABC transformation to any Byzantine consensus protocol. 3) We demonstrate that, when applied to an optimal Byzantine consensus protocol, ABC constructs an accountable Byzantine consensus protocol that is (1) optimal with respect to the communication complexity in solving consensus whenever consensus is solvable, and (2) optimal with respect to the accountability complexity in obtaining accountability whenever disagreement occurs. 4) We generalize ABC to other distributed computing problems besides the classic consensus problem. We characterize a class of agreement tasks, including reliable and consistent broadcast [5], that ABC renders accountable.

关键词： Measurement distributed processing Lattices Complexity theory Consensus protocol Reliability Task analysis

来源：评论

学校读者我要写书评

暂无评论

Euler Meets GPU: Practical Graph Algorithms with theoretical Guarantees 35

Euler Meets GPU: Practical Graph Algorithms with Theoretical...

引用

35th ieee International parallel and distributed processing symposium (IPDPS)

作者： Polak, Adam Siwiec, Adrian Stobierski, Michal Jagiellonian Univ Fac Math & Comp Sci Krakow Poland

ISBN: (纸本)9781665440660

the Euler tour technique is a classical tool for designing parallel graph algorithms, originally proposed for the PRAM model. We ask whether it can be adapted to run efficiently on GPU. We focus on two established applications of the technique: (1) the problem of finding lowest common ancestors (LCA) of pairs of nodes in trees, and (2) the problem of finding bridgis in undirected graphs. In our experiments, we compare theoretically optimal algorithms using the Euler tour technique against simpler heuristics supposed to perform particularly well on typical instances. We show that the Euler tour-based algorithms not only fulfill their theoretical promises and outperform practical heuristics on hard instances, but also perform on par with them on easy instances.

关键词： graph algorithms parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

6th ieee Workshop on parallel and distributed processing for Computational Social Systems (ParSocial 2022)

Proceedings - 2022 IEEE 36th International Parallel and Dist...

引用

Proceedings - 2022 ieee 36th International parallel and distributed processing symposium Workshops, IPDPSW 2022 2022年 1127-1128页

作者： Korah, John Santos, Eunice E. California State Polytechnic University Pomona United States University of Illinois Urbana-Champaign United States

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：