检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

344 篇 会议
19 篇 期刊文献
1 册 图书

馆藏范围

364 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

305 篇 工学
- 261 篇 软件工程
- 250 篇 计算机科学与技术...
- 13 篇 电子科学与技术（可...
- 9 篇 信息与通信工程
- 5 篇 控制科学与工程
- 4 篇 机械工程
- 4 篇 生物工程
- 3 篇 生物医学工程（可授...
- 1 篇 力学（可授工学、理...
- 1 篇 动力工程及工程热...
- 1 篇 电气工程
- 1 篇 核科学与技术
- 1 篇 农业工程
- 1 篇 环境科学与工程（可...
- 1 篇 网络空间安全
57 篇 理学
- 53 篇 数学
- 4 篇 生物学
- 4 篇 系统科学
- 4 篇 统计学（可授理学、...
- 2 篇 化学
18 篇 管理学
- 12 篇 管理科学与工程(可...
- 11 篇 工商管理
- 5 篇 图书情报与档案管...
5 篇 经济学
- 5 篇 应用经济学
3 篇 法学
- 3 篇 社会学
3 篇 教育学
- 3 篇 教育学
1 篇 农学
- 1 篇 作物学

主题

54 篇 performance
50 篇 parallel process...
34 篇 parallel program...
33 篇 algorithms
27 篇 languages
25 篇 design
20 篇 parallel algorit...
20 篇 gpu
9 篇 experimentation
9 篇 measurement
8 篇 parallel
7 篇 scalability
7 篇 graphics process...
7 篇 theory
7 篇 parallel computi...
6 篇 parallelism
6 篇 mpi
6 篇 concurrency
5 篇 graph algorithms
5 篇 logic programmin...

机构

7 篇 carnegie mellon ...
4 篇 indiana univ blo...
3 篇 univ of tokyo
3 篇 tsinghua univ de...
3 篇 univ chinese aca...
3 篇 massachusetts in...
3 篇 univ illinois ur...
3 篇 swiss fed inst t...
3 篇 mit csail united...
3 篇 shanghai jiao to...
3 篇 tsinghua univ pe...
3 篇 univ calif berke...
2 篇 ist austria klos...
2 篇 georgetown univ ...
2 篇 univ wisconsin d...
2 篇 yale university ...
2 篇 shanghai key lab...
2 篇 univ of wisconsi...
2 篇 tsinghua univers...
2 篇 shanghai jiao to...

作者

8 篇 blelloch guy e.
6 篇 hoefler torsten
6 篇 garland michael
6 篇 zhai jidong
6 篇 chen haibo
6 篇 shun julian
5 篇 sun yihan
4 篇 dhulipala laxman
4 篇 chen wenguang
4 篇 tsigas philippas
4 篇 tan guangming
4 篇 wang haojie
4 篇 mellor-crummey j...
4 篇 gu yan
4 篇 kennedy ken
3 篇 taura kenjiro
3 篇 li jiajia
3 篇 yonezawa akinori
3 篇 pingali keshav
3 篇 kim jungwon

语言

362 篇 英文
2 篇 其他

检索条件"任意字段=Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming"

共 364 条记录，以下是1-10 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

相关度排序

相关度排序
时效性降序
时效性升序

PPoPP 2025 - proceedings of the 2025 30th ACM SIGPLAN Annual symposium on principles and practice of parallel programming

PPoPP 2025 - Proceedings of the 2025 30th ACM SIGPLAN Annual...

引用

30th ACM SIGPLAN Annual symposium on principles and practice of parallel programming, PPoPP 2025

ISBN: (纸本)9798400714436

the proceedings contain 49 papers. the topics discussed include: Semi-StructMG: a fast and scalable semi-structured algebraic multigrid;LibRTS: a spatial indexing library by ray tracing;high-performance visual semantics compression for AI-driven science;COMPSO: optimizing gradient compression for distributed training with second-order optimizers;TurboFFT: co-designed high-performance and fault-tolerant fast Fourier transform on GPUs;Helios: efficient distributed dynamic graph sampling for online GNN inference;triangle counting on tensor cores;AC-Cache: a memory-efficient caching system for small objects via exploiting access correlations;magneto: accelerating parallel structures in DNNsvia co-optimization of operators;and FlashSparse: minimizing computation redundancy for fast sparse matrix multiplications on tensor cores.

关键词：

来源：评论

学校读者我要写书评

暂无评论

proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

24th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP 2019

ISBN: (纸本)9781450362252

the proceedings contain 58 papers. the topics discussed include: beyond human-level accuracy: computational challenges in deep learning;throughput-oriented GPU memory allocation;SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU;incremental flattening for nested data parallelism;modular transactions: bounding mixed races in space and time;processing transactions in a predefined order;data-flow/dependence profiling for structured transformations;lightweight hardware transactional memory profiling;provably and practically efficient granularity control;semantics-aware scheduling policies for synchronization determinism;and a round-efficient distributed betweenness centrality algorithm.

关键词：

来源：评论

学校读者我要写书评

暂无评论

POSTER: FastBWA: Practical and Cost-Efficient Genome Sequence Alignment Pipeline 30

POSTER: FastBWA: Practical and Cost-Efficient Genome Sequenc...

引用

30th symposium on principles and practice of parallel programming

作者： Zhang, Zhonghai Li, Yewen Meng, Ke Zhang, Chunming Tan, Guangming Chinese Acad Sci Inst Comp Technol State Key Lab Processors Beijing Peoples R China Univ Chinese Acad Sci Beijing Peoples R China

ISBN: (纸本)9798400714436

Sequence alignment is a fundamental and often time-consuming step in genomic data analysis. Typically, it adheres to the seed-and-extension paradigm and numerous accelerator-based approaches have been proposed to optimize either of the kernels. However, these approaches often increase costs and contribute minimally to the overall alignment process. To address this, we have designed an optimized full pipeline, FastBWA, which seeks to enhance performance while keeping costs low and explores the potential of CPU computing resources. Our implementation demonstrates that FastBWA achieves up to 2.5x and 1.8x in end-to-end alignment throughput compared to BWA-MEM and its newer version, BWA-MEM2.

关键词： Genome Sequence Alignment BWA-MEM Acceleration parallel Application

来源：评论

学校读者我要写书评

暂无评论

Crystality: A programming Model for Smart Contracts on parallel EVMs 25

Crystality: A Programming Model for Smart Contracts on Paral...

引用

30th symposium on principles and practice of parallel programming

作者： Wang, Hao Pan, Minghao Wang, Jiaping Int Digital Econ Acad Beijing Peoples R China Hong Kong Univ Sci & Technol Guangzhou Guangzhou Peoples R China

ISBN: (纸本)9798400714436

Scaling blockchain performance through parallel smart contract execution has gained significant attention, as traditional methods remain constrained by the performance of a single virtual machine (VM), even in multi-chain or Layer-2 systems. parallel VMs offer a compelling solution by enabling concurrent transaction execution within a single smart contract, using multiple CPU cores. However, Ethereum's sequential, shared-everything model limits the efficiency of existing parallel mechanisms, resulting in frequent rollbacks with optimistic methods and high overhead with pessimistic methods due to state dependency analysis and locking. this paper introduces Crystality, a programming model for smart contracts on parallel Ethereum Virtual Machines (EVMs) that enables developers to express and leverage the parallelism inherent in smart contracts. Crystality introduces Programmable Contract Scopes to partition contract states into non-overlapping, parallelizable segments and decompose a smart contract function into finer-grained components. Crystality also features Asynchronous Functional Relay to manage execution flow across EVMs. these features simplify parallelism expression and enable asynchronous execution for commutative contract operations. Crystality extends Solidity with directives, transpiling Crystality code into standard Solidity code for EVM compatibility. the system supports two execution modes: an asynchronous mode for transactions involving commutative operations and an optimistic-based fallback to ensure blockdefined transaction order. Our experiments demonstrated Crystality's superior performance compared to Ethereum, Aptos, and Sui on a 64-core machine.

关键词： Blockchain Smart Contract parallel Execution Concurrency Control parallel EVMs

来源：评论

学校读者我要写书评

暂无评论

POSTER: Magneto: Accelerating parallel Structures in DNNs via Co-Optimization of Operators 30

POSTER: Magneto: Accelerating Parallel Structures in DNNs vi...

引用

30th symposium on principles and practice of parallel programming

作者： Di, Zhanyuan Wang, Leping Ren, Ziyi Shao, En Zhao, Jie Feng, Siyuan Tao, Dingwen Tan, Guangming Sun, Ninghui Chinese Acad Sci SKLP Inst Comp Beijing Peoples R China Univ Chinese Acad Sci Beijing Peoples R China Hunan Univ Changsha Hunan Peoples R China Shanghai Jiao Tong Univ Shanghai Peoples R China

ISBN: (纸本)9798400714436

Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. this paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02x and 4.19x, respectively.

关键词： DNN Inference GPU

来源：评论

学校读者我要写书评

暂无评论

Adaptive parallel Training for Graph Neural Networks 25

Adaptive Parallel Training for Graph Neural Networks

引用

30th symposium on principles and practice of parallel programming

作者： Ma, Kaihao Liu, Renjie Yan, Xiao Cai, Zhenkun Song, Xiang Wang, Minjie Li, Yichao Cheng, James Chinese Univ Hong Kong Hong Kong Peoples R China Southern Univ Sci & Technol Shenzhen Peoples R China Ctr Perceptual & Interact Intelligence Hong Kong Peoples R China Amazon Seattle WA USA AWS Shanghai AI Lab Shanghai Peoples R China

ISBN: (纸本)9798400714436

there are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., with the shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://***/kaihaoma/APT.

关键词： Graph Neural Networks Distributed and parallel Training Network Communication

来源：评论

学校读者我要写书评

暂无评论

DORADD: Deterministic parallel Execution in the Era of Microsecond-Scale Computing 25

DORADD: Deterministic Parallel Execution in the Era of Micro...

引用

30th symposium on principles and practice of parallel programming

作者： Liu, Zhengqing Unal, Musa Parkinson, Matthew J. Kogias, Marios Imperial Coll London London England Ecole Polytech Fed Lausanne Lausanne Switzerland Azure Res Austin TX USA

ISBN: (纸本)9798400714436

Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel systems (DPS), we identify certain design pitfalls, such as batched execution and inefficient runtime synchronization, that preclude them from meeting the demands of mu s-scale and high-throughput distributed systems deployed in modern datacenters. We present DORADD, a deterministically parallel runtime with low latency and high throughput, designed for modern datacenter services. DORADD introduces a hybrid scheduling scheme that effectively decouples request dispatching from execution. It employs a single dispatcher to deterministically construct a dynamic dependency graph of incoming requests and worker pools that can independently execute requests in a work-conserving and synchronization-free manner. Furthermore, DORADD overcomes the single-dispatcher throughput bottleneck based on core pipelining. We use DORADD to build an in-memory database and compare it with Caracal, the current state-of-the-art deterministic database, via the YCSB and TPC-C benchmarks. Our evaluation shows up to 2.5x better throughput and more than 150x and 300x better tail latency in non-contended and contended cases, respectively. We also compare DO-RADD with Caladan, the state-of-the-art non-deterministic remote procedure call (RPC) scheduler, and demonstrate that determinism in DORADD does not incur any performance overhead.

关键词： parallel execution determinism runtime scheduling

来源：评论

学校读者我要写书评

暂无评论

SBMGT: Scaling Bayesian Multinomial Group Testing 25

SBMGT: Scaling Bayesian Multinomial Group Testing

引用

30th symposium on principles and practice of parallel programming

作者： Chen, Weicong Qi, Hao Tatsuoka, Curtis Lu, Xiaoyi Univ Calif Merced Merced CA 95343 USA Univ Pittsburgh Pittsburgh PA 15260 USA

ISBN: (纸本)9798400714436

Group testing is a widely used binary classification method that efficiently distinguishes between samples with and without a binary-classifiable attribute by pooling and testing subsets of a group. Bayesian Group Testing (BGT) is the state-of-the-art approach, which integrates prior risk information into a Bayesian Boolean Lattice framework to minimize test counts and reduce false classifications. However, BGT, like other existing group testing techniques, struggles with multinomial group testing, where samples have multiple binary-classifiable attributes that can be individually distinguished simultaneously. We address this need by proposing Bayesian Multinomial Group Testing (BMGT), which includes a new Bayesian-based model and supporting theorems for an efficient and precise multinomial pooling strategy. We further design and develop SBMGT, a high-performance and scalable framework to tackle BMGT's computational challenges by proposing three key innovations: 1) a parallel binaryencoded product lattice model with up to 99.8% efficiency;2) the Bayesian Balanced Partitioning Algorithm (BBPA), a multinomial pooling strategy optimized for parallel computation with up to 97.7% scaling efficiency on 4096 cores;and 3) a scalable multinomial group testing analytics framework, demonstrated in a real-world disease surveillance case study using AIDS and STDs datasets from Uganda, where SBMGT reduced tests by up to 54% and lowered false classification rates by 92% compared to BGT.

关键词： Multinomial group testing Bayesian methods parallel algorithms Graph algorithms

来源：评论

学校读者我要写书评

暂无评论

POSTER: Minimizing speculation overhead in a parallel recognizer for regular texts 30

POSTER: Minimizing speculation overhead in a parallel recogn...

引用

30th symposium on principles and practice of parallel programming

作者： Borsotti, Angelo Breveglieri, Luca Morzenti, Angelo Reghizzi, Stefano Crespi Politecn Milan Milan Italy CNR IEIIT Milan Italy

ISBN: (纸本)9798400714436

Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finite-state automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived from regular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks the overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup over a serial algorithm. the existing data-parallel DFA-based recognizers suffer from an excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions. Our data-parallel algorithm is based on the new FA type called reduced-interface DFA (RI-DFA), which minimizes the speculation overhead without incurring in the penalty of nondeterministic transitions or of impractically enlarged DFA machines. the algorithm is theoretically efficient, because it combines the state-reduction of an NFA with the speed of deterministic transitions, thus improving on both DFA-based and NFA-based existing implementations. the practical applicability of the RI-DFA approach is confirmed by a quantitative comparison of the number of starting states for a large public benchmark of complex FAs. On multi-core computing architectures, the RI-DFA recognizer is considerably faster than the NFA-based one on all benchmarks, while it matches the DFA-based one on some benchmarks and performs much better on some others. the extra time needed to construct RI-DFA vs DFA is moderate and is compatible with a practical use. the full paper with all details is in [4]. © 2025 Copyright held by the owner/author(s).

关键词： regular language recognition data-parallel recognition algorithm minimal speculation speedup onmulti-core architecture multi-entry DFA reduced-interface DFA

来源：评论

学校读者我要写书评

暂无评论

MARLIN: Mixed-Precision Auto-Regressive parallel Inference on Large Language Models 25

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference o...

引用

30th symposium on principles and practice of parallel programming

作者： Frantar, Elias Castro, Roberto L. Chen, Jiale Hoefler, Torsten Alistarh, Dan IST Austria Klosterneuburg Austria Univ A Coruna CITIC La Coruna Spain Swiss Fed Inst Technol Zurich Switzerland Neural Mag Inc Somerville NJ USA

ISBN: (纸本)9798400714436

As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4x) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8x) when integrated with the popular vLLM opensource serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.

关键词： Large language model (LLM) inference GPU programming Batch parallelism

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共37页 << < 1 2 3 4 5 6 7 8 9 10 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：