检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

322 篇 会议
18 篇 期刊文献

馆藏范围

340 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

288 篇 工学
- 248 篇 软件工程
- 232 篇 计算机科学与技术...
- 13 篇 电子科学与技术（可...
- 7 篇 信息与通信工程
- 5 篇 控制科学与工程
- 4 篇 机械工程
- 4 篇 生物工程
- 3 篇 生物医学工程（可授...
- 1 篇 力学（可授工学、理...
- 1 篇 动力工程及工程热...
- 1 篇 电气工程
- 1 篇 核科学与技术
- 1 篇 农业工程
- 1 篇 环境科学与工程（可...
53 篇 理学
- 49 篇 数学
- 4 篇 生物学
- 4 篇 系统科学
- 4 篇 统计学（可授理学、...
- 2 篇 化学
14 篇 管理学
- 10 篇 管理科学与工程(可...
- 8 篇 工商管理
- 4 篇 图书情报与档案管...
3 篇 经济学
- 3 篇 应用经济学
2 篇 法学
- 2 篇 社会学
1 篇 教育学
- 1 篇 教育学
1 篇 农学
- 1 篇 作物学

主题

54 篇 performance
48 篇 parallel process...
33 篇 algorithms
33 篇 parallel program...
27 篇 languages
25 篇 design
20 篇 parallel algorit...
20 篇 gpu
9 篇 experimentation
9 篇 measurement
7 篇 graphics process...
7 篇 theory
7 篇 parallel
6 篇 scalability
6 篇 mpi
6 篇 parallel computi...
6 篇 concurrency
5 篇 parallelism
5 篇 graph algorithms
5 篇 multicore

机构

7 篇 carnegie mellon ...
4 篇 indiana univ blo...
4 篇 shanghai jiao to...
3 篇 univ of tokyo
3 篇 tsinghua univ de...
3 篇 univ chinese aca...
3 篇 massachusetts in...
3 篇 univ illinois ur...
3 篇 swiss fed inst t...
3 篇 mit csail united...
3 篇 tsinghua univ pe...
3 篇 univ calif berke...
2 篇 ist austria klos...
2 篇 fudan univ sch c...
2 篇 georgetown univ ...
2 篇 univ wisconsin d...
2 篇 shanghai key lab...
2 篇 univ of wisconsi...
2 篇 tsinghua univers...
2 篇 shanghai jiao to...

作者

8 篇 blelloch guy e.
7 篇 chen haibo
6 篇 hoefler torsten
6 篇 garland michael
6 篇 zhai jidong
6 篇 shun julian
5 篇 sun yihan
4 篇 dhulipala laxman
4 篇 chen wenguang
4 篇 tsigas philippas
4 篇 tan guangming
4 篇 wang haojie
4 篇 nikolopoulos dim...
4 篇 mellor-crummey j...
4 篇 gu yan
4 篇 kennedy ken
3 篇 taura kenjiro
3 篇 li jiajia
3 篇 yonezawa akinori
3 篇 pingali keshav

语言

325 篇 英文
15 篇 其他

检索条件"任意字段=Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"

共 340 条记录，以下是51-60 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Multi-Queues Can Be State-of-the-Art Priority Schedulers 22

Multi-Queues Can Be State-of-the-Art Priority Schedulers

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Postnikova, Anastasiia Koval, Nikita Nadiradze, Giorgi Alistarh, Dan ITMO Univ St Petersburg Russia JetBrains Prague Czech Republic IST Austria Klosterneuburg Austria

ISBN: (纸本)9781450392044

Designing and implementing efficient parallel priority schedulers is an active research area. An intriguing proposed design is the Multi-Queue: given n threads and m >= n distinct priority queues, task insertions are performed uniformly at random, while, to delete, a thread picks two queues uniformly at random, and removes the observed task of higher priority. this approach scales well, and has probabilistic rank guarantees: roughly, the rank of each task removed, relative to remaining tasks in all other queues, is O(m) in expectation. Yet, the performance of this pattern is below that of well-engineered schedulers, which eschew theoretical guarantees for practical efficiency. We investigate whether it is possible to design and implement a Multi-Queue-based task scheduler that is both highly-efficient and has analytical guarantees. We propose a new variant called the Stealing Multi-Queue (SMQ), a cache-efficient variant of the Multi-Queue, which leverages both queue affinity-each thread has a local queue, from which tasks are usually removed;but, with some probability, threads also attempt to steal higher-priority tasks from the other queues-and task batching, that is, the processing of several tasks in a single insert / remove step. these ideas are well-known for task scheduling without priorities;our theoretical contribution is showing that, despite relaxations, this design can still provide rank guarantees, which in turn implies bounds on total work performed. We provide a general SMQ implementation which can surpass state-of-the-art schedulers such as OBIM and PMOD in terms of performance on popular graph-processing benchmarks. Notably, the performance improvement comes mainly from the superior rank guarantees provided by our scheduler, confirming that analytically-reasoned approaches can still provide performance improvements for priority task scheduling. the full version of this paper is available in [24].

关键词： priority scheduling relaxed algorithms concurrency parallel graph processing

来源：评论

学校读者我要写书评

暂无评论

Scaling Graph Traversal to 281 Trillion Edges with 40 Million Cores 22

Scaling Graph Traversal to 281 Trillion Edges with 40 Millio...

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Cao, Huanqi Wang, Yuanwei Wang, Haojie Lin, Heng Ma, Zixuan Yin, Wanwang Chen, Wenguang Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China Tsinghua Univ BNRist Beijing Peoples R China Peking Univ Sch Comp Sci Beijing Peoples R China Natl Supercomp Ctr Wuxi Wuxi Jiangsu Peoples R China

ISBN: (纸本)9781450392044

Graph processing, especially high-performance graph traversal, plays a more and more important role in data analytics. the successor of Sunway TaihuLight, NEW SUNWAY, is equipped with nearly 10 PB memory and over 40 million cores, which brings the opportunity to process hundreds of trillions of edges graphs. However, the graph with an unprecedented scale also brings severe performance challenges, including load imbalance, poor locality, and irregular access of graph traversal workload. To address the scalability problem, we propose a novel 3-level degree-aware 1.5D graph partitioning, which benefits from both delegated 1D and 2D partitioning. By delegating extremely heavy vertices globally and other heavy vertices on columns and rows in the processes mesh, we break the scalability wall of previous partitioning methods. Together with sub-iteration direction optimization, core group -aware core subgraph segmenting, and a new on-chip sorting mechanism using RMA, we achieve 180,792 GTEPS on a graph with 281 trillion edges, using 103,912 processors with over 40 million cores, achieving 1.75x performance and 8x capacity compared to the previous state of the art and conforming to the Graph 500 BFS benchmark[14].

关键词： massively parallel algorithm breadth-first search heterogeneous architecture

来源：评论

学校读者我要写书评

暂无评论

VAPRO: Performance Variance Detection and Diagnosis for Production-Run parallel Applications 22

VAPRO: Performance Variance Detection and Diagnosis for Prod...

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Zheng, Liyan Zhai, Jidong Tang, Xiongchao Wang, Haojie Yu, Teng Jin, Yuyang Song, Shuaiwen Leon Chen, Wenguang Tsinghua Univ Beijing Peoples R China Sangfor Technol Inc Shenzhen Guangdong Peoples R China Univ Sydney Sydney NSW Australia BNRist Beijing Peoples R China

ISBN: (纸本)9781450392044

Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications' behavior hard to understand. therefore, detecting and diagnosing performance variance are of crucial importance for users and application developers. However, previous detection approaches either bring too large overhead and hurt applications' performance, or rely on nontrivial source code analysis that is impractical for production-run parallel applications. In this work, we propose VAPRO, a performance variance detection and diagnosis framework for production-run parallel applications. Our approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workload, which can be used for performance variance detection. To effectively identify these snippets at runtime even without program source code, we introduce State Transition Graph (STG) to track program execution and then conduct lightweight workload analysis on STG to locate variance. To diagnose the detected variance, VAPRO leverages a progressive diagnosis method based on a hybrid model leveraging variance breakdown and statistical analysis. Results show that the performance overhead of VAPRO is only 1.38% on average. VAPRO can detect the variance in real applications caused by hardware bugs, memory, and IQ After fixing the detected variance, the standard deviation of the execution time is reduced by up to 73.5%. Compared with the state-of-the-art variance detection tool based on source code analysis, VAPRO achieves 30.0% higher detection coverage.

关键词： Performance Variance Anomaly Detection System Noise

来源：评论

学校读者我要写书评

暂无评论

the Performance Power of Software Combining in Persistence 22

The Performance Power of Software Combining in Persistence

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Fatourou, Panagiota Kallimanis, Nikolaos D. Kosmas, Eleftherios Univ Paris LIPADE F-75006 Paris France Fdn Res Technol Hellas FORTH Inst Comp Sci Ilellas Greece Univ Crete Dept Comp Sci Iraklion Greece

ISBN: (纸本)9781450392044

the availability of Non-Volatile Main Memory (known as NVMM) enables the design of recoverable concurrent algorithms. We study the power of software combining in achieving recoverable synchronization and designing persistent data structures. Software combining is a general synchronization approach, which attempts to simulate the ideal world when executing synchronization requests (i.e., requests that must be executed in mutual exclusion). A single thread, called the combiner, executes all active requests, while the rest of the threads are waiting for the combiner to notify them that their requests have been applied. Software combining significantly decreases the synchronization cost and outperforms many other synchronization techniques in various cases. We identify three persistence principles, crucial for performance, that an algorithm's designer has to take into consideration when designing highly-efficient recoverable synchronization protocols or data structures. We illustrate how to make the appropriate design decisions in all stages of devising recoverable combining protocols to respect these principles. Specifically, we present two recoverable software combining protocols, satisfying different progress properties, that are many times faster and have much lower persistence cost than a large collection of existing persistent techniques for achieving scalable synchronization. We build fundamental recoverable data structures, such as stacks and queues, based on these protocols that outperform by far existing recoverable implementations of such data structures. We also provide the first recoverable implementation of a concurrent heap and present experiments to show that it has good performance when the size of the heap is not very large.

关键词： non-volatile memory NVM-based computing persistence recoverable algorithms and data structures software combining concurrent data structures stack queue heap synchronization wait-freedom performance principles performance analysis

来源：评论

学校读者我要写书评

暂无评论

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level parallel Constructs 23

High-Performance GPU-to-CPU Transpilation and Optimization v...

引用

28th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2023

作者： Moses, William S. Ivanov, Ivan R. Domke, Jens Endo, Toshio Doerfert, Johannes Zinenko, Oleksandr MIT CSAIL United States Tokyo Tech Japan RIKEN Japan LLNL United States Google France

ISBN: (纸本)9798400700156

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7×. © 2023 Owner/Author.

关键词： Supercomputers

来源：评论

学校读者我要写书评

暂无评论

Minimizing speculation overhead in a parallel recognizer for regular texts 25

Minimizing speculation overhead in a parallel recognizer for...

引用

proceedings of the 30th acm sigplan Annual symposium on principles and practice of parallel programming

作者： Angelo Borsotti Luca Breveglieri Angelo Morzenti Stefano Crespi Reghizzi Politecnico di Milano Milano Italy Politecnico di Milano and CNR-IEIIT Milano Italy

ISBN: (纸本)9798400714436

Speculative data-parallel algorithms for language recognition have been widely experimented for various types of finitestate automata (FA), deterministic (DFA) and nondeterministic (NFA), often derived fromregular expressions (RE). Such an algorithm cuts the input string into chunks, independently recognizes each chunk in parallel by means of identical FAs, and at last joins the chunk results and checks the overall consistency. In chunk recognition, it is necessary to speculatively start the FAs in any state, thus causing an overhead that reduces the speedup over a serial algorithm. the existing data-parallel DFA-based recognizers suffer from an excessive number of starting states, and the NFA-based ones suffer from the number of nondeterministic transitions.

关键词： data-parallel recognition algorithm

来源：评论

学校读者我要写书评

暂无评论

Simplifying low-level GPU programming with GAS 21

Simplifying low-level GPU programming with GAS

引用

26th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2021

作者： Yan, Da Wang, Wei Chu, Xiaowen Hkust Hong Kong Hong Kong Baptist University Hong Kong

ISBN: (纸本)9781450382946

Many low-level optimizations for NVIDIA GPU can only be implemented in native hardware assembly (SASS). However, programming in SASS is unproductive and not portable. To simplify low-level GPU programming, we present GAS (Gpu ASsembly), a PTX-like language that provides a stable instruction set across hardware architectures while giving programmers a low-level control of code execution. We demonstrate that GAS can be used with ease for low-level benchmarking and performance tuning in the context of Tensor Core HGEMM. © 2021 Owner/Author.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

Magneto: Accelerating parallel Structures in DNNs via Co-Optimization of Operators 25

Magneto: Accelerating Parallel Structures in DNNs via Co-Opt...

引用

proceedings of the 30th acm sigplan Annual symposium on principles and practice of parallel programming

作者： Zhanyuan Di Leping Wang Ziyi Ren En Shao Jie Zhao Siyuan Feng Dingwen Tao Guangming Tan Ninghui Sun SKLP Institute of Computing Technology CAS University of Chinese Academy of Sciences SKLP Institute of Computing Technology CAS Hunan University Shanghai Jiao Tong University

ISBN: (纸本)9798400714436

Deep neural networks (DNNs) increasingly rely on parallel structures to enhance performance and efficiency. However, existing machine learning compilers (MLCs) face challenges in optimizing these structures due to limited parallel fusion scopes and insufficient consideration of intra-operator information. this paper introduces Magneto, a novel framework designed to accelerate parallel structures in DNNs through the co-optimization of parallel operators. By expanding the scope of parallel operator fusion and introducing a dedicated co-tuning algorithm, Magneto unlocks new opportunities for co-optimization. Experimental results demonstrate that Magneto outperforms NVIDIA TensorRT and AMD MIGraphX, achieving speedups of 3.02× and 4.19×, respectively.

关键词： DNN

来源：评论

学校读者我要写书评

暂无评论

TensorMD: Molecular Dynamics Simulation with Ab Initio Accuracy of 50 Billion Atoms 25

TensorMD: Molecular Dynamics Simulation with Ab Initio Accur...

引用

proceedings of the 30th acm sigplan Annual symposium on principles and practice of parallel programming

作者： Yucheng Ouyang Ying Liu Honghui Shang Zhenchuan Chen Jiahao Shan Huimin Cui Xiaobing Feng Xin Chen Xingyu Gao Lifang Wang Haifeng Song Rongfen Lin Fang Li Institute of Computing Technology Chinese Academy of Sciences Beijing China National Research Center of Parallel Computer Engineering and Technology Beijing China Institute of Applied Physics and Computational Mathematics Beijing China

ISBN: (纸本)9798400714436

Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (ML) tools have been leveraged in MLIPs, but they are not perfectly matched with each other, since many optimization opportunities in MLIPs have been missed by ML tools. this inefficiency arises from the fact that HPC+AI applications work with far more computational complexity compared with pure AI scenarios. this paper has developed an MLIP, named TensorMD, independently from any ML tool. TensorMD has been evaluated on two supercomputers and scaled to 51.8 billion atoms, i.e., ~ 3× compared with state-of-the-art.

关键词： GPU

来源：评论

学校读者我要写书评

暂无评论

Modernizing parallel code with pattern analysis 21

Modernizing parallel code with pattern analysis

引用

26th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2021

作者： Lozano, Roberto Castañeda Cole, Murray Franke, Björn University of Edinburgh School of Informatics Edinburgh United Kingdom

ISBN: (纸本)9781450382946

Fifty years of parallel programming has generated a substantial legacy parallel codebase, creating a new portability challenge: re-parallelizing already parallel code. Our solution exploits inherently portable parallel patterns, and addresses the challenge of identifying patternization opportunities in legacy parallel code via constraint matching on traced dynamic dataflow graphs. Notably, this makes the analysis source-independent and equally applicable to sequential and parallel legacy code. We identify various map and reduction patterns, including compositions, in Pthreads code. Experiments with the Starbench suite show that our analysis is effective (finding 86% of the patterns known in the literature), accurate (reporting actual patterns in 98% of the cases), and efficient (scaling linearly with the size of the execution traces). We re-express the found patterns via a parallel pattern library, making code freely portable across CPU/GPU systems and performing competitively with hand-tuned implementations at zero additional effort. © 2021 acm.

关键词： Pattern matching

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共34页 << < 2 3 4 5 6 7 8 9 10 11 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：