检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

322 篇 会议
18 篇 期刊文献

馆藏范围

340 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

288 篇 工学
- 248 篇 软件工程
- 232 篇 计算机科学与技术...
- 13 篇 电子科学与技术（可...
- 7 篇 信息与通信工程
- 5 篇 控制科学与工程
- 4 篇 机械工程
- 4 篇 生物工程
- 3 篇 生物医学工程（可授...
- 1 篇 力学（可授工学、理...
- 1 篇 动力工程及工程热...
- 1 篇 电气工程
- 1 篇 核科学与技术
- 1 篇 农业工程
- 1 篇 环境科学与工程（可...
53 篇 理学
- 49 篇 数学
- 4 篇 生物学
- 4 篇 系统科学
- 4 篇 统计学（可授理学、...
- 2 篇 化学
14 篇 管理学
- 10 篇 管理科学与工程(可...
- 8 篇 工商管理
- 4 篇 图书情报与档案管...
3 篇 经济学
- 3 篇 应用经济学
2 篇 法学
- 2 篇 社会学
1 篇 教育学
- 1 篇 教育学
1 篇 农学
- 1 篇 作物学

主题

54 篇 performance
48 篇 parallel process...
33 篇 algorithms
33 篇 parallel program...
27 篇 languages
25 篇 design
20 篇 parallel algorit...
20 篇 gpu
9 篇 experimentation
9 篇 measurement
7 篇 graphics process...
7 篇 theory
7 篇 parallel
6 篇 scalability
6 篇 mpi
6 篇 parallel computi...
6 篇 concurrency
5 篇 parallelism
5 篇 graph algorithms
5 篇 multicore

机构

7 篇 carnegie mellon ...
4 篇 indiana univ blo...
4 篇 shanghai jiao to...
3 篇 univ of tokyo
3 篇 tsinghua univ de...
3 篇 univ chinese aca...
3 篇 massachusetts in...
3 篇 univ illinois ur...
3 篇 swiss fed inst t...
3 篇 mit csail united...
3 篇 tsinghua univ pe...
3 篇 univ calif berke...
2 篇 ist austria klos...
2 篇 fudan univ sch c...
2 篇 georgetown univ ...
2 篇 univ wisconsin d...
2 篇 shanghai key lab...
2 篇 univ of wisconsi...
2 篇 tsinghua univers...
2 篇 shanghai jiao to...

作者

8 篇 blelloch guy e.
7 篇 chen haibo
6 篇 hoefler torsten
6 篇 garland michael
6 篇 zhai jidong
6 篇 shun julian
5 篇 sun yihan
4 篇 dhulipala laxman
4 篇 chen wenguang
4 篇 tsigas philippas
4 篇 tan guangming
4 篇 wang haojie
4 篇 nikolopoulos dim...
4 篇 mellor-crummey j...
4 篇 gu yan
4 篇 kennedy ken
3 篇 taura kenjiro
3 篇 li jiajia
3 篇 yonezawa akinori
3 篇 pingali keshav

语言

338 篇 英文
2 篇 其他

检索条件"任意字段=Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"

共 340 条记录，以下是251-260 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Ownership Passing: Efficient Distributed Memory programming on Multi-core Systems 13

Ownership Passing: Efficient Distributed Memory Programming ...

引用

18th acm sigplan symposium on principles and practice of parallel programming

作者： Friedley, Andrew Hoefler, Torsten Bronevetsky, Greg Lumsdaine, Andrew Ma, Ching-Chen Indiana Univ Bloomington IN 47405 USA ETH Zurich Switzerland Lawrence Livermore Natl Lab Livermore CA USA Rose Hulman Inst Technol Terre Haute IN 47803 USA

ISBN: (纸本)9781450319225

the number of cores in multi- and many-core high-performance processors is steadily increasing. MPI, the de-facto standard for programming high-performance computing systems offers a distributed memory programming model. MPI's semantics force a copy from one process' send buffer to another process' receive buffer. this makes it difficult to achieve the same performance on modern hardware than shared memory programs which are arguably harder to maintain and debug. We propose generalizing MPI's communication model to include ownership passing, which make it possible to fully leverage the shared memory hardware of multi-and many-core CPUs to stream communicated data concurrently with the receiver's computations on it. the benefits and simplicity of message passing are retained by extending MPI with calls to send (pass) ownership of memory regions, instead of their contents, between processes. Ownership passing is achieved with a hybrid MPI implementation that runs MPI processes as threads and is mostly transparent to the user. We propose an API and a static analysis technique to transform legacy MPI codes automatically and transparently to the programmer, demonstrating that this scheme is easy to use in practice. Using the ownership passing technique, we see up to 51% communication speedups over a standard message passing implementation on state-of-the art multicore systems. Our analysis and interface will lay the groundwork for future development of MPI-aware optimizing compilers and multi-core specific optimizations, which will be key for success in current and next-generation computing platforms.

关键词： Ownership Passing Distributed Memory Shared Memory Message Passing Multi-core

来源：评论

学校读者我要写书评

暂无评论

TileSpGEMM: A Tiled Algorithm for parallel Sparse General Matrix-Matrix Multiplication on GPUs 22

TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Ma...

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Niu, Yuyao Lu, Zhengyang Ji, Haonan Song, Shuhui Jin, Zhou Liu, Weifeng China Univ Petr Super Sci Software Lab Beijing Peoples R China

ISBN: (纸本)9781450392044

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. the existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection. We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.

关键词： Sparse matrix SpGEMM tiled algorithm GPU

来源：评论

学校读者我要写书评

暂无评论

DAPPLE: A pipelined data parallel approach for training large models 21

DAPPLE: A pipelined data parallel approach for training larg...

引用

26th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2021

作者： Fan, Shiqing Rong, Yi Meng, Chen Cao, Zongyan Wang, Siyu Zheng, Zhen Wu, Chuan Long, Guoping Yang, Jun Xia, Lixue Diao, Lansong Liu, Xiaoyong Lin, Wei Alibaba Group China

ISBN: (纸本)9781450382946

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23× speedup under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6× speedup of training throughput and saves 12% of memory consumption at the same time. © 2021 acm.

关键词： Scheduling algorithms

来源：评论

学校读者我要写书评

暂无评论

Provably Fast and Space-Efficient parallel Biconnectivity 23

Provably Fast and Space-Efficient Parallel Biconnectivity

引用

28th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2023

作者： Dong, Xiaojun Wang, Letong Gu, Yan Sun, Yihan UC Riverside

ISBN: (纸本)9798400700156

Computing biconnected components (BCC) of a graph is a fundamental graph problem. the canonical parallel BCC algorithm is the Tarjan-Vishkin algorithm, which has O(n + m) optimal work and polylogarithmic span on a graph with n vertices and m edges. However, Tarjan-Vishkin is not widely used in practice. We believe the reason is the space-inefficiency (it uses O(m) extra space). In practice, existing parallel implementations are based on breath-first search (BFS). Since BFS has span proportional to the diameter of the graph, existing parallel BCC implementations suffer from poor performance on large-diameter graphs and can be slower than the sequential algorithm on many real-world graphs. We propose the first p arallel b iconnectivity algorithm (FAST-BCC) that has optimal work, polylogarithmic span, and is space-efficient. Our algorithm creates a skeleton graph based on any spanning tree of the input graph. then we use the connectivity information of the skeleton to compute the biconnectivity of the original input. We carefully analyze the correctness of our algorithm, which is highly non-trivial. We implemented FAST-BCC and compared it with existing implementations, including GBBS, Slota and Madduri's algorithm, and the sequential Hopcroft-Tarjan algorithm. We tested them on a 96-core machine on 27 graphs with varying edge distributions. FAST-BCC is the fastest on all graphs. On average (geometric means), FAST-BCC is 3.1× faster than the best existing baseline on each graph. © 2023 Owner/Author.

关键词： Graphic methods

来源：评论

学校读者我要写书评

暂无评论

Scheduling parallel Programs by Work Stealing with Private Deques 13

Scheduling Parallel Programs by Work Stealing with Private D...

引用

18th acm sigplan symposium on principles and practice of parallel programming

作者： Acar, Umut A. Chargueraud, Arthur Rainey, Mike Carnegie Mellon Univ Dept Comp Sci Pittsburgh PA 15213 USA Univ Paris 11 CNRS LRI Orsay France

ISBN: (纸本)9781450319225

Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e. g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm. For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. these advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice. In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to impleme

关键词： work stealing nested parallelism dynamic load balancing

来源：评论

学校读者我要写书评

暂无评论

An ownership policy and deadlock detector for promises 21

An ownership policy and deadlock detector for promises

引用

26th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2021

作者： Voss, Caleb Sarkar, Vivek Georgia Institute of Technology United States

ISBN: (纸本)9781450382946

Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or Transitive Joins. However, the promise, a popular synchronization primitive for parallel tasks, does not enjoy deadlock-freedom guarantees. Promises can exhibit deadlock-like bugs;however, the concept of a deadlock is not currently well-defined for promises. To address these challenges, we propose an ownership semantics in which each promise is associated to the task which currently intends to fulfill it. Ownership immediately enables the identification of bugs in which a task fails to fulfill a promise for which it is responsible. Ownership further enables the discussion of deadlock cycles among tasks and promises and allows us to introduce a robust definition of deadlock-like bugs for promises. Cycle detection in this context is non-trivial because it is concurrent with changes in promise ownership. We provide a lock-free algorithm for precise runtime deadlock detection. We show how to obtain the memory consistency criteria required for the correctness of our algorithm under TSO and the Java and C++ memory models. An evaluation compares the execution time and memory usage overheads of our detection algorithm on benchmark programs relative to an unverified baseline. Our detector exhibits a 12% (1.12×) geometric mean time overhead and a 6% (1.06×) geometric mean memory overhead, which are smaller overheads than in past approaches to deadlock cycle detection. © 2021 acm.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Data structures for task-based priority scheduling 14

Data structures for task-based priority scheduling

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Wimmer, Martin Versaci, Francesco Träff, Jesper Larsson Cederman, Daniel Tsigas, Philippas Faculty of Informatics Parallel Computing Vienna University of Technology 1040 Vienna/Wien Austria Computer Science and Engineering Chalmers University of Technology 412 96 Göteborg Sweden

ISBN: (纸本)9781450326568

We present three lock-free data structures for priority task scheduling: a priority work-stealing one, a centralized one with ρ-relaxed semantics, and a hybrid one combining both concepts. With the single-source shortest path (SSSP) problem as example, we show how the different approaches affect the prioritization and provide upper bounds on the number of examined nodes. We argue that priority task scheduling allows for an intuitive and easy way to parallelize the SSSP problem, notoriously a hard task. Experimental evidence supports the good scalability of the resulting algorithm. the larger aim of this work is to understand the trade-offs between scalability and priority guarantees in task scheduling systems. We show that ρ-relaxation is a valuable technique for improving the first, while still allowing semantic constraints to be satisfied: the lock-free, hybrid κ-priority data structure can scale as well as work-stealing, while still providing strong priority scheduling guarantees, which depend on the parameter κ. Our theoretical results open up possibilities for even more scalable data structures by adopting a weaker form of ρ-relaxation, which still enables the semantic constraints to be respected.

关键词： Scalability

来源：评论

学校读者我要写书评

暂无评论

Merchandiser: Data Placement on Heterogeneous Memory for Task-parallel HPC Applications with Load-Balance Awareness 23

Merchandiser: Data Placement on Heterogeneous Memory for Tas...

引用

28th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2023

作者： Xie, Zhen Liu, Jie Li, Jiajia Li, Dong University of California Argonne National Laboratory Merced United States University of California Merced United States North Carolina State University United States

ISBN: (纸本)9798400700156

the emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. Deciding the placement of data objects on HM is critical for high performance. We reveal a performance problem related to data placement on HM. the problem is manifested as load imbalance among tasks in task-parallel HPC applications. the root of the problem comes from being unaware of parallel-task semantics and an incorrect assumption that bringing frequently accessed pages to fast memory always leads to better performance. To address this problem, we introduce a load balance-aware page management system, named Merchandiser. Merchandiser introduces task semantics during memory profiling, rather than being application-agnostic. Using the limited task semantics, Merchandiser effectively sets up coordination among tasks on the usage of HM to finish all tasks fast instead of only considering any individual task. Merchandiser is highly automated to enable high usability. Evaluating with memory-consuming HPC applications, we show that Merchandiser reduces load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution. © 2023 acm.

关键词： Semantics

来源：评论

学校读者我要写书评

暂无评论

Data locality and load balancing in COOL 93

Data locality and load balancing in COOL

引用

4th acm sigplan symposium on principles and practice of parallel programming, PPOPP 1993

作者： Chandra, Rohit Gupta, Anoop Hennessy, John L. Center for Integrated Systems Stanford University StanfordCA94305 United States

ISBN: (纸本)0897915895

Large-scale shared memory multiprocessors typically support a multilevel memory hierarchy consisting of per-processor caches, a local portion of shared memory, and remote shared memory. On such machines, the performance of parallel programs is often limited by the high latency of remote memory references. In this paper we explore how knowledge of the underlying memory hierarchy can be used to schedule computation and distribute data structures, and thereby improve data locality. Our study is done in the context of COOL, a concurrent object-oriented language developed at Stanford. We develop abstractions for the programmer to supply optional information about the data reference patterns of the program. this information is used by the runtime system to distribute tasks and objects so that the tasks execute close (in the memory hierarchy) to the objects they reference. We demonstrate the effectiveness of these techniques by applying them to several applications chosen from the SPLASH parallel benchmark suite. Our experience suggests that improving data locality can be simple through a combination of programmer abstractions and smart runtime scheduling.

关键词： Object oriented programming

来源：评论

学校读者我要写书评

暂无评论

thread to Strand Binding of parallel Network Applications in Massive Multi-threaded Systems

Thread to Strand Binding of Parallel Network Applications in...

引用

15th acm sigplan symposium on principles and practice of parallel programming

作者： Radojkovic, Petar Cakarevic, Vladimir Verdu, Javier Pajuelo, Alex Cazorla, Francisco J. Nemirovsky, Mario Valero, Mateo Univ Politecn Cataluna E-08028 Barcelona Spain CSIC Madrid Spain

ISBN: (纸本)9781605587080

In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts (strands). We call this last scheduling step the thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.

关键词： Algorithms Measurement Performance Process Scheduling Simultaneous Multithreading CMT UltraSPARC T2

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共34页 << < 22 23 24 25 26 27 28 29 30 31 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：