检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

333 篇 会议
46 篇 期刊文献

馆藏范围

379 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

329 篇 工学
- 283 篇 软件工程
- 268 篇 计算机科学与技术...
- 12 篇 电子科学与技术（可...
- 7 篇 信息与通信工程
- 7 篇 控制科学与工程
- 4 篇 机械工程
- 4 篇 生物医学工程（可授...
- 4 篇 生物工程
- 2 篇 力学（可授工学、理...
- 1 篇 动力工程及工程热...
- 1 篇 电气工程
- 1 篇 建筑学
- 1 篇 土木工程
- 1 篇 化学工程与技术
- 1 篇 核科学与技术
- 1 篇 农业工程
- 1 篇 环境科学与工程（可...
61 篇 理学
- 55 篇 数学
- 5 篇 系统科学
- 4 篇 生物学
- 4 篇 统计学（可授理学、...
- 3 篇 化学
- 1 篇 物理学
19 篇 管理学
- 11 篇 管理科学与工程(可...
- 9 篇 工商管理
- 8 篇 图书情报与档案管...
3 篇 经济学
- 3 篇 应用经济学
3 篇 法学
- 3 篇 社会学
1 篇 教育学
- 1 篇 教育学
1 篇 农学
- 1 篇 作物学

主题

71 篇 performance
49 篇 parallel process...
42 篇 algorithms
41 篇 parallel program...
39 篇 languages
34 篇 design
21 篇 gpu
20 篇 parallel algorit...
12 篇 experimentation
12 篇 measurement
9 篇 theory
8 篇 mpi
8 篇 parallel computi...
7 篇 scalability
7 篇 graphics process...
7 篇 parallel
7 篇 concurrency
6 篇 parallelism
6 篇 semantics
6 篇 openmp

机构

8 篇 carnegie mellon ...
4 篇 univ wisconsin d...
4 篇 indiana univ blo...
3 篇 univ of tokyo
3 篇 univ chinese aca...
3 篇 massachusetts in...
3 篇 univ illinois ur...
3 篇 swiss fed inst t...
3 篇 mit csail united...
3 篇 shanghai jiao to...
3 篇 tsinghua univ pe...
3 篇 univ utah sch co...
3 篇 rice univ housto...
3 篇 univ calif berke...
2 篇 ist austria klos...
2 篇 princeton univ d...
2 篇 georgetown univ ...
2 篇 shanghai key lab...
2 篇 univ of wisconsi...
2 篇 tsinghua univers...

作者

8 篇 blelloch guy e.
6 篇 hoefler torsten
6 篇 garland michael
6 篇 chen haibo
6 篇 shun julian
5 篇 sun yihan
5 篇 zhai jidong
5 篇 tsigas philippas
4 篇 dhulipala laxman
4 篇 tan guangming
4 篇 wang haojie
4 篇 nikolopoulos dim...
4 篇 long guoping
4 篇 valero mateo
4 篇 mellor-crummey j...
4 篇 gu yan
4 篇 kennedy ken
3 篇 taura kenjiro
3 篇 li jiajia
3 篇 yonezawa akinori

语言

340 篇 英文
39 篇 其他

检索条件"任意字段=6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"

共 379 条记录，以下是41-50 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

parallel Block-Delayed Sequences 22

Parallel Block-Delayed Sequences

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Westrick, Sam Rainey, Mike Anderson, Daniel Blelloch, Guy E. Carnegie Mellon Univ Pittsburgh PA 15213 USA

ISBN: (纸本)9781450392044

programming languages using functions on collections of values, such as map, reduce, scan and filter, have been used for over fifty years. Such collections have proven to be particularly useful in the context of parallelism because such functions are naturally parallel. However, if implemented naively they lead to the generation of temporary intermediate collections that can significantly increase memory usage and runtime. To avoid this pitfall, many approaches use "fusion" to combine operations and avoid temporary results. However, most of these approaches involve significant changes to a compiler and are limited to a small set of functions, such as maps and reduces. In this paper we present a library-based approach that fuses widely used operations such as scans, filters, and flattens. In conjunction with existing techniques, this covers most of the common operations on collections. Our approach is based on a novel technique which parallelizes over blocks, with streams within each block. We demonstrate the approach by implementing libraries targeting multicore parallelism in two languages: parallel ML and C++, which have very different semantics and compilers. To help users understand when to use the approach, we define a cost semantics that indicates when fusion occurs and how it reduces memory allocations. We present experimental results for a dozen benchmarks that demonstrate significant reductions in both time and space. In most cases the approach generates code that is near optimal for the machines it is running on.

关键词： parallel programming fusion collections functional programming

来源：评论

学校读者我要写书评

暂无评论

Lifetime-Based Optimization for Simulating Quantum Circuits on a New Sunway Supercomputer 23

Lifetime-Based Optimization for Simulating Quantum Circuits ...

引用

28th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2023

作者： Chen, Yaojian Liu, Yong Shi, Xinmin Song, Jiawei Liu, Xin Gan, Lin Guo, Chu Fu, Haohuan Gao, Jie Chen, Dexun Yang, Guangwen Tsinghua University Beijing China National Supercomputing Center in Wuxi Zhejiang Lab Hangzhou China Information Engineering University Zhengzhou China National Supercomputing Center in Wuxi China Tsinghua University National Supercomputing Center in Wuxi China National Research Centre of Parallel Engineering and Technology Beijing China Tsinghua University National Supercomputing Center in Wuxi Zhejiang Lab Hangzhou China

ISBN: (纸本)9798400700156

High-performance classical simulator for quantum circuits, in particular the tensor network contraction algorithm, has become an important tool for the validation of noisy quantum computing. In order to address the memory limitations, the slicing technique is used to reduce the tensor dimensions, but it could also lead to additional computation overhead that greatly slows down the overall performance. this paper proposes novel lifetime-based methods to reduce the slicing overhead and improve the computing efficiency, including, an interpretation method to deal with slicing overhead, an inplace slicing strategy to find the smallest slicing set and an adaptive tensor network contraction path refiner customized for Sunway architecture. Experiments show that in most cases the slicing overhead with our inplace slicing strategy would be less than the Cotengra, which is the most used graph path optimization software at present. Finally, the resulting simulation time is reduced to 96.1s for the Sycamore quantum processor RQC, with a sustainable single-precision performance of 308.6Pflops using over 41M cores to generate 1M correlated samples, which is more than 5 times performance improvement compared to 60.4 Pflops in 2021 Gordon Bell Prize work. © 2023 Owner/Author.

关键词： Timing circuits

来源：评论

学校读者我要写书评

暂无评论

Poster: the Problem-Based Benchmark Suite PBBS), V2 27

Poster: The Problem-Based Benchmark Suite PBBS), V2

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Anderson, Daniel Blelloch, Guy E. Dhulipala, Laxman Dobson, Magdalen Sun, Yihan Carnegie Mellon Univ Pittsburgh Pittsburgh PA USA Univ Maryland Coll Park MD USA UC Riverside Riverside Riverside CA USA

ISBN: (纸本)9781450392044

the Problem-Based Benchmark Suite (PBBS) is a set of benchmark problems designed for comparing algorithms, implementations and platforms. For each problem, the suite defines the problem in terms of the input-output relationship, and supplies a set of input instances along with input generators, a default implementation, code for checking correctness or accuracy, and a timing harness. the suite makes it possible to compare different algorithms, platforms (e.g. CPU vs CPU), and implementations using different programming languages or libraries. the purpose is to better understand how well a wide variety of problems parallelize, and what techniques/algorithms are most effective. the suite was first announced in 2012 with 14 benchmark problems. Here we describe some significant updates. In particular, we have added nine new benchmarks from a mix of problems in text processing, computational geometry and machine learning. We have further optimized the default implementations;several are the fastest available for multicore CPUs, often achieving near perfect speedup on the 72 core machine we test them on. the suite now also supplies significantly larger default test instances, as well as a broader variety, with many derived from real-world data.

关键词： benchmarking parallel algorithms performance

来源：评论

学校读者我要写书评

暂无评论

A Scalable Hybrid Total FETI Method for Massively parallel FEM Simulations 23

A Scalable Hybrid Total FETI Method for Massively Parallel F...

引用

28th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2023

作者： Lin, Kehao Zhou, Chunbao Zeng, Yan Nie, Ningming Wang, Jue Li, Shigang Feng, Yangde Wang, Yangang Yao, Kehan Yao, Tiechui Zhang, Jilin Wan, Jian Hangzhou Dianzi University Hangzhou China Computer Network Information Center Chinese Academy of Sciences Beijing China University of Chinese Academy of Sciences Beijing China School of Computer Science Beijing University of Posts and Telecommunications Beijing China

ISBN: (纸本)9798400700156

the Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method plays an important role in solving large-scale and complex engineering problems. this method needs to handle numerous matrix-vector multiplications. Directly calling the vendor-optimized library for general matrix-vector multiplication (gemv) on GPU leads to low performance, since it does not consider optimizations for different matrix sizes in HTFETI, i.e. different row and column sizes. In addition, state-of-the-art graph partitioning methods cannot guarantee load balancing for HTFETI, since the matrix size is determined by the length of the subdomain boundary. To solve the problems above, we first port gemv to the multi-stream pipeline scheme and develop a new batched kernel function on GPU, which brings 15%∼30% throughput improvement and 37% average GFLOPs improvement, respectively. We also propose a multi-grained load-balancing scheme based on graph repartitioning and work-stealing, and the load imbalance ratio is down to 1.05∼1.09 from 1.5. We have successfully applied the scalable HTFETI method to simulate the whole core assembly of China Experimental Fast Reactor (CEFR) for steady-state analysis, and the efficiencies of weak scalability and strong scalability reach 78% and 72% on 12,288 GPUs, respectively. As far as we know, this is the first time that HTFETI has been used in large-scale and high-fidelity whole core assembly simulation. © 2023 Owner/Author.

关键词： Scalability

来源：评论

学校读者我要写书评

暂无评论

PERFLOW: A Domain Specific Framework for Automatic Performance Analysis of parallel Applications 22

PERFLOW: A Domain Specific Framework for Automatic Performan...

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Jin, Yuyang Wang, Haojie Zhong, Runxin Zhang, Chen Zhai, Jidong Tsinghua Univ Beijing Peoples R China

ISBN: (纸本)9781450392044

Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required. To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PERFLOW. PERFLOW abstracts the step-by-step process of performance analysis as a dataflow graph. this dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PERFLOW'S built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PERFLOW by three case studies of real-world applications with up to 700K lines of code. Results show that PERFLOW significantly eases the implementation of customized analysis tasks. In addition, PERFLOW is able to perform analysis and locate performance bugs automatically and effectively.

关键词： Performance Analysis Domain Specific Framework Dataflow Graph

来源：评论

学校读者我要写书评

暂无评论

Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based Synchronization 23

Boosting Performance and QoS for Concurrent GPU B+trees by C...

引用

28th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2023

作者： Zhang, Weihua Zhao, Chuanlei Peng, Lu Lin, Yuzhe Zhang, Fengzhe Lu, Yunping School of Computer Science Fudan University China Institute of Big Data Fudan University China State Key Laboratory of Mathematical Engineering and Advanced Computing China Parallel Processing Institute Fudan University China Department of Computer Science Tulane University United States

ISBN: (纸本)9798400700156

Concurrent B+trees have been widely used in many systems. With the scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate concurrent B+trees performance. When many concurrent requests are processed, the conflicts should be detected and resolved. Prior methods guarantee the correctness of concurrent GPU B+trees through lock-based or software transactional memory (STM)-based approaches. However, these methods complicate the request processing logic, increase the number of memory accesses and bring execution path divergence. they lead to performance degradation and variance in response time increasing. Moreover, previous methods do not guarantee linearizability among concurrent requests. In this paper, we design a combined-based concurrency control framework, called Eirene, for GPU B+tree to reduce the overhead of conflict detection and resolution. First, a combining-based synchronization method is designed to combine and issue requests. It combines the requests with the same key, constructs their dependence, decides the issued request, and determines their return values. Since only one request for each key is issued, key conflicts are eliminated. then, an optimistic STM method is used to reduce structure conflicts. the query and the update requests are partitioned into different kernels. For the update kernels, STM is involved only when the number of the retry reaches a threshold. Finally, a locality-aware warp reorganization optimization is proposed to improve memory behavior and reduce conflicts by exploiting the locality among requests. Evaluations on an NVIDIA A100 GPU show that Eirene is efficient (a throughput of 2.4 billion per second) and can guarantee linearizability. Compared to the state-of-the-art GPU B+tree, it can achieve a speedup of 7.43X and reduce the response time variance from 36% to 5%. © 2023 acm.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

parallel k-Core Decomposition with Batched Updates and Asynchronous Reads 24

Parallel k-Core Decomposition with Batched Updates and Async...

引用

29th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2024

作者： Liu, Quanquan C. Shun, Julian Zablotchi, Igor Yale University United States MIT CSAIL United States Mysten Labs Switzerland

ISBN: (纸本)9798400704352

Maintaining a dynamic k-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for maintaining an approximate k-core decomposition. In their solution, both reads and updates need to be batched, and therefore each type of operation can incur high latency waiting for the other type to finish. To tackle most real-world workloads, which are dominated by reads, this paper presents a novel hybrid concurrent-parallel dynamic k-core data structure where asynchronous reads can proceed concurrently with batches of updates, leading to significantly lower read latencies. Our approach is based on tracking causal dependencies between updates, so that causally related groups of updates appear atomic to concurrent readers. Our data structure guarantees linearizability and liveness for both reads and updates, and maintains the same approximation guarantees as prior work. Our experimental evaluation on a 30-core machine shows that our approach reduces read latency by orders of magnitude compared to the batch-dynamic algorithm, up to a (4.05 · 105 ) -factor. Compared to an unsynchronized (non-linearizable) baseline, our read latency overhead is only up to a 3.21-factor greater, while improving accuracy of coreness estimates by up to a factor of 52.7. © 2024 Copyright held by the owner/author(s).

关键词： Data structures

来源：评论

学校读者我要写书评

暂无评论

OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-parallel Code 23

OpenCilk: A Modular and Extensible Software Infrastructure f...

引用

28th acm sigplan Annual symposium on principles and practice of parallel programming, PPoPP 2023

作者： Schardl, Tao B. Lee, I.-Ting Angelina MIT CSAIL United States Washington University St. Louis United States

ISBN: (纸本)9798400700156

this paper presents OpenCilk, an open-source software infrastructure for task-parallel programming that allows for substantial code reuse and easy exploration of design choices in language abstraction, compilation strategy, runtime mechanism, and productivity-tool development. the OpenCilk infrastructure consists of three main components: a compiler designed to compile fork-join task-parallel code, an efficient work-stealing runtime scheduler, and a productivity-tool development framework based on compiler instrumentation designed for fork-join parallel computations. OpenCilk is modular - modifying one component for the most part does not necessitate modifications to the other components - and easy to extend - its construction naturally encourages code reuse. Despite being modular and easy to extend, OpenCilk produces high-performing code. We investigated OpenCilk's modularity, extensibility, and performance through several case studies, including a study to extend OpenCilk to support multiple parallel runtime systems, including Cilk Plus, OpenMP, and oneTBB. OpenCilk's design enables rapid prototyping of new compiler back ends to target different parallel-runtime ABIs. Each back end required fewer than 2000 new lines of code. We examined the OpenCilk runtime's performance empirically on 15 benchmark Cilk programs and found that it outperforms the other runtimes by a geometric mean of 4% - 26% on 1 core and 10% - 120% on 48 cores. © 2023 Owner/Author.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

TileSpGEMM: A Tiled Algorithm for parallel Sparse General Matrix-Matrix Multiplication on GPUs 22

TileSpGEMM: A Tiled Algorithm for Parallel Sparse General Ma...

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Niu, Yuyao Lu, Zhengyang Ji, Haonan Song, Shuhui Jin, Zhou Liu, Weifeng China Univ Petr Super Sci Software Lab Beijing Peoples R China

ISBN: (纸本)9781450392044

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. the existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection. We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.

关键词： Sparse matrix SpGEMM tiled algorithm GPU

来源：评论

学校读者我要写书评

暂无评论

Multi-Queues Can Be State-of-the-Art Priority Schedulers 22

Multi-Queues Can Be State-of-the-Art Priority Schedulers

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Postnikova, Anastasiia Koval, Nikita Nadiradze, Giorgi Alistarh, Dan ITMO Univ St Petersburg Russia JetBrains Prague Czech Republic IST Austria Klosterneuburg Austria

ISBN: (纸本)9781450392044

Designing and implementing efficient parallel priority schedulers is an active research area. An intriguing proposed design is the Multi-Queue: given n threads and m >= n distinct priority queues, task insertions are performed uniformly at random, while, to delete, a thread picks two queues uniformly at random, and removes the observed task of higher priority. this approach scales well, and has probabilistic rank guarantees: roughly, the rank of each task removed, relative to remaining tasks in all other queues, is O(m) in expectation. Yet, the performance of this pattern is below that of well-engineered schedulers, which eschew theoretical guarantees for practical efficiency. We investigate whether it is possible to design and implement a Multi-Queue-based task scheduler that is both highly-efficient and has analytical guarantees. We propose a new variant called the Stealing Multi-Queue (SMQ), a cache-efficient variant of the Multi-Queue, which leverages both queue affinity-each thread has a local queue, from which tasks are usually removed;but, with some probability, threads also attempt to steal higher-priority tasks from the other queues-and task batching, that is, the processing of several tasks in a single insert / remove step. these ideas are well-known for task scheduling without priorities;our theoretical contribution is showing that, despite relaxations, this design can still provide rank guarantees, which in turn implies bounds on total work performed. We provide a general SMQ implementation which can surpass state-of-the-art schedulers such as OBIM and PMOD in terms of performance on popular graph-processing benchmarks. Notably, the performance improvement comes mainly from the superior rank guarantees provided by our scheduler, confirming that analytically-reasoned approaches can still provide performance improvements for priority task scheduling. the full version of this paper is available in [24].

关键词： priority scheduling relaxed algorithms concurrency parallel graph processing

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共38页 << < 1 2 3 4 5 6 7 8 9 10 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：