检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

312 篇 会议
18 篇 期刊文献

馆藏范围

330 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

281 篇 工学
- 239 篇 软件工程
- 228 篇 计算机科学与技术...
- 12 篇 电子科学与技术（可...
- 7 篇 信息与通信工程
- 5 篇 控制科学与工程
- 4 篇 机械工程
- 4 篇 生物工程
- 3 篇 生物医学工程（可授...
- 1 篇 力学（可授工学、理...
- 1 篇 动力工程及工程热...
- 1 篇 电气工程
- 1 篇 核科学与技术
- 1 篇 农业工程
- 1 篇 环境科学与工程（可...
54 篇 理学
- 50 篇 数学
- 4 篇 生物学
- 4 篇 系统科学
- 4 篇 统计学（可授理学、...
- 2 篇 化学
15 篇 管理学
- 11 篇 管理科学与工程(可...
- 9 篇 工商管理
- 4 篇 图书情报与档案管...
3 篇 经济学
- 3 篇 应用经济学
2 篇 法学
- 2 篇 社会学
1 篇 教育学
- 1 篇 教育学
1 篇 农学
- 1 篇 作物学

主题

54 篇 performance
49 篇 parallel process...
33 篇 algorithms
32 篇 parallel program...
27 篇 languages
25 篇 design
20 篇 parallel algorit...
20 篇 gpu
9 篇 experimentation
9 篇 measurement
7 篇 graphics process...
7 篇 theory
7 篇 parallel
6 篇 mpi
6 篇 parallel computi...
6 篇 concurrency
5 篇 scalability
5 篇 parallelism
5 篇 graph algorithms
5 篇 synchronization

机构

7 篇 carnegie mellon ...
4 篇 indiana univ blo...
3 篇 univ of tokyo
3 篇 univ chinese aca...
3 篇 massachusetts in...
3 篇 univ illinois ur...
3 篇 swiss fed inst t...
3 篇 mit csail united...
3 篇 shanghai jiao to...
3 篇 tsinghua univ pe...
3 篇 univ calif berke...
2 篇 ist austria klos...
2 篇 georgetown univ ...
2 篇 univ wisconsin d...
2 篇 shanghai key lab...
2 篇 univ of wisconsi...
2 篇 tsinghua univers...
2 篇 tsinghua univ de...
2 篇 shanghai jiao to...
2 篇 nvidia corporati...

作者

8 篇 blelloch guy e.
6 篇 hoefler torsten
6 篇 garland michael
6 篇 chen haibo
6 篇 shun julian
5 篇 sun yihan
5 篇 zhai jidong
5 篇 tsigas philippas
4 篇 dhulipala laxman
4 篇 tan guangming
4 篇 wang haojie
4 篇 mellor-crummey j...
4 篇 agrawal kunal
4 篇 gu yan
4 篇 kennedy ken
3 篇 taura kenjiro
3 篇 li jiajia
3 篇 yonezawa akinori
3 篇 pingali keshav
3 篇 kim jungwon

语言

328 篇 英文
2 篇 其他

检索条件"任意字段=Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming"

共 330 条记录，以下是131-140 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Extracting logical structure and identifying stragglers in parallel execution traces 14

Extracting logical structure and identifying stragglers in p...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Isaacs, Katherine E. Gamblin, Todd Bhatele, Abhinav Bremer, Peer-Timo Schulz, Martin Hamann, Bernd Department of Computer Science University of California Davis United States Center for Applied Scientific Computing Lawrence Livermore National Laboratory United States

ISBN: (纸本)9781450326568

We introduce a new approach to automatically extract an idealized logical structure from a parallel execution trace. We use this structure to define intuitive metrics such as the lateness of a process involved in a parallel execution. By analyzing and illustrating traces in terms of logical steps, we leverage a developer's understanding of the happened-before relations in a parallel program. this technique can uncover dependency chains, elucidate communication patterns, and highlight sources and propagation of delays, all of which may be obscured in a traditional trace visualization.

关键词： Visualization

来源：评论

学校读者我要写书评

暂无评论

Fine-grain parallel megabase sequence comparison with multiple heterogeneous GPUs 14

Fine-grain parallel megabase sequence comparison with multip...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： De Sandes, Edans F.O. Miranda, Guillermo Melo, Alba C.M.A. Martorell, Xavier Ayguadé, Eduard University of Brasilia Brazil Universitat Politècnica de Catalunya Barcelona Supercomputing Center Spain

ISBN: (纸本)9781450326568

this paper proposes and evaluates a parallel strategy to execute the exact Smith-Waterman (SW) algorithm for megabase DNA sequences in heterogeneous multi-GPU platforms. In our strategy, the computation of a single huge SW matrix is spread over multiple GPUs, which communicate border elements to the neighbour, using a circular buffer mechanism that hides the communication overhead. We compared 4 pairs of human-chimpanzee homologous chromosomes using 2 different GPU environments, obtaining a performance of up to 140.36 GCUPS (Billion of cells processed per second) with 3 heterogeneous GPUS.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

Triolet: A programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing 14

Triolet: A programming system that unifies algorithmic skele...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Rodrigues, Christopher Jablin, thomas Dakkak, Abdul Hwu, Wen-Mei University of Illinois at Urbana-Champaign United States

ISBN: (纸本)9781450326568

Functional algorithmic skeletons promise a high-level programming interface for distributed-memory clusters that free developers from concerns of task decomposition, scheduling, and communication. Unfortunately, prior distributed functional skeleton frameworks do not deliver performance comparable to that achievable in a low-level distributed programming model such as C with MPI and OpenMP, even when used in concert with high-performance array libraries. there are several causes: they do not take advantage of shared memory on each cluster node;they impose a fixed partitioning strategy on input data;and they have limited ability to fuse loops involving skeletons that produce a variable number of outputs per input. We address these shortcomings in the Triolet programming language through a modular library design that separates concerns of parallelism, loop nesting, and data partitioning. We show how Triolet substantially improves the parallel performance of algorithms involving array traversals and nested, variable-size loops over what is achievable in Eden, a distributed variant of Haskell. We further demonstrate how Triolet can substantially simplify parallel programming relative to C with MPI and OpenMP while achieving 23.100% of its performance on a 128-core cluster. Copyright © 2014 acm.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Designing and auto-tuning parallel 3-D FFT for computation-communication overlap 14

Designing and auto-tuning parallel 3-D FFT for computation-c...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Song, Sukhyun Hollingsworth, Jeffrey K. Department of Computer Science University of Maryland College Park United States

ISBN: (纸本)9781450326568

this paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code performs fully asynchronous communication without any support from special hardware. We also improve cache performance through loop tiling. To cope with the complex tradeoff regarding our optimization techniques, we parameterize our code and auto-tune the parameters efficiently in a large parameter space. Experimental results from two systems confirm that our code achieves a speedup of up to 1.76× over the FFTW library. Copyright © 2014 acm.

关键词： Fast Fourier transforms

来源：评论

学校读者我要写书评

暂无评论

Efficient deterministic multithreading without global barriers 14

Efficient deterministic multithreading without global barrie...

引用

proceedings of the 19th acm sigplan symposium on principles and practice of parallel programming

作者： Lu, Kai Zhou, Xu Bergan, Tom Wang, Xiaoping Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Changsha China College of Computer National University of Defense Technology Changsha China University of Washington Computer Science and Engineering United States

ISBN: (纸本)9781450326568

Multithreaded programs execute nondeterministically on conventional architectures and operating systems. this complicates many tasks, including debugging and testing. Deterministic multithreading (DMT) makes the output of a multithreaded program depend on its inputs only, which can totally solve the above problem. However, current DMT implementations suffer from a common inefficiency: they use frequent global barriers to enforce a deterministic ordering on memory accesses. In this paper, we eliminate that inefficiency using an execution model we call deterministic lazy release consistency (DLRC). Our execution model uses the Kendo algorithm to enforce a deterministic ordering on synchronization, and it uses a deterministic version of the lazy release consistency memory model to propagate memory updates across threads. Our approach guarantees that programs execute deterministically even when they contain data races. We implemented a DMT system based on these ideas (RFDet) and evaluated it using 17 parallel applications. Our implementation targets C/C++ programs that use POSIX threads. Results show that RFDet gains nearly 2x speedup compared with Dthreads-a start-of-the-art DMT system. Copyright © 2014 acm.

关键词： C++ (programming language)

来源：评论

学校读者我要写书评

暂无评论

Well-structured futures and cache locality 14

Well-structured futures and cache locality

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Herlihy, Maurice Liu, Zhiyu Computer Science Department Brown University United States

ISBN: (纸本)9781450326568

In fork-join parallelism, a sequential program is split into a directed acyclic graph of tasks linked by directed dependency edges, and the tasks are executed, possibly in parallel, in an order consistent with their dependencies. A popular and effective way to extend fork-join parallelism is to allow threads to create futures. A thread creates a future to hold the results of a computation, which may or may not be executed in parallel. that result is returned when some thread touches that future, blocking if necessary until the result is ready. Recent research has shown that while futures can, of course, enhance parallelism in a structured way, they can have a deleterious effect on cache locality. In the worst case, futures can incur Ω(PT ∞ +tT∞) deviations, which implies Ω(CPT∞+CtT∞) additional cache misses, where C is the number of cache lines, P is the number of processors, t is the number of touches, and T∞ is the computation span. Since cache locality has a large impact on software performance on modern multicores, this result is troubling. In this paper, however, we show that if futures are used in a simple, disciplined way, then the situation is much better: if each future is touched only once, either by the thread that created it, or by a later descendant of the thread that created it, then parallel executions with work stealing can incur at most O(CPT2∞ ) additional cache misses, a substantial improvement. this structured use of futures is characteristic of many (but not all) parallel applications. Copyright © 2014 acm.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Data structures for task-based priority scheduling 14

Data structures for task-based priority scheduling

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Wimmer, Martin Versaci, Francesco Träff, Jesper Larsson Cederman, Daniel Tsigas, Philippas Faculty of Informatics Parallel Computing Vienna University of Technology 1040 Vienna/Wien Austria Computer Science and Engineering Chalmers University of Technology 412 96 Göteborg Sweden

ISBN: (纸本)9781450326568

We present three lock-free data structures for priority task scheduling: a priority work-stealing one, a centralized one with ρ-relaxed semantics, and a hybrid one combining both concepts. With the single-source shortest path (SSSP) problem as example, we show how the different approaches affect the prioritization and provide upper bounds on the number of examined nodes. We argue that priority task scheduling allows for an intuitive and easy way to parallelize the SSSP problem, notoriously a hard task. Experimental evidence supports the good scalability of the resulting algorithm. the larger aim of this work is to understand the trade-offs between scalability and priority guarantees in task scheduling systems. We show that ρ-relaxation is a valuable technique for improving the first, while still allowing semantic constraints to be satisfied: the lock-free, hybrid κ-priority data structure can scale as well as work-stealing, while still providing strong priority scheduling guarantees, which depend on the parameter κ. Our theoretical results open up possibilities for even more scalable data structures by adopting a weaker form of ρ-relaxation, which still enables the semantic constraints to be respected.

关键词： Scalability

来源：评论

学校读者我要写书评

暂无评论

SCCMulti: An improved parallel strongly connected components algorithm 14

SCCMulti: An improved parallel strongly connected components...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Tomkins, Daniel Smith, Timmie Amato, Nancy M. Rauchwerger, Lawrence Parasol Laboratory Department of Computer Science and Engineering Texas A and M University United States

ISBN: (纸本)9781450326568

Tarjan's famous linear time, sequential algorithm for finding the strongly connected components (SCCs) of a graph relies on depth first search, which is inherently sequential. Deterministic parallel algorithms solve this problem in logarithmic time using matrix multiplication techniques, but matrix multiplication requires a large amount of total work. Randomized algorithms based on reachability - the ability to get from one vertex to another along a directed path - greatly improve the work bound in the average case. However, these algorithms do not always perform well;for instance, Divide-and-Conquer Strong Components (DCSC), a scalable, divide-and-conquer algorithm, has good expected theoretical limits, but can perform very poorly on graphs for which the maximum reachability of any vertex is small. A related algorithm, MultiPivot, gives very high probability guarantees on the total amount of work for all graphs, but this improvement introduces an overhead that increases the average running time. this work introduces SCCMulti, a multi-pivot improvement of DCSC that offers the same consistency as MultiPivot without the time overhead. We provide experimental results demonstrating SCCMulti's scalability;these results also show that SCCMulti is more consistent than DCSC and is always faster than MultiPivot.

关键词： Matrix algebra

来源：评论

学校读者我要写书评

暂无评论

CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications 14

CUDA-NP: Realizing nested thread-level parallelism in GPGPU ...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Yang, Yi Zhou, Huiyang Department of Computing Systems Architecture NEC Laboratories America Inc. United States Department of Electrical and Computer Engineering North Carolina State University United States

ISBN: (纸本)9781450326568

parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest Nvidia Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degrees of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implemented our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.18 times on average. Copyright © 2014 acm.

关键词： Application programming interfaces (API)

来源：评论

学校读者我要写书评

暂无评论

Resilient X10: Efficient failure-aware programming 14

Resilient X10: Efficient failure-aware programming

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

作者： Cunningham, David Grove, David Herta, Benjamin Iyengar, Arun Kawachiya, Kiyokuni Murata, Hiroki Saraswat, Vijay Takeuchi, Mikio Tardieu, Olivier IBM T. J. Watson Research Center Japan Google Inc. Japan IBM Research Tokyo Japan

ISBN: (纸本)9781450326568

Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. the advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically. We are concerned with the development of generalpurpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a Happens Before Invariance Principle and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. the programmer is only responsible for continuing execution with fewer computational resources and the loss of part of the heap, and can do so while taking advantage of domain knowledge. We build a complete implementation of the language, capable of executing benchmark applications on hundreds of nodes. We describe the algorithms required to make the language runtime resilient. We then give three applications, each with a different approach to fault tolerance (replay, decimation, and domain-level checkpointing). these can be executed at scale and survive node failure. We show that for these programs the overhead of resilience is a small fraction of overall runtime by comparing to equivalent non-resilient X10 programs. On one program we show end-to-end performance of Resilient X10 is ∼100x faster than Hadoop. Copyright © 2014 acm.

关键词： Fault tolerance

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共33页 << < 10 11 12 13 14 15 16 17 18 19 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：