检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

361 篇 会议
46 篇 期刊文献

馆藏范围

407 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

351 篇 工学
- 296 篇 软件工程
- 287 篇 计算机科学与技术...
- 13 篇 电子科学与技术（可...
- 7 篇 信息与通信工程
- 7 篇 控制科学与工程
- 4 篇 机械工程
- 4 篇 电气工程
- 4 篇 生物工程
- 3 篇 生物医学工程（可授...
- 2 篇 动力工程及工程热...
- 1 篇 力学（可授工学、理...
- 1 篇 建筑学
- 1 篇 土木工程
- 1 篇 化学工程与技术
- 1 篇 核科学与技术
- 1 篇 农业工程
- 1 篇 环境科学与工程（可...
61 篇 理学
- 55 篇 数学
- 6 篇 系统科学
- 4 篇 生物学
- 4 篇 统计学（可授理学、...
- 3 篇 化学
- 1 篇 物理学
17 篇 管理学
- 12 篇 管理科学与工程(可...
- 9 篇 工商管理
- 5 篇 图书情报与档案管...
4 篇 教育学
- 4 篇 教育学
3 篇 经济学
- 3 篇 应用经济学
2 篇 法学
- 2 篇 社会学
1 篇 农学
- 1 篇 作物学

主题

72 篇 performance
49 篇 parallel process...
46 篇 parallel program...
43 篇 algorithms
40 篇 languages
34 篇 design
22 篇 gpu
21 篇 parallel algorit...
12 篇 experimentation
12 篇 measurement
10 篇 parallel computi...
9 篇 theory
8 篇 mpi
7 篇 parallelism
7 篇 graphics process...
7 篇 parallel
7 篇 openmp
7 篇 concurrency
6 篇 multicore
5 篇 reliability

机构

7 篇 carnegie mellon ...
4 篇 univ wisconsin d...
4 篇 indiana univ blo...
4 篇 shanghai jiao to...
3 篇 univ of tokyo
3 篇 tsinghua univ de...
3 篇 univ chinese aca...
3 篇 massachusetts in...
3 篇 univ illinois ur...
3 篇 swiss fed inst t...
3 篇 mit csail united...
3 篇 tsinghua univ pe...
3 篇 univ utah sch co...
3 篇 rice univ housto...
3 篇 univ calif berke...
3 篇 univ texas austi...
2 篇 ist austria klos...
2 篇 fudan univ sch c...
2 篇 princeton univ d...
2 篇 georgetown univ ...

作者

8 篇 blelloch guy e.
7 篇 chen haibo
6 篇 hoefler torsten
6 篇 garland michael
6 篇 zhai jidong
6 篇 shun julian
5 篇 sun yihan
5 篇 tsigas philippas
4 篇 dhulipala laxman
4 篇 pingali keshav
4 篇 chen wenguang
4 篇 tan guangming
4 篇 wang haojie
4 篇 nikolopoulos dim...
4 篇 long guoping
4 篇 valero mateo
4 篇 mellor-crummey j...
4 篇 gu yan
4 篇 leiserson charle...
4 篇 kennedy ken

语言

380 篇 英文
26 篇 其他
1 篇 葡萄牙文

检索条件"任意字段=16th ACM Symposium on Principles and Practice of Parallel Programming"

共 407 条记录，以下是251-260 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

相关度排序

相关度排序
时效性降序
时效性升序

Leveraging parallel Nesting in Transactional Memory

Leveraging Parallel Nesting in Transactional Memory

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Barreto, Joao Dragojevic, Aleksandar Ferreira, Paulo Guerraoui, Rachid Kapalka, Michal Univ Tecn Lisboa INESC ID Lisbon Portugal Swiss Fed Inst Technol CH-1015 Lausanne Switzerland

ISBN: (纸本)9781605587080

Exploiting the emerging reality of affordable multi-core architectures goes through providing programmers with simple abstractions that would enable them to easily turn their sequential programs into concurrent ones that expose as much parallelism as possible. While transactional memory promises to make concurrent programming easy to a wide programmer community, current implementations either disallow nested transactions to run in parallel or do not scale to arbitrary parallel nesting depths. this is an important obstacle to the central goal of transactional memory, as programmers can only start parallel threads in restricted parts of their code. this paper addresses the intrinsic difficulty behind the support for parallel nesting in transactional memory, and proposes a novel solution that, to the best of our knowledge, is the first practical solution to meet the lowest theoretical upper bound known for the problem. Using a synthetic workload configured to test parallel transactions on a multi-core machine, a practical implementation of our algorithm yields substantial speed-ups (up to 22x with 33 threads) relatively to serial nesting, and shows that the time to start and commit transactions, as well as to detect conflicts, is independent of nesting depth.

关键词： Algorithms Nested parallel programs fork-join work-stealing transactional memory

来源：评论

学校读者我要写书评

暂无评论

Scaling LAPACK Panel Operations Using parallel Cache Assignment

Scaling LAPACK Panel Operations Using Parallel Cache Assignm...

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Castaldo, Anthony M. Whaley, R. Clint Univ Texas San Antonio Dept Comp Sci San Antonio TX 78249 USA

ISBN: (纸本)9781605587080

In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. the Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl's law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8-core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the high-order flop count, and do not perform as well. By demonstrating a straight-forward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.

关键词： ATLAS LAPACK QR LU factorization parallel multicore multi-core GPU

来源：评论

学校读者我要写书评

暂无评论

Helper Locks for Fork-Join parallel programming

Helper Locks for Fork-Join Parallel Programming

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Agrawal, Kunal Leiserson, Charles E. Sukha, Jim Washington Univ St Louis MO 63130 USA

ISBN: (纸本)9781605587080

Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose that a processor p is executing a parallel region A after having acquired the lock L protecting A. If another processor p' tries to acquire L, then instead of blocking and waiting for p to complete A, processor p' joins p to help it complete A. Additional processors not blocked on L may also help to execute A. the HELPER runtime system can execute fork-join computations augmented with helper locks and parallel regions. HELPER supports the unbounded nesting of parallel regions. We provide theoretical completion-time and space-usage bounds for a design of HELPER based on work stealing. Specifically, let V be the number of parallel regions in a computation, let T-1 be its work, and let (T) over tilde (infinity) be its "aggregate span" - the sum of the spans (critical-path lengths) of all its parallel regions. We prove that HELPER completes the computation in expected time O(T-1/P + (T) over tilde (infinity) + PV) on P processors. this bound indicates that programs with a small number of highly parallel critical sections can attain linear speedup. For the space bound, we prove that HELPER completes a program using only O(P (S) over tilde (1)) stack space, where e (S) over tilde (1) is the sum, over all regions, of the stack space used by each region in a serial execution. Finally, we describe a prototype of HELPER implemented by modifying the Cilk multithreaded runtime system. We used this prototype to implement a concurrent hash table with a resize operation protected by a helper lock.

关键词： Algorithms Performance theory Cilk fork-join multithreading helper lock nested parallelism parallel region scheduling work stealing

来源：评论

学校读者我要写书评

暂无评论

Using Data Structure Knowledge for Efficient Lock Generation and Strong Atomicity

Using Data Structure Knowledge for Efficient Lock Generation...

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Upadhyaya, Gautam Midkiff, Samuel P. Pai, Vijay S. Purdue Univ W Lafayette IN 47907 USA

ISBN: (纸本)9781605587080

To achieve high-performance on multicore systems, shared-memory parallel languages must efficiently implement atomic operations. the commonly used and studied paradigms for atomicity are fine-grained locking, which is both difficult to program and error-prone;optimistic software transactions, which require substantial overhead to detect and recover from atomicity violations;and compiler-generation of locks from programmer-specified atomic sections, which leads to serialization whenever imprecise pointer analysis suggests the mere possibility of a conflicting operation. this paper presents a new strategy for compiler-generated locking that uses data structure knowledge to facilitate more precise alias and lock generation analyses and reduce unnecessary serialization. Implementing and evaluating these ideas in the Java language shows that the new strategy achieves eight-thread speedups of 0.83 to 5.9 for the five STAMP benchmarks studied, outperforming software transactions on all but one benchmark, and nearly matching programmer-specified fine-grained locks on all but one benchmark. the results also indicate that compiler knowledge of data structures improves the effectiveness of compiler analysis, boosting eight-thread performance by up to 300%. Further, the new analysis allows for software support of strong atomicity with less than 1% overhead for two benchmarks and less than 20% for three others. the strategy also nearly matches the performance of programmer-specified fine-grained locks for the SPECjbb2000 benchmark, which has traditionally not been amenable to static analyses.

关键词： Algorithms Performance Design Automatic lock generation transactional memory parallel programming

来源：评论

学校读者我要写书评

暂无评论

An Adaptive Performance Modeling Tool for GPU Architectures

An Adaptive Performance Modeling Tool for GPU Architectures

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Baghsorkhi, Sara S. Delahaye, Matthieu Patel, Sanjay J. Gropp, William D. Hwu, Wen-mei W. Univ Illinois Urbana IL 61801 USA

ISBN: (纸本)9781605587080

this paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. the model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. the proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.

关键词： Design Measurement Performance Analytical model GPU parallel programming Performance estimation

来源：评论

学校读者我要写书评

暂无评论

Improving parallelism and Locality with Asynchronous Algorithms

Improving Parallelism and Locality with Asynchronous Algorit...

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Liu, Lixia Li, Zhiyuan Purdue Univ Dept Comp Sci W Lafayette IN 47907 USA

ISBN: (纸本)9781605587080

As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. this is especially true for large computing problems solved by iterative algorithms because of the large data set typically used. Loop tiling, also known as loop blocking, was shown previously to be an effective way to enhance data locality, and hence to reduce the memory bandwidth pressure, for a class of iterative algorithms executed on a single processor. Unfortunately, the tiled programs suffer from reduced parallelism because only the loop iterations within a single tile can be easily parallelized. In this work, we propose to use the asynchronous model to enable effective loop tiling such that both parallelism and locality can be attained simultaneously. Asynchronous algorithms were previously proposed to reduce the communication cost and synchronization overhead between processors. Our new discovery is that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to simultaneously attained data locality and loop-level parallelism. We present supporting evidence from experiments with three well-known numerical kernels.

关键词： asynchronous algorithms loop tiling parallel numerical programs data locality memory performance

来源：评论

学校读者我要写书评

暂无评论

PHANTOM: Predicting Performance of parallel Applications on Large-Scale parallel Machines Using a Single Node

PHANTOM: Predicting Performance of Parallel Applications on ...

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Zhai, Jidong Chen, Wenguang Zheng, Weimin Tsinghua Univ Tsinghua Natl Lab Informat Sci & Technol Dept Comp Sci & Technol Beijing 100084 Peoples R China

ISBN: (纸本)9781605587080

For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for large-scale parallel applications on non-existing target machines. this paper proposes a novel approach to predict the sequential computation time accurately and efficiently. We assume that there is at least one node of the target platform but the whole target system need not be available. We make two main technical contributions. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. As a result, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that computation behavior of processes in parallel applications can be clustered into a few groups while processes in each group have similar computation behavior. this observation helps us reduce measurement time significantly because we only need to execute representative parallel processes instead of all of them. We have implemented a performance prediction framework, called PHANTOM, which integrates the above computation-time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms. For ASCI Sweep3D, the error of our approach is less than 5% on 1024 processor cores. Compared to a recent regression-based prediction approach, PHANTOM presents better prediction accuracy across different platforms.

关键词： Performance Prediction parallel Application Deterministic Replay Trace-driven Simulation

来源：评论

学校读者我要写书评

暂无评论

Does Cache Sharing on Modern CMP Matter to the Performance of Contemporary Multithreaded Programs?

Does Cache Sharing on Modern CMP Matter to the Performance o...

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Zhang, Eddy Z. Jiang, Yunlian Shen, Xipeng Coll William & Mary Dept Comp Sci Williamsburg VA 23187 USA

ISBN: (纸本)9781605587080

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention. A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. the influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood. In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. the measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

关键词： Shared Cache thread Scheduling parallel Program Optimizations Chip Multiprocessors

来源：评论

学校读者我要写书评

暂无评论

thread to Strand Binding of parallel Network Applications in Massive Multi-threaded Systems

Thread to Strand Binding of Parallel Network Applications in...

引用

15th acm SIGPLAN symposium on principles and practice of parallel programming

作者： Radojkovic, Petar Cakarevic, Vladimir Verdu, Javier Pajuelo, Alex Cazorla, Francisco J. Nemirovsky, Mario Valero, Mateo Univ Politecn Cataluna E-08028 Barcelona Spain CSIC Madrid Spain

ISBN: (纸本)9781605587080

In processors with several levels of hardware resource sharing, like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts (strands). We call this last scheduling step the thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing. We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.

关键词： Algorithms Measurement Performance Process Scheduling Simultaneous Multithreading CMT UltraSPARC T2

来源：评论

学校读者我要写书评

暂无评论

Diversification and Intensification in parallel SAT Solving

Diversification and Intensification in Parallel SAT Solving

引用

16th Annual International Conference on the principles and practice of Constraint programming

作者： Guo, Long Hamadi, Youssef Jabbour, Said Sais, Lakhdar Univ Lille Nord France CRIL CNRS UMR 8188 F-62307 Artois Lens France Microsoft Res Cambridge England Ecole Polytech LIX F-91128 Palaiseau France Microsoft Res Joint Ctr INRIA F-91893 Orsay France

ISBN: (纸本)9783642153952

In this paper, we explore the two well-known principles of diversification and intensification in portfolio-based parallel SAT solving. these dual concepts play an important role in several search algorithms including local search, and appear to be a key point in modern parallel SAT solvers. To study their trade-off, we define two roles for the computational units. Some of them classified as Masters perform an original search strategy, ensuring diversification. the remaining units, classified as Slaves are there to intensify their master's strategy. Several important questions have to be answered. the first one is what information should be given to a slave in order to intensify a given search effort? the second one is, how often, a subordinated unit has to receive such information? Finally, the question of finding the number of subordinated units and their connections with the search efforts has to be answered. Our results lead to an original intensification strategy which outperforms the best parallel SAT solver ManySAT, and solves some open SAT instances.

关键词： Satisfiability SAT and CSP Search

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共41页 << < 22 23 24 25 26 27 28 29 30 31 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：