检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

1,504 篇 会议
105 篇 期刊文献

馆藏范围

1,609 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

1,168 篇 工学
- 1,111 篇 计算机科学与技术...
- 557 篇 软件工程
- 118 篇 电气工程
- 75 篇 信息与通信工程
- 46 篇 控制科学与工程
- 37 篇 电子科学与技术（可...
- 13 篇 材料科学与工程（可...
- 13 篇 农业工程
- 11 篇 机械工程
- 11 篇 光学工程
- 8 篇 化学工程与技术
- 8 篇 生物工程
- 7 篇 建筑学
- 7 篇 生物医学工程（可授...
- 6 篇 动力工程及工程热...
- 5 篇 土木工程
- 3 篇 力学（可授工学、理...
579 篇 理学
- 557 篇 数学
- 55 篇 统计学（可授理学、...
- 16 篇 物理学
- 9 篇 生物学
- 9 篇 系统科学
- 8 篇 化学
73 篇 管理学
- 64 篇 管理科学与工程(可...
- 40 篇 工商管理
- 10 篇 图书情报与档案管...
16 篇 农学
- 16 篇 作物学
6 篇 经济学
- 6 篇 应用经济学
3 篇 法学
- 3 篇 社会学
3 篇 教育学
- 3 篇 教育学
2 篇 医学
1 篇 文学
1 篇 军事学

主题

237 篇 parallel algorit...
173 篇 parallel process...
80 篇 computer archite...
74 篇 parallel process...
57 篇 parallel program...
55 篇 algorithms
47 篇 parallel archite...
41 篇 hardware
30 篇 scheduling
27 篇 computer program...
21 篇 graph algorithms
20 篇 computer systems...
18 篇 approximation al...
18 篇 processor schedu...
18 篇 computational mo...
18 篇 field programmab...
17 篇 parallel computi...
16 篇 computer science
16 篇 performance
16 篇 delay

机构

32 篇 carnegie mellon ...
15 篇 swiss fed inst t...
15 篇 carnegie mellon ...
11 篇 univ maryland de...
11 篇 stanford univ st...
10 篇 univ maryland co...
10 篇 mit 77 massachus...
10 篇 univ calif berke...
8 篇 eth zurich
7 篇 georgetown univ ...
7 篇 mit cambridge ma...
7 篇 univ texas austi...
6 篇 penn state univ ...
6 篇 mit csail cambri...
5 篇 univ calif river...
5 篇 princeton univer...
5 篇 university of ma...
5 篇 microsoft res re...
5 篇 carnegie mellon ...
5 篇 harvard univ cam...

作者

38 篇 blelloch guy e.
20 篇 gu yan
18 篇 gibbons phillip ...
18 篇 shun julian
18 篇 goodrich michael...
16 篇 fineman jeremy t...
15 篇 sun yihan
14 篇 dhulipala laxman
13 篇 vishkin uzi
12 篇 agrawal kunal
11 篇 leiserson charle...
10 篇 ballard grey
10 篇 hoefler torsten
10 篇 anon
10 篇 miller gary l.
10 篇 harris david g.
9 篇 ghaffari mohsen
9 篇 tangwongsan kana...
9 篇 reif john h.
9 篇 demmel james

语言

1,569 篇 英文
40 篇 其他

检索条件"任意字段=Annual ACM Symposium on Parallel Algorithms and Architectures"

共 1609 条记录，以下是491-500 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Hardware/Software Vectorization for Closeness Centrality on Multi-/Many-Core architectures 28

Hardware/Software Vectorization for Closeness Centrality on ...

引用

28th IEEE International parallel & Distributed Processing symposium Workshops (IPDPSW)

作者： Sariyuce, Ahmet Erdem Saule, Erik Kaya, Kamer Catalyurek, Umit V. Ohio State Univ Dept Biomed Informat Columbus OH 43210 USA Ohio State Univ Dept Comp Sci & Engn Columbus OH 43210 USA Ohio State Univ Dept Elect & Comp Engn Columbus OH 43210 USA Univ N Carolina Dept Comp Sci Charlotte NC 28223 USA

ISBN: (纸本)9781479941162

Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today's social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we exploit hardware and software vectorization in combination with fine-grain parallelization to compute the closeness centrality values. The proposed vectorization approach enables us to do concurrent breadth-first search operations and significantly increases the performance. We provide a comparison of different vectorization schemes and experimentally evaluate our contributions with respect to the existing parallel CPU-based solutions on cutting-edge hardware. Our implementations achieve to be 11 times faster than the state-of-the-art implementation for a graph with 234 million edges. The proposed techniques are beneficial to show how the vectorization can be efficiently utilized to execute other graph kernels that require multiple traversals over a large-scale network on cutting-edge architectures.

关键词： Centrality closeness centrality vectorization breadth-first search Intel Xeon Phi

来源：评论

学校读者我要写书评

暂无评论

Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling

Executing Dynamic Data-Graph Computations Deterministically ...

引用

26th acm symposium on parallelism in algorithms and architectures (SPAA)

作者： Kaler, Tim Hasenplaugh, William Schardl, Tao B. Leiserson, Charles E. MIT Comp Sci & Artificial Intelligence Lab 32 Vassar St Cambridge MA 02139 USA

ISBN: (纸本)9781450328210

A data-graph computation - popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi - is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex's prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round. This paper introduces PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by PRISM to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze PRISM using work-span analysis. Let G = (V,E) be a degree-Delta graph colored with colors, and suppose that Q subset of V is the set of active vertices in a round. Define size(Q) = vertical bar Q vertical bar+ Sigma (v is an element of q) deg (v) which is proportional to the space required to store the vertices of Q using a sparsegraph layout. We show that a P-processor execution of PRISM performs updates in Q using O(chi(lg(Q/chi) + lg Delta D) + lgP) span and Theta(size(Q)+ chi + P) work. These theoretical guarantees are matched by good empirical performance. We modified GraphLab to incorporate PRISM and studied seven application benchmarks on a 12-core multicore machine. PRISM executes the benchmarks 1:2-2:1 times faster than GraphLab's nondeterministic lock-based scheduler while providing deterministic behavior. This paper also presents PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operat

关键词： Data-graph computations multicore multithreading parallel programming chromatic scheduling determinism scheduling work stealing

来源：评论

学校读者我要写书评

暂无评论

SupMR: Circumventing Disk and Memory Bandwidth Bottlenecks for Scale-up MapReduce 28

SupMR: Circumventing Disk and Memory Bandwidth Bottlenecks f...

引用

28th IEEE International parallel & Distributed Processing symposium Workshops (IPDPSW)

作者： Sevilla, Michael Nassi, Ike Ioannidou, Kleoni Brandt, Scott Maltzahn, Carlos Univ Calif Santa Cruz Comp Sci Dept Santa Cruz CA 95060 USA

ISBN: (纸本)9781479941162

Reading input from primary storage (i.e. the ingest phase) and aggregating results (i.e. the merge phase) are important pre- and post-processing steps in large batch computations. Unfortunately, today's data sets are so large that the ingest and merge job phases are now performance bottlenecks. In this paper, we mitigate the ingest and merge bottlenecks by leveraging the scale-up MapReduce model. We introduce an ingest chunk pipeline and a merge optimization that increases CPU utilization (50 - 100%) and job phase speedups (1.16x - 3.13x) for the ingest and merge phases. Our techniques are based on well-known algorithms and scale-out MapReduce optimizations, but applying them to a scale-up computation framework to mitigate the ingest and merge bottlenecks is novel.

关键词： Applications architectures Distributed applications Distributed systems Performance measurements

来源：评论

学校读者我要写书评

暂无评论

The future of accelerator programming: Abstraction, performance or can we have both? 14

The future of accelerator programming: Abstraction, performa...

引用

29th annual acm symposium on Applied Computing, SAC 2014

作者： Rocki, Kamil Burtscher, Martin Suda, Reiji IBM Research 650 Harry Road San Jose CA 95120 United States University of Tokyo Department of Computer Science 7-3-1 Hongo Bunkyo-ku Tokyo Japan Texas State University Department of Computer Science San Marcos TX 78666 United States

ISBN: (纸本)9781450324694

In a perfect world, code would only be written once and would run on different devices with high efficiency. To a degree, that used to be the case in the era of frequency scaling on a single core. However, due to power limitations, parallel programming has become necessary to obtain performance gains. But parallel architectures differ substantially from each other, often require specialized knowledge to exploit them, and typically necessitate reimplementation and fine tuning of programs. These slow tasks frequently result in situations where most of the time is spent reimplementing old rather than writing new code. The goal of our research is to find programming techniques that increase productivity, maintain high performance, and provide abstraction to free the programmer from these unnecessary and time-consuming tasks. However, such techniques usually come at the cost of substantial performance degradation. This paper investigates current approaches to portable accelerator programming, seeking to answer whether they make it possible to combine high efficiency with sufficient algorithm abstraction. It discusses OpenCL as a potential solution and presents three approaches of writing portable code: GPU-centric, CPU-centric, and combined. By applying the three approaches to a real-world program, we show that it is at least sometimes possible to run exactly the same code on many different devices with minimal performance degradation using parameterization. The main contributions of this paper are an extensive review of the current state-of-the-art and our original approach of addressing the stated problem with the copious-parallelism technique. Copyright 2014 acm.

关键词： parallel architectures

来源：评论

学校读者我要写书评

暂无评论

Arbitrary Modulus Indexing 47

Arbitrary Modulus Indexing

引用

47th annual IEEE/acm International symposium on Microarchitecture (MICRO)

作者： Diamond, Jeffrey R. Fussell, Donald S. Keckler, Stephen W. Univ Texas Austin Austin TX 78712 USA NVIDIA Santa Clara CA USA

ISBN: (纸本)9781479969982

Modern high performance processors require memory systems that can provide access to data at a rate that is well matched to the processor's computation rate. Common to such systems is the organization of memory into local high speed memory banks that can be accessed in parallel. Associative look up of values is made efficient through indexing instead of associative memories. These techniques lose effectiveness when data locations are not mapped uniformly to the banks or cache locations, leading to bottlenecks that arise from excess demand on a subset of locations. Address mapping is most easily performed by indexing the banks using a mod(2 N) indexing scheme, but such schemes interact poorly with the memory access patterns of many computations, making resource conflicts a significant memory system bottleneck. Previous work has assumed that prime moduli are the best choices to alleviate conflicts and has concentrated on finding efficient implementations for them. In this paper, we introduce a new scheme called Arbitrary Modulus Indexing (AMI) that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs. We also demonstrate that, for a memory-intensive workload on a modern replay-style GPU architecture, prime moduli are not in general the best choices for memory bank and cache set mappings. Applying AMI to set of memory intensive benchmarks eliminates 98% of bank and set conflicts, resulting in an average speedup of 24% over an aggressive baseline system and a 64% average reduction in memory system replays at reasonable implementation cost.

关键词： prime banking index schemes fast division and modulus GPU caches replay architectures

来源：评论

学校读者我要写书评

暂无评论

Compiler Support for Optimizing Memory Bank-Level parallelism 47

Compiler Support for Optimizing Memory Bank-Level Parallelis...

引用

47th annual IEEE/acm International symposium on Microarchitecture (MICRO)

作者： Ding, Wei Guttman, Diana Kandemir, Mahmut Penn State Univ University Pk PA 16802 USA

ISBN: (纸本)9781479969982

Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or manycores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.

关键词： parallel processing Scheduling Optimization Arrays Vectors Random access memory Schedules

来源：评论

学校读者我要写书评

暂无评论

The complexity of optimal mechanism design

The complexity of optimal mechanism design

引用

25th annual acm-SIAM symposium on Discrete algorithms, SODA 2014

作者： Daskalakis, Constantinos Deckelbaum, Alan Tzamos, Christos EECS MIT United States Department of Mathtematics MIT United States

ISBN: (纸本)9781611973389

Myerson's seminal work provides a computationally efficient revenue-optimal auction for selling one item to multiple bidders [18]. Generalizing this work to selling multiple items at once has been a central question in economics and algorithmic game theory, but its complexity has remained poorly understood. We answer this question by showing that a revenue-optimal auction in multi-item settings cannot be found and implemented computationally efficiently, unless ZPP ⊇ P#p. This is true even for a single additive bidder whose values for the items are independently distributed on two rational numbers with rational probabilities. Our result is very general: we show that it is hard to compute any encoding of an optimal auction of any format (direct or indirect, truthful or non-truthful) that can be implemented in expected polynomial time. In particular, under well-believed complexity-theoretic assumptions, revenue-optimization in very simple multi-item settings can only be tractably approximated. We note that our hardness result applies to randomized mechanisms in a very simple setting, and is not an artifact of introducing combinatorial structure to the problem by allowing correlation among item values, introducing combinatorial valuations, or requiring the mechanism to be deterministic (whose structure is readily combinatorial). Our proof is enabled by a flow-interpretation of the solutions of an exponential-size linear program for revenue maximization with an additional supermodularity constraint. Copyright © 2014 by the Society for Industrial and Applied Mathematics.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

A Stream Processing Framework for On-line Optimization of Performance and Energy Efficiency on Heterogeneous Systems 28

A Stream Processing Framework for On-line Optimization of Pe...

引用

28th IEEE International parallel & Distributed Processing symposium Workshops (IPDPSW)

作者： Ranft, Benjamin Denninger, Oliver Pfaffe, Philip FZI Res Ctr Informat Technol D-76131 Karlsruhe Germany Karlsruhe Inst Technol D-76131 Karlsruhe Germany

ISBN: (纸本)9781479941162

Modern processors have the potential of executing compute-intensive programs quickly and efficiently, but require applications to be adapted to their ever increasing parallelism. Here, heterogeneous systems add complexity by combining processing units with different characteristics. Scheduling should thus consider the performance of each processor as well as competing workloads and varying inputs. To assist programmers of stream processing applications in facing this challenge we present libHawaii, an open source library for cooperatively using all processors of heterogeneous systems easily and efficiently. It supports exploiting data flow, data element and task parallelism via pipelining, partitioning and demand-based allocation of consecutive work items. Scheduling is automatically adapted on-line to continuously optimize performance and energy efficiency. Our C++ library does not depend on specific hardware architectures or parallel computing frameworks. However, it facilitates maximizing the throughput of compatible GPUs by overlapping computations and memory transfers while maintaining low latencies. This paper describes the algorithms and implementation of libHawaii and demonstrates its usage on existing applications. We experimentally evaluate our library using two examples: General matrix multiplication (GEMM) is a simple yet important building block of many high-performance computing applications. Complementarily, the detection, extraction and matching of sparse image features exhibits greater complexity, including indeterministic memory access and synchronization.

关键词： heterogeneous computing stream processing load balancing energy efficiency real-time parallel programming

来源：评论

学校读者我要写书评

暂无评论

A Sound and Complete Abstraction for Reasoning about parallel Prefix Sums 14

A Sound and Complete Abstraction for Reasoning about Paralle...

引用

41st annual acm SIGPLAN-SIGACT symposium on Principles of Programming Languages (POPL)

作者： Chong, Nathan Donaldson, Alastair F. Ketema, Jeroen Univ London Imperial Coll Sci Technol & Med London SW7 2AZ England

ISBN: (纸本)9781450325448

Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively parallel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance. We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a soundness and completeness result showing that a generic sequential prefix sum implementation is correct for an array of length n if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the input and result require O(n lg (n)) space. This improves upon an existing result by Sheeran where the input requires O(n lg (n)) space and the result O(n(2) lg (n)) space, and is more feasible for large n than a method by Voigtlander that uses O(n) space for the input and result but requires running O(n(2)) tests. We then extend our abstraction and results to the context of data-parallel programs, developing an automated verification method for GPU implementations of prefix sums. Our method uses static verification to prove that a generic prefix sum implementation is data race-free, after which functional correctness of the implementation can be determined by running a single test case under the interval of summations abstraction. We present an experimental evaluation using four different prefix sum algorithms, showing that our method is highly automatic, scales to large thread counts, and significantly outperforms Voigtlander's method when applied to large arrays.

关键词： parallel prefix sum computation GPUs abstraction formal verification

来源：评论

学校读者我要写书评

暂无评论

Flexible cooperation in parallel local search 14

Flexible cooperation in parallel local search

引用

29th annual acm symposium on Applied Computing, SAC 2014

作者： Munera, Danny Diaz, Daniel Abreu, Salvador Codognet, Philippe University of Paris 1 France University of Évora CENTRIA Portugal JFLI - CNRS UPMC University of Tokyo Japan

ISBN: (纸本)9781450324694

Constraint-Based Local Search (CBLS) consist in using Local Search methods [4] for solving Constraint Satisfaction Problems (CSP). In order to further improve the performance of Local Search, one possible option is to take advantage of the increasing availability of parallel computational resources. parallel implementation of local search meta-heuristics has been studied since the early 90's, when multiprocessor machines started to become widely available, see [6]. One usually distinguishes between single-walk and multiple-walk methods: Single-walk methods consist in using parallelism inside a single search process, e.g. for parallelizing the exploration of the neighborhood, while multiple-walk methods (also called multi-start methods) consist in developing concurrent explorations of the search space, either independently (IW) or cooperatively (CW) with some communication between concurrent processes. Although good results can be achieved just with IW [1], a more sophisticated paradigm featuring cooperation between independent walks should bring better performance. We thus propose a general framework for cooperative search, which defines a flexible and parametric strategy based on the cooperative multi-walk (CW) scheme. The framework is oriented towards distributed architectures based on clusters of nodes, with the notion of "teams" running on nodes which group several individual search engines (e.g. multicore nodes). The idea is that teams are distributed and thus have limited inter-node communication. This framework allows the programmer to define aspects such as the degree of intensification and diversification present in the parallel search process. A good trade-off is essential to reach high performance. A preliminary implementation of the general CW framework has been done in the X10 programming language [5], and performance evaluation over a set of well-known benchmark CSPs shows that CW consistently outperforms IW.

关键词：

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共161页 << < 46 47 48 49 50 51 52 53 54 55 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：