检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

2,501 篇 会议
69 篇 期刊文献
4 册 图书

馆藏范围

2,574 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

1,951 篇 工学
- 1,911 篇 计算机科学与技术...
- 889 篇 软件工程
- 388 篇 信息与通信工程
- 292 篇 电气工程
- 139 篇 电子科学与技术（可...
- 102 篇 控制科学与工程
- 66 篇 网络空间安全
- 32 篇 动力工程及工程热...
- 25 篇 建筑学
- 24 篇 机械工程
- 21 篇 生物医学工程（可授...
- 20 篇 土木工程
- 16 篇 生物工程
- 15 篇 交通运输工程
- 15 篇 安全科学与工程
- 14 篇 环境科学与工程（可...
- 13 篇 光学工程
- 10 篇 力学（可授工学、理...
- 10 篇 化学工程与技术
419 篇 理学
- 345 篇 数学
- 42 篇 统计学（可授理学、...
- 40 篇 系统科学
- 38 篇 物理学
- 21 篇 生物学
- 18 篇 化学
353 篇 管理学
- 304 篇 管理科学与工程(可...
- 126 篇 工商管理
- 74 篇 图书情报与档案管...
23 篇 经济学
- 22 篇 应用经济学
14 篇 法学
- 14 篇 社会学
8 篇 农学
7 篇 医学
5 篇 教育学
2 篇 文学
2 篇 军事学

主题

188 篇 parallel process...
155 篇 application soft...
137 篇 graphics process...
130 篇 parallel process...
122 篇 computer archite...
114 篇 hardware
110 篇 computational mo...
102 篇 distributed comp...
101 篇 concurrent compu...
94 篇 computer science
86 篇 runtime
86 篇 distributed comp...
84 篇 parallel program...
67 篇 scalability
65 篇 graphics process...
61 篇 libraries
61 篇 instruction sets
60 篇 resource managem...
56 篇 kernel
55 篇 bandwidth

机构

15 篇 oak ridge natl l...
12 篇 cent s univ sch ...
11 篇 argonne natl lab...
11 篇 univ tennessee k...
10 篇 guangzhou univ s...
10 篇 school of comput...
10 篇 univ manchester ...
10 篇 ohio state univ ...
8 篇 oak ridge natl l...
7 篇 univ chinese aca...
7 篇 hunan univ coll ...
6 篇 chinese acad sci...
6 篇 iit dept comp sc...
6 篇 oak ridge nation...
6 篇 hunan engn lab r...
6 篇 univ illinois de...
6 篇 department of co...
5 篇 univ sci & techn...
5 篇 georgia state un...
5 篇 georgia inst tec...

作者

16 篇 dongarra jack
13 篇 wang guojun
12 篇 sun xian-he
10 篇 cerin christophe
9 篇 schulz martin
9 篇 guo minyi
9 篇 agrawal gagan
9 篇 wolf felix
9 篇 robert yves
8 篇 matsuoka satoshi
8 篇 jin hai
7 篇 li kenli
7 篇 prasad sushil k.
7 篇 banicescu ioana
7 篇 antoniu gabriel
7 篇 kale laxmikant v...
7 篇 zhou xuehai
7 篇 labarta jesus
6 篇 li xi
6 篇 hoefler torsten

语言

2,572 篇 英文
1 篇 葡萄牙文
1 篇 其他

检索条件"任意字段=13th IEEE International Symposium on Parallel and Distributed Processing with Applications"

共 2574 条记录，以下是101-110 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

相关度排序

相关度排序
时效性降序
时效性升序

13th ieee Workshop on parallel / distributed Combinatorics and Optimization (PDCO 2023)

2023 IEEE International Parallel and Distributed Processing ...

引用

2023 ieee international parallel and distributed processing symposium Workshops, IPDPSW 2023 2023年 878-879页

来源：评论

学校读者我要写书评

暂无评论

HETEROGENEOUS ARCHITECTURE FOR SPARSE DATA processing 36

HETEROGENEOUS ARCHITECTURE FOR SPARSE DATA PROCESSING

引用

36th ieee international parallel and distributed processing symposium (ieee IPDPS)

作者： Adavally, Shashank Weaver, Alex Vasireddy, Pranathi Kavi, Krishna Mehta, Gayatri Gulur, Nagendra Univ North Texas Denton TX 76203 USA

ISBN: (纸本)9781665497473

Sparse matrices are very common types of information used in scientific and machine learning applications including deep neural networks. Sparse data representations lead to storage efficiencies by avoiding storing zero values. However, sparse representations incur metadata computational overheads - software first needs to find row/column locations of non-zero values before performing necessary computations. Such metadata accesses involve indirect memory accesses (of the form a[b[i] ] where a[.] and b[.] are large arrays) and they are cache and prefetch-unfriendly, resulting in frequent load stalls. In this paper, we will explore a dedicated hardware for a memory-side accelerator called Hardware Helper thread (HHT) that performs all the necessary index computations to fetch only the nonzero elements from sparse matrix and sparse vector and supply those values to the primary core, creating heterogeneity within a single CPU core. We show both performance gains and energy savings of HHT for sparse matrix-dense vector multiplication (SpMV) and sparse matrixsparse vector multiplication (SpMSpV). the ASIC HHT shows average performance gains ranging between 1.7 and 3.5 depending on the sparsity levels, vector-widths used by RISCV vector instructions and if the Vector (in Matrix-Vector multiplication) is sparse or dense. We also show energy savings of 19% on average when ASIC HHT is used compared to baseline (for SpMV), and the HHT requires 38.9% of a RISCV core area.

关键词： Sparse matrices DNN Hardware Accelerators RISCV

来源：评论

学校读者我要写书评

暂无评论

parallel Vertex Cover Algorithms on GPUs 36

Parallel Vertex Cover Algorithms on GPUs

引用

36th ieee international parallel and distributed processing symposium (ieee IPDPS)

作者： Yamout, Peter Barada, Karim Jaljuli, Adnan Mouawad, Amer E. El Hajj, Izzat Amer Univ Beirut Beirut Lebanon

ISBN: (纸本)9781665481069

Finding small vertex covers in a graph has applications in numerous domains such as scheduling, computational biology, telecommunication networks, artificial intelligence, social science, and many more. Two common formulations of the problem include: Minimum Vertex Cover (MVC), which finds the smallest vertex cover in a graph, and Parameterized Vertex Cover (PVC), which finds a vertex cover whose size is less than or equal to some parameter k. Algorithms for both formulations involve traversing a search tree, which grows exponentially with the size of the graph or the value of k. parallelizing the traversal of the vertex cover search tree on GPUs is challenging for multiple reasons. First, the search tree is a narrow binary tree which makes it difficult to extract enough sub-trees to process in parallel to fully utilize the GPU's massively parallel execution resources. Second, the search tree is highly imbalanced which makes load balancing across a massive number of parallel GPU workers especially challenging. third, keeping around all the intermediate state needed to traverse many sub-trees in parallel puts high pressure on the GPU's memory resources and may act as a limiting factor to parallelism. To address these challenges, we propose an approach to traverse the vertex cover search tree in parallel using GPUs while handling dynamic load balancing. Each thread block traverses a different sub-tree using a local stack, however, we use a global worklist to balance the load to ensure that all blocks remain busy. Blocks contribute branches of their sub-trees to the global worklist on an as-needed basis, while blocks that finish their subtrees pick up new ones from the global worklist. We use degree arrays to represent intermediate graphs so that the representation is compact in memory to avoid limiting parallelism, but selfcontained which is necessary for the load balancing process. Our evaluation shows that compared to approaches used in prior work, our hybrid approa

关键词： Limiting Processor scheduling Instruction sets Heuristic algorithms Social sciences Memory management Graphics processing units

来源：评论

学校读者我要写书评

暂无评论

An Integral-equation-oriented Vectorized SpMV Algorithm and its Application on CT Imaging Reconstruction 36

An Integral-equation-oriented Vectorized SpMV Algorithm and ...

引用

36th ieee international parallel and distributed processing symposium (ieee IPDPS)

作者： Ye, Weicai Huang, Chenghuan Huang, Jiasheng Li, Jiajun Lu, Yao Jiang, Ying Sun Yat Sen Univ Sch Comp Sci & Engn Guangdong Prov Key Lab Computat Sci Guangzhou 510275 Peoples R China

ISBN: (纸本)9781665481069

Sparse-matrix vector multiplication (SpMV) is a core routine in many applications. Its performance is limited by memory bandwidth, which is for matrix transport between processors and memory, and instruction latency in computations. Vectorized operations (SIMD) can dramatically improve the execution efficiency, but irregular matrices' sparsity pattern is not compatible with the style of SIMD execution. We present a new matrix format, Compressed Sparse Column Vector (CSCV), and a corresponding vectorized SpMV algorithm for matrices arising from integral equations. this SpMV algorithm can inherently suit wide SIMD instructions and reduce the memory bandwidth used. We implement this algorithm for Computed Tomography (CT) imaging reconstructions on both Intel and AMD x86 platforms and compare it with seven stateof-the-art SpMV implementations using different CT imaging matrices. Experimental results show that CSCV can achieve up to 96.9 GFLOP/s in single-precision tests, with speedup 3.70x to MKL and 3.48x to the second place implementation. Furthermore, the implementation of CSCV SpMV is performance portable, which excludes almost all SIMD assemble code and has promising performance with compiler-assisted vectorization. Code Availability: https://***/sysu-compsci/cscv

关键词： parallel SpMV integral-operator-oriented vectorization CT imaging reconstruction

来源：评论

学校读者我要写书评

暂无评论

A Near-Memory Radix Sort Accelerator with parallel 1-bit Sorter 30

A Near-Memory Radix Sort Accelerator with Parallel 1-bit Sor...

引用

ieee 30th international symposium on Field-Programmable Custom Computing Machines (FCCM)

作者： Cho, Jihwan Maulana, Dalta Imam Jung, Wanyeong Korea Adv Inst Sci & Technol Sch Elect Engn Daejeon South Korea

ISBN: (纸本)9781665483322

Sorting is one of the most fundamental operations for many applications. For efficient sorting, data locality can be exploited by processing subdivided data in parallel. this work presents a high-performance and area-efficient near-memory radix sort accelerator where end-to-end sorting is performed locally. With a parallel 1-bit radix sorter, it achieves high throughput by processing multiple keys per cycle. Tested with Xilinx Zynq UltraScale+ ZCU104 FPGA, the experimental result shows up to 10x performance speedup over CPU. It is highly area-efficient and can be integrated into each processing node of a distributed computing system with low area cost.

关键词： Costs throughput distributed computing Field programmable gate arrays Sorting

来源：评论

学校读者我要写书评

暂无评论

the MIT Supercloud Workload Classification Challenge 36

The MIT Supercloud Workload Classification Challenge

引用

36th ieee international parallel and distributed processing symposium (ieee IPDPS)

作者： Tang, Benny J. Chen, Qiqi Weiss, Matthew L. Frey, Nathan C. McDonald, Joseph Bestor, David Yee, Charles Arcand, William Bergeron, William Byun, Chansup Edelman, Daniel Houle, Michael Hubbell, Matthew Jones, Michael Kepner, Jeremy Klein, Anna Michaleas, Adam Michaleas, Peter Milechin, Lauren Mullen, Julia Prout, Andrew Reuther, Albert Rosa, Antonio Bowne, Andrew McEvoy, Lindsey Li, Baolin Tiwari, Devesh Gadepally, Jiay Samsi, Siddharth MIT 77 Massachusetts Ave Cambridge MA 02139 USA MIT Lincoln Lab 244 Wood St Lexington MA 02173 USA Northeastern Univ Boston MA 02115 USA US Air Force Cambridge MA USA

ISBN: (纸本)9781665497473

High-Performance Computing (HPC) centers and cloud providers support an increasingly diverse set of applications on heterogenous hardware. As Artificial Intelligence (AI) and Machine Learning (ML) workloads have become an increasingly larger share of the compute workloads, new approaches to optimized resource usage, allocation, and deployment of new AI frameworks are needed. By identifying compute workloads and their utilization characteristics, HPC systems may be able to better match available resources with the application demand. By leveraging datacenter instrumentation, it may be possible to develop AI-based approaches that can identify workloads and provide feedback to researchers and datacenter operators for improving operational efficiency. To enable this research, we released the MIT Supercloud Dataset, which provides detailed monitoring logs from the MIT Supercloud cluster. this dataset includes CPU and GPU usage by jobs, memory usage, and file system logs. In this paper, we present a workload classification challenge based on this dataset. We introduce a labelled dataset that can be used to develop new approaches to workload classification and present initial results based on existing approaches. the goal of this challenge is to foster algorithmic innovations in the analysis of compute workloads that can achieve higher accuracy than existing methods. Data and code will be made publicly available via the Datacenter Challenge website : https://***.

关键词： Technological innovation distributed processing File systems Instruments High performance computing Graphics processing units Machine learning

来源：评论

学校读者我要写书评

暂无评论

MLCNN: Cross-Layer Cooperative Optimization and Accelerator Architecture for Speeding Up Deep Learning applications 36

MLCNN: Cross-Layer Cooperative Optimization and Accelerator ...

引用

36th ieee international parallel and distributed processing symposium (ieee IPDPS)

作者： Jiang, Beilei Cheng, Xianwei Tang, Sihai Ma, Xu Gu, Zhaochen Fu, Song Yang, Qing Liu, Mingxiong Univ North Texas Denton TX 76203 USA Los Alamos Natl Lab Los Alamos NM USA

ISBN: (纸本)9781665481069

the ever-increasing number of layers, millions of parameters, and large data volume make deep learning workloads resource-intensive and power-hungry. In this paper, we develop a convolutional neural network (CNN) acceleration framework, named MLCNN, which explores algorithm-hardware co-design to achieve cross-layer cooperative optimization and acceleration. MLCNN dramatically reduces computation and on-off chip communication, improving CNN's performance. To achieve this, MLCNN reorders the position of nonlinear activation layers and pooling layers, which we prove results in a negligible accuracy loss;then the convolutional layer and pooling layer are co-optimized by means of redundant multiplication elimination, local addition reuse, and global addition reuse. To the best of our knowledge, MLCNN is the first of its kind that incorporates cooperative optimization across convolutional, activation, and pooling layers. We further customize the MLCNN accelerator to take full advantage of cross-layer CNN optimization to reduce both computation and on-off chip communication. Our analysis shows that MLCNN can significantly reduce (up to 98%) multiplications and additions. We have implemented a prototype of MLCNN and evaluated its performance on several widely used CNN models using both an accelerator-level cycle and energy model and RTL implementation. Experimental results show that MLCNN achieves 3.2x speedup and 2.9x energy efficiency compared with dense CNNs. MLCNN's optimization methods are orthogonal to other CNN acceleration techniques, such as quantization and pruning. Combined with quantization, our quantized MLCNN gains a 12.8x speedup and 11.3x energy efficiency compared with DCNN.

关键词： Deep learning Cross-layer optimization Accelerators Performance evaluation

来源：评论

学校读者我要写书评

暂无评论

parallel Approximations of the Tukey g-and-h Likelihoods and Predictions for Non-Gaussian Geostatistics 36

Parallel Approximations of the Tukey g-and-h Likelihoods and...

引用

36th ieee international parallel and distributed processing symposium (ieee IPDPS)

作者： Mondal, Sagnik Abdulah, Sameh Ltaief, Hatem Sun, Ying Genton, Marc G. Keyes, David E. King Abdullah Univ Sci & Technol Comp Elect & Math Sci & Engn Div Thuwal 239556900 Saudi Arabia King Abdullah Univ Sci & Technol Stat Program Thuwal Saudi Arabia King Abdullah Univ Sci & Technol Extreme Comp Res Ctr Thuwal Saudi Arabia

ISBN: (纸本)9781665481069

Maximum likelihood estimation is an essential tool in the procedure to impute missing data in climate/weather applications. By defining a particular statistical model, the maximum likelihood estimation can be used to understand the underlying structure of given geospatial data. the Gaussian random field has been widely used to describe geospatial data, as one of the most popular models under the hood of maximum likelihood estimation. Computation of Gaussian log-likelihood demands operations on a dense symmetric positive definite matrix, often parameterized by the Mat ' ern correlation function. this computation of the log-likelihood requires O( n2) storage and O( n3) operations, which can be a huge task considering that the number of geographical locations, n, now commonly reaches into the millions. However, despite its appealing theoretical properties, the assumptions of Gaussianity may be unrealistic since real data often show signs of skewness or have some extreme values. Herein, we consider the Tukey g-and-h (TGH) random field as an example of a non-Gaussian random field that shows more robustness in modeling geospatial data by including two more parameters to incorporate skewness and heavy tail features in the model. this work provides the first HPC implementation of the TGH random field's inference on parallel hardware architectures. Using task-based programming models associated with dynamic runtime systems, our implementation leverages the high concurrency of current parallel systems. this permits to run the exact log-likelihood evaluation of the Tukey g-and-h (TGH) random fields for a decent number of geospatial locations. To tackle large-scale problems, we provide additionally an implementation of the given model using two different lowrank approximations. We compress the aforementioned positivedefinite symmetric matrix for computing the log-likelihood and rely on the Tile Low-Rank (TLR) and the Hierarchical OffDiagonal Low-Rank (HODLR) matrix approximatio

关键词： Maximum likelihood estimation Symmetric matrices Runtime Computational modeling Linear algebra Tail Predictive models

来源：评论

学校读者我要写书评

暂无评论

Concurrent CPU-GPU Task Programming using Modern C++ 36

Concurrent CPU-GPU Task Programming using Modern C++

引用

36th ieee international parallel and distributed processing symposium Workshops, IPDPSW 2022

作者： Huang, Tsung-Wei Lint, Yibo University of Utah Department of Electrical and Computer Engineering United States Department of Computer Science Peking University China

ISBN: (纸本)9781665497473

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of heterogeneous decomposition strategies. Our new CPU-GPU programming model allows users to express a problem in a way that adapts to effective separation of concerns and expertise encapsulation. Compared with existing libraries, Heteroflow is more cost-efficient in performance scaling, programming productivity, and solution generality. We have evaluated Heteroflow on two real applications in VLSI design automation and demonstrated the performance scalability across different CPU-GPU numbers and problem sizes. At a particular example of VLSI timing analysis with million-scale tasking, Heteroflow achieved 7.7times runtime speed-up (99 vs 13 minutes) over a baseline on a machine of 40 CPU cores and 4 GPUs. © 2022 ieee.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

distributed-Memory Sparse Kernels for Machine Learning 36

Distributed-Memory Sparse Kernels for Machine Learning

引用

36th ieee international parallel and distributed processing symposium (ieee IPDPS)

作者： Bharadwaj, Vivek Buluc, Aydin Demmel, James Univ Calif Berkeley EECS Dept Berkeley CA 94720 USA Lawrence Berkeley Natl Lab Computat Res Div Berkeley CA USA

ISBN: (纸本)9781665481069

Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a subsequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using ***-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30% of time spent exclusively in communication compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. Our 2.5D communication-eliding algorithms save 21% of communication time compared to the unoptimized sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.

关键词： SDDMM SpMM FusedMM Communication Avoiding Algorithms

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共258页 << < 7 8 9 10 11 12 13 14 15 16 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：