In this paper, we propose a new fast parallel sparse matrix-vector multiplication (SpMV) algorithm on GPU platforms. The new algorithm, called segSpMV, is based on the compressed sparse row (CSR) format and can be app...
详细信息
ISBN:
(纸本)9781479941322
In this paper, we propose a new fast parallel sparse matrix-vector multiplication (SpMV) algorithm on GPU platforms. The new algorithm, called segSpMV, is based on the compressed sparse row (CSR) format and can be applied to wide computational applications with both structured and unstructured matrices. The SpMV operation has very low computing to communication ratio and is bandwidth-limited. The new SpMV algorithm tries to reduce the memory access by partitioning the rows, whose nonzero patterns are irregular in general, into a number of fixed-length segments. As a result, both multiplication and summation phases now can enjoy the coalesced memory access and they can be finished in one kernel launch. The summation phase can also be further improved by using GPU reduction techniques for large segment lengths. The resulting SpMV method constantly outperforms all published algorithms and the SpMV method in the recent CUSPARSE library based on a set of public matrix benchmarks.
Multi and many-core processors have emerged as the dominant solution for processing in the whole range of computer system, from small devices to large-scale installations. Chip multi-processors, which are homogeneous,...
详细信息
Multi and many-core processors have emerged as the dominant solution for processing in the whole range of computer system, from small devices to large-scale installations. Chip multi-processors, which are homogeneous, multi and manycore processors, offer an unprecedented amount of on-chip, shared resources and brings a unique set of challenges. Given the importance of the Last-Level Cache management techniques to achieve near-perfect isolation, we survey the state of the art and propose research directions to address the most pressing issues in modern computer systems. To better understand the various research directions in the field, we propose a classification of the presented techniques. Finally, we discuss possible research directions.
The demand for mining large datasets using shared-nothing clusters is steadily on the rise. Despite the availability of parallelprocessing paradigms such as MapReduce, scalable data mining is still a tough problem. N...
详细信息
ISBN:
(纸本)9781450329248
The demand for mining large datasets using shared-nothing clusters is steadily on the rise. Despite the availability of parallelprocessing paradigms such as MapReduce, scalable data mining is still a tough problem. Naïve ports of existing algorithms to platforms like Hadoop exhibit various scalability bottlenecks, which prevent their application to large real-world datasets. These bottlenecks arise from various pitfalls that have to be overcome, including the scalability of the mathematical operations of the algorithm, the performance of the system when executing iterative computations, as well as its ability to efficiently execute meta learning techniques such as cross-validation and ensemble learning. In this paper, we present our work on overcoming these pitfalls. In particular, we show how to scale the mathematical operations of two popular recommendation mining algorithms, discuss an optimistic recovery mechanism that improves the performance of distributed iterative data processing, and outline future work on efficient sample generation for scalable meta learning. Early results of our work have been contributed to open source libraries, such as Apache Mahout and Stratosphere, and are already deployed in industry use cases. Copyright is held by the owner/author(s).
Energy consumption optimization of HPC applications inherently requires measurements for reference and comparison. However, most of today's systems lack the necessary hardware support for power or energy measureme...
详细信息
Energy consumption optimization of HPC applications inherently requires measurements for reference and comparison. However, most of today's systems lack the necessary hardware support for power or energy measurements. Furthermore, in-band data availability is preferred for specific optimization techniques such as auto-tuning. For this reason, we present in-band energy consumption models for the IBM POWER7 processor based on hardware counters. We demonstrate that linear regression is a suitable means for modeling energy consumption, and we rely on already available, high-level benchmarks for training instead of self-written or hand-tuned micro-kernels. We compare modeling efforts for different instruction mixes caused by two compilers (GCC and IBM XL) as well as various multi-threading usage scenarios, and validate across our training benchmarks and two real-world applications. Results show mean errors of approximately 1% and overall max errors of 5.3% for GCC.
Corner detection is an extremely important technique in image recognition, which is widely employed in various applications for image recognition. With the widespread use of mobile devices, image recognition technique...
详细信息
Cloud computing service make possible applications by given that visualized resources that can be energetically allocated to virtual clusters. Nowadays IT companies and business companies make use of cloud environment...
详细信息
With the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processingtechniques, which can only process graphs that fit completel...
详细信息
ISBN:
(纸本)9781509066070
With the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processingtechniques, which can only process graphs that fit completely in RAM, or disk-based techniques that sacrifice performance. Contribution. In this work, we propose a novel RAM-Disk hybrid approach to graph processing that can scale well from a single shared-memory node to large distributed-memory systems. It works by partitioning the graph into subgraphs that fit in RAM and uses a paging-like technique to load subgraphs. We show that without modifying the algorithms, this approach can scale from small memory-constrained systems (such as tablets) to large-scale distributed machines with 16, 000+ cores.
The proceedings contain 51 papers. The special focus in this conference is ADBIS Short Contributions, Special Session on Big Data: New Trends and applications, The Second international Workshop on GPUs in Databases, T...
ISBN:
(纸本)9783319018621
The proceedings contain 51 papers. The special focus in this conference is ADBIS Short Contributions, Special Session on Big Data: New Trends and applications, The Second international Workshop on GPUs in Databases, The Second international Workshop on Ontologies Meet Advanced Information Systems, The First international Workshop on Social Business Intelligence: Integrating Social Content in Decision Making and Doctoral Consortium. The topics include: New Trends in databases and information systems;New trends in databases and information systems;New ontological alignment system based on a non-monotonic description logic;spatiotemporal co-occurrence rules;An efficient spatial access method for highly redundant point data;Labeling association rule clustering through a genetic algorithm approach, Time series queries processing with GPU support;Rule-based multi-dialect infrastructure for conceptual problem solving over heterogeneous distributed information resources;distributedprocessing of Xpath queries using mapreduce;A Query language for workflow instance data;When too similar is bad;Viable systems model based information flows;On materializing paths for faster recursive querying;XSLTmark II - a simple, extensible and portable XSLT benchmark;ReMoSSA;DSD;Designing parallel relational data warehouses;Big data new frontiers;Extraction, sentiment analysis and visualization of massive public messages;Desidoo, a big-data application to join the online and real-world marketplaces;GraphDB - storing large graphs on secondary memory;Hadoop on a low-budget general purpose hpc cluster in academia and Discovering contextual association rules in relational databases.
In this paper, an adaptive architecture for dynamic management and allocation of on-chip FPGA Block Random Access Memory (BRAM) resources is presented. This facilitates the dynamic sharing of valuable and scarce on-ch...
详细信息
Bit-reproducibility has many advantages in the context of high-performance computing. Besides simplifying and making more accurate the process of debugging and testing the code, it can allow the deployment of applicat...
详细信息
ISBN:
(纸本)9781479938018
Bit-reproducibility has many advantages in the context of high-performance computing. Besides simplifying and making more accurate the process of debugging and testing the code, it can allow the deployment of applications on heterogeneous systems, maintaining the consistency of the computations. In this work we analyze the basic operations performed by scientific applications and identify the possible sources of non-reproducibility. In particular, we consider the tasks of evaluating transcendental functions and performing reductions using non-associative operators. We present a set of techniques to achieve reproducibility and we propose improvements over existing algorithms to perform reproducible computations in a portable way, at the same time obtaining good performance and accuracy. By applying these techniques to more complex tasks we show that bit-reproducibility can be achieved on a broad range of scientific applications.
暂无评论