Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversa...
详细信息
ISBN:
(纸本)9780769546759
Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithms such as reachability and graph matching. Since the scale of data stored and queried in these databases is increasing, it is important to obtain high performing implementations of graph traversal that can efficiently utilize the processing power of modern processors. In this work, we present a scalable Breadth-First Search Traversal algorithm for modern multi-socket, multi-core CPUs. Our algorithm uses lock- and atomic-free operations on a cache-resident structure for arbitrary sized graphs to filter out expensive main memory accesses, and completely and efficiently utilizes all available bandwidth resources. We propose a work distribution approach for multi-socket platforms that ensures load-balancing while keeping cross-socket communication low. We provide a detailed analytical model that accurately projects the performance of our single- and multi-socket traversal algorithms to within 5-10% of obtained performance. Our analytical model serves as a useful tool to analyze performance bottlenecks on modern CPUs. When measured on various synthetic and real-world graphs with a wide range of graph sizes, vertex degrees and graph diameters, our implementation on a dual-socket Intel (R) Xeon (R) X5570 (Intel microarchitecture code name Nehalem) system achieves 1.5X-13.2X performance speedup over the best reported numbers. We achieve around 1 Billion traversed edges per second on a scale-free R-MAT graph with 64M vertices and 2 Billion edges on a dual-socket Nehalem system. Our optimized algorithm is useful as a building block for efficient multi-node implementations and future exascale systems, thereby allowing them to ride the trend of increasing per-node compute and bandwidth resources.
Advances in communication for parallel programming have yielded one-sided messaging systems. The MPI bindings for Ruby have been augmented to include the remote memory access functions of MPI-2.
ISBN:
(纸本)0780321754
Advances in communication for parallel programming have yielded one-sided messaging systems. The MPI bindings for Ruby have been augmented to include the remote memory access functions of MPI-2.
Due to the rapid growth in the multicore and GPU based computing devices, the need to teach parallel computing in CS/CE curriculum has become almost mandatory nowadays. A course on parallel Computing systems (PCS) has...
详细信息
ISBN:
(纸本)9781538655559
Due to the rapid growth in the multicore and GPU based computing devices, the need to teach parallel computing in CS/CE curriculum has become almost mandatory nowadays. A course on parallel Computing systems (PCS) has been designed to provide an understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. An activity based learning approach was adopted for teaching the course and several parallel programming paradigms and technologies such OpenMP, MPI, and CUDA have been covered. This course was offered as a required course to graduate students. This paper describes the implementation of the course at Thiagarajar College of Engineering. Evaluation of the implementation of the course reveals that for students who have not been exposed to parallel and distributed computing, i) activity based learning results in better knowledge gain compared to the traditional approach, ii) learning OpenMP was much easier than MPI or CUDA, iii) some parallel and distributed Computing (PDC) concepts such as false sharing were harder to grasp compared to basic concepts, and iv) it is essential to introduce parallel computing in the undergraduate curriculum.
We employ probabilistic causality analysis to study the performance data of 301 students from the upper-level undergraduate parallel programming class at the University of Central Florida. To our surprise, we discover...
详细信息
ISBN:
(纸本)9781538655559
We employ probabilistic causality analysis to study the performance data of 301 students from the upper-level undergraduate parallel programming class at the University of Central Florida. To our surprise, we discover that good performance in our lower-level undergraduate programming CS-1 and CS-II classes is not a significant causal factor that contributed to good performance in our parallel programming class. On the other hand, good performance in systems classes like Operating systems, Information Security, Computer Architecture, Object Oriented Software and systems Software coupled with good performance in theoretical classes like Introduction to Discrete Structures, Artificial Intelligence and Discrete Structures-II are strong indicators of good performance in our upper-level undergraduate parallel programming class. We believe that such causal analysis may be useful in identifying whether parallel and distributed computing concepts have effectively penetrated the lower-level computer science classes at an institution.
Index is an important element in databases, and the existence of index is unavoidable. When an index has been built on a particular attribute, database operations (e.g. selection, join) on this attribute will become m...
详细信息
ISBN:
(纸本)0769515797
Index is an important element in databases, and the existence of index is unavoidable. When an index has been built on a particular attribute, database operations (e.g. selection, join) on this attribute will become more efficient by utilizing the index. In this paper we focus on parallel algorithms for selection queries involving index that is data searching on indexed attributes. In this paper, we propose two categories of parallel selection queries using index: parallel exact match and range selections;depending on the type of selection conditions. As parallel algorithms for these selection queries are very much influenced by indexing schemes, we will also describe various index partitioning methods for paralleldatabases, and discusses their efficiency in supporting parallel selection query processing.
As processors and systems on chip in the embedded world increasingly become multicore, parallel programming remains a difficult, time-consuming and complicated task. End users who are not parallel programming experts ...
详细信息
ISBN:
(纸本)9781479942930
As processors and systems on chip in the embedded world increasingly become multicore, parallel programming remains a difficult, time-consuming and complicated task. End users who are not parallel programming experts have a need to exploit such processors and architectures, using high level programming languages, like Scilab or MATLAB. The ALMA toolset solves this problem: it takes Scilab code as input and produces parallel code for embedded multiprocessor systems on chip, using platform quasi-agnostic optimizations. The platform information is provided by an architecture description language designed for the purpose of a flexible system description as well as simulation. A hierarchical system description in combination with a parameterizable simulation environment allows fine-grained trade-offs between simulation performance and simulation accuracy.
A parallel technique with a coupled inductor is proposed for an LCL resonant push-pull converter. The coupled inductor is inserted in the front of the common resonant capacitor of two parallel soft-switching push-pull...
详细信息
ISBN:
(纸本)9781424456703
A parallel technique with a coupled inductor is proposed for an LCL resonant push-pull converter. The coupled inductor is inserted in the front of the common resonant capacitor of two parallel soft-switching push-pull converters to share the load currents of the two converters. The equivalent circuit and state equation of the proposed converter is described, and then the effect of the mismatch of the equivalent resistor and leakage inductor on current-sharing degree is derived. Experimental results of a 2kW inverter showed that the current difference of the two parallel units is less than 2%.
Novel platforms of modular robot systems have been developed with important applications in safety, transportation and sensing domains. In such systems, modular robots are able to change their organization in order to...
详细信息
ISBN:
(纸本)9781479942930
Novel platforms of modular robot systems have been developed with important applications in safety, transportation and sensing domains. In such systems, modular robots are able to change their organization in order to obtain different shapes. The conception of distributed programs allowing the "optimal" reorganization of a set of robots into a specific shape appears as a very challenging problem. In this paper we present an original distributed meta-algorithm for micro-robots shape-shifting problem. We show that this meta-algorithm, described as a general functioning schema, presents a good framework to easily conceive distributed algorithms for shape-shifting problems. We also prove the facility to instantiate the algorithm for special target shapes and we give an adaptation of the algorithm to reach any horizontally convex form. The presented meta-algorithm presents two main advantages: first, there is no need to exact positioning of the robots and secondly, the memory storage and communication requirements are significantly reduced.
SPICE is widely used for transistor-level circuit simulation. However, with the growing complexity of the VLSI at nano-scale, the traditional SPICE simulator has become inefficient to provide accurate verifications. T...
详细信息
ISBN:
(纸本)9780769546766
SPICE is widely used for transistor-level circuit simulation. However, with the growing complexity of the VLSI at nano-scale, the traditional SPICE simulator has become inefficient to provide accurate verifications. This thesis tries to accelerate transistor-level simulation on multi/many-core systems, and we will solve 3 problems: 1) develop a parallel sparse LU factorization algorithm for circuit simulation;2) implement the matrix solver on GPU to further accelerate the solver;3) develop a circuit partitioning based parallel simulation approach on distributed machines to obtain better scalability. The experimental results show that the proposed parallel LU factorization algorithm effectively accelerates the matrix solver for circuit simulation on both CPU and GPU.
The amount of data generated by social media, social networks and distributed platforms such as blockchain, have reached quite high levels. Various data analysis methods could be applied this big data. One of these me...
详细信息
ISBN:
(纸本)9781728138015
The amount of data generated by social media, social networks and distributed platforms such as blockchain, have reached quite high levels. Various data analysis methods could be applied this big data. One of these methods is to classify geo-tagged social network data in order to report geographical area associated with the data. We propose an efficient parallel classification approach and implement a classifier tool which is capable of processing huge amount of data. To test our approach, we collect Twitter data over five densest areas of Turkey. There are important factors affecting the classification performance such as the spatial indexing and the parallelization strategies. Hierarchical Triangular Mesh (HTM) and R-Tree spatial indexes are used for indexing regions. For parallel processing data streams classifier tool is implemented based on Apache Spark and Kafka platforms in order to obtain high scalability. To show effectiveness of our method, we perform tests on Amazon Web Services (AWS) Cloud environment and compare our method against a method which implements HTM on a Microsoft SQL Server. Results show that 1.6 - 4.5 fold speed-up is obtained and Twitter data that is collected over a month can be processed effectively in three hours.
暂无评论