the objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note # 247, 2011], the...
详细信息
ISBN:
(数字)9783642314643
ISBN:
(纸本)9783642314636;9783642314643
the objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note # 247, 2011], the bidiagonal transformation using tile algorithms with a two-stage approach has shown very promising results on square matrices. However, for tall and skinny matrices, the inherent problem of processingthe panel in a domino- like fashion generates unnecessary sequential tasks. By using tree reduction, the panel is horizontally split, which creates another dimension of parallelism and engenders many concurrent tasks to be dynamically scheduled on the available cores. the results reported in this paper are very encouraging. the new tile bidiagonal transformation, targeting tall and skinny matrices, outperforms the state-of-the-art numerical linear algebra libraries LAPACK V3.2 and Intel MKL ver. 10.3 by up to 29-fold speedup and the standard two-stage PLASMA BRD by up to 20-fold speedup, on an eight socket hexa-core AMD Opteron multicore shared-memory system.
Classical grammar for natural languages, which is defined by the linguistics, is widely used in many natural languages processing (NLP) tasks, such as information extraction, machine translation and parsing. the class...
详细信息
this paper describes the design of unified support vector machine circuit for pedestrians and cars detection. By unifying the algorithms and architectures of linear and nonlinear SVM classifications, the proposed circ...
详细信息
ISBN:
(纸本)9781467308595
this paper describes the design of unified support vector machine circuit for pedestrians and cars detection. By unifying the algorithms and architectures of linear and nonlinear SVM classifications, the proposed circuit can support both linear and non-linear classifications very efficiently in terms of circuit size and performance. the circuit size is minimized by sharing most of the resources required in the computation for both classification types. parallel architecture with pipeline is adopted to accelerate the processing speed to handle a large amount of operations for real-time processing. 48x96 and 64x64 sliding windows with 6 window strides are used to detect pedestrians and cars, respectively. the synthesized circuit using 65nm standard cell library consists of 848,349 gates and its maximum operating frequency is 435MHz. the circuit can process 91.9 640x480 image frames per second assuming three cameras equipped on front, right and left side positions of the vehicle.
To make parallel programming as widespread as parallelarchitectures, more structured parallel programming paradigms are necessary. One of the possible approaches are algorithmic skeletons. they can be seen as higher ...
详细信息
the GAIA Extended Research Infrastructure is located at the southeast of Spain. It targets the research of Future Internet architectures and comprises several facilities from the University of Murcia and the Spanish g...
详细信息
Most Data Warehouses (DW) are stored in Relational Database Management Systems (RDBMS) using a star-schema model. While this model yields a trade-off between performance and storage requirements, huge data warehouses ...
详细信息
the parallel FEM package NuscaS allows us to solve adaptive FEM problems with 3D unstructured meshes on distributed-memory parallel computers such as PC-clusters. In our previous works, a new method for parallelizing ...
详细信息
ISBN:
(数字)9783642314643
ISBN:
(纸本)9783642314636;9783642314643
the parallel FEM package NuscaS allows us to solve adaptive FEM problems with 3D unstructured meshes on distributed-memory parallel computers such as PC-clusters. In our previous works, a new method for parallelizing the FEM adaptation was presented, based on using the 8-tetrahedra longest-edge partition. this method relies on a decentralized approach, and is more scalable in comparison to previous implementations requiring a centralized synchronizing node. At present nodes of clusters contain more and more processing cores. their efficient utilization is crucial for providing high performance of numerical codes. In this paper, different schemes of mapping the mesh adaptation algorithm on such hierchical architectures are presented and compared. these schemes use either the pure message-passing model, or the hybrid approach which combines shared-memory and message-passing models. Also, we investigate an approach for adapting the pure MPI model to hierarchical topology of clusters with multi-core nodes.
In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. the modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the numbe...
详细信息
ISBN:
(数字)9783642297403
ISBN:
(纸本)9783642297403;9783642297397
In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. the modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the number of cores increases, it suffers from high communication delays. therefore, NoC based architecture is proposed. the N-Body problem is a classical problem of approximating the motion of bodies. Two methods, namely Barnes-Hut (Barnes) and Fast Multipole (FMM), have been developed for fast simulation. the two algorithms have been implemented and studied in conventional computer systems and Graphics processing Units (GPUs). However, as a promising unconventional multicore architecture, the evaluation of N-Body methods in a NoC platform has not been well addressed. We define a NoC model based on state-of-the-art systems. Evaluation results are presented using a cycle accurate full system simulator. Experiments show that, Barnes scales better (53.7x/Barnes and 36.6x/FMM for 64 processing elements) and requires less cache than FMM. However, we observe hot-spot traffic in Barnes. Our analysis and experiment results provide a guideline for studying N-Body methods in a NoC platform.
In this paper two parallel numerical algorithms for solution of parabolic problems on graphs are investigated. the fully implicit and predictor-corrector finite difference schemes are proposed to approximate the diffe...
详细信息
processing of extremely large polygonal (vector-based) spatial datasets has been a long-standing research challenge for scientists in the Geographic Information Systems and Science (GIS) community. Surprisingly, it is...
详细信息
ISBN:
(纸本)9780769549569;9781467362184
processing of extremely large polygonal (vector-based) spatial datasets has been a long-standing research challenge for scientists in the Geographic Information Systems and Science (GIS) community. Surprisingly, it is not for the lack of individual parallel algorithm; we discovered that the irregular and data intensive nature of the underlying processing is the main reason for the meager amount of work by way of system design and implementation. Furthermore, of all the systems reported in the literature, very few deal withthe complexities of vector-based datasets and none, including commercial systems, on the cloud platform. We have designed and implemented an open-architecture-based system named Crayons for Windows Azure cloud platform using state-of-the-art techniques. We have implemented three different architectures of Crayons with different load balancing schemes. Crayons scales well for sufficiently large data sets, achieving end-to-end absolute speedup of over 28-fold employing 100 Azure processors. For smaller and more irregular workload, it still yields over 10-fold speedup.
暂无评论