Efforts to support high performance computing (HPC) applications' requirements in the context of cloud computing have motivated us to design HPC Shelf, a cloud computing services platform to build and deploy large...
详细信息
Efforts to support high performance computing (HPC) applications' requirements in the context of cloud computing have motivated us to design HPC Shelf, a cloud computing services platform to build and deploy large-scale parallel computing systems. We introduce Alite, the contextual contract system of HPC Shelf, to select component implementations according to requirements of the host application, target parallel computing platform characteristics (e.g., clusters and MPPs), quality of service (QoS) properties, and cost restrictions. It is evaluated through a small-scale case study employing two complementary component-based frameworks. The first one aims to represent components that implement linear algebra computations based on the BLAS interface. In turn, the second one aims to represent parallel computing platforms on the IaaS cloud offered by Amazon EC2 Service.
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is ric...
详细信息
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine /spl ***/GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32 K/spl times/32 K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
Summary form of only given: Apache Hadoop has become the platform of choice for developing large-scale data-intensive applications. In this tutorial, we will discuss design philosophy of Hadoop, describe how to design...
详细信息
Summary form of only given: Apache Hadoop has become the platform of choice for developing large-scale data-intensive applications. In this tutorial, we will discuss design philosophy of Hadoop, describe how to design and develop Hadoop applications and higher-level application frameworks to crunch several terabytes of data, using anywhere from four to 4,000 computers. We will discuss solutions to common problems encountered in maximizing Hadoop application performance. We will also describe several frameworks and utilities developed using Hadoop that increase programmer-productivity and application-performance.
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays acces...
详细信息
In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays accessed in one computation phase. Based on ARV, we use array grouping to eliminate the pseudo data distributing of small shared data and improve the page locality. Experimental results show that ARV-based array grouping can greatly improve the performance of applications with non-continuous data access and strict access affinity on OpenMP/JIAJIA cluster. For applications with small shared arrays, array grouping can improve the performance obviously when the processor number is small.
The recent popularity of the Java programming language has brought automatic dynamic memory management (a.k.a., the garbage collection) into the mainstream. Traditional garbage collectors suffer from long garbage coll...
详细信息
ISBN:
(纸本)0780373715
The recent popularity of the Java programming language has brought automatic dynamic memory management (a.k.a., the garbage collection) into the mainstream. Traditional garbage collectors suffer from long garbage collection pauses (stop-the-world mark-sweep algorithm) or inability of collecting cyclic garbage (reference counting approach). Generational garbage collection, however, is based only on the weak generational hypothesis that most objects die young. In this paper, the performance evaluation of a new multithreaded concurrent generational garbage collector (MCGC) based on mark-sweep with the assistance of reference counting is reported. The MCGC can take advantage of multiple CPUs in an SMP system and the merits of lightweight processes. Furthermore, the long garbage collection pause can be reduced and the garbage collection efficiency can be enhanced. Measurement results indicate that the MCGC improves the garbage collection pause time up to 96.75% over the traditional stop-the-world mark-sweep garbage collector. Moreover, the MCGC receives minimal time and space penalties as shown in the report of the total execution time, the memory footprint and the sticky reference count rate.
In this paper, we propose a new family of interconnection networks, called cyclic networks (CNs), in which an intercluster connection is defined on a set of nodes whose addresses are cyclic shifts of one another. The ...
详细信息
In this paper, we propose a new family of interconnection networks, called cyclic networks (CNs), in which an intercluster connection is defined on a set of nodes whose addresses are cyclic shifts of one another. The node degrees of basic CNs are independent of system size, but can vary from a small constant (e.g., 3) to as large as required, thus providing flexibility and effective tradeoff between cost and performance. The diameters of suitably constructed CNs can be asymptotically optimal within their lower bounds, given the degrees. We show that packet routing and ascend/descend algorithms can be performed in /spl Theta/(log/sub d/ N) communication steps on some CNs with N nodes of degree /spl Theta/(d). Moreover CNs can also efficiently emulate homogeneous product networks (e.g., hypercubes and high dimensional meshes). As a consequence, we obtain a variety of efficient algorithms on such networks, thus proving the versatility of CNs.
Satellite remote sensing radar technologies provide powerful tools for geohazard monitoring and risk management at synoptic scale. In particular, advanced Multi-Temporal SAR Interferometric algorithms are capable to d...
详细信息
ISBN:
(纸本)9781479979301
Satellite remote sensing radar technologies provide powerful tools for geohazard monitoring and risk management at synoptic scale. In particular, advanced Multi-Temporal SAR Interferometric algorithms are capable to detect ground deformations and structural instabilities with millimetric precision, but impose strong requirements in terms of hardware re-sources. Recent advances in GPU computing and programming hold promise for time efficient implementation of imaging algorithms, thus enhancing the development of advanced Emergency Management Services based on Earth Observation technologies. In this study, a preliminary assessment of the potentials of GPU processing is carried out, by comparing CPU (single- and multi-thread) and GPU implementations of InSAR time-consuming algorithm kernels. In particular, it is focused on the fine coregistration of SAR interferometric pairs, a crucial step in the interferogram generation process. Experimental results are discussed.
A study is reported whose aim was to produce a system to facilitate offline programming of robots and to provide a testbed for alternative algorithms for the services provided. The system was specified using the forma...
详细信息
A study is reported whose aim was to produce a system to facilitate offline programming of robots and to provide a testbed for alternative algorithms for the services provided. The system was specified using the formal description technique LOTOS (language of temporal ordering specification). LOTOS is best known for its use in the description of OSI protocols and is supported by an ISO standard. LOTOS consists of a process algebra for specifying the structure of the system and the interactions between components of the system, and an algebraic data typing mechanism for specifying the operations the system carries out. The description of the system was heavily influenced by techniques used in the design of operating systems. Concurrency was introduced at the initial design stage, there was an explicit separation of concerns and the specification was structured hierarchically, with actions at one level appearing atomic to the next higher level. Each level in the hierarchy provides an increasingly abstract view of the robot. The resulting description was executed, or animated, using the SEDOS tool, to help determine that the correct behaviour had been encapsulated by the description. The specification was then implemented on a network of transputers, using 3L parallel Pascal.< >
Parsec is a parallel programming environment whose goal is to simplify the development of multicomputer programs without, as is often the case, sacrificing performance. We have reconciled these objectives by "com...
详细信息
Parsec is a parallel programming environment whose goal is to simplify the development of multicomputer programs without, as is often the case, sacrificing performance. We have reconciled these objectives by "compiling" the structure of parallel applications into information to configure each of a small set of communication primitives on a context-sensitive basis. In this paper, we show how Parsec can be used to implement a high-performance processor farm and compare Parsec and hand-optimized implementations to demonstrate that Parsec can achieve a similar level of performance. Extensive static analysis and optimization is necessary to achieve these results. We discuss both the tools which perform these tasks as well as the user interface that provides the necessary declarative structural information. Using the processor farm, we show how Parsec simplifies the task of specifying the structure of a parallel application and improves the result by supporting abstraction, reuse and scalability.< >
暂无评论