Distributed applications tend to have a complex design due to issues such as concurrency, synchronization and communication. Researchers in the past have proposed simpler abstractions to hide these complexities. Howev...
详细信息
ISBN:
(纸本)9781424442379
Distributed applications tend to have a complex design due to issues such as concurrency, synchronization and communication. Researchers in the past have proposed simpler abstractions to hide these complexities. However, many of the proposed techniques use messaging protocols which incur high overhead and are not very scalable. To address these limitations, in our previous work [20], we proposed an efficient Distributed Data Sharing Substrate (DDSS) using the features of high-speed networks. In this paper we propose several design optimizations for DDSS in multi-coresystems such as the combination of shared memory and message queues for inter-process communication, dedicated thread for communication progress and for onloading DDSS operations such as get and put. Our micro-benchmark results not only show a very low latency in DDSS operations but also demonstrate the scalability of DDSS with increasing number of processes. Application evaluations with R-Tree and B-Tree query processing and distributed STORM shows an improvement of up to 56%, 45% and 44%, respectively, as compared to traditional implementations. Evaluations with application checkpointing using DDSS demonstrate the scalability with increasing number of checkpointing applications. Further in our evaluations, we demonstrate the portability of DDSS across multiple modem interconnects including InfiniBand and iWARP-capable 10-Gigabit Ethernet networks (applicable for both LAN/WAN environments).
multi-core processors introduce many challenges both at the system and application levels that need to be addressed in order to attain the best performance. In this paper, we study the impact of the multi-core technol...
详细信息
ISBN:
(纸本)9783540898931
multi-core processors introduce many challenges both at the system and application levels that need to be addressed in order to attain the best performance. In this paper, we study the impact of the multi-core technologies in the context of two scalable, production-level molecular dynamics simulation frameworks. Experimental analysis and observations in this paper provide for a better understanding of the interactions between the application and the underlying system features such as memory bandwidth, architectural optimization. and communication library implementation. In particular, we observe that parallel efficiencies could be as low as 50% on quad-coresystems while a set of dual-core processors connected with a high speed interconnect can easily outperform the same number of cores on a socket or in a package. This indicates that certain modifications to the software stack and application implementations are necessary in order to fully exploit the performance of multi-core based systems.
With the progress of semiconductor technologies and then the advent of multi-core processors, the age of Serial computing is over and parallel computing technology is now emerging as mainstream. Parallel programming m...
详细信息
MPI_Alltoall is one of the most communication intense collective operation used in many parallel applications. Recently, the supercomputing arena has witnessed phenomenal growth of commodity clusters built using Infin...
详细信息
ISBN:
(纸本)9781424416936
MPI_Alltoall is one of the most communication intense collective operation used in many parallel applications. Recently, the supercomputing arena has witnessed phenomenal growth of commodity clusters built using InfiniBand and multi-coresystems. In this context, it is important to optimize this operation for these emerging clusters to allow for good application scaling. However optimizing MPI_Alltoall on these emerging systems is not a trivial task. InfiniBand architecture allowsfor varying implementations of the network protocol stack. For example, the protocol can be totally on-loaded to a host processing core or it can be off-loaded onto the NIC or can use any combination of the two. Understanding the characteristics of these different implementations is critical in optimizing a communication intense operation such as MPI_Alltoall. In this paper, we systematically study these different architectures and propose new schemes for MPI_Alltoall tailored to these architectures. Specifically, we demonstrate that we cannot use one common scheme which performs optimally on each of these varying architectures. For example, on-loaded implementations can exploit multiple cores to achieve better network utilization, and in offload interfaces aggregation can be used to avoid congestion on multi-coresystems. We employ shared memory aggregation techniques in these schemes and elucidate the impact of these schemes on multi-coresystems. The proposed design achieves a reduction in MPI_Alltoall time by 55% for 512Byte messages and speeds up the CPMD application by 33%.
The emergence of multi-core and many-core processors has introduced new opportunities and challenges to EDA research and development. While the availability of increasing parallel computing power holds new promise to ...
详细信息
ISBN:
(纸本)9781424428205
The emergence of multi-core and many-core processors has introduced new opportunities and challenges to EDA research and development. While the availability of increasing parallel computing power holds new promise to address manycomputing challenges in CAD, the leverage of hardware parallelism can only be possible with a new generation of parallel CAD applications. In this paper, we propose a novel multi-Algorithm Parallel circuit Simulation approach (MAPS) and its multi-core implementation to expedite one of the most fundamental CAD applications: transistor-level transient circuit simulation. MAPS starts multiple simulation algorithms in parallel for a given simulation task. By properly synchronizing these algorithms on-the-fly, we exploit the diversity in simulation algorithms to achieve possibly superlinear overall speedup in transient simulation. In addition, our unique multi-algorithm framework allows unique safe exploration of simulation methods that are conventionally discarded due to convergence concerns. As a coarse grained parallel simulation approach, the implementation of MAPS demands a minimum of parallel programming effort and allows for reuse of existing serial simulation codes.
We present a novel rails approach so that future e-Science applications can effectively exploit future system architectures, including multi-core and many-core architectures, multiple network cards, multiple graphical...
详细信息
作者:
Dennis, Jack B.MIT
Comp Sci & Artificial Intelligence Lab Cambridge MA 02139 USA
The Fresh Breeze Project concerns the architecture and design of a multicore chip that can achieve superior performance while supporting composability of parallel programs. The requirements of composability imply that...
详细信息
ISBN:
(纸本)9781424416936
The Fresh Breeze Project concerns the architecture and design of a multicore chip that can achieve superior performance while supporting composability of parallel programs. The requirements of composability imply that the management of processor allocation and memory management must be sufficiently flexible to permit reassignment of resources according to the current needs of computations. The Fresh Breeze Programming model combines the spawn/join threading model of Cilk[4] with a write-once memory model based on fixed-size chunks that are allocated and freed by efficient hardware mechanisms. This model supports computing jobs by many users, each consisting of a hierarchy of function activations. The model satisfies all six principles for supporting modular program construction[3]. Within this programming model, it is possible for any parallel program to be used, without change, as a component in building larger parallel programs.
During the project we were working on physical conflicts occurred in a multiagent system. In order to resolve the problem many different conflict resolution strategies were examined and many proposed solutions found. ...
详细信息
The leap from single-core to multi-core has permanently altered the course of computing, enabling increased productivity, powerful energy-efficient performance, and leading-edge advanced computing experiences. Althoug...
详细信息
Aggregation is among the core functionalities of OLAP systems. Frequently, such queries are issued in decision support systems to identify interesting groups of data. When more than one aggregation function is involve...
详细信息
ISBN:
(纸本)9781424418367
Aggregation is among the core functionalities of OLAP systems. Frequently, such queries are issued in decision support systems to identify interesting groups of data. When more than one aggregation function is involved and the notion of interest is not clearly defined, skyline queries provide a robust mechanism to capture the potentially interesting points where (i) users do not need to specify a ranking function and (ii) the result is independent of the dimension scales. To provide better exploration functionalities in OLAP systems, we propose to use skyline queries over aggregated data to identify the most interesting groups. Since aggregation functions have to be ad-hoc to cover a wide variety of user interests, the skyline over the aggregates has to be computed on the fly. Hence any algorithm to compute such a skyline must be fast and be able to progressively produce the result set with potential skyline groups being produced as early as possible. We explore a family of algorithms which try to consume only as many data records as are necessary to compute the skyline and design an optimal algorithm. We further refine the algorithm by taking into account systems issues such as disk behavior which are often ignored but have strong impact on real system performance. Experimental results validate the performance and progressive benefits of our algorithm.
暂无评论