distributed machine learning(DML) has become a feasible solution to deal withthe growing training data and models. Reviewing the existing architecture of DML, Parametric server(PS) architecture stands out in iterativ...
详细信息
ISBN:
(纸本)9781728176505
distributed machine learning(DML) has become a feasible solution to deal withthe growing training data and models. Reviewing the existing architecture of DML, Parametric server(PS) architecture stands out in iterative convergence algorithms and widely deployed in practice, thanks to flexible expansion and so on. Under this architecture, the parameter synchronization mode based on Bulk Synchronous parallel(BSP) has become one of the research hotspots. As for the BSP mode, each iteration efficiency is determined by the slowest node in the cluster, therefore, the straggler problem becomes the main reason for reducing the efficiency of DML training, which is even more prominent in the heterogeneous cloud services. Existing works mainly focus on the straggler problem, and the importance of communication is usually ignored. However, inefficient communication is also one of the reasons for the inefficiency of DML iterations. In this paper, we propose DSANA, which first alleviates certain straggler problems by dynamically scheduling computation tasks. Secondly, DSANA improves the overlap of computation/communication by dividing larger transmission parameters, thus further improving the iteration efficiency of DML training. We conduct comparison experiments withthe classic iterative algorithm PageRank on four different-scale data sets in two cloud service scenarios. the experimental results show that DSANA can improve the training efficiency to 36.6%$\sim$ 56.4% compared withthe baseline solution.
We present a general-purpose software framework, which allows different multi-disciplinary communities to take advantage of a distributed computational infrastructure. the ultimate goal is to provide organizations tha...
详细信息
ISBN:
(纸本)9781467301206
We present a general-purpose software framework, which allows different multi-disciplinary communities to take advantage of a distributed computational infrastructure. the ultimate goal is to provide organizations that need to exploit resources with CPU-intensive loose-parallel tasks with a software service capable to offer a user-friendly, standard and highly customizable access to the Grid. the software suite we developed has been designed specifically for organizations that cannot afford the adoption costs of more specialized and complex frameworks, developed in High Energy Physics (HEP) environment, but that still require an easy-to-use interface to the Grid. Our framework heavily relies on a bookkeeping database, storing both application-specific and infrastructure meta-data, which is tightly coupled with a web-based user-interface. the first makes available to the users information on the execution status of jobs and their specific meaning and parameters, and contributes in orchestrating the submission mechanism. the latter provides job submission management, bookkeeping database interactions, basic monitoring functionality and eLog system. Multi-site sub-missions based on user-defined requests and fine grain parametric submission interfaces are available. the structure of framework services follow a centralized design: job management service and bookkeeping database are hosted in a European Grid Infrastructure (EGI) site. Jobs executed into remote sites transfer their output to predefined target site repository and update the bookkeeping database. In addition, the framework requires a proper configuration of the remote Grid sites on which the jobs will run. Results from a large production of Monte Carlo simulated events submitted to 15 Grid sites are reported, and a comparison in terms of features, scopes, and targets, with a broad spectrum of general-purpose solutions in the same field of application is presented as well.
parallel computational scientific applications have been described by their computation and communication patterns. From a storage and I/O perspective, these applications can also be grouped into separate data models ...
详细信息
parallel computational scientific applications have been described by their computation and communication patterns. From a storage and I/O perspective, these applications can also be grouped into separate data models based on the way data is organized and accessed during simulation, analysis, and visualization. parallel netCDF is a popular library used in many scientific applications to store scientific datasets and provides high-performance parallel I/O. Although the metadata-rich netCDF file format can effectively store and describe regular multi-dimensional array datasets, it does not address the full range of current and future computational science data models. In this paper, we present a new storage scheme in parallel netCDF to represent a broad variety of data models used in modern computational scientific applications. this scheme also allows concurrent metadata construction for different data objects from multiple groups of application processes, an important feature in obtaining a high degree of I/O parallelism for data models exhibiting irregular data distribution. Furthermore, we employ non-blocking I/O functions to aggregate irregularly distributed data requests into large, contiguous data requests, to achieve high-performance I/O. Using an example of adaptive mesh refinement data model, we demonstrate the proposed scheme can produce scalable performance results for both data and metadata creation and access.
Hadoop is an efficient and simple parallel framework following the Map Reduce paradigm, and making the parallel processing recently become a hot issue in data-intensive applications. Since Hadoop can be easily deploye...
详细信息
Hadoop is an efficient and simple parallel framework following the Map Reduce paradigm, and making the parallel processing recently become a hot issue in data-intensive applications. Since Hadoop can be easily deployed on large-scale clusters including up to thousands of computers, various studies intend to process common relational database operations also on this new platform and expect to achieve a remarkable performance. However, these works have to prepare customized programs according to different input format, making the communication between co-workers difficult. Additionally, all intermediate data have to be transformed to key-value pairs and then transferred through the underlying HDFS, making the data processable by Map and Reduce tasks and keeping a balanced workload on the cluster. During this period, unnecessary overhead decreases boththe speed-up and scale-up of these systems. therefore, this paper attempts to propose a light and efficient coupling structure thus to combine Hadoop with single-computer databases on the engine level. On one hand, it uses a well-designed parallel data model to make end-users represent parallel queries like common queries. All current and future data types and algorithms can be used directly, having no need to be specifically changed for the parallel platform. On the other hand, it provides a simple and independent distributed file system to transfer data among database engines directly, without passing through HDFS, hence to remove as much as possible unnecessary transform and transfer overhead. For purpose of demonstration, a prototype parallel Secondo is introduced in this paper. It has been fully evaluated in both small and large scale clusters, achieving satisfactory performances for different database operations.
the proceedings contain 18 papers. the special focus in this conference is on Modeling Techniques and Tools for Computer Performance Evaluation. the topics include: A performability modeling environment tool;dependabi...
ISBN:
(纸本)9783540631019
the proceedings contain 18 papers. the special focus in this conference is on Modeling Techniques and Tools for Computer Performance Evaluation. the topics include: A performability modeling environment tool;dependability evaluation and the optimization of performability;design and implementation of a network computing platform using JAVA;storage alternatives for large structured state spaces;an efficient disk-based tool for solving very large markov models;efficient transient overload tests for real-time systems;towards an analytical tool for performance modeling of ATM networks by decomposition;an embedded network simulator to support network protocols’ development;synchronized two-way voice simulation for internet phone performance analysis and evaluation;processes as language-oriented building blocks of stochastic petri nets;measurement tools and modeling techniques for evaluating WEB server performance;workload characterization of input/output intensive parallel applications;interval based workload characterization for distributedsystems;bounding the loss rates in a multistage ATM switch;simple bounds for queues fed by markovian sources and on queue length moments in fork and join queuing networks with general service times.
the proceedings contain 35 papers. the special focus in this conference is on Mining Humanistic Data. the topics include: Digitally Assisted Planning and Monitoring of Supportive Recommendations in Canc...
ISBN:
(纸本)9783031083402
the proceedings contain 35 papers. the special focus in this conference is on Mining Humanistic Data. the topics include: Digitally Assisted Planning and Monitoring of Supportive Recommendations in Cancer Patients;CAIPI in Practice: Towards Explainable Interactive Medical Image Classification;a Deep Q Network-Based Multi-connectivity Algorithm for Heterogeneous 4G/5G Cellular systems;simulating Blockchain Consensus Protocols in Julia: Proof of Work vs Proof of Stake;Maximum Likelihood Estimators on MCMC Sampling Algorithms for Decision Making;employing Natural Language Processing Techniques for Online Job Vacancies Classification;Probabilistic Quantile Multi-step Forecasting of Energy Market Prices: A UK Case Study;proactive Buildings: A Prescriptive Maintenance Approach;performance Meta-analysis for Big-Data Univariate Auto-Imputation in the Building Sector;non-intrusive Diagnostics for Legacy Heat-Pump Performance Degradation;a 5G-Based Architecture for Localization Accuracy;anomaly Detection in Small-Scale Industrial and Household Appliances;an Innovative Software Platform for Efficient Energy, Environmental and Cost Planning in Buildings Retrofitting;deep Learning-Based Segmentation of the Atherosclerotic Carotid Plaque in Ultrasonic Images;An Intelligent Grammar-Based Platform for RNA H-type Pseudoknot Prediction;An Automated 2D U-Net Segmentation Method for the Identification of Cancer Brain Metastases Using MRI Images;the Use of Robotics in Critical Use Cases: the 5G-ERA Project Solution;fundamental Features of the Smart5Grid Platform Towards Realizing 5G Implementation;experimentation Scenarios for Machine Learning-Based Resource Management;efficient Data Management and Interoperability Middleware in Business-Oriented Smart Port Use Cases;5G for the Support of Smart Power Grids: Millisecond Level Precise distributed Generation Monitoring and Real-Time Wide Area Monitoring;monitoring Neurological Disorder Patients via Deep Learning Based Facial Expressions An
Molecular dynamics (MD) simulation is a common tool to study the physical movements of atoms and molecules in many research fields. However, it is an extremely time-consuming application which takes researchers weeks ...
详细信息
ISBN:
(纸本)9781509042982
Molecular dynamics (MD) simulation is a common tool to study the physical movements of atoms and molecules in many research fields. However, it is an extremely time-consuming application which takes researchers weeks or months to run a single simulation when simulation size scales up and computing demands keep growing. In this paper, an improved MD implementation on Sunway TaihuLight supercomputer is developed to solve the above mentioned issues. the new implementation is extended from an existing implementation (i.e., LAMMPS) which widely uses the MD application. Sunway TaihuLight is a heterogeneous supercomputer with a fully customized integration approach and a brand new many-core processor SW26010. the Sunway TaihuLight takes the world's first place with peak performance over 100PFlops. We propose an optimization method of the MD simulation in three steps: paralleled extensions to SW26010, memory-access optimizations, and vectorization. After the optimization process, an 8x speedup is achieved on a single computing node. Superiorly, we also scale up to 95 thousands computing nodes (6 millions cores) with an almost leaner speedup. Besides, the proposed methods can also be applied to many other molecular dynamics codes, or similar applications.
We propose a practical parallel on-the-fly algorithm for enumerative LTL (linear temporal logic) model checking. the algorithm is designed for a cluster of workstations communicating via MPI (message passing interface...
详细信息
ISBN:
(纸本)9780769520353
We propose a practical parallel on-the-fly algorithm for enumerative LTL (linear temporal logic) model checking. the algorithm is designed for a cluster of workstations communicating via MPI (message passing interface). the detection of cycles (faulty runs) effectively employs the so called back-level edges. In particular, a parallel level-synchronized breadth-first search of the graph is performed to discover back-level edges. For each level, the back-level edges are checked in parallel by a nested depth-first search to confirm or refute the presence of a cycle. Several optimizations of the basic algorithm are presented and advantages and drawbacks of their application to distributed LTL model-checking are discussed. Experimental implementation of the algorithm shows promising results.
this research is devoted to a quantitative comparison of the performance of several parallel programming approaches and compares their computational performance. Comparison is performed for the Computational Dynamics ...
this research is devoted to a quantitative comparison of the performance of several parallel programming approaches and compares their computational performance. Comparison is performed for the Computational Dynamics Problem solved by the MacCormack scheme. parallel computation properties of this sample problem task are well-understood. the parallel programming techniques were chosen considering the recent trends in high-performance computing. Both high-level framework-based implementations (using OneAPI DPC++ and ArrayFire) and low-level implementations (based on CUDA C++) are reviewed and their performance is compared. Additionally, single SMP systems with multiple CUDA-capable GPUs were studied using GPUDirect and Unified Memory technologies. Wall-time was used as a performance metric. the comparison was performed using Student’s t-test for the Gauss-distributed experimental results and the non-parametric Wilcoxon signed-rank test – for other distributions. the results show that CUDA-based solutions outperform other approaches though development time considerations often can favor more high-level approaches.
Approaching a comprehensive performance benchmark for on-line transaction processing (OLTP) applications in a cloud environment is a challenging task. Fundamental features of clouds, such as the pay-as-you-go pricing ...
详细信息
ISBN:
(纸本)9781479954711
Approaching a comprehensive performance benchmark for on-line transaction processing (OLTP) applications in a cloud environment is a challenging task. Fundamental features of clouds, such as the pay-as-you-go pricing model and unknown underlying configuration of the system, are contrary to the basic assumptions of available benchmarks such as TPC-W or RUBiS. In this paper, we introduce a systematic performance benchmark approach for OLTP applications on public clouds that use virtual machines(VMs). We propose WPress benchmark, which is based on the widespread blogging software, WordPress, as a representative OLTP application and implement an open source workload generator. Furthermore, we utilize a CPU micro-benchmark to investigate CPU performance of cloud-based VMs in greater detail. Average response time and total VM cost are the performance metrics measured by WPress. We evaluate small and large instance types of three real-life cloud providers, Amazon EC2, Microsoft Azure and Rackspace cloud. Results imply that Rackspace cloud has better average response times and total VM cost on small instances. However, Microsoft Azure is preferable for large instance type.
暂无评论