Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the...
详细信息
ISBN:
(纸本)9781450305525
Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.
Current distributedcomputing systems comprising of commodity computers like Network of Workstations (NOW) are obliged to deploy multicore processors to raise their performance. However, because multicore processors w...
详细信息
Optimization of the task scheduling represent one of the most important open issues of large scale distributed systems. Generally, the overall performance of a distributed system is highly influenced by the quality of...
详细信息
Volunteer computing systems such as BOINC use several interacting scheduling policies, which must address multiple requirements across a large space of usage scenarios. In developing BOINC, we need to design and optim...
详细信息
Autonomic computing systems promise to manage themselves on a set of basic rules specified to higher level objectives. One of the challenges in making this possible is dependable collaboration among peers in a large-s...
详细信息
The proceedings contain 52 papers. The topics discussed include: knowledge-based platform for environmental risk management;distributed symbolic computations;building an HPC ecosystem in Europe;job management in WebCo...
详细信息
ISBN:
(纸本)0769529364
The proceedings contain 52 papers. The topics discussed include: knowledge-based platform for environmental risk management;distributed symbolic computations;building an HPC ecosystem in Europe;job management in WebCom;whole genome comparison on a network of workstations;dual communication network in program control based on global application state monitoring;checkpoint and recovery for parallel applications with dynamic number of processes;a prototype of social and economic based resource allocation system in grid computing;credentials management for authentication in a grid-based e-learning platform;integrating distributed component and mobile agents programming models in grid computing;interactive fusion simulation and visualization on the grid;fully distributed active and passive task management for grid computing;a new iterative method to improve network coordinates-based internet distance estimation;and dynamic workflow control with global states monitoring.
Orchestrating composite applications inside distributed systems requires complex coordination. In this frame workflow orchestration engines provide a viable solution. Contrary to their centralized counterparts, decent...
详细信息
Perhaps the most utilized and demanded task in data mining is classification. Most existing classification algorithms require all the data used for constructing the model for classification, or at least a good part of...
详细信息
Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize mass...
详细信息
ISBN:
(纸本)9781450305525
Heterogeneous architecture is becoming an important way to build a massive parallel computer system, i.e. the CPU-GPU heterogeneous systems ranked in Top500 list. However, it is a challenge to efficiently utilize massive parallelism of both applications and architectures on such heterogeneous systems. In this paper we present a practice on how to exploit and orchestrate parallelism at algorithm level to take advantage of underlying parallelism at architecture level. A potential Petaflops application cryo-EM 3D reconstruction is selected as an example. We exploit all possible parallelism in cryo-EM 3D reconstruction, and leverage a self-adaptive dynamic scheduling algorithm to create a proper parallelism mapping between the application and architecture. The parallelized programs are evaluated on a subsystem of Dawning Nebulae supercomputer, whose node is composed of two Intel six-core Xeon CPUs and one Nvidia Fermi CPU. The experiment confirms that hierarchical parallelism is an efficient pattern of parallel programming to utilize capabilities of both CPU and CPU in a heterogeneous system. The CUDA kernels run more than 3 times faster than the OpenMP parallelized ones using 12 cores (threads). Based on the CPU-only version, the hybrid CPU-CPU program further improves the whole application's performance by 30% on the average.
MATLAB® and its open-source implementation Octave have proven to be one of the most productive environments for scientific computing in recent years. There have been multiple efforts to develop an efficient paral...
详细信息
暂无评论