In this paper, we apply data mining classification schemes to predict failures in a high performance computer system. Failure and Usage data logs collected on supercomputing clusters at Los Alamos National Laboratory ...
详细信息
In this paper, we apply data mining classification schemes to predict failures in a high performance computer system. Failure and Usage data logs collected on supercomputing clusters at Los Alamos National Laboratory (LANL) were used to extract instances of failure information. For each failure instance, past and future failure information is accumulated -- time of usage, system idle time, time of unavailability, time since last failure, time to next failure. We performed two separate analyses, with and without classifying the failures based on their root cause. Based on this data, we applied some popular decision tree classifiers to predict if a failure would occur within 1 hour. Our experiments show that our prediction system predicts failures with a high-degree of precision up to 73% and recall of about 80%. We also observed that employing the usage data along with the failure data has improved the accuracy of prediction.
In this paper, we investigate the accuracy of position estimation of distributed sensor nodes. The method of the estimation is calculated by communication connectivity statuses of local communication between the nodes...
详细信息
In this paper, we investigate the accuracy of position estimation of distributed sensor nodes. The method of the estimation is calculated by communication connectivity statuses of local communication between the nodes. Actual positions of the nodes are unknown. A host computer, which can collect all of the connectivity conditions of the network, iteratively calculate the estimated positions to be less error of the communication connectivity between actual positions and estimated positions. We carried out the simulation studies with respect to spatial size of the distribution, density of the sensor nodes and dynamical distribution with moving and changing of the sensor positions.
Many modern embedded systems, such as cyber-physical systems, feature close integration of computation and physical components. Configurability, efficiency, adaptability, reliability, and usability are essential featu...
详细信息
Many modern embedded systems, such as cyber-physical systems, feature close integration of computation and physical components. Configurability, efficiency, adaptability, reliability, and usability are essential features for such systems. A workflow engine is a software application that manages workflows. It helps developers separate control flows from activities of the system, and thus is able to enhance configurability and development efficiency. This work aims at the design and implementation of a workflow engine for cyber-physical systems so as to configure workflows with less efforts. To meet timing requirements, the engine is designed to schedule activities to meet their timing requirements, and provides admission control so that these requirements of a set of workflows are guaranteed as long as it is admitted. A humanoid robot is used as the test-bed for the workflow engine. We model robot applications as workflows, and show that how the workflow engine provides real time guarantee and enhances its configurability.
Runtime adaptation of business processes and their IT system implementations to changes can usually be done in several ways. MiniZnMASC middleware makes adaptation decisions that maximize business value while satisfyi...
详细信息
Runtime adaptation of business processes and their IT system implementations to changes can usually be done in several ways. MiniZnMASC middleware makes adaptation decisions that maximize business value while satisfying all given constraints. All necessary information about alternative adaptations and their business metrics are specified as policies in WS-Policy4MASC. Using an example loan application business process, we demonstrate how MiniZnMASC supports 4 different autonomic business-driven decision making algorithms for adaptation.
This study considers a heterogeneous computing system and corresponding workload being investigated by the Extreme Scale systems Center (ESSC) at Oak Ridge National Laboratory (ORNL). The ESSC is part of a collaborati...
详细信息
This study considers a heterogeneous computing system and corresponding workload being investigated by the Extreme Scale systems Center (ESSC) at Oak Ridge National Laboratory (ORNL). The ESSC is part of a collaborative effort between the Department of Energy (DOE) and the Department of Defense (DoD) to deliver research, tools, software, and technologies that can be integrated, deployed, and used in both DOE and DoD environments. The heterogeneous system and workload described here are representative of a prototypical computing environment being studied as part of this collaboration. Each task can exhibit a time-varying importance or utility to the overall enterprise. In this system, an arriving task has an associated priority and precedence. The priority is used to describe the importance of a task, and precedence is used to describe how soon the task must be executed. These two metrics are combined to create a utility function curve that indicates how valuable it is for the system to complete a task at any given moment. This research focuses on using time-utility functions to generate a metric that can be used to compare the performance of different resource schedulers in a heterogeneous computing system. The contributions of this paper are: (a) a mathematical model of a heterogeneous computing system where tasks arrive dynamically and need to be assigned based on their priority, precedence, utility characteristic class, and task execution type, (b) the use of priority and precedence to generate time-utility functions that describe the value a task has at any given time, (c) the derivation of a metric based on the total utility gained from completing tasks to measure the performance of the computing environment, and (d) a comparison of the performance of resource allocation heuristics in this environment.
Recently, PCM (phase change memory) emerges as a new storage media and there is a bright prospect that PCM will be used as a storage device in the near future. Since the optimistic access time of PCM is expected to be...
详细信息
Recently, PCM (phase change memory) emerges as a new storage media and there is a bright prospect that PCM will be used as a storage device in the near future. Since the optimistic access time of PCM is expected to be almost identical to that of DRAM, we can make a question that the traditional buffer cache will be still effective for high speed secondary storage such as PCM. This paper answers it by showing that the buffer cache is still effective in such environments due to the software overhead and the bimodal block reference characteristics. Based on this observation, we present a new buffer cache management scheme appropriately for the system where the speed gap between the cache and storage is small. To this end, we analyze the condition that caching gains and find the characteristics of I/O traces that can be exploited in managing buffer cache for PCM-storage.
The implementation of communication protocols is an important development task that appears frequently in software projects. This article is a vision paper that describes the components of the current available implem...
详细信息
The implementation of communication protocols is an important development task that appears frequently in software projects. This article is a vision paper that describes the components of the current available implementation strategies and problems that arise. The article introduces the main existing protocol engineering techniques and puts them into the context of model driven software development. At the end a methodology is introduced for the automatic generation of manager interfaces of Device Agent protocols for the use in a distributed component oriented environment, using ASN.1 and SDL.
To rewrite a sequential program into a concurrent one, the programmer has to enforce atomic execution of a sequence of accesses to shared memory to avoid unexpected inconsistency. There are two means of enforcing this...
详细信息
To rewrite a sequential program into a concurrent one, the programmer has to enforce atomic execution of a sequence of accesses to shared memory to avoid unexpected inconsistency. There are two means of enforcing this atomicity: one is the use of lock-based synchronization and the other is the use of software transactional memory (STM). However, it is difficult to predict which one is more suitable for an application than the other without trying both mechanisms because their performance heavily depends on the application. We have developed a system named SAW that decouples the synchronization mechanism from the application logic of a Java program and enables the programmer to statically select a suitable synchronization mechanism from a lock or an STM. We introduce annotations to specify critical sections and shared objects. In accordance with the annotated source program and the programmer's choice of a synchronization mechanism, SAW generates aspects representing the synchronization processing. By comparing the rewriting cost using SAW and that using individual synchronization mechanism directly, we show that SAW relieves the programmer's burden. Through several benchmarks, we demonstrate that SAW is an effective way of switching synchronization mechanisms according to the characteristics of each application.
The trend in architectural designs has been towards using simple cores for building multicore chips, instead of a single complex out-of-order (OOO) core, due to the increased complexity and energy requirements of out ...
详细信息
The trend in architectural designs has been towards using simple cores for building multicore chips, instead of a single complex out-of-order (OOO) core, due to the increased complexity and energy requirements of out of order processors. Multicore chips provide better performance when compared with OOO cores while executing parallel applications. However, they are not able to exploit the parallelism inherent in single threaded applications. To this end, this paper presents a compiler optimization methodology coupled with minimal hardware extensions to extract simple fine-grained threads from a single-threaded application, for execution on multiple cores of a chip multiprocessor (CMP). These fine-grained threads are independent and eliminate the need for communication between cores, reducing costly communication latencies. This approach, which we call Parabilis is scalable for up to eight cores, and does not require complex hardware additions to simple multicore systems. Our evaluation shows that Parabilis yields an average speedup of 1.51 on an 8-core CMP architecture.
Network and systems management platforms were based on simple centralized architectures. Centralized architectures have proved deficient in managing current complex networks, such as the Internet. This has led to more...
详细信息
Network and systems management platforms were based on simple centralized architectures. Centralized architectures have proved deficient in managing current complex networks, such as the Internet. This has led to more complex and distributed architectures for network and system management. Throughout their development, management platforms have passed through intermediate stages such as weakly distributed control systems, strongly distributed control systems, domain based systems, and active distributed management systems [1]. In order to facilitate the development of such a platform, we will build a management collaborative community around grid concepts, so that it supports the integration of multiple management tasks in parallel manner. Access to information of different management domains requires some computational resources that are provided through grid interface and virtual organizations. Some system prototype results are presented
暂无评论