The current trend for building an ontology-based datamanagement system (DMS) is to capitalize on efforts made to design a preexisting well-established DMS (a reference system). The method amounts to extracting from t...
详细信息
The current trend for building an ontology-based datamanagement system (DMS) is to capitalize on efforts made to design a preexisting well-established DMS (a reference system). The method amounts to extracting from the reference DMS a piece of schema relevant to the new application needs-a module-, possibly personalizing it with extra constraints w.r.t. the application under construction, and then managing a data set using the resulting schema. In this paper, we extend the existing definitions of modules and we introduce novel properties of robustness that provide means for checking easily that a robust module-based DMS evolves safely w.r.t. both the schema and the data of the reference DMS. We carry out our investigations in the setting of description logics which underlie modern ontology languages, like RDFS, OWL, and OWL2 from W3C. Notably, we focus on the DL-liteA dialect of the DL-lite family, which encompasses the foundations of the QL profile of OWL2 (i.e., DL-liteR): the W3C recommendation for efficiently managing large data sets.
With the Semantic Web relying on ontologies to establish online machine-interpretable information, the Internet is growing into a semantically aware computing paradigm that facilitates Web entities' discovery of t...
详细信息
With the Semantic Web relying on ontologies to establish online machine-interpretable information, the Internet is growing into a semantically aware computing paradigm that facilitates Web entities' discovery of the knowledge and resources they need. Ambient intelligence aims to enable smart interaction beyond the Internet by embedding intelligence into our environment to unobtrusively support users' daily activities. To accomplish these goals, ontologies and semantic awareness are crucial for better understanding a user's context. While interest in the Semantic Web has spurred the development of large-scale semantic grid architectures, expanding the Semantic Web to the other end of the computing spectrum is a complex undertaking. The techniques and tools that support the Semantic Web aren't designed to deal with the resource-constrained devices with which people frequently interact in an ambient-intelligence environment. A proposed coding scheme for ontologies embeds semantic awareness in devices with limited memory and processing capabilities, such as sensory nodes and smart phones. This scheme provides a compact representation of an ontology and is enhanced with an efficient and effective semantic-matching algorithm similar to subsumption testing in many ontology reasoners.
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algo...
详细信息
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files;it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
Autonomic decision-making based on rules and metrics is inevitably on the rise in distributed software systems. Often, the metrics are acquired from system observations such as static checks and runtime traces. To avo...
详细信息
Autonomic decision-making based on rules and metrics is inevitably on the rise in distributed software systems. Often, the metrics are acquired from system observations such as static checks and runtime traces. To avoid bias propagation and hence reduce wrong decisions in increasingly autonomous systems due to poor observation data quality, multiple independent observers can exchange their findings and produce a majority-accepted, complete and outlier-cleaned ground truth in the form of consensus-supported metrics. In this work, we motivate the growing importance of metrics for informed and autonomic decisions in clouds and other distributed systems, present reasons for diverging observations, and describe a federated approach to produce ground truth with data-centric consensus voting for more reliable decision making processes. We validate the system design with experiments in the area of cloud software artefact observations and highlight benefits for reproducible distributed system behaviour.
Performance analysis of computing systems is an increasingly difficult task due to growing system complexity. Traditional tools rely on ad hoc procedures. With these, determining which of the manifold system and workl...
详细信息
Performance analysis of computing systems is an increasingly difficult task due to growing system complexity. Traditional tools rely on ad hoc procedures. With these, determining which of the manifold system and workload parameters to examine is often a lengthy and highly speculative process. The analysis is often incomplete and, therefore, prone to revealing faulty conclusions and not uncovering useful tuning knowledge. We address this problem by introducing a data mining approach called ADMiRe ( Analyzer for data Mining Results). In this scheme, regression analysis is first applied to performance data to discover correlations between various system and workload parameters. The results of this analysis are summarized in sets of regression rules. The user can then formulate intuitive algebraic expressions to manipulate these sets of rules to capture critical information. To demonstrate this approach, we use ADMiRe to analyze an Oracle database system running the TPC- C ( Transaction Processing Performance Council) benchmark. The results generated by ADMiRe were confirmed by Oracle experts. We also show that by applying ADMiRe to Microsoft Internet Information Server performance data, we can improve system performance by 20 percent.
In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data re...
详细信息
In recent years, we have witnessed an increasing demand to process big data in numerous applications. It is observed that there often exist substantial amounts of repetitive data in different portions of a big data repository/dataset for applications such as genome sequence analyses. In this paper, we present a novel method, called the VA-Store, to reduce the large space requirement for repetitive data in prevailing genome sequence analysis tasks usingk(0)-mers (i.e., subsequences of length k(0)) with multiple k values. The VA-Store maintains a physical store for one portion of the input dataset (i.e., k(0) k(0)-mers) and supports multiple virtual stores for other portions of the dataset (i.e., kk-mers with k(0) k(0)). Utilizing important relationships among repetitive data, the VA-Store transforms a given query on a virtual store into one or more queries on the physical store for execution. Both precise and approximate transformations are considered. Accuracy estimation models for approximate solutions are derived. Query optimization strategies are suggested to improve query performance. Our experiments using real and synthetic datasets demonstrate that the VA-Store is quite promising in providing effective storage and efficient query processing for solving a kernel database problem on repetitive big data for genome sequence analysis applications.
The widespread adoption of Internet of Things (IoT) motivated the emergence of mixed workloads in smart cities, where fast arriving geo-referenced big data streams are joined with archive tables, aiming at enriching s...
详细信息
The widespread adoption of Internet of Things (IoT) motivated the emergence of mixed workloads in smart cities, where fast arriving geo-referenced big data streams are joined with archive tables, aiming at enriching streams with descriptive attributes that enable insightful analytics. Applications are now relying on finding, in real-time, to which geographical regions data streaming tuples belong. This problem requires a computationally intensive stream-static join for joining a dynamic stream with a disk-resident static table. In addition, the time-varying nature of fluctuation in geospatial data arriving online calls for an approximate solution that can trade-off QoS constraints while ensuring that the system survives sudden spikes in data loads. In this paper, we present SpatialSSJP, an adaptive spatial-aware approximate query processing system that specifically focuses on stream-static joins in a way that guarantees achieving an agreed set of Quality-of-Service goals and maintains geo-statistics of stateful online aggregations over stream-static join results. SpatialSSJP employs a state-of-art stratified-like sampling design to select well-balanced representative geospatial data stream samples and serve them to a stream-static geospatial join operator downstream. We implemented a prototype atop Spark Structured Streaming. Our extensive evaluations on big real datasets show that our system can survive and mitigate harsh join workloads and outperform state-of-art baselines by significant magnitudes, without risking rigorous error bounds in terms of the accuracy of the output results. SpatialSSJP achieves a relative accuracy gain against plain Spark joins of approximately 10% in worst cases but reaching up to 50% in best case scenarios.
Despite being commonly used in big-data analytics;the outcome of dimensionality reduction remains a black-box to most of its users. Understanding the quality of a low-dimensional embedding is important as not only it ...
详细信息
Despite being commonly used in big-data analytics;the outcome of dimensionality reduction remains a black-box to most of its users. Understanding the quality of a low-dimensional embedding is important as not only it enables trust in the transformed data, but it can also help to select the most appropriate dimensionality reduction algorithm in a given scenario. As existing research primarily focuses on the visual exploration of embeddings, there is still a need for enhancing interpretability of such algorithms. To bridge this gap, we propose two novel interactive explanation techniques for low-dimensional embeddings obtained from any dimensionality reduction algorithm. The first technique LAPS produces a local approximation of the neighborhood structure to generate interpretable explanations on the preserved locality for a single instance. The second method GAPS explains the retained global structure of a high-dimensional dataset in its embedding, by combining non-redundant local-approximations from a coarse discretization of the projection space. We demonstrate the applicability of the proposed techniques using 16 real-life tabular, text, image, and audio datasets. Our extensive experimental evaluation shows the utility of the proposed techniques in interpreting the quality of low-dimensional embeddings, as well as with selecting the most suitable dimensionality reduction algorithm for any given dataset.
In this paper, a problem of production plans, named k-most demanding products (k-MDP) discovering, is formulated. Given a set of customers demanding a certain type of products with multiple attributes, a set of existi...
详细信息
In this paper, a problem of production plans, named k-most demanding products (k-MDP) discovering, is formulated. Given a set of customers demanding a certain type of products with multiple attributes, a set of existing products of the type, a set of candidate products that can be offered by a company, and a positive integer k, we want to help the company to select k products from the candidate products such that the expected number of the total customers for the k products is maximized. We show the problem is NP-hard when the number of attributes for a product is 3 or more. One greedy algorithm is proposed to find approximate solution for the problem. We also attempt to find the optimal solution of the problem by estimating the upper bound of the expected number of the total customers for a set of k candidate products for reducing the search space of the optimal solution. An exact algorithm is then provided to find the optimal solution of the problem by using this pruning strategy. The experiment results demonstrate that both the efficiency and memory requirement of the exact algorithm are comparable to those for the greedy algorithm, and the greedy algorithm is well scalable with respect to k.
This paper presents flott, a fast, low memory T-transform algorithm which can be used to compute the string complexity measure T-complexity. The algorithm uses approximately one third of the memory of its predecessor ...
详细信息
This paper presents flott, a fast, low memory T-transform algorithm which can be used to compute the string complexity measure T-complexity. The algorithm uses approximately one third of the memory of its predecessor while reducing the running time by about 20 percent. The flott implementation has the same worst-case memory requirements as state of the art suffix tree construction algorithms. A suffix tree can be used to efficiently compute the Lempel-Ziv production complexity, which is another measure of string complexity. The C-implementation of flott is available as Open Source software.
暂无评论