Stream processing prevails and SQL query on streams has become one of the most popular application scenarios. For example, in 2021, the global number of active IoT endpoints reaches 12.3 billion. Unfortunately, the in...
Stream processing prevails and SQL query on streams has become one of the most popular application scenarios. For example, in 2021, the global number of active IoT endpoints reaches 12.3 billion. Unfortunately, the increasing scale of data and strict user requests place much pressure on existing stream processing systems, requiring high processing throughput with low latency. To further improve the performance of current stream processing systems, we propose a compression-based stream processing engine, called CompressStreamDB, which enables adaptive fine-grained stream processing directly on compressed streams, without decompression. Particularly, CompressStreamDB involves eight compression methods targeting various data types in streams, and it also provides a cost model for dynamically selecting the appropriate compression methods. By exploring data redundancy among streams, CompressStreamDB not only saves space in data transmission between client and server, but also achieves high throughput with low latency in SQL query on stream processing. Our experimental results show that compared to the state-of-the-art stream processing system on uncompressed streams, CompressStreamDB achieves 3.24× throughput improvement and 66.0% lower latency on average. Besides, CompressStreamDB saves 66.8% space.
Lexical simplification has attracted much attention in many languages, which is the process of replacing complex words in a given sentence with simpler alternatives of equivalent meaning. Although the richness of voca...
详细信息
To support dramatically increased traffic loads,communication networks become *** cell association(CA)schemes are timeconsuming,forcing researchers to seek fast *** paper proposes a deep Q-learning based scheme,whose ...
详细信息
To support dramatically increased traffic loads,communication networks become *** cell association(CA)schemes are timeconsuming,forcing researchers to seek fast *** paper proposes a deep Q-learning based scheme,whose main idea is to train a deep neural network(DNN)to calculate the Q values of all the state-action pairs and the cell holding the maximum Q value is *** the training stage,the intelligent agent continuously generates samples through the trial-anderror method to train the DNN until *** the application stage,state vectors of all the users are inputted to the trained DNN to quickly obtain a satisfied CA result of a scenario with the same BS locations and user *** demonstrate that the proposed scheme provides satisfied CA results in a computational time several orders of magnitudes shorter than traditional ***,performance metrics,such as capacity and fairness,can be guaranteed.
Machine reading comprehension has been a research focus in natural language processing and intelligence ***,there is a lack of models and datasets for the MRC tasks in the anti-terrorism ***,current research lacks the...
详细信息
Machine reading comprehension has been a research focus in natural language processing and intelligence ***,there is a lack of models and datasets for the MRC tasks in the anti-terrorism ***,current research lacks the ability to embed accurate background knowledge and provide precise *** address these two problems,this paper first builds a text corpus and testbed that focuses on the anti-terrorism domain in a semi-automatic ***,it proposes a knowledge-based machine reading comprehension model that fuses domain-related triples from a large-scale encyclopedic knowledge base to enhance the semantics of the *** eliminate knowledge noise that could lead to semantic deviation,this paper uses a mixed mutual ttention mechanism among questions,passages,and knowledge triples to select the most relevant triples before embedding their semantics into the *** results indicate that the proposed approach can achieve a 70.70%EM value and an 87.91%F1 score,with a 4.23%and 3.35%improvement over existing methods,respectively.
GPU's powerful computational capacity holds great potentials for processing hierarchically-compressed data without decompression in data science domain. Unfortunately, existing GPU approaches offer only traversal-...
详细信息
ISBN:
(纸本)9781665454452
GPU's powerful computational capacity holds great potentials for processing hierarchically-compressed data without decompression in data science domain. Unfortunately, existing GPU approaches offer only traversal-based data analytics; random access is extremely inefficient, substantially limiting their utility. To solve this problem, we develop a novel and broadly applicable optimization that enables efficient random access to hierarchically-compressed data without decompression in GPU memory. We address three major challenges for enabling efficient random access to compressed data on GPUs. The first challenge is designing GPU data structures that support random access. The second challenge is efficiently generating data structures on GPU. Generating data structures for random access is costly on the CPU, and the inefficiency increases dramatically when PCIe data transmission is incorporated. The third challenge is query processing on compressed data in GPU memory. Random accesses, including data updates, result in significant conflicts between massive threads. To solve the first challenge, we propose and modify a number of compressed data structures, including indexing within the complicated GPU memory hierarchy. To address the second challenge, we develop a two-phase process for generating these data structures on the GPU. To handle the third challenge, we propose a double-parsing design to avoid data conflicts. We evaluate our solution on two GPU platforms using five real-world datasets. Experiments show that the random access operations on GPU can achieve 65.04x average speedup compared to the state-of-the-art method.
knowledge graphs (KGs) play an increasingly im-portant role in many knowledge-aware tasks. However, existing KGs are struggle with incompleteness, which motivates knowledge graph completion (KGC), that is, predicting ...
详细信息
ISBN:
(数字)9798350377613
ISBN:
(纸本)9798350377620
knowledge graphs (KGs) play an increasingly im-portant role in many knowledge-aware tasks. However, existing KGs are struggle with incompleteness, which motivates knowledge graph completion (KGC), that is, predicting the lost links between entities based on observed triples. Reasoning over relation paths in incomplete KGs is popular. Nonetheless, some significant issues are still remained to be addressed, such as path noise and ambiguity of inferred relation. To address these problems, we propose a novel path augmented _Reasoning model with avoidance of Path noise and Disambiguation of inferred relation in this paper, referred to as RPD. In this model, we calculate the sum of resource allocation for each relation path to measure its reliability to avoid the inference of path noise. To address the ambiguity of an inferred relation, we introduce position embedding to denote the relation position along the path when learning path representation. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of our proposal RPD model in the handling of KGC tasks compared to SOTAs.
In multi-label learning, each instance is associated with a set of labels simultaneously. Most existing studies assume that the set of labels for each instance is complete. However, it is generally difficult to obtain...
In multi-label learning, each instance is associated with a set of labels simultaneously. Most existing studies assume that the set of labels for each instance is complete. However, it is generally difficult to obtain all the relevant labels of each instance, and only a partial or even empty set of relevant labels is available, which is called semi-supervised multi-label learning with missing labels. To tackle this problem, we propose a novel framework that considers label correlations and instance correlations to recover the missing labels and utilizes a large amount of unlabeled data simultaneously to improve the classification performance. Specifically, a new supplementary label matrix is firstly obtained by learning the label correlation. Secondly, considering each class label may be decided by some specific characteristics of its own, a label-specific data representation is hence learned for each class label. Thirdly, instance correlations are utilized not only to recover the missing labels, but also to propagate the supervision information from labeled instances to unlabeled ones. In addition, a united objective function is designed to facilitate the above processing and an accelerated proximal gradient method is adopted to solve the optimization problem. Finally, extensive experimental results conducted on several benchmark datasets demonstrate the effectiveness of the proposed method compared to competing ones.
data science is a rapidly growing academic field with significant implications for all conventional scientific studies. However, most relevant studies have been limited to one or several facets of data science from a ...
详细信息
data science is a rapidly growing academic field with significant implications for all conventional scientific studies. However, most relevant studies have been limited to one or several facets of data science from a specific application domain perspective and less to discuss its theoretical framework. data science is unique in that its research goals, perspectives, and body of knowledge are distinct from other sciences. The core theories of data science are the DIKW pyramid, data-intensive scientific discovery, data science life cycle, data wrangling or munging,big data analytics, data management, and governance, data products Dev Ops, and big data visualization. Six main trends characterize the recent theoretical studies on data science are:(1)the growing significance of data Ops,(2) the rise of citizen data scientists,(3) enabling augmented data science,(4) integrating data warehouse with data lake,(5) diversity of domain-specific data science, and(6) implementing data stories as data products. Further development of data science should prioritize four ways to turn challenges into opportunities:(1) accelerating theoretical studies of data science,(2) the trade-off between explainability and performance,(3) achieving data ethics, privacy and trust, and(4) aligning academic curricula with industrial needs.
Large-scale pre-trained models such as GPT and BERT have demonstrated remarkable performance in information extraction tasks. However, their black-box nature poses challenges for reliability and interpretability. In c...
Large-scale pre-trained models such as GPT and BERT have demonstrated remarkable performance in information extraction tasks. However, their black-box nature poses challenges for reliability and interpretability. In contrast, rule- based extraction methods have better interpretability, but typically require domain experts to manually establish rules, limiting their generalization ability. In industry, there is often a demand for reliable knowledge extraction to reduce the time spent on manual verification of each piece of knowledge. In this paper, we explore the idea of combining GPT and symbolic-based methods to automatically discover reliable extraction patterns in text with a particular writing style. This method leverages the characteristics of high information density and similar writing patterns in text with a specific writing style to generate verifiable and reliable patterns. We conduct experiments on two datasets with a specific writing style to demonstrate its effectiveness, validating the idea of combining large models for reliable information extraction pattern discovery in the tested datasets.
Purpose:Using the metaphor of"unicorn,"we identify the scientific papers and technical patents characterized by the informetric feature of very high citations in the first ten years after publishing,which ma...
详细信息
Purpose:Using the metaphor of"unicorn,"we identify the scientific papers and technical patents characterized by the informetric feature of very high citations in the first ten years after publishing,which may provide a new pattern to understand very high impact works in science and ***/methodology/approach:When we set CT as the total citations of papers or patents in the first ten years after publication,with CT≥5,000 for scientific"unicorn"and CT≥500 for technical"unicorn,"we have an absolute standard for identifying scientific and technical"unicorn"***:We identify 165 scientific"unicorns"in 14,301,875 WoS papers and 224 technical"unicorns"in 13,728,950 DII patents during 2001–*** 50%of"unicorns"belong to biomedicine,in which selected cases are individually *** rare"unicorns"increase following linear model,the fitting data show 95%confidence with the RMSE of scientific"unicorn"is 0.2127 while the RMSE of technical"unicorn"is *** limitations:A"unicorn"is a pure quantitative consideration without concerning its quality,and"potential unicorns"as CT≤5,000 for papers and CT≤500 for patents are left in future *** implications:Scientific and technical"unicorns"provide a new pattern to understand high-impact works in science and ***"unicorn"pattern supplies a concise approach to identify very high-impact scientific papers and technical ***/value:The"unicorn"pattern supplies a concise approach to identify very high impact scientific papers and technical patents.
暂无评论