When the global pandemic struck in 2020, most countries established task forces to meet a challenge that impacted governmental resources. It became apparent that data, intelligence gathering, and both modelling and pr...
详细信息
As the number of graph applications increases rapidly in many domains, new graph algorithms (or queries) have become more important than ever before. The current two-step approach to develop and test a graph algorithm...
详细信息
ISBN:
(纸本)9781728191843
As the number of graph applications increases rapidly in many domains, new graph algorithms (or queries) have become more important than ever before. The current two-step approach to develop and test a graph algorithm is very expensive for trillion-scale graphs required in many industrial applications. In this paper, we propose a concept of graph processing simulation, a single-step approach that generates a graph and processes a graph algorithm simultaneously. It consists of a top-down graph upscaling method called V-Upscaler and a graph processing simulation method following the vertex-centric GAS model called T-GPS. Users can develop a graph algorithm and check its correctness and performance conveniently and cost-efficiently even for trillion-scale graphs. Through extensive experiments, we have demonstrated that our single-step approach of V-Upscaler and T-GPS significantly outperforms the conventional two-step approach, although ours uses only a single machine, while the conventional one uses a cluster of eleven machines.
Non-Fungible Tokens (NFTs) are digital assets based on a blockchain and those are characterized as unique cryptographic tokens and non-interchangeable. To date, research into the NFT marketplace has been relatively li...
详细信息
Evaluating the readability of text has been a critical step in several applications, ranging from text simplification, learning new languages, providing school children with appropriate reading material to conveying i...
详细信息
Evaluating the readability of text has been a critical step in several applications, ranging from text simplification, learning new languages, providing school children with appropriate reading material to conveying important medical information in an easily understandable way. A lot of research has been dedicated to evaluating readability on larger bodies of texts, like articles and paragraphs, but the application on single sentences has received less attention. In this paper, we explore several machine learning techniques - logistic regression, random forest, Naive Bayes, KNN, MLP, XGBoost - on a corpus of sentences from the English and simple EnglishWikipedia. We build and compare a series of binary readability classifiers using extracted features as well as generated all-MiniLM-L6-v2-based embeddings, and evaluate them against standard classification evaluation metrics. To the authors' knowledge, this is the first time this sentence transformer is used in the task of readability assessment. Overall, we found that the MLP models, with and without embeddings, as well as the Random Forest, outperformed the other machine learning algorithms.
In the fields of machine learning and data mining, unsupervised feature selection plays an important role in processing large amounts of high-dimensional unlabeled data. This paper proposes an original and novel unsup...
详细信息
ISBN:
(数字)9781665490627
ISBN:
(纸本)9781665490627
In the fields of machine learning and data mining, unsupervised feature selection plays an important role in processing large amounts of high-dimensional unlabeled data. This paper proposes an original and novel unsupervised feature selection based on feature grouping and orthogonal constraints. We consider the domain relationship in the original data and reconstruct the similarity matrix based on the correlation between the features. We use a generalized incoherent regression model based on orthogonal constraints. Furthermore, a graph regularization term with local structure preservation constraints is added to ensure that the feature subset does not lose local structural features in the original data space. Besides, an iterative algorithm is proposed to solve the optimization problem by iteratively updating the global similarity matrix, and constructing weight matrix, pseudo-label matrix and transformation matrix. Through experiments on 6 benchmark datasets, the clustering performance of the proposed method outperforms state-of-the-art unsupervised feature selection methods. The source code is available at: https://***/misteru/FGOC.
Classifying and recognizing voice pathologies non-invasively using acoustic analysis saves patient and specialist time and can improve the accuracy of assessments. In this work, we intend to understand which models pr...
详细信息
ISBN:
(纸本)9783031232350;9783031232367
Classifying and recognizing voice pathologies non-invasively using acoustic analysis saves patient and specialist time and can improve the accuracy of assessments. In this work, we intend to understand which models provide better accuracy rates in the distinction between healthy and pathological, to later be implemented in a system for the detection of vocal pathologies. 194 control subjects and 350 pathological subjects distributed across 17 pathologies were used. Each subject has 3 vowels in 3 tones, which is equivalent to 9 sound files per subject. For each sound file, 13 parameters were extracted (jitta, jitter, Rap, PPQ5, ShdB, Shim, APQ3, APQ5, F0, HNR, autocorrelation, Shannon entropy and logarithmic entropy). For the classification between healthy and pathological, several classifiers were used (Decision Trees, Discriminant Analysis, Logistic Regression Classifiers, Naive Bayes Classifiers, Support Vector Machines, Nearest Neighbor Classifiers, Ensemble Classifiers, Neural Network Classifiers) with various models. For each patient, 118 parameters were used (13 acoustic parameters * 9 sound files per subject, plus the subject's gender). As pre-processing of the input matrix data, the Outliers treatment was used using the quartile method, then the data were normalized and, finally, Principal Component Analysis (PCA) was applied in order to reduce the dimension. As the best model, the Wide Neural Network was obtained, with an accuracy of 98% and AUC of 0.99.
Leverage Python software combined with contrastive analysis, structural analysis and trend analysis to analyze the current bamboo furniture market. Make use of Baidu Index to analyze the future development trend of ba...
详细信息
Under the background of accelerating globalization, accurate and efficient language translation is very important for cross-cultural communication. In this paper, ATTEBSC (Algorithm for Translation Template Extraction...
详细信息
ISBN:
(数字)9798331527662
ISBN:
(纸本)9798331527679
Under the background of accelerating globalization, accurate and efficient language translation is very important for cross-cultural communication. In this paper, ATTEBSC (Algorithm for Translation Template Extraction Based on Sentence Comparison) is adopted, aiming at automatically extracting and comparing translation templates from large-scale text data by combining natural language processing technology and machine learning method. The specific methods include using deep learning framework to analyze sentence structure, using syntactic and semantic analysis tools to identify key translation units, and then extracting high-frequency and efficient translation templates through comparative analysis algorithm. In addition, ATTEBSC also introduces a dynamic updating mechanism, which can continuously optimize the translation template library according to new data input. Bleu (Bilingual Evaluation Under Study) scores in all fields are higher than 0.8, and TER is lower than 30%, which indicates that the translation quality of machine translation system in all fields is high. The research results show that ATTEBSC has obvious advantages in improving translation quality and efficiency, especially in dealing with professional or technical text translation, which can significantly improve translation accuracy and fluency.
Big dataprocessing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO p...
Big dataprocessing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This paper presents a MaxCompute based integrated system to support multi-objective resource optimization via fine-grained instance-level modeling and optimization. We propose a new architecture that breaks RO into a series of simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level RO decisions well under a second. Evaluation using production workloads shows that our new RO system could reduce 37-72% latency and 43-78% cost at the same time, compared to the current optimizer and scheduler, while running in 0.02-0.23s.
Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-...
详细信息
ISBN:
(纸本)9783031416781;9783031416798
Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs. Popular table structure data-sets will be published in OTSL format to the community.
暂无评论