In the current era of extensive data usage across industries, data collection, preservation, utilization, and organization has become more challenging and nuanced because it is necessary to consider critical concerns ...
In the current era of extensive data usage across industries, data collection, preservation, utilization, and organization has become more challenging and nuanced because it is necessary to consider critical concerns such as data security, privacy, and legal issues, apart from efficiency issues. As a result, Thai government initiated the idea and effort to implement data governance throughout the government agency. This paper showcases the implementation of data governance in a governmental research organization with highly diverse structured and unstructured data. The implementation follows international standards and the guidelines of the Digital Government Development Agency (DGA). The executives set up the working body, including the data Governance Council and data Stewards, responsible for setting up and deploying policies and regulations. Creating awareness and the necessary infrastructure are the main focuses in the first-year phase. The metadata was designed to extend DGA's version and match the organization's unique requirements. A data catalog platform was developed accordingly. We organized activities to boost employee awareness and participation, including advertising and data catalog platform training. By the end of the first year of implementation, every organization unit had registered at least one data record into the data catalog.
Difficulties in replication and reproducibility of empirical evidences in machinelearning research have become a prominent topic in recent years. Ensuring that machinelearning research results are sound and reliable...
详细信息
Graph Neural Networks (GNNs) have achieved great success in various data mining tasks but they heavily rely on a large number of annotated nodes, requiring considerable human efforts. Despite the effectiveness of exis...
详细信息
ISBN:
(数字)9798350317152
ISBN:
(纸本)9798350317169
Graph Neural Networks (GNNs) have achieved great success in various data mining tasks but they heavily rely on a large number of annotated nodes, requiring considerable human efforts. Despite the effectiveness of existing GNN-based Active learning (AL) methods, they assume that the annotated labels are always correct, which is contradictory to the error-prone labeling process in a practical crowdsourcing environment. Besides, due to this impractical assumption, existing works only focus on optimizing the node selection in AL but neglect optimizing the labeling process. Therefore, we present NC-ALG, the first GNN-based AL framework that optimizes both the node selection and node labeling process under a noisy crowd. For node selection, NC-ALG introduces a new measurement to model influence reliability and an effective influence maximization objective to select nodes. For node labeling, NC-ALG significantly reduces the labeling cost by considering the model-predicted labels and the labels of mirror nodes. To the best of our knowledge, this is the first attempt to consider GNN-based AL under the practical noisy crowd. Empirical studies on public datasets demonstrate that NC-ALG significantly outperforms existing methods in terms labeling efficiency. Notably, it only takes NC-ALG one-third of the labeling budget that the competitive baseline GRAIN needs to achieve an accuracy of 70.7 % on PubMed.
It is challenging to implement Kernel methods, if the data sources are distributed and cannot be joined at a trusted third party for privacy reasons. It is even more challenging, if the use case rules out privacy-pres...
It is challenging to implement Kernel methods, if the data sources are distributed and cannot be joined at a trusted third party for privacy reasons. It is even more challenging, if the use case rules out privacy-preserving approaches that introduce noise or entail significant computational overhead. An example for such a use case is machinelearning on clinical data. To realize exact and efficient privacy preserving computation of kernel methods, we propose FLAKE, a Framework for learning with Anonymized KErnels on horizontally distributed data. With our method, the data sources mask their data so that a Gram matrix can be computed without compromising privacy or utility. The Gram matrix allows to calculate many kernel matrices, which can be used to train kernel-based machinelearning algorithms such as Support Vector machines. We prove that our framework prevents an adversary from learning the input data or the number of input features under a semi-honest threat model. The conducted experiments on clinical, genomic, and image data provide confirmation that our approach is applicable across a wide range of settings. Additionally, our method outperforms comparable approaches in both computational efficiency and accuracy. Thus, FLAKE is a lightweight, applicable approach suitable for various use cases.
The imbalanced data classification problem has aroused lots of concerns from both academia and industry since data imbalance is a widespread phenomenon in many real-world scenarios. Although this problem has been well...
详细信息
ISBN:
(数字)9798350317152
ISBN:
(纸本)9798350317169
The imbalanced data classification problem has aroused lots of concerns from both academia and industry since data imbalance is a widespread phenomenon in many real-world scenarios. Although this problem has been well researched from the view of imbalanced class samples, we further argue that graph neural networks (GNNs) expose a unique source of imbalance from the influenced nodes of different classes of labeled nodes, i.e., labeled nodes are imbalanced in terms of the number of nodes they influenced during the influence propagation in GNNs. To tackle this previously unexplored influence-imbalance issue, we connect social influence maximization with the imbalanced node classification problem and propose balanced influence maximization (BIM). Specifically, BIM greedily assigns the pseudo label to the node which can maximize the number of influenced nodes in GNN training while making the influence of each class more balance. Experimental results on five public datasets demonstrate the effectiveness of our method in relieving the influence-imbalance issue. For example, when training a GCN with an imbalance ratio of 0.1, BIM significantly outperforms the most competitive baseline by 0.6% -9.8% in five public datasets in terms of the F1 score.
Chatbot platforms, e.g., Facebook and Line, have revolutionized human interaction in the digital age. In order to develop an automatic chatbot classification, there are several challenges especially for Thai chat mess...
详细信息
Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to...
详细信息
Decentralized stochastic optimization has become a crucial tool for addressing large-scale machinelearning and control problems. In decentralized algorithms, all computing nodes are connected through a network topolo...
Decentralized stochastic optimization has become a crucial tool for addressing large-scale machinelearning and control problems. In decentralized algorithms, all computing nodes are connected through a network topology, and each node communicates only with its direct neighbors. Decentralized algorithms can significantly reduce communication overhead by eliminating the need for global communication. However, existing research on the linear speedup analysis of decentralized stochastic algorithms is limited to the condition of network-dependent learning rates, which rarely holds in practice since the network connectivity is typically unknown to each node. As a result, it remains an open question whether a linear speedup bound can be achieved using network-independent learning rates. This paper provides an affirmative answer. By utilizing a new analysis framework, we prove that D-SGD and Exact-Diffusion, two representative decentralized stochastic algorithms, can achieve linear speedup with network-independent learning rates. Simulations are provided to validate our theories.
This article describes the developed architecture of the system module for processing and interpreting analog medical data. Patients often undergo examinations in various medical institutions, and since their results ...
详细信息
This article describes the developed architecture of the system module for processing and interpreting analog medical data. Patients often undergo examinations in various medical institutions, and since their results are often handed out to the patient in printed form, the receiving institution transfers them to its database manually. There is also a tendency to completely refuse analog media and use only digital ones. But in this case, another problem appears - either loss or conversion of the accumulated analog base into digital format. These days, automatic document management systems for medical institutions - Health information systems (HIS) - are actively developing. The software module developed in accordance with the architecture described in the article can be used by developers of various HIS to automate the work with analog data. If it is necessary, it can also be freely expanded by adding new modules for working with various analog data. In this article, we take ECG scans and medical test results as examples of such data. As a result of the work undertaken the prototype of the designed system was developed and tested.
The retrieval of sun-induced fluorescence (SIF) from hyper-spectral imagery is an ill-posed problem that has been tackled in different ways. We present a novel retrieval method combining semi-supervised deep learning ...
The retrieval of sun-induced fluorescence (SIF) from hyper-spectral imagery is an ill-posed problem that has been tackled in different ways. We present a novel retrieval method combining semi-supervised deep learning with an existing spectral fitting method. A validation study with in-situ SIF measurements shows high sensitivity of the deep learning method to SIF changes even though systematic shifts deteriorate its absolute prediction accuracy. A detailed analysis of diurnal SIF dynamics and SIF prediction in topographically variable terrain highlights the benefits of this deep learning approach.
暂无评论