The advent of Big data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI,MapReduce,and *** important step for any parallel clusterin...
详细信息
The advent of Big data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI,MapReduce,and *** important step for any parallel clustering algorithm is the distribution of data amongst the cluster *** step governs the methodology and performance of the entire *** typically use random,or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning,as per the requirements of the ***,these strategies are generic and are not tailor-made for any specific parallel clustering *** this paper,we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they *** also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS,Cell-Based Dimensional Split for dGridSLINK,which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution,and Projection-Based Split,which is a generic distribution *** of these preserve spatial locality,achieve disjoint partitioning,and ensure good data load *** experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for,based on which we give appropriate recommendations for their usage.
data valuation quantifies the contribution of each data point to the performance of a machine learning model. Existing works typically define the value of data by its improvement of the validation performance of the t...
data valuation quantifies the contribution of each data point to the performance of a machine learning model. Existing works typically define the value of data by its improvement of the validation performance of the trained model. However, this approach can be impractical to apply in collaborative machine learning and data marketplace since it is difficult for the parties/buyers to agree on a common validation dataset or determine the exact validation distribution a priori. To address this, we propose a distributionally robust data valuation approach to perform data valuation without known/fixed validation distributions. Our approach defines the value of data by its improvement of the distributionally robust generalization error (DRGE), thus providing a worst-case performance guarantee without a known/fixed validation distribution. However, since computing DRGE directly is infeasible, we propose using model deviation as a proxy for the marginal improvement of DRGE (for kernel regression and neural networks) to compute data values. Furthermore, we identify a notion of uniqueness where low uniqueness characterizes low-value data. We empirically demonstrate that our approach outperforms existing data valuation approaches in data selection and data removal tasks on real-world datasets (e.g., housing price prediction, diabetes hospitalization prediction). Copyright 2024 by the author(s)
In today’s corporate landscape, the creation of questionnaires, surveys or evaluation forms for employees is a widespread practice. These tools are regularly used to check various aspects such as motivation, opportun...
详细信息
A language L is said to be regular-measurable if there exists an infinite sequence of pairs of regular languages that "converges" to L. Instead of regular languages, this paper examines measuring power of se...
详细信息
This paper explores the application of state-of-the-art natural language processing (NLP) technologies to improve the user experience in games. Our motivation stems from the realization that a virtual assistant’s inp...
详细信息
The automatic segmentation of tumours or lesions from magnetic resonance imaging (MRI) pictures is a critical but difficult task in clinical settings, sometimes necessitating laborious and time-consuming techniques. D...
详细信息
This research work presents a novel language intervention system for Tamil-speaking children with autism spectrum disorder (ASD). The system satisfies the considerable requirement for tools aimed at one more section o...
详细信息
Leveraging AI to analyze key topics on African social media can enhance public governance. Our study analyzes social media discourse within African society on development concerns by (1) evaluating AI techniques for s...
详细信息
Adam has become one of the most favored optimizers in deep learning problems. Despite its success in practice, numerous mysteries persist regarding its theoretical understanding. In this paper, we study the implicit b...
Knee Osteoarthritis (KOA), the most prevalent joint disease, significantly impacts elderly mobility due to progressive cartilage degeneration. Early prediction is crucial for preventing disease progression and guiding...
详细信息
暂无评论