Big data programming frameworks have become increasingly important for the development of applications for which performance and scalability are critical. In those complex frameworks, optimizing code by hand is hard a...
详细信息
Big data programming frameworks have become increasingly important for the development of applications for which performance and scalability are critical. In those complex frameworks, optimizing code by hand is hard and time-consuming, making automated optimization particularly necessary. In order to automate optimization, a prerequisite is to find suitable abstractions to represent programs;for instance, algebras based on monads or monoids to represent distributed data collections. Currently, however, such algebras do not represent recursive programs in a way which allows for analyzing or rewriting them. In this paper, we extend a monoid algebra with a fixpoint operator for representing recursion as a first class citizen and show how it enables new optimizations. Experiments with the Spark platform illustrate performance gains brought by these systematic optimizations.
Long-term streamflow data are essential for water resources planning and management, cascade reservoir scheduling, and understanding the response of water resources to climate change and human activities. Streamflow r...
详细信息
Long-term streamflow data are essential for water resources planning and management, cascade reservoir scheduling, and understanding the response of water resources to climate change and human activities. Streamflow reconstructions can effectively "fill-in" missing runoff data gaps. However, considering the scarcity of observational monitoring stations and the limitations of distributed hydrological models, the reconstruction of long-term time series of runoff under varying surface and climatic conditions remains a challenge. Here, we propose a hydrological knowledge-informed Long Short-Term Memory (LSTM) model (Hydro-LSTM) for monthly streamflow reconstruction using open-access distributed data. Hydrological knowledge was derived from hydrological governing equations and parameters for each independent water cycle component. The Hydro-LSTM addresses the lack of physical consistency inherent in data-driven models, along with missing observations. The approach was applied to simulate monthly runoff of representative rivers in the Tibetan Plateau (TP) from 1980 to 2018. The results show that streamflow reconstructions for these eight stations yielded favorable levels of performance;trends in dynamic change and the range of runoff in the model training period and test period are consistent with the measured values. Values of NSE, CC, and KGE range between 0.715-0.968, 0.847-0.985, and 0.786-0.969, respectively. The influence of hydrological expertise and distributed data on the model is discussed. The introduction of hydrological knowledge makes the driving elements have hydrological significance, which improves the physical consistency and interpretability of the Hydro-LSTM model. The proposed Hydro-LSTM is expected to (1) achieve accurate and efficient reconstructions of long-term runoff time series using open-access distributed data and limited observations and (2) provide a new perspective for runoff reconstruction and prediction, with promising application prospec
The composite quantile regression estimator is widely acknowledged for its robustness and efficiency, offering a compelling alternative to both ordinary least squares and quantile regression estimators in linear model...
详细信息
The composite quantile regression estimator is widely acknowledged for its robustness and efficiency, offering a compelling alternative to both ordinary least squares and quantile regression estimators in linear models. However, when data is not randomly distributed across different workers in distributed settings, existing methods for composite quantile regression become statistically inefficient. To address this limitation, we present a novel one-step upgraded pilot composite quantile regression method. Our proposed approach involves two essential steps. In the first step, we obtain a pilot estimator by leveraging a small random sample collected from different workers. Subsequently, in the second step, we perform one-step updating based on the pilot estimator, involving the summarization of sample moment quantities on each worker. The resulting estimator is theoretically proven to be as statistically efficient as the composite quantile regression estimator using the entire sample, without relying on restrictive assumptions about randomness. Furthermore, the resulting estimator inherits the robustness and efficiency advantages of the composite quantile regression estimator, while also being computationally efficient in terms of communication cost and storage usage. To validate the practical performance of our proposed method, we conduct numerical studies using simulated and real data, demonstrating its effectiveness in real-world scenarios.
Foundation models have achieved remarkable success across various domains, but still face critical challenges such as limited data availability, high computational requirements, and rapid knowledge obsolescence. To ad...
详细信息
Foundation models have achieved remarkable success across various domains, but still face critical challenges such as limited data availability, high computational requirements, and rapid knowledge obsolescence. To address these issues, we propose a novel framework that integrates model merging with federated learning to enable continual foundation model updates without centralizing sensitive data. In this framework, each client fine-tunes a local model, and the server merges these models using multiple merging strategies. We experimentally evaluate the effectiveness of these methods using the CLIP model for image classification tasks across diverse datasets. The results demonstrate that advanced merging methods can surpass simple averaging in terms of accuracy, although they introduce challenges such as catastrophic forgetting and sensitivity to hyperparameters. This study defines a realistic and practical problem setting for decentralized foundation model updates, and provides a comparative analysis of merging techniques, offering valuable insights for scalable and privacy-preserving model evolution in dynamic environments.
Divide and conquer algorithm is a common strategy applied in big data. Model averaging has the natural divide-and-conquer feature, but its theory has not been developed in big data scenarios. The goal of this paper is...
详细信息
Divide and conquer algorithm is a common strategy applied in big data. Model averaging has the natural divide-and-conquer feature, but its theory has not been developed in big data scenarios. The goal of this paper is to fill this gap. We propose two divide-and conquer-type model averaging estimators for linear models with distributed data. Under some regularity conditions, we show that the weights from Mallows model averaging criterion converge in L-2 to the theoretically optimal weights minimizing the risk of the model averaging estimator. We also give the bounds of the in-sample and out-of-sample mean squared errors and prove the asymptotic optimality for the proposed model averaging estimators. Our conclusions hold even when the dimensions and the number of candidate models are divergent. Simulation results and a real airline data analysis illustrate that the proposed model averaging methods perform better than the commonly used model selection and model averaging methods in distributed data cases. Our approaches contribute to model averaging theory in distributed data and parallel computations, and can be applied in big data analysis to save time and reduce the computational burden.
For massive data stored on multiple machines, we propose a distributed subsampling procedure for the composite quantile regression. By establishing the consistency and asymptotic normality of the composite quantile re...
详细信息
For massive data stored on multiple machines, we propose a distributed subsampling procedure for the composite quantile regression. By establishing the consistency and asymptotic normality of the composite quantile regression estimator from a general subsampling algorithm, we derive the optimal subsampling probabilities and the optimal allocation sizes under the L-optimality criteria. A two-step algorithm is developed to approximate the optimal subsampling procedure. The proposed methods are illustrated through numerical experiments on simulated and real datasets.
In recent years, many methodologies for distributed data have been developed. However, there are two problems. First, most of these methods require the data to be randomly and uniformly distributed across different ma...
详细信息
In recent years, many methodologies for distributed data have been developed. However, there are two problems. First, most of these methods require the data to be randomly and uniformly distributed across different machines. Second, the methods are mainly not robust. To solve these problems, we propose a distributed pilot modal regression estimator, which achieves robustness and can adapt when the data are stored nonrandomly. First, we collect a random pilot sample from different machines;then, we approximate the global MR objective function by a communication-efficient surrogate that can be efficiently evaluated by the pilot sample and the local gradients. The final estimator is obtained by minimizing the surrogate function in the master machine, while the other machines only need to calculate their gradients. Theoretical results show the new estimator is asymptotically efficient as the global MR estimator. Simulation studies illustrate the utility of the proposed approach.
Stochastic configuration networks (SCNs), as a class of randomized learning models, are incrementally built under a supervisory mechanism, and theoretically ensure error-free learning for training sets. This paper pro...
详细信息
Stochastic configuration networks (SCNs), as a class of randomized learning models, are incrementally built under a supervisory mechanism, and theoretically ensure error-free learning for training sets. This paper proposes a federated version of SCNs (FSCNs) for large-scale data, which are geographically distributed among different end-user clients with non-shareable data due to privacy and security concerns. Unlike centralized learning that needs to collect data from clients and store them collectively on a cloud server, FSCNs enable distributed analytics in a collaborative learning paradigm without centrally aggregating new data, thereby preventing the leakage of private information. Considering different supervisory and aggregate schemes of model parameters, two FSC algorithms with two aggregate strategies are presented. The experiment results on both data regression and classification show the effectiveness and feasibility of our proposed federated learning scheme. (c) 2022 Elsevier Inc. All rights reserved.
In this era of transformation businesses and organizations are navigating the intricate landscape of digital markets. These markets rely on data driven insights to make decisions and achieve success. This research pap...
详细信息
ISBN:
(纸本)9783031609961;9783031609978
In this era of transformation businesses and organizations are navigating the intricate landscape of digital markets. These markets rely on data driven insights to make decisions and achieve success. This research paper explores the world of market research specifically focusing on the synergy, between distributed data and knowledge-based systems. Our goal is to understand and capitalize on emerging trends by unraveling the dynamics of this combination. We begin our investigation with an exploration of methods and technologies for gathering and harmonizing data from various sources such as social media platforms, e commerce websites, Internet of Things (IoT) devices and more. By integrating these sources, we create datasets that form the foundation for our research. Next, we dive into knowledge-based systems utilizing intelligence and machine learning algorithms to extract valuable insights and patterns from our integrated data. These insights not deepen our understanding of emerging market trends. Also serve as a basis for developing effective digital marketing strategies and campaigns. Throughout this journey we also consider ethical aspects and respect privacy concerns since data usage is crucial, in today's information age. Our paper showcases real life examples and practical uses from industries to demonstrate the advantages of our approach. In essence we take a glimpse into the future speculating on how digital market research will evolve, mapping out paths and emphasizing areas, for exploration and innovation. This research aims to equip businesses with the insights and resources needed to navigate the changing digital market landscape.
Knowledge discovery is one of the key areas in predictive data mining tasks. Performing classification tasks on a single source of data using a decision tree algorithm is a relatively straightforward process. However,...
详细信息
暂无评论