Federated learning (FL), a distributed learning strategy, improves security and privacy by eliminating the need for clients to share their local data; however, FL struggles with non-IID (non independent and identicall...
详细信息
ISBN:
(数字)9798350374889
ISBN:
(纸本)9798350374896
Federated learning (FL), a distributed learning strategy, improves security and privacy by eliminating the need for clients to share their local data; however, FL struggles with non-IID (non independent and identically distributed) data. Clustered FL aims to remedy this by grouping similar clients and training a model per group; nevertheless, it faces difficulties in determining clusters without sharing local data and conducting model evaluation. Clustered FL evaluation on unseen clients typically applies all models, selecting the best-performer for each client - approach known as best-fit cluster evaluation. This paper challenges such evaluation process arguing that it violates a fundamental machine learning principle: test dataset labels should be used only for performance calculation, not for model selection. We show that best-fit cluster evaluation results in significant accuracy overestimates. Moreover, we present an evaluation approach that maintains the separation between model selection and evaluation by reserving a portion of the target client data for model selection, while the remaining data is used for accuracy estimation. Experiments on four datasets, encompassing various IID and non-IID scenarios, demonstrate that the best-fit cluster evaluation produces overestimates that are statistically different from our evaluation.
Is process migration useful for load balancing? We present experimental results indicating that the answer to this question depends largely on the characteristics of the applied workload. Experiments with our Shiva sy...
详细信息
Is process migration useful for load balancing? We present experimental results indicating that the answer to this question depends largely on the characteristics of the applied workload. Experiments with our Shiva system, which supports remote execution and process migration, show that only those CPU bound workloads which were generated using an unrealistic exponential distribution for execution times show improvements for dynamic load balancing. (We use the term 'dynamic' to indicate remote execution determined at and not prior to run time. The latter is known as 'static' load balancing.) Using a more realistic workload distribution and adding a number of short lived tasks prevents dynamic algorithms from working. Migration is only useful with heterogeneous workloads. We find the migration of executing tasks to remote data to be effective for balancing I/O bound workloads, and indicate the region of 'workload variable space' for which this migrate-to-data approach is useful.
In this paper, we propose a blind synchronization method for signals with sampling rate offset (SRO) and missing data, which occasionally occurs in distributed recording for acoustic scene classification. In our metho...
详细信息
ISBN:
(数字)9798350367331
ISBN:
(纸本)9798350367348
In this paper, we propose a blind synchronization method for signals with sampling rate offset (SRO) and missing data, which occasionally occurs in distributed recording for acoustic scene classification. In our method, the correspondence between short-time frames is first estimated using cross-correlation and dynamic programming (DP) matching. Then, two methods for producing synchronized signals are compared. The first method is based on the overlap-add along the DP path, while the second method uses the DP path only to identify missing data positions and compensates for the SRO with a linear phase model. The performance of these methods is evaluated through experiments. The results are promising, and further applications to acoustic scene classification are expected.
Nowadays the need for fast and reliable communication is increasing, which leads us to look for new ways to enhance channel coding. In this paper, we will study the case of a distributed coding between two users that ...
详细信息
Nowadays the need for fast and reliable communication is increasing, which leads us to look for new ways to enhance channel coding. In this paper, we will study the case of a distributed coding between two users that aim to transmit data to a common destination, where each user transmits a partial redundancy to the destination, and relies on the second user for the remaining. The purpose of distributing the redundancy creation and transmission, is to benefit from each user channel quality for a more accurate decoding. In the context of our analysis, we will use a 1/2 rate convolutional code with between users and a distributed Turbo code for transmission to the destination. However, this study will aim to highlight the different key factors, as well as the advantage of choosing a distributed encoding.
COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate ...
详细信息
COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a sub sample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point, this can reduce evaluation cost by 100X or more.
The paper analyzed the characteristics of digital library resources and the requirements for *** gave the technology and characteristics of grid storage ,analyzed the advantages of the grid storage, and discussed the ...
详细信息
The paper analyzed the characteristics of digital library resources and the requirements for *** gave the technology and characteristics of grid storage ,analyzed the advantages of the grid storage, and discussed the application of grid storage technology in the digital library resources stored from three aspects.
Bounds are developed on the probability that the Cartesian product of a given number of finite random sets does not intersect (avoids) a given fixed set. These bounds are then used to estimate the probability of data ...
详细信息
Bounds are developed on the probability that the Cartesian product of a given number of finite random sets does not intersect (avoids) a given fixed set. These bounds are then used to estimate the probability of data loss in a distributed storage system that uses erasure codes to protect against data loss when disks fail. These are the first bounds on the probability of data loss that we are aware of. We compare our upper bound on the probability of data loss to approximations that are used in the literature, and show that our bounds are tighter and the gap is significant in some cases. Our bounds also suggest that in some cases, a more efficient (higher rate) code will suffice to meet a data loss probability target than that predicted by approximations widely used in the industry.
The Hadoop distributed File System (HDFS) is a distributed storage system that stores large volumes of data reliably and provide access to the data by the applications at high bandwidth. HDFS provides high reliability...
详细信息
ISBN:
(纸本)9781509052646
The Hadoop distributed File System (HDFS) is a distributed storage system that stores large volumes of data reliably and provide access to the data by the applications at high bandwidth. HDFS provides high reliability and availability by replicating data, typically three copies, and distribute these replicas across multiple data nodes. The placement of data replicas is one of the key issues that affect the performance of HDFS. In the current HDFS replica placement policy the replicas of data blocks cannot be evenly distribute across cluster nodes, so the current HDFS has to rely on load balancing utility to balance replica distributions which results in more time and resources consuming. These challenges drive the need for intelligent methods that solve the data placement problem to achieve high performance without the need for load balancing utility. In this paper, we propose an intelligent policy for data placement in cloud storage systems addressing the above challenges.
Owing to proliferation of smart phones, communication services such as a video streaming service are common in a mobile situation. For these services, quality evaluation and communication control based on Quality of E...
详细信息
ISBN:
(纸本)9781467390576
Owing to proliferation of smart phones, communication services such as a video streaming service are common in a mobile situation. For these services, quality evaluation and communication control based on Quality of Experience (QoE), which is the degree of a user's subjective satisfaction, is very important because the final goal of delivering high-quality service is improving user's satisfaction. QoE tends to be affected by several factors including Quality of Service (QoS). Therefore, collecting QoS data from the users who use the mobile application has become one of promising schemes to meet the QoE, which is called the crowd sourcing data. The crowd sourcing data are, however, apt to be affected by sensing errors and low accuracy. In this study, we propose to estimate some densities just from the records of application use and its location information to remove the sensing errors and low accuracy. The Kernel density estimator is used to derive a continuous density function from discrete distributed sample data such as the collected QoS data. From the viewpoint of QoS, it is important to extract high-density fields of the application use to find out QoS degradation. After estimating the Kernel density estimator, we determine the borderline between the high-density field and the other field by using the reference value that can be determined from the observed data. Simulated experiments verify effectiveness of the proposed method.
暂无评论