Missing data is an inevitable and ubiquitous problem in the data-driven Intelligent Transportation System (ITS), which seriously affects the accuracy of urban traffic planning and management. Most existing traffic dat...
详细信息
ISBN:
(纸本)9781728135557
Missing data is an inevitable and ubiquitous problem in the data-driven Intelligent Transportation System (ITS), which seriously affects the accuracy of urban traffic planning and management. Most existing traffic dataprocessing methods often only exploit the characteristics of single source data. In this paper, we present a novel coupled tensors model by using multi-source traffic data for missing data imputation, and propose a tensor completion algorithm based on a modified CMTF-WOPT(Coupled Matrix and Tensor Factorization-Weighted OP-Timization) algorithm to recover the missing traffic data. We also present extensive simulation results by using real world traffic datasets to evaluate the performance of the proposed algorithm. The simulation results show that the proposed coupled tensor completion algorithm makes a significant improvement on the recovery accuracy compared with existing tensor completion algorithms, especially under high missing rates.
In this paper, we present a method to handle data imbalance for classification with neural networks, and apply it to acoustic event detection (AED) problem. The common approach to tackle data imbalance is to use class...
详细信息
ISBN:
(纸本)9781479981311
In this paper, we present a method to handle data imbalance for classification with neural networks, and apply it to acoustic event detection (AED) problem. The common approach to tackle data imbalance is to use class-weights in the objective function while training. An existing more sophisticated approach is to map the input to clusters in an embedding space, so that learning is locally balanced by incorporating inter-cluster and inter-class margins. On these lines, we propose a method to learn the embedding using a novel objective function, called triple-header cross entropy. Our scheme integrates in a simple way with back-propagation based training, and is computationally more efficient than general hinge-loss based embedding learning schemes. The empirical evaluation results demonstrate the effectiveness of the proposed method for AED with imbalanced training data.
data fusion (DF) from multiple heterogeneous sources is a typical task for many multisensor applications including remote sensing classification problems. Multiple classifier systems (MCS) provide a natural way to sol...
详细信息
ISBN:
(纸本)9781479981311
data fusion (DF) from multiple heterogeneous sources is a typical task for many multisensor applications including remote sensing classification problems. Multiple classifier systems (MCS) provide a natural way to solve DF on the decision level by training individual classifiers separately on its own data source and then combine their outputs. In this paper, we consider a dynamic selection (DS) framework to select and fuse competent classifiers of MCS. For this, we propose a competence estimation and selection method to improve the performance of the DF system especially under class imbalance. We evaluate the method with synthetic and real datasets, demonstrating the applicability of the proposed framework.
This paper describes the CMU Wilderness Multilingual Speech dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations. On average each language provides around 20 hours o...
详细信息
ISBN:
(纸本)9781479981311
This paper describes the CMU Wilderness Multilingual Speech dataset. A dataset of over 700 different languages providing audio, aligned text and word pronunciations. On average each language provides around 20 hours of sentence-lengthed transcriptions. We describe our multi-pass alignment techniques and evaluate the results by building speech synthesizers on the aligned data. Most of the resulting synthesizers are good enough for deployment and use. The tools to do this work are released as open source, and instructions on how to apply such alignment for novel languages are given.
Computational methods for identifying hidden structures in high-order data are critical for exploratory data analysis tasks. This work proposes a joint dimensionality reduction and co-clustering algorithm for tensors....
详细信息
ISBN:
(纸本)9781479981311
Computational methods for identifying hidden structures in high-order data are critical for exploratory data analysis tasks. This work proposes a joint dimensionality reduction and co-clustering algorithm for tensors. A compressed representation of a tensor is obtained via a Tucker-like decomposition model, whose factor matrices capture the tensor co-clustering structure. Factor matrices correspond to the cluster centroids of the tensor fibers per mode, whose entries interact nonlinearly to build the tensor approximation. The algorithm, developed based on the alternating-direction method of multipliers, has computational complexity similar to that of a single Tucker decomposition.
As information security is increasingly valued, privacy preserving data mining has become a research hotspot in the field of big data and signalprocessing. We propose a new differentially private greedy decision fore...
详细信息
ISBN:
(纸本)9781479981311
As information security is increasingly valued, privacy preserving data mining has become a research hotspot in the field of big data and signalprocessing. We propose a new differentially private greedy decision forest algorithm called DPGDF to help improve the accuracy of privacy-preserving data mining Unlike previous algorithms that only employed greedy decision trees or random forests, our algorithm uses a combination of greedy trees and parallel combination theory to construct a greedy decision forest and coordinate privacy protection and prediction accuracy to achieve the best balance. Combined with smooth sensitivity, the introduction of noise is minimized, making the prediction accuracy of the algorithm notably better than the current state-of-the-art algorithms. Experiments on the UCI datasets show that the prediction accuracy of our algorithm is about 10% higher than that of those algorithms.
We propose a novel "big data" application of geometric feature extraction techniques to autonomously identify and track the temporal evolution of charged particle trails in the Martian ionosphere. Specifical...
详细信息
ISBN:
(纸本)9781479981311
We propose a novel "big data" application of geometric feature extraction techniques to autonomously identify and track the temporal evolution of charged particle trails in the Martian ionosphere. Specifically, we propose a Radon transform extension to the geometric distance transform algorithmically isolate potentially overlapping trail features in energy spectrograms. Our methods seek to connect large-scale statistical analysis with individual case studies and thus provide the computational framework or connecting theoretical models with potential terabytes of remote sensing data. Based on individual ion populations as the basic unit of observation, we provide data-driven results of applying our method over representative energy spectrograms generated from the NASA Mars Atmosphere and Volatile Evolution (MAVEN) mission data from the Solar Wind Ion Analyzer (SWIA) instniment.
This paper proposes a canonical correlation based feature extraction method with application to anomaly detection in electric appliances. Electric appliances in homes, offices or manufacturing factories are nowadays m...
详细信息
ISBN:
(纸本)9781479981311
This paper proposes a canonical correlation based feature extraction method with application to anomaly detection in electric appliances. Electric appliances in homes, offices or manufacturing factories are nowadays monitored by Internet of Things (IoT) platforms and systems. For unsupervised anomaly detection in such IoT systems, learning a model is challenging, since normal and anomaly behavior coexist in time-domain signals and are difficult to identify. For accurate model training, we propose to split odd and even frequency harmonics of electric current signals and transform using canonical correlation analysis to extract discriminative features. Evaluations on real-world data demonstrates that proposed approach outperforms the conventional unsupervised feature extraction methods.
dataset reshuffling across mobile devices allows for speeding up on-device distributed machine learning, which however requires significant communication bandwidth. In this paper, we propose a pliable data shuffling a...
详细信息
ISBN:
(纸本)9781479981311
dataset reshuffling across mobile devices allows for speeding up on-device distributed machine learning, which however requires significant communication bandwidth. In this paper, we propose a pliable data shuffling approach to significantly reduce the communication cost for on-device distributed learning via joint data placement and transmission design. This is achieved by establishing the novel interference alignment conditions and diversity constraints for data shuffling to improve the statistical learning performance. Unfortunately, the presented pliable data shuffling problem is a highly intractable mixed combinatorial optimization problem, for which a novel sparse and low-rank framework is developed, supported by the computationally efficient difference-of-convex (DC) algorithm. Numerical results demonstrate that the proposed pliable data shuffling is able to significantly reduce the communication bandwidth while achieving desirable learning performance.
The huge volume of data that are available today requires data selective processing approaches that avoid the costs in computational complexity via appropriately treating the non-innovative data. In this paper, extens...
详细信息
ISBN:
(纸本)9781479981311
The huge volume of data that are available today requires data selective processing approaches that avoid the costs in computational complexity via appropriately treating the non-innovative data. In this paper, extensions of the well-known adaptive filtering LMS-Newton and LMS-Quasi-Newton Algorithms are developed that enable data selection while also addressing the censorship of outliers that emerge due to high measurement errors. The proposed solutions allow the prescription of how often the acquired data are expected to be incorporated into the learning process based on some a priori information regarding the environment. Simulation results on both synthetic and real-world data verify the effectiveness of the proposed algorithms that may achieve significant reductions in computational costs without sacrificing estimation accuracy due to the selection of the data.
暂无评论