A major computational challenge in analyzing metagenomics sequencing reads is to identify unknown sources of massive and heterogeneous short DNA reads. A promising approach is to efficiently and sufficiently extract a...
详细信息
ISBN:
(纸本)9783319387826;9783319387819
A major computational challenge in analyzing metagenomics sequencing reads is to identify unknown sources of massive and heterogeneous short DNA reads. A promising approach is to efficiently and sufficiently extract and exploit sequence features, i.e., k-mers, to bin the reads according to their sources. Shorter k-mers may capture base composition information while longer k-mers may represent reads abundance information. We present a novel Poisson-Markov mixture Model (PMM) to systematically integrate the information in both long and short k-mers and develop a parallel algorithm for improving both reads binning performance and running time. We compare the performance and running time of our PMM approach with selected competing approaches using simulated data sets, and we also demonstrate the utility of our PMM approach using a time course metagenomics data set. The proba-bilistic modeling framework is sufficiently flexible and general to solve a wide range of supervised and unsupervised learning problems in metagenomics.
The identification and separation of contributions associated with different sources or processes is a general problem in signal and image processing. Here, we focus on the decomposition of multiple linear relationshi...
详细信息
ISBN:
(纸本)9781479999880
The identification and separation of contributions associated with different sources or processes is a general problem in signal and image processing. Here, we focus on the decomposition of multiple linear relationships and introduce a non-negative formulation. The proposed models can be viewed as generalizations of latent class regression models and account for possibly varying magnitudes of the linear transfer functions. Along with these models, we present model calibration algorithms. We first demonstrate their performance on simulated data. We also report an application to the analysis of upper ocean dynamics from remote sensing data (namely, satellite-derived Sea Surface Height (SSH) and Sea Surface temperature (SST) image series). This application further stresses the proposed formulation's relevance compared to state-of-the-art regression models.
This thesis addresses two important problems in modern statistics: discriminant analysis of big data and dimension reduction of high-dimensional data such as microarray gene expression data. These problems are commonl...
详细信息
This thesis addresses two important problems in modern statistics: discriminant analysis of big data and dimension reduction of high-dimensional data such as microarray gene expression data. These problems are commonly encountered in various scientific fields and can pose considerable challenges since traditional approaches might not work properly or even break down in the high-dimensional setting. For the first problem of discriminant analysis of big data, one of the widely used parametric approaches is to model the distribution of the feature vector in each of the predefined classes via a normal mixture distribution. The component-covariance matrices in the normal mixture for a class are highly parameterized, thus, rendering them impractical for high-dimensional datasets. Therefore, as the dimension increases, some forms of regularization need to be implemented. In this thesis, an innovative factor model approach, called mixtures of common factor ana- lyzers for discriminant analysis (MCFDA), is proposed. With this approach, the component- covariance matrices are taken to have a factor-analytic form with common loadings across the classes (common before the transformation of the factors into white noise). This approach also allows the data to be viewed in low-dimensional spaces by plotting the (estimated) values of the latent factors corresponding to the observed data points. To improve the robustness of our MCFDA approach for data which have heavy tails or atypical observations, we also adopt the multivariate t-family for the component-error and factor distri- butions. We refer to this model as the mixtures of common t-factor analyzers for discriminant analysis (MCtFDA). With this approach, both the common factor loadings and the diagonal matrix of error terms need to be specified as the same across the classes. This approach has great flexibility for modelling data which are non-normal or with outliers. For the second problem of dimension reduction, we focus on
In this paper we developed the estimation implementation of the generalized hyperbolic multivariate (GH) distribution with a non-fixed Bessel function. The covariance matrix estimated through the GH distribution compl...
详细信息
Word alignment is a basic task in natural language processing and it usually serves as the starting point when building a modern statistical machine translation system. However, the state-of-art parallel algorithm for...
详细信息
ISBN:
(纸本)9781509050819
Word alignment is a basic task in natural language processing and it usually serves as the starting point when building a modern statistical machine translation system. However, the state-of-art parallel algorithm for word alignment is still time-consuming. In this work, we explore a parallel implementation of word alignment algorithm on Graphics Processor Unit (GPU), which has been widely available in the field of high performance computing. We use the Compute Unified Device Architecture (CUDA) programming model to re-implement a state-of-the-art word alignment algorithm, called IBM expectation-maximization (EM) algorithm. A Tesla K40M card with 2880 cores is used for experiments and execution times obtained with the proposed algorithm are compared with a sequential algorithm and a multi-threads algorithm on an IBM X3850 server, which has two Intel Xeon E7 CPUs (2.0GHz * 10 cores). The best experimental results show a 16.8-fold speedup compared to the multi-threads algorithm and a 234.7-fold speedup compared to the sequential algorithm.
Joint modeling of longitudinal measurements and time to event data is often performed by fitting a shared parameter model. Another method for joint modeling that may be used is a marginal model. As a marginal model, w...
详细信息
Joint modeling of longitudinal measurements and time to event data is often performed by fitting a shared parameter model. Another method for joint modeling that may be used is a marginal model. As a marginal model, we use a Gaussian model for joint modeling of longitudinal measurements and time to event data. We consider a regression model for longitudinal data modeling and a Weibull proportional hazard model for event time data modeling. A Gaussian copula is used to consider the association between these two models. A Monte Carlo expectation-maximization approach is used for parameter estimation. Some simulation studies are conducted in order to illustrate the proposed method. Also, the proposed method is used for analyzing a clinical trial dataset.
In this paper, a frequency domain expectation-maximization (EM)-based channel estimation algorithm for Space Time Block Coded-Orthogonal Frequency Division Multiplexing (STBC-OFDM) systems is investigated to support h...
详细信息
In this paper, a frequency domain expectation-maximization (EM)-based channel estimation algorithm for Space Time Block Coded-Orthogonal Frequency Division Multiplexing (STBC-OFDM) systems is investigated to support higher data rate applications in wireless communications. The computational complexity of the frequency domain EM-based channel estimation is increased when higher order constellations are used because of the ascending size of the search set space. Thus, a search set reduction algorithm is proposed to decrease the complexity without sacrificing the system performance. The performance results of the proposed algorithm is obtained in terms of Bit Error Rate (BER) and Mean Square Error (MSE) for 16QAM and 64QAM modulation schemes.
In this paper, a sparse Bayesian learning framework for DOA estimation in multiple input multiple output (MIMO) radar is proposed with unknown nonuniform noise. In the proposed method, the redundant elements of MIMO r...
详细信息
ISBN:
(纸本)9781509048281
In this paper, a sparse Bayesian learning framework for DOA estimation in multiple input multiple output (MIMO) radar is proposed with unknown nonuniform noise. In the proposed method, the redundant elements of MIMO radar can be eliminated by using the reduced dimensional (RD) transformation. Then a sparse Bayesian model of covariance vector is formulated by assuming that the prior source power is independent zero-mean Gaussian distributed with hyperparameters for its unknown variance. The hyperparameters and nonuniform noise variances are estimated by utilizing the expectation-maximization (EM) algorithm and least squares (LS) criterion, respectively. Finally, the spectrum of hyperparameters is used to estimate the coarse DOA, and a high-precision DOA estimation is achieved by using a refined 1-D searching procedure based on the reconstruction result. Simulation results have demonstrated that the proposed method can work well with different nonuniform noise and achieve better performance.
To reduce the cost of large-scale re-sequencing, multiple individuals are pooled together and sequenced called pooled sequencing. Pooled sequencing could provide a cost-effective alternative to sequencing individuals ...
详细信息
To reduce the cost of large-scale re-sequencing, multiple individuals are pooled together and sequenced called pooled sequencing. Pooled sequencing could provide a cost-effective alternative to sequencing individuals separately. To facilitate the application of pooled sequencing in haplotype-based diseases association analysis, the critical procedure is to accurately estimate haplotype frequencies from pooled samples. Here we present Ehapp2 for estimating haplotype frequencies from pooled sequencing data by utilizing a database which provides prior information of known haplotypes. We first translate the problem of estimating frequency for each haplotype into finding a sparse solution for a system of linear equations, where the NNREG algorithm is employed to achieve the solution. Simulation experiments reveal that Ehapp2 is robust to sequencing errors and able to estimate the frequencies of haplotypes with less than 3% average relative difference for pooled sequencing of mixture of real Drosophila haplotypes with 50 x total coverage even when the sequencing error rate is as high as 0.05. Owing to the strategy that proportions for local haplotypes spanning multiple SNPs are accurately calculated first, Ehapp2 retains excellent estimation for recombinant haplotypes resulting from chromosomal crossover. Comparisons with present methods reveal that Ehapp2 is state-of-the-art for many sequencing study designs and more suitable for current massive parallel sequencing.
This paper introduces a generative model of voice fundamental frequency (F-0) contours that allows us to extract prosodic features from raw speech data. The present F-0 contour model is formulated by translating the F...
详细信息
This paper introduces a generative model of voice fundamental frequency (F-0) contours that allows us to extract prosodic features from raw speech data. The present F-0 contour model is formulated by translating the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. There are two motivations behind this formulation. One is to derive a general parameter estimation framework for the Fujisaki model that allows the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model that we can incorporate into statistical-model-based text-to-speech synthesizers in such a way that the Fujisaki-model parameters can be learned from a speech corpus in a unified manner. It could also be useful for other speech applications such as emotion recognition, speaker identification, speech conversion and dialogue systems, in which prosodic information plays a significant role. We quantitatively evaluated the performance of the proposed Fujisaki model parameter extractor using real speech data. Experimental results revealed that our method was superior to a state-of-the-art Fujisaki model parameter extractor.
暂无评论