Workshop on High-dimensional Data Analysis

(27 – 29 Feb 2008)

 ... Jointly organized with Department of Statistics & Applied Probability

~ Abstracts ~

Sliced regression for dimension reduction
Hansheng Wang, Peking University, China

By slicing the region of the response (Li, 1991) and applying local kernel regression (MAVE, Xia, et al, 2002) to each slice, a new dimension reduction method is proposed.

Compared with the traditional inverse regression methods, e.g. sliced inverse regression (Li, 1991), the new method is free of the linearity condition (Li, 1991) and enjoys much improved estimation accuracy. Compared with the direct estimation methods (e.g., MAVE), the new method is much more robust against extreme values and can capture the entire central subspace (Cook, 1998) exhaustively. To determine the CS dimension, a consistent cross-validation (CV) criterion is developed. Extensive numerical studies including one real example confirm our theoretical findings.



A binary response transformation-expectation estimation in dimension reduction
Lixing Zhu, The Hong Kong Baptist University, Hong Kong

Slicing estimation is one of the most popularly used methods in the sufficient dimension reduction area. However, the efficacy of the slicing estimation for many inverse regression methods depends heavily on the choice of slice number when response variable is continuous. It is similar to, but more difficult than classical tuning parameter selection in nonparametric function estimation. Thus, how to select the slice number is a longstanding, and still open problem. In this paper, we propose a binary response transformation-expectation (BRTE) method. It completely avoids selecting the number of slices, and meanwhile preserves the integrity of the original central subspace. This generic method also ensures the root $n$ consistency and the asymptotic normality of slicing estimators for many inverse regression methods, and can be applied to multivariate response cases. Finally, BRTE is compared with the existing estimators by extensive simulations and an illustrative real data example.




Central limit theorem for linear spectral statistics of large dimensional F matrix
Shurong Zheng, Northeast Normal University, China

A central limit theorem (CLT) for linear spectral statistics (LSS) of a product of a large dimensional sample covariance matrix and a nonnegative definite Hermitian matrix was established in Bai and Silverstein (2004). However, their results don’t cover the case of a product of one sample covariance matrix and the inverse of another covariance matrix, independent of each other (F matrix). This is because for F matrix, their CLT established the asymptotic normality of the difference of two dependent statistics defined by the empirical spectral distribution (ESD) of F matrix and by the ESD of the inverse of the second sample covariance matrix. But in fact, in many applications of F matrix, one is often interested in making statistical inference for the parameter defined by the limiting spectral distribution (LSD) of F matrix. Then one is interested in the asymptotic distribution of the difference of the parameter and the estimator defined by LSS of F matrix. In this paper, we shall establish the CLT for LSS of F matrix. As a consequence, we shall also establish the CLT for LSS of beta matrix.

Key words and phrases: Linear spectral statistics, central limit theorem, large dimensional random matrix, large dimensional data analysis.



Clustering curves via subspace projection
Jeng-Min Chiou, Institute of Statistical Science, Academia Sinica, Taiwan

This study considers a functional clustering method, k-centers functional clustering, for random curves. The k-centers functional clustering approach accounts for both the mean and the modes of variation differentials among clusters, and predicts cluster memberships via projection and reclassification. The distance measures considered include the L2 distance and the functional correlation defined in this study, which are embedded in the clustering criteria. The cluster membership predictions are based on nonparametric random effect models of the truncated Karhunen-Loeve expansion, coupled with a nonparametric iterative mean and covariance updating scheme. The properties of the proposed clustering methods unravel the cluster qualities. Simulation studies and practical examples illustrate the practical performance of the proposed methods.




Nonlinear dimension reduction with kernel methods
Su-Yun Huang, Institute of Statistical Science, Academia Sinica, Taiwan

Dimension reduction has long been an important technique for high-dimensional data analysis. The principal component analysis (PCA), canonical correlation analysis (CCA), and sliced inverse regression (SIR) are some important tools in classical statistical analysis for linear dimension reduction. In this talk we will introduce their nonlinear extension using kernel methods.

The essence of kernel-based nonlinear dimension reduction is to map the pattern data originally observed in Euclidean space to a high-dimensional Hilbert space, called feature space, by an appropriate kernel transformation. Low-dimensional projections of high-dimensional feature data are approximately elliptically-contoured and approximately Gaussian distributed. Notions of PCA, CCA and SIR can be extended to the framework of kernel associated feature Hilbert space, known as reproducing kernel Hilbert space, for nonlinear dimension reduction. Computing algorithms including large data handling and numerical examples will be presented.




Variable selection and coefficient estimation via regularized rank regression
Chenlei Leng, National University of Singapore

The penalized least squares method with some appropriately defined penalty is widely used for simultaneous variable selection and coefficient estimation in linear regression. However, the least squares (LS) based methods may be adversely affected by outlying observations and heavy tailed distributions.
On the other hand, the least absolute deviation (LAD) estimator is more robust, but may be inefficient for many distributions of interest.
To overcome these issues, we propose a novel method termed the regularized rank regression estimator by combining the LAD and the penalized LS methods for variable selection. We show that the proposed estimator has attractive theoreotical properties and is easy to implement.
Simulations and real data analysis both show that the proposed methed performs well in finite sample cases.


« Back...



Dimension reduction for unsupervised and partially supervised learning
Debasis Sengupta, Indian Statistical Institute, India

Machine learning is often attempted through clustering and/or classification of multidimensional input data. While classification and clustering are used in supervised and unsupervised learning, respectively, there are clustering problems in the case of partially supervised learning, where the classed represented in the training data are far from being exhaustive. In all these cases, the problem of high dimensionality has to be addressed. We consider dimension reduction for clustering on the basis of a mixture model, where observations are normally distributed around a cluster center, and cluster centers also have a multivariate normal distribution. We propose an intuitively appealing objective function for this problem, and work out a solution in the cases of unsupervised and partially supervised clustering.
We apply the methods to the problem of pug-mark based estimation of tiger population total, and that of clustering organisms in terms of tetra- nucleotide content pattern of ribosomal DNA sequences.


« Back...



Spectra of large dimensional random matrices (LDRM)
Arup Bose, Indian Statistical Institute, India

We shall consider (square) matrices with random entries (real or complex). For example, Sample variance covariance matrix, IID matrix, Wigner matrix, Toeplitz matrix etc. where the dimension is growing to infinity. Properties of eigenvalues of such matrices are of interest.

In this talk we will mostly look at real symmetric matrices and discuss in a broad way the limiting spectral distribution (LSD) of these matrices under suitable conditions.

We shall provide some simulations with these matrices, loose description of some results on LSD and pose some questions which should be of interest to statisticians and probabilists.


« Back...



RKHS formulations of some functional data analysis problems
Tailen Hsing, University of Michigan, USA

We discuss the inference of two processes in the contexts of functional data analysis, including canonical correlations and regression. The common approach defines canonical variables or regressors in terms of projections in a Hilbert space. While this is conceptionally straightforward, it has a number of weaknesses. We describe an approach that does not require the specification of a Hilbert space, which leads to theories and more general inference procedures.


« Back...



Supervised singular value decomposition and its application to independent component analysis for fMRI
Young Truong, The University of North Carolina, USA

Functional Magnetic Resonance Imaging(fMRI) has been used by neuroscientists as a powerful tool to study
brain functions. Independent component analysis (ICA) is an effective method to explore spatio-temporal features in fMRI data. It has been especially successful to recover brain-function-related signals from recorded mixtures of unrelated signals. Due to the high sensitivity of MR scanners, spikes are commonly observed in fMRI data, and they deteriorate the analysis. No particular method exists yet to address this problem. In this paper, we introduce a supervised singular value decomposition technique into the data reduction step of ICA. Two major advantages are discussed: first, the proposed method improves the robustness of ICA against spikes; second, the method uses the particular fMRI experiment designs to guide the fully data-driven ICA, and makes the computation more efficient. The advantages are demonstrated using a spatio-temporal simulation study as well as a real data analysis. This is a joint work with Bai, P., Shen, H. and Huang, X.


« Back...



Model selection, dimension reduction and liquid association: a trilogy via Stein’s lemma
Ker-Chau Li, Institute of Statistical Science, Academia Sinica, Taiwan and
University of California, Los Angeles, USA

In this talk, I will describe how a basic idea from Stein’s monumental work in decision theory has led to my earlier research in model selection (generalized cross validation, honest confidence region), dimension reduction (sliced inverse regression and principal Hessian direction) and more recently in the development of liquid association for bioinformatics applications.

Li, K. C. (1985). From Stein's unbiased risk estimates to the method of generalized cross validation. Ann. Statist. 13 1352-1377.
Li, K. C. (1992). On principal Hessian directions for data visualization and dimension reduction : another application of Stein's lemma. J. Ameri. Stat. Assoc. 87, 1025-1039.
Li, KC, Palotie A, Yuan, S, Bronnikov, D., Chen D., Wei X., Choi, O., Saarela J., Peltonen L. (2007) Finding candidate disease genes by liquid association. Genome Biology, 8, R205. oi:10.1186/gb-2007-8-10-r205






Functional mixture regression
Thomas Lee, The Chinese University of Hong Kong

This talk introduces Functional Mixture Regression (FMR), a natural and useful extension of the classical functional linear regression (FLR) model. FMR generalizes FLR essentially in the same way as linear mixture regression generalizes linear regression. That is, the observed predictor random processes are allowed to form sub-groups in such a way that each sub-group will have its own regression parameter function. In this talk both theoretical and empirical properties on FMR will be discussed.

This is joint work with Yuejiao Fu and Fang Yao.