(2 - 31 January 2004)


Considerations on sample classification and gene selection with microarray data using machine learning approaches
Xuegong Zhang, MOE Key Lab of Bioinformatics/ Department of Automation, Tsinghua University, China

With the advance of microarray techniques, high expection has been given to better sample classification (such as the classification of disease vs. normal or of subtypes of a cancer) at molecular levels with microarray data. The feature of high-dimensionality (typically thousands of genes) and small sample sizes (typically tens or hundreds of cases) makes this task very challenging. The complexity of the diseases, the poor understanding of the underlying biology and the imperfectness of the data makes the problem even harder. With examples on the classification of lymph-node metastasis status and ER status of breast cancers, and on the famous leukemia data sets, this talk will introduce a SVM-based strategy for sample classification and gene selection (named R-SVM) and the observations in the experiments. However, the emphasis of the talk will be some general opinion on the task of sample classification and gene selection, some possible pitfalls therein, and some consideration on the general strategy.

 « Back

Normalization for cDNA microarray experiments having many differentially expressed genes
I-Shou Chang, National Health Research Institute, Taiwan

This talk discusses two normalization methods for cDNA microarray data in which a substantial proportion of genes differ in expression between the two mRNA samples, or there is no symmetry in the expression levels of up/down-regulated genes. The first method concerns the situation that there are no control DNA sequences on the slide. The first step of this approach is to perform global normalization based on dye-swap experiments, and then use a statistical criterion to select a set of (almost) constantly expressed genes. Based on this set, intensity dependent normalization is carried out using local regression method. The usefulness of this method is clearly demonstrated in simulation studies and in the analysis of real data sets. In particular, it is shown in the simulation studies that this method identifies genes with a lower false positive rate and a lower false negative rate than a commonly used method, when a large number of genes are turned up or down. The second method concerns the situation that there are control sequences on the slide. Calibration curves relating fluorescence signal intensities to gene expressional levels are considered in the context of Bayesian isotonic regression, which makes use of smooth priors on Bernstein polynomials and Markov Chain Monte Carlo methods to study the isotonic regression problem. The second method is applied to identify early onset genes in the study of transcriptional profiling of Autographa Californica multiple polyhedrosis virus.

 « Back

A new web-based mouse phenotype analysis system (MPHASYS) to integrate molecular and pathophysiological end points of aging
Jan Vijg, University of Texas Health Science Center at San Antonio, USA

Progress in the science of aging is largely driven by the use of model systems, ranging from yeast and nematodes to mice. Especially mouse models are highly suitable to study the complexities of aging in view of their short evolutionary distance to humans, their equal genome size and the recently emerged opportunities of altering genetic pathways considered to be critically important in determining aging phenotypes and life span. The study of such mutants, however, has been hampered by the lack of objective standards embedded in new information science and data management technologies for the comparative analysis of pathophysiological characteristics over their life span. The severity of this problem, exacerbated by the rapid increase in the number of mouse mutants, is increased by orders of magnitude by the emergence of tools for global molecular characterization, such as RNA and protein profiling using microarrays. Hence, new databases and integrated tool sets are needed to bring together biological information obtained at the molecular, cellular, organ and system level of the various mouse models, in order to understand the functional interactions that are at the basis of a genetic alteration. Here we describe the creation and use of MPHASYS, a new, web-based mouse phenotype analysis system. MPHASYS includes: (1) a pathology ontology, describing clinical observations, gross pathology, anatomy and histopathology; (2) an objective pathology data entry system; and (3) transparent query and data analysis systems that can interact with currently available standards for molecular database systems, such as microarrays. This “federated mouse database” provides a solid basis for downstream data analysis of the occurrence of adverse biological effects in cohorts of aging mice. As an example of the use of MPHASYS, a comparative analysis is presented of pathophysiological and gene expressional endpoints, related to aging, in cohorts of mutant mice with defects in genome maintenance systems.

 « Back

Spot shape modelling and saturated pixels in microarrays
Mats Rudemo, Chalmers University of Technology, Sweden

To be able to study lowly expressed genes in microarray experiments it is useful to increase the gain in the scanning. However, a large gain may cause some pixels for highly expressed genes to become saturated.

Techniques for adjustment of highly expressed signal intensities are given by Wit and McClure (2003) based on a small set of available spot summaries such as spot mean, spot median and spot variance. As mentioned by Wit and McClure it should be possible to get more accurate adjustments when all pixel values are available. In the present project we study spatial statistical models for pixel values which should enable such adjustments.

A convenient type of modelling is to transform data to become approximately Gaussian distributed with a mean value function determined by gene intensities and spot shapes and a corresponding covariance function. For such models censored pixel values can be optimally estimated. We study different types of transformations, spot shapes and covariance functions. The transformations include logarithmic and power transforms with an offset and the inverse hyperbolic sine transform of Huber et al. (2002) and Durbin et al. (2002). The spot shapes include three types suggested in Wierling et al. (2002): (i) an isotropic 2D Gaussian distribution, (ii) a crater spot distribution consisting of a difference between two scaled isotropic 2D Gaussian distributions and (iii) a plateau spot distribution. An additional model with a polynomial-hyperbolic spot shape is introduced which gives a considerably improved performance for the dataset studied.

The models are applied to the analysis of a dataset obtained with a specially designed 50mer oligonucleotide microarray. Here 452 selected genes in transgenic Arabidopsis plants are compared to the corresponding genes in wild-type plants. Data include scans with different gains ranging from no saturation to heavy saturation. This is joint work with Claus Ekstrom, Charlotte Kristensen and Soren Bak, Copenhagen.

 « Back

Error modeling, data transformation and robust calibration for microarray data
Anja von Heydebreck, Max Planck Institute for Molecular Genetics, Germany

Microarray gene expression measurements are affected by a number of variable experimental conditions, e.g. in sample preparation, labelling, and hybridisation, which lead to systematic and stochastic variation in the data. Normalization tries to correct for the systematic experimental variation, whereas error models are used to represent the remaining stochastic variation. In replicate microarray data, one often observes a systematic intensity-dependence of the variances of log-transformed intensities. Thus the significance of a measured fold change depends on the intensity level at which it was observed.

We show how a variance-stabilizing transformation can be derived from a simple error model for microarray data. For large intensities, the transformation coincides with the usual logarithmic transformation, such that expression differences can still be interpreted in terms of fold changes. For small intensities, the transformation diminishes the fluctuation of the intensities that is usually visible in log-transformed data.

Using a parametric statistical model, we simultaneously estimate the parameters of the transformations for variance stabilization and normalization. A robust estimation technique is used in order to avoid a bias due to differentially expressed genes. In applications to benchmark datasets, this approach compares favorably to other normalization algorithms.

 « Back

A comparison of microarray platforms
Darlene Goldstein, Swiss Institute for Experimental Cancer Research, Switzerland

Several different platforms are available for quantifying gene expression. My talk will introduce the technologies and present results of a study comparing some of these, including commercial arrays from Affymetrix and Agilent, in-house spotted cDNA arrays, and MPSS methodology. Advantages and disadvantages of the platforms will be discussed, as well as assessments of reproducibility within and agreement between technologies. Comparisons with results of quantitative PCR are currently in progress and will also be reported if available.

 « Back

GPmerge – a computing program for cDNA microarray raw data processing
Jinming Li, Nanyang Technological University, Singapore

Microarray experiments generate millions of data points. But these data are useful only when biologically meaningful information can be extracted. This involves many facets of data processing, statistical analysis, and data visualization, etc.

Novel computation tools and reliable data processing procedures are essential for the meaningful and accurate interpretation of microarray data. However, the current computing tools are inefficient, and sometime even unreliable. So the development of novel algorithms and approaches for microarray data processing is a challenge for bioinformaticians. We will introduce a computer program GPmerge developed by us, which can be used to process the microarray raw data generated by image analysis software such as Axon GenePix Pro or Quantarray.

It is widely accepted that any single microarray output is subjected to substantial variability. By pooling data from replicates, we can provide a more reliable classification of gene expression. Designing experiments with replications will greatly reduce misclassification rates. The development of GPmerge reflects our efforts toward to the pooling of replicated data sets generated by image analysis software, so that a user can use the overall information provided by these replicated microarray slides.

 « Back

Building genetic networks in gene expression patterns
Eric Fung Siu Leung, University of Hong Kong

Building genetic regulatory networks from time series data of gene expression patterns is an important and useful topic in bioinformatics. Probabilistic Boolean Networks (PBNs) were recently developed as a model of gene regulatory networks. PBNs is able to cope with uncertainty, corporates rule-based dependencies between genes and discover the relative sensitivity of genes in their interactions with other genes. However, in lacks of prior knowledge on the nature of predictors, i.e., the existence of the set of predictors. PBNs are unlikely used in practice because of huge number of possible predictors with its corresponding probability. In this paper we propose a multivariate Markov chain model to model the dynamics of a genetic network for gene expression patterns. One of the contributions here is to preserve the strength of PBNs and reduce the complexity of the networks. We propose a multivariate Markov chain model whose number of states and parameters are linear with respect to the number of genes of the model. We also develop an efficient estimation method for the model parameters. Numerical examples with applications to yeast data are given to illustrate the effectiveness of the model.

 « Back

Bayesian hierarchical modelling of gene expression data
Sylvia Richardson Imperial College, UK

We show how Bayesian hierarchical modelling strategies can be usefully applied to gene expression data for signal extraction and differential expression and carry out joint estimation of model parameters in a full Bayesian framework, using MCMC techniques.

For signal extraction from Affymetrix GeneChip data at the gene-probe level, the proposed models use both perfect match (PM) and mismatch (MM) intensities. They include background and cross-hybridization terms and allow for part of the MM intensity being the true signal. At the gene level, we pool information across probe sets and over repeated measurements, to obtain gene expression measurements under given conditions.

For investigating differential gene expression, we propose a flexible method for choosing a list of genes for further investigation based on a model of the sources of variability of the experimental set-up. We give empirical evidence that expression-level dependent array effects are frequently needed, and explore different non-linear functions as part of a model-based approach to normalisation. The model includes gene-specific variances but imposes some necessary shrinkage through a hierarchical structure. Model criticism via posterior predictive checks is discussed. To choose a list of genes, we propose to combine various criteria (for instance, fold change and overall expression) into a single indicator variable for each gene. The posterior distribution of these variables is then used to pick the list of genes, thereby taking into account uncertainty in parameter estimates.

Lewin A, Richardson S, Marshall. C, Glazier, A and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression, submitted for publication, available at

 « Back

A practical projected clustering algorithm for gene expression profiles
Kevin Yip, The University of Hong Kong

In gene expression data, clusters can be found in subspaces in which a set of related genes have similar expression patterns in a set of samples. Traditional clustering algorithms may fail to identify such clusters as the expression patterns of the cluster members are not similar in the full input space. A number of algorithms have been proposed to identify clusters in subspaces, but most of them require the input of some parameter values that are hard for users to determine. In this talk, I will introduce a new projected clustering algorithm that dynamically adjusts its internal thresholds in response to the clustering status. This allows the algorithm to avoid using any hard-to-determine parameters, which simplifies the analysis of complex gene expression data. I will present the experimental results on some synthetic and real datasets to show that the algorithm is able to identify projected clusters that make both statistical and biological sense.

 « Back

Hidden Markov modelling of genomic interactions
Ernst Wit, University of Glasgow, UK

Microarray technology has made the simultaneous measurement of gene transcription a routine activity. Whereas gene transcription is only one stage in the complex genomic process of living organisms, it gives a fascinating insight in one aspect of this activity across the whole genome. Gene regulation is a complex biological process, which involves gene-gene and gene-protein interactions. An operator region, to which the enzyme polymerase can bind to start transcription, precedes the gene sequence. Such local features regulating transcription, pose the question whether there might be local spatial gene interactions.

We define a Hidden Markov Model (HMM) to relate the observed expression levels to hidden states "Up", "Down" and "Same" for a time-series gene expression dataset. A Potts Model is identified to describe the interactions between neighboring states. A typical problem in these types of model is the estimation of the hidden parameters because of the intractability of the normalizing constant. Recent work by Pettitt et al (2002) provides a clue to avoid to use pseudolikehood and to solve this issue for a wide class of HMMs.

This is joint work with N. Friel (University of Glasgow).

 « Back

Hierarchical bayesian modelling of multiple arrays experiments
Annibale Biggeri, University of Florence, Italy

We propose a Hierarchical Bayesian model for the simultaneous analysis of repli- cated gene expression profiles in standard reference design. Gene expression is modelled as the sum, on an appropriate scale, of fixed terms representing dif- ferent sources of variation (e.g. pin effect or dye effect for normalization and one or more parameters to describe the effect of an experimental factor). Treat- ment effects are represented by a set of parameters whose prior distribution is a mixture of three components. This corresponds to the definition of a dis- crete latent variable with three possible states (labelling a given gene being under-expressed, over-expressed or not-differentially expressed with respect to the reference sample). The Bayesian approach uses all the information collected to make inference and it allows to estimate the posterior probability of each sin- gle gene being differentially expressed. All the sources of variation are modelled in a common and consistent framework, avoiding the need of multiple distinct steps of analysis (for example, normalization and testing).

We applied the model to some cDNA microarray experiments designed to evaluate signal variation related to temperature of hybridization on Saccha- romyces cerevisiae, diet-effects on rat carcinogenesis experiments and human gene expression profiles from patients affected by dysplastic disease.

 « Back

More on the analysis of time-course microarray data
Terry Speed, UC Berkeley, USA and WEHI, Australia

Time course microarray data sometimes involves autocorrelation, that is, there are biological reasons for expecting expression measurements at one time to be correlated with expression measurements at nearby times. Sometimes time course microarray data are replicated. And frequently, time course cDNA microarray data are collected with one channel being a common reference mRNA source.

In this talk I discuss replicated time course cDNA microarray data with autocorrelation, measured relative to a common reference. The question I focus on is this: given such replicated time course data, one series for each spot on the slide, how can we best determine which genes have constant and which genes have varying (relative) expression levels across the times?

One common approach in longitudinal data problems like this is to carry out F-tests, treating the times as levels of a one-way classification. This approach ignores any autocorrelation which may be be present, and simply compares between to within time variation. An alternative approach is to treat this as a multivariate problem, seeking a likelihood ratio test of the null hypothesis that a (vector) mean is constant, against the alternative that it is not constant. A difficulty here is that we typically have more times than we do replicates, so estimating a covariance matrix for the replicate series is problematic, but a variety of solutions to this difficulty suggest themselves. A third strategy is to extend a familiar empirical Bayes approach to the scalar version of the same problem.

In the talk, which represents work in progress carried out jointly with Yu Chuan Tai of the Program in Biostatistics at UC Berkeley, I'll explain these approaches and compare them on a data set with the characteristics described.

 « Back

Directed indices for exploring gene expression data
Charles Kooperberg, Fred Hutchinson Cancer Research Center

Expression studies with clinical outcome data are becoming available for analysis. An important goal is to identify genes or clusters of genes where expression is related to patient outcome. While clustering methods are useful data exploration tools, they do not directly allow one to relate the expression data to clinical outcome. Alternatively, methods which rank genes based on their univariate significance do not incorporate gene function or relationships to genes that have been previously identified. We consider a gene index technique that generalizes methods that rank genes by their univariate associations to patient outcome. Genes are ordered based on simultaneously linking their expression both to patient outcome and to a specific gene of interest. The technique can also be used to suggest profiles or means of bundles of gene expression related to patient outcome. The methods are illustrated on a gene expression data set based on patients with Diffuse Large Cell Lymphoma.

This is joint work with Michael LeBlanc.

 « Back

The analysis of proteomics spectra from serum samples
Keith Baggerly, M.D. Anderson Cancer Center

Just as microarrays allow us to measure the relative RNA expression levels of thousands of genes at once, mass spectrometry profiles can provide quick summaries of the expression levels of hundreds of proteins. Using spectra derived from easily available biological samples such as serum or urine, we hope to identify proteins linked with a difference of interest such as the presence or absence of cancer. In this talk, we will briefly introduce two of the more common mass sprectrometry techniques, matrix-assisted laser desorption and ionization/time of flight (MALDI-TOF) and surface-enhanced laser desorption and ionization/time of flight (SELDI-TOF). We then describe two case studies, one using each of the above techniques. While we do uncover some structure of interest, aspects of the data clearly illustrate the need for careful experimental design, data cleaning, and data preprocessing to ensure that the structure found is due to biology. Time permitting, we will then discuss further examples using data collected at MD Anderson, in some cases illustrating that these lessons have been learned.

 « Back

Analyzing data from a splice array experiment
Jean Yee Hwa Yang, University of California, San Francisco

Splice-specific microarrays provide a basis to investigate the effect of mutations and other factors on splicing events in the creation of mature mRNA. This talk will illustrate various statistical designs and analysis issues from a study aimed at detecting differential gene expression between selected spliceosome mutants. The data features an unbalanced, nested design with minimal degree of replication. I will begin the talk with a brief overview of the splice array technology and discuss potential methods for synthesizing results from various approaches. The design of these arrays also provide a platform for comparing the performance of different normalization methods.

 « Back

Unsupervised determination of gene significance in time-course microarray data
Radha Krishna Murthy, Karuturi, Genome Institute of Singapore

Motivation: The abundance of a significant portion of the temporal induction-repression expression pattern of a gene among other genes in a time-course data is an indication of its non-randomness. The significance of the portion that matches between two gene profiles can be derived using binomial analysis and/or its variant. Considering the induction-repression pattern alone is both meaningful and significant since the related genes induced/repressed in a given period may not show the same exact shape of induction/repression. Further, microarray measurements are of low quality which might make expression patterns of related genes less similar. Based on this observation we developed an approach called friendly neighbors (FNs). In this approach, the significance score of a gene is number of genes in the same experiment that share its induction-repression pattern more than a certain threshold.

Results: The FNs approach has been applied to discover putative estrogen target genes, to detect cell cycle regulated genes in S. cerevisiae, and to elicit the modes of expression of immune genes in SARS infected samples. The new approach performed better than paired t-test and simple expression level based filtering methods on estrogen target gene discovery. It did significantly well on cell cycle regulated gene discovery in the absence of task-specific knowledge. Using the new approach we discovered trends which might not be elicited by typical hierarchical clustering in SARS infected samples data.

 « Back

Practical use of Bayesian mixture model for comparative microarray analyses in clinical oncology
Philippe Broët, Institut Curie and INSERM, France

Recent developments in transcriptome-oriented biotechnologies have made possible the comparative analysis of thousands of mRNA expression in parallel. Typically, these data consist of the measurement of gene expression under various experimental or biological conditions that can potentially provide information on the complex transcriptional activity for the biological system under study. In parallel to the rapid development of these technologies, research into ways of identifying gene expression changes in microarray experiments taking into account false conclusions has become an active area. Up to now, statistical procedures have mostly relied on the multiple testing framework in order to control false positive conclusions. In this framework, two quantities have been considered: the familywise error rate (FWER) and the false discovery rate (FDR). This latter criterion is now widely used for microarray analyses since it controls a error quantity which is relevant and leads to more powerful procedures than those relying on the FWER. In this spirit, important work has been done for estimating the FDR or the pFDR in a non-parametric mixture approach. However, a drawback of these latter procedures is that they only focus on protecting against false positive conclusions. In the exploratory and screening context of most microarray data analysis, investigators may however be seriously concerned that such methods do not take into account for false negative discoveries and lead to discarding too large a proportion of meaningful experimental information. Since in many cases complex biological pathways are of interest, it is difficult to envisage exploratory strategies which only protect against false-positive without controlling for false-negative results. As a large gene expression variation does not necessarily translates into a major role in the biological process studied and vice versa, gene with small variation in expression should not be discarded by a blind selection process. This is especially true for genome-wide microarray experiments which are followed by large-scale rt-PCR or custom microarrays focusing on specific pathways.

In this presentation, we will consider the problem of detecting differentially expressed genes in multiclass response microarray experiments and of providing false discoveries rate estimates for a define subset of genes that help the investigator in its gene selection process. Multiclass response (MCR) experiments correspond to a situation where there are more than two groups to be compared. Although this situation is frequently encountered in biomedical microarray studies, it has received less attention than the classical two class comparison problem. For this purpose, we propose a mixture model-based approach on a modified F-statistic that allows to identify gene expression change profiles for MCR experiments. This new approach is based on a fully Bayesian mixture model that extends previous work on two class comparison in microarray experiments. We illustrate the performance in estimating false discovery and non-discovery rates using simulated microarray data sets. The usefulness of this new approach will be illustrated on real data investigating breast cancer.

 « Back

Understanding array CGH data
Jane Fridlyand, The Jain Lab, UCSF Cancer Center, USA

The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic derangement seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. In order to investigate genomic alterations we are using microarray-based comparative genomic hybridization (array CGH). The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers.

We discuss general analytical and visualization approaches applicable to the array CGH data. We also use unsupervised Hidden Markov Models approach to utilize the spatial coherence between nearby clones. The clones are partitioned into the states which represent underlying copy number of the group of clones. The method is demonstrated on the primary melanoma data and on the two cell line datasets with known copy number alterations for one of them. The biological conclusions drawn from the analyses are discussed.

 « Back

Biological and practical issues
Mark Reimers, Karolinska Institute, Stockholm

Individual differences: some genes are more variable between individuals than others. Certain classes of genes are quite often regulated by large factors in a few individuals in comparison to the majority.

Scales for Analysis: what are the benefits and drawbacks of transforming the scale? The variability of most measures increases with the signal. Hence some sort of concave transforming function is often used, most commonly the logarithm. However in some cases this seems to actually hurt the analysis, as when the treatment down-regulates genes, and these are more variable.

Experimental consistency: the details of experiment setup make a huge difference to the results; often these differences can be detected at an early stage of the analysis.

Spatial effects on chips: although the idea is to have massively parallel measures, often the hybridization reaction proceeds differently on different regions of the chip. Sometimes this can be normalized.

 « Back

Affymetrix low-level analysis
Mark Reimers, Karolinska Institute, Stockholm

The probe sets used by Affymetrix contain between 11 and 20 probes for each gene. Sometimes different probes map to different splice variants, but the aim has been to have a probe set that consistently matches one splice variant. However the values of signal strength across samples differ greatly for probes in a single probe set, although the patterns are often similar. This suggests the use of a linear model to fit the probe affinities, and the gene abundance estimates simultaneously. Several authors have presented such a measure. We’ll look at the two leading measures: dChip and RMA.

 « Back

Practical issues in Affymetrix analysis
Mark Reimers, Karolinska Institute, Stockholm

The multi-chip methods for Affymetrix analysis seem better in principle and seem often to do better in practice. However there are differences between them; some experiments have shown fairly consistent differences; others show interesting but unexplained systematic differences. Comparisons with spotted arrays done to the same test samples suggest some models are consistently better for certain purposes, but no model is uniformly best. Most multi-chip models assume that probes behave consistently. Some data sets suggest considerable variation in probe performance; this may reflect differences in non-specific hybridization between tissues.

 « Back

Improvement of DNA microarray data analysis and better interpretation of microarray results
Henry Yang He, Bioinformatics Institute, Singapore

cDNA/oligo microarrays provide simple and economical ways to explore gene expression patterns on a genomic scale, and are used by an increasing number of biologists. In comparison to conventional methods, microarray technology can be used for guided gene discovery, meaning that microarray data are used to select handful genes out of the whole genome. This selection process involves two stage classifications: 1) classification of genes as differentially or non-differentially expressed and 2) classification of genes as biomarkers or non-marker genes. Although microarray technology is still at infant stage, a lot of computational methods have evolved. Thus, the question arises how to choose proper microarray data analysis methods. We try to ask this question by developing validation methods for each analysis steps. This talk will highlight what kind of microarray experiments are needed to obtain useful and reliable information. The talk will also give suggestions how to choose analysis algorithms and software suits for proper microarray data analysis.

 « Back

Integration of gene expression and protein activity data to estimate structure of a metabolic pathway
Marek Kimmel, Rice University, USA

NF-kB transcription factor and its signaling pathway play a major role in triggering immune response in humans. Its regulation involves at least two-feedback-loops, which can be modeled by means of ordinary differential equations: A deterministic model involves two-compartment kinetics of the activators IkB kinase (IKK) and NF-kB, the inhibitors A20 and IkBa, and their complexes. In resting cells the unphosphorylated IkBk binds to NF-kB and sequesters it in an inactive form in the cytoplasm. In response to extracellular signals such as TNF or IL-1, IKK is transformed from its neutral form (IKKn) into its active form (IKKa), a form capable of phosphorylating IkBa leading to IkBa degradation. Degradation of IkBa releases the main activator NF-kB, which then enters the nucleus and triggers transcription of the inhibitors and numerous other genes. The newly synthesized IkBa leads NF-kB out of the nucleus and sequesters it in the cytoplasm, while A20 inhibits IKK by easing its transformation into the inactive form (IKKi), a form different from IKKn, no longer capable of phosphorylating IkBa. After parameter fitting, the proposed model is able to properly reproduce time behavior of all variables for which the data now is available, nuclear NF-kB, cytoplasmic IkBa, A20 and IkBa mRNA transcripts, IKK and IKK catalytic activity in both wild-type and A20-deficient cells. The model allows detailed analysis of kinetics of the involved proteins and their complexes and gives the predictions of the possible responses of whole kinetics to the change in the level of a given activator or inhibitor. However, the NF-kB transcription factor acts by attaching to one or two sites of the promoter region of any gene, and this act is random, and followed by random detachment. We build a stochastic model, which allows simulating this process: In each particular cell, the effect of the extracellular signal leads to non-vanishing oscillations, which, at the population level, cancel due to phase shifts. This unexpected effect leads to testable predictions, which we are trying to verify using single-cell observations.

 « Back

Spearman’s footrule as a measure of cDNA microarray reproducibility
Byung Soo Kim, Yonsei University, Korea

Replication is a crucial aspect of to microarray experiments, due to various sources of errors that persist even after removing systematic effects. It has been confirmed that replication in microarray studies is not equivalent to duplication, and hence it is not a waste of scientific resources. Replication and reproducibility are the most important issues for the microarray application in genomics. However, little attention has been paid to the assessment of reproducibility among replicates. Here we develop using Spearman’s footrule a new measure of the reproducibility of cDNA microarrays, which is based on how consistently a gene’s relative rank is maintained in two replicates. The reproducibility measure termed as index.R has a R2-type operational interpretation. The index.R assesses reproducibility at the initial stage of the microarray data analysis even before normalization is done. We first define three layers of replicates; biological, technical and hybridizational replicates, which refer to different biological units, different mRNA’s from a same tissue, and different cDNA’s from a same mRNA, respectively. As the replicate layer moves down to a lower level, the experiment has the fewer sources of errors, and thus, is expected to be more reproducible. To validate the method we applied the index.R to two sets of controlled cDNA microarray experiments each of which had two or three layers of replicates. The index.R showed an uniform increase as the layer of the replicates moved into a more homogeneous environment. We also noted that the index.R had a larger jump size that Pearson’s correlation or Spearman’s rank correlation for each replicate layer move, and therefore, it has greater expandability as a measure in [0,1] than these two other measures.

 « Back

Statistical study of inter-lab and inter-platform agreement of DNA microarray data
Lei Liu, University of Illinois at Urbana-Champaign

As the gene expression profile data from DNA microarrays accumulate rapidly, a natural need of comparing data across different data sets arises. Unlike DNA sequence comparison, comparison of microarray data can be quite challenging due to the complexity of the data. Different laboratories may adopt different technology platforms. How reliable can we compare data from different labs and different platforms? To address this question, we conducted a statistical study of inter-lab and inter-platform agreement of microarray data from a same type of experiment using Intra-Class Correlation, Kappa Statistics, and Pearson Correlation. The platforms involved include Affymetrix GeneChip, custom cDNA arrays, and custom oligo arrays. We investigated the consistency of replicates, agreement by pair wise comparison, two-fold change agreement, and overall agreement. We also discussed effects of data filtering and the duplication of genes on the arrays.

 « Back

Strategic design and meta-analysis of expression genomic experiments
Edison Liu, Genome Institute of Singapore

DNA microarrays make possible the rapid and comprehensive assessment of the transcriptional activity of a cell, and as such, have proven valuable in assessing the molecular logic of biological processes and human diseases. With the focus on the post hoc statistical analysis of data, attention to the design of the array experiments, to the strategic convergence of results, and to quality control measures may be limited. Our premise is that optimal analysis requires an accounting and control of the many sources of variance within the system, the structuring of experiments to optimally answer specific questions, the ability to make sense of the results through intelligent database interrogation, and then the finality of data validation. We will describe the sources and impact of technical and analytical error, offer solutions to circumvent these problems, and discuss experiment-appropriate design and validation through experimental and database interrogations. Specific mention will be made of strategic design whereby convergence of the results from a series of experiments using different systems can be used to uncover fundamental biological truths.

 « Back

Common parameters in parallel regressions: extracting information from within-array replicate spots
Gordon Smyth, Walter and Eliza Hall Institute of Medical Research, Australia

Spotted microarrays are printed robotically from DNA plates. Very often the robot is programmed to print more than one spot from each DNA well on each array resulting in within-array replicate spots for each gene. Within-array replicate spots are heavily correlated through spatial proximity on the same array and hence the usual approach is to average the results of the replicate spots before undertaking further analysis. This talk shows that substantial information about gene variability and hence differential expression can be extracted from the within-array replicates by analysing the replicates individually using a pooled correlation estimator.

 « Back

Recognizing and dealing with common problems in proteomic mass spectrometry
Keith Baggerly, M.D. Anderson Cancer Center

What types of problems are commonly encountered in proteomics? How should we design an experiment in this context? Why do peaks change shape with mass? How do we define "a peak"?

In this talk the speaker will present some answers to the above questions, with illustrations drawn from case studies encountered at MD Anderson. This talk will focus primarily on MALDI and SELDI as described in Friday's lecture, but time permitting the speaker will touch briefly on one or two other modalities being explored.

 « Back

Smoothing application in microarray analysis
Paul Eilers, Leiden University Medical Centre, Netherlands

In many areas of microarray analysis smoothing can be applied fruitfully. I will present the application of penalized likelihood (P-splines) to:

  • Trend correction in MA plots, to improve normalization. P-splines can be modified to make an extremely fast smoother.
  • Improved presentation of scatterplots. The many dots in scatterplots can hide the patterns and they make display and printing slow. Fast smoothing of a two-dimensional histogram and color-coded display can help.
  • Modelling and correction of spatial trends and pin effects in background and signal estimates. Tensor products of B-splines and spatial penalties give an effective 2-D smoother.
  • Presentation and analysis of time series data. Smooth trends improve displays and help to define distance measures between curves.

 « Back