University of Rochester Medical Center
SearchDirectoryNewsEventsStrong HealthURMC Home
 

Fall 2008 Biostatistics Colloquia


Nonparametric Variance Estimation for Systematic Samples

Jean Opsomer, PhD
Colorado State University


Systematic sampling is a frequently used sampling method in natural resource surveys, because of its ease of implementation and its design efficiency. An important drawback of systematic sampling, however, is that no direct estimator of the design variance is available. We describe a new estimator of the model-based expectation of the design variance, under a nonparametric model for the population. The nonparametric model is sufficiently flexible that it can be expected to hold at least approximately for many practical situations. We prove the consistency of the estimator for both the anticipated variance and the design variance under the nonparametric model. The approach is used on a forest survey dataset, on which we compare a number of design-based and model-based variance estimators.

Thursday, November 20, 2008
3:30 PM
K-207 (Room 2-6408) Medical Center

 

Bayesian Inference for High Dimensional Functional and Image Data using Functional Mixed Models

Jeffrey S. Morris, PhD
Department of Biostatistics
The University of Texas MD Anderson Cancer Center

 High dimensional, irregular functional data are increasingly encountered in scientific research.  For example, MALDI-MS yields proteomics data consisting of one-dimensional spectra with many peaks, array CGH or SNP chip arrays yield one-dimensional functions of copy number information along the genome, 2D gel electrophoresis and LC-MS yield two-dimensional images with spots that correspond to peptides present in the sample, and fMRI yields four-dimensional data consisting of three-dimensional brain images observed over a sequence of time points on a fine grid.  In this talk, I will discuss how to identify regions of the functions/images that are related to factors of interest using Bayesian wavelet-based functional mixed models.  The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while borrowing strength between observations in all dimensions.  I will demonstrate how to identify regions of the functions that are significantly associated with factors of interest, in a way that takes both statistical and practical significance into account and controls the Bayesian false discovery rate to a pre-specified level.    I will also discuss how to extend this framework to include functional predictors with coefficient surfaces.  These methods will be applied to a series of functional data sets.

Thursday, November 13, 2008
3:30 PM
Adolph Auditorium (Room 1-7619) Medical Center


Semiparametric Analysis of Recurrent and Terminal Event Data

Douglas E. Schaubel, PhD
Department of Biostatistics, University of Michigan

In clinical and observational studies, the event of interest is often one which can occur multiple times for the same subject (i.e., a recurrent event). Moreover, there may be a terminal event (e.g. death) which stops the recurrent event process and, typically, is strongly correlated with the recurrent event process. We consider the recurrent/terminal event setting and model the dependence through a shared gamma frailty that is included in both the recurrent event rate and terminal event hazard functions.  Conditional on the frailty, a model is specified only for the marginal recurrent event process, hence avoiding the strong Poisson-type assumptions traditionally used. Analysis is based on estimating functions that allow for estimation of covariate effects on the marginal recurrent event rate and terminal event hazard. The method also permits estimation of the degree of association between the two processes. Closed-form asymptotic variance estimators are proposed. The proposed methods are evaluated through simulations to assess the applicability of the asymptotic results in finite samples, and to evaluate the sensitivity of the method to departures from its underlying assumptions. The methods are illustrated in an analysis of hospitalization data for patients in an international multi-center study of outcomes among peritoneal dialysis patients.  This is joint work with Yining Ye and Jack Kalbfleisch.

Thursday, November 6, 2008
3:30 PM
K-207 (Room 2-6408) Medical Center



Spatio-temporal Analysis via Generalized Additive Models

Kung-Sik Chan, PhD
The University of Iowa

Generalized Additive Model (GAM) has been widely used in practice. However, GAM assumes iid errors, which invalidates its use for many spatio-temporal data. For the latter kind of data, the Generalized Additive Mixed Model (GAMM) may be more appropriate. While there exist several approaches for estimating a GAMM, these approaches suffer from the problems of being numerically unstable or computer-intensive.

In this talk, I will discuss some recent, joint work with Xiangming Fang. We develop an iterative algorithm for Penalized Maximum Likelihood (PML) and Restricted Penalized Maximum Likelihood (REML) estimation of a GAM with correlated errors. Although the new approach does not assume any specific correlation structure, the Mátern spatial correlation model is of particular interest, as motivated by our biological applications. As some of the Mátern parameters are not consistently estimable under the fixed domain asymptotics, situations for the spatio-temporal case are investigated, where the spatial design is assumed to be fixed with temporally independent repeated measurements and the spatial correlation structure does not change over time. Our theoretical investigation exploits the fact that penalized likelihood estimation can be given a Bayesian interpretation. The conditions under which the asymptotic posterior normality holds are discussed. We also develop a model diagnosis method for checking the assumption of independence across time for spatio-temporal data. In practice, selecting the best model is often of interest. A model selection criterion based on the Bayesian framework is proposed to compare different candidate models. The proposed methods are illustrated by simulation and a fisheries application.

Thursday, October 23, 2008
3:30 PM
Adolph Auditorium (Room 1-7619) Medical Center

Challenges in Joint Modeling of Longitudinal and Survival Data

Jane-Ling Wang, PhD
Department of Statistics
University of California at Davis

It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties.

We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches.

Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.

*This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis)

 


Yakovlev Colloquium*: Detecting Disparities in Long-term Cancer Survivals: Challenges and Possible Solutions

Yi Li , PhD
Department of Biostatistics
Harvard University
Dana-Farber Cancer Institute

This talk deals with long-term disease-specific survivals among the prostate cancer patients in the NIH Surveillance Epidemiology and End Results (SEER) program, wherein the main endpoint (e.g. deaths from prostate cancer) and the censoring causes (e.g. deaths from heart diseases) may be dependent. While a number of authors have studied the mixture survival model to analyze survival data with non-negligible long-term survival fractions, none has studied the mixture model in the presence of dependent censoring. To account for such dependence, we propose a more general long-term survival model that allows for dependent censoring. We derive the models from the perspective of competing risks and model the dependence between the censoring time and the survival time using a class Archimedean copula models. Within this framework, we consider the parameter estimation, the long-term survival detection, and the two-sample comparison of latency distributions in the presence of dependent censoring when a proportion of patients is deemed to be long-term survivors. Large sample results using the martingale theory are obtained. We examine the finite sample performance of the proposed methods via simulation and apply them to analyze the SEER prostate cancer data.

Thursday, September 18, 2008
3:30 PM
K-207 (Room 2-6408) Medical Center

*To honor Dr. Andrei Yakovlev’s major contributions to the department, our first colloquium each academic year will be dedicated to his memory.


Spring 2008 Biostatistics Colloquia


Discovery of Latent Patterns in Disability Data and the Issue of Model Choice

Tanzy Mae Love, PhD
Department of Statistics
Carnegie Mellon University

Model choice is a major methodological issue in the explosive growth of data-mining models involving latent structure for clustering and classification. Here, we work from a general formulation of hierarchical Bayesian mixed-membership models and present several model specifications and variations, both parametric and nonparametric, in the context of learning the number of latent groups and associated patterns for clustering units. We elucidate strategies for comparing models and specifications by producing novel analyses of the following data set: data on functionally disabled American seniors from the National Long Term Care Survey.

Thursday, April 24, 2008
3:30 PM
Upper Auditorium (Room 3-7619) Medical Center


Funding Opportunities at the National Science Foundation

Grace Yang, PhD
Program Director, Statistics & Probability
National Science Foundation
Division of Mathematical Sciences

Thursday, April 17, 2008
3:30 PM
K-307 (Room 3-6408) Medical Center


Multiple imputation methods in application of a random slope coefficient linear model to randomized clinical trial data

Moonseong Heo, PhD
Department of Psychiatry
Weill Medical College of Cornell University

Two types of multiple imputation methods, proper and improper, for imputing missing not at random (MNAR) continuous data are considered in the context of attrition problems arising from antidepressant clinical trials, whose primary interest is to compare treatment effects on the declines in depressive symptoms over the study period. Both methods borrow information from completers data to construct pseudo donor sampling distributions from which imputed values are drawn, but differ in characterizing those distributions. A joint likelihood of each method is constructed based on a selection model for missing data. Their performance was evaluated based on maximum likelihood estimates of a random slope coefficient model that fits the imputed data to test the treatment effect via modeling interaction between the treatment and the slope of depressive symptom decline. The following performance evaluation criteria were considered: bias, statistical power, root mean square error, coverage probability of the 95% confidence interval (CI), and width of the CI. The two methods are compared with other analytic strategies for incomplete data: completers-only data analysis, available observations analysis, and last observation carried forward (LOCF) analysis. A simulation study showed that the two multiple imputation methods have favorable results in bias and statistical power and width of the 95% CI, whereas the available observations analysis showed favorable results in bias, root mean square and coverage rate. Completers-only analysis showed better results than the LOCF analysis. Those findings guided interpretation of results from an antidepressant trial for geriatric depression. Finally, a comparison with a sequential hot deck multiple imputation method in application to analysis with missing binary outcome from a recently completed antipsychotic trial will be discussed.

Wednesday, April 9, 2008
3:45 PM
K-307 (Room 3-6408) Medical Center


Improved Measurement Modeling and Regression with Latent Variables

Karen Bandeen-Roche, PhD
Professor of Biostatistics and Medicine
Johns Hopkins Bloomberg School of Public Health

Latent variable models have long been utilized by behavioral scientists to summarize constructs that are represented by multiple measured variables or are difficult to measure, such as health practices and psychiatric syndromes. They have been regarded as particularly useful when variables that can be measured are highly imperfect surrogates for the construct of inferential interest, but they are also criticized as being overly abstract, weakly estimable, computationally intensive and sensitive to unverifiable modeling assumptions. My talk describes two lines of research to improve the utility of latent variable modeling, counterbalancing strengths and weaknesses. First, it reviews methods I have developed for assessing modeling assumptions and delineating what are the targets of parameter estimation in the case of maximum likelihood fitting, allowing for a mis-specified model. Then, it describes new strategies for developing measurement models for subsequent use in developing regression outcomes. One affords approximately unbiased estimation vis a vis full latent variable regression. A second counterbalances standard latent variable modeling assumptions—focused on internal validity of measurement—with alternative assumptions—say, focused on external or concurrent validation. Small sample performance properties are evaluated. The methods will be illustrated using data on post traumatic stress disorder in a population-based sample and aging and adverse health in older adults. It is hoped that the findings will lead to improved usage of latent variable models in scientific investigations.

Thursday, April 3, 2008
3:30 PM
Class of 62 Auditorium (Room G-9425) Medical Center


Branching Processes as Models of Progenitor Cell Populations and Estimation of the Offspring Distributions

In memory of Andrei Yakovlev

Nikolay Yanev, PhD
Professor and Chair
Dept of Probability and Statistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences

This paper considers two new models of reducible age-dependent branching processes with emigration in conjunction with estimation problems arising in cell biology. Methods of statistical inference are developed using the relevant embedded discrete branching structure. Based on observations of the branching process with emigration, estimators of the offspring probabilities are proposed for the hidden unobservable process without emigration, the latter being of prime interest to investigators. The problem under consideration is motivated by experimental data generated by time-lapse video-recording of cultured cells that provides abundant information on their individual evolutions and thus on the basic parameters of their life cycle in tissue culture. Some parameters, such as the mean and variance of the mitotic cycle time, can be estimated nonparametrically without resorting to any mathematical model of cell population kinetics. For other parameters, such as the offspring distribution, a model-based inference is needed. Age-dependent branching processes have proven to be useful models for that purpose. A special feature of the data generated by time-lapse experiments is the presence of censoring effects due to migration of cells out of the field of observation. For the time-to-event observations, such as the mitotic cycle time, the effects of data censoring can be accounted for by standard methods of survival analysis. No methods are available to accommodate such effects in the statistical inference on the offspring distribution. Within the framework of branching processes, the loss of cells to follow-up can be modeled as a process of emigration. Incorporating the emigration process into a pertinent branching model of cell evolution provides the basis for the proposed estimation techniques. The statistical inference on the offspring distribution is illustrated with an application to the development of oligodendrocytes in cell culture.
     This talk is based on joint work with Drs. A. Yakovlev and V. Stoimenova.

Thursday, March 27, 2008
3:30 PM
Upper Auditorium (Room 3-7619) Medical Center


Challenges in Joint Modeling of Longitudinal and Survival Data

Jane-Ling Wang, PhD
Professor
University of California at Davis

It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties. We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches. Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.
     This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis).

Thursday, March 6, 2008
3:30 PM
Upper Auditorium (Medical Center, Room 3-7619)



Fall 2007 Biostatistics Colloquia

General Transformation Models for Joint Analysis of Recurrent Events and Terminal Event

Donglin Zeng, PhD
Associate Professor
University of North Carolina, Chapel Hill

We propose a class of transformation models with random effects for joint modeling recurrent events and a terminal event. The class of transformation models include both the proportional hazards model and the proportional odds model as special cases. The nonparametric maximum likelihood estimation method is used to derive the estimators, which are then shown to be consistent, asymptotically normal and asymptotically efficient. A simple algorithm is proposed to calculate the estimators. Simulation studies are conducted to examine the small-sample performance of the proposed method. The method is further applied to a real data set.

Friday, December 7, 2007
1:30 PM
Biostatistics Conference Room (MRBX G-11213)


Sequential evaluation of measurement error in a reliability study

Aiyu Liu, PhD
Senior Investigator
National Institute of Child Health & Human Development

We introduce sequential testing procedures for the planning and analysis of reliability studies to assess the measurement error in measuring the level of a biomarker. The designs allow repeated evaluation of reliability of the measurements and stop testing if early evidence shows the measurement error to be within the level of tolerance. Methods are developed and critical values tabulated for a number of two-stage designs. The methods are exemplified using an example evaluating the reliability of an oxidative stress biomarker.

Thursday, November 15, 2007
3:30 PM
Room 1-7619 (Adolph Auditorium) Medical Center


Resampling-based Multiple Testing Methods with Covariate Adjustment: Application to Investigation of Antiretroviral Drug Susceptibility

Victor DeGruttola, ScD
Professor of Biostatistics
Harvard School of Public Health

Identification of patterns of genetic mutations that are associated with clinical resistance to specific antiretroviral drugs in HIV-infected patients requires adjustment for potential confounders, such as the number of active drugs in a patient's regimen other than the one of interest. A variety of methods (e.g. regression trees, neural networks, support vector regression, least squares regression, least angle regression) are available for fitting high dimensional models, which are especially useful for prediction. Our goal focuses on the discovery of important patterns of mutations associated with resistance to a specific drug, after robust adjustment for the impact of covariates. Motivated by this problem, we investigated resampling-based methods to test equal mean response across multiple groups defined by HIV genotype, after adjustment for covariates. We consider construction of test statistics and their null distributions under two types of model: parametric and semiparametric. The covariate function (e.g., linear or quadratic) is explicitly specified in the parametric but not in the semiparametric approach. The parametric approach is more precise when models are correctly specified, but suffers from bias when they are not; the semiparametric approach is more robust to model misspecification, but may be less efficient. To help preserve Type I error while also improving power in both approaches, we propose resampling approaches based on matching of observations with similar covariate values. Matching reduces the impact of model misspecification as well as imprecision in estimation. These methods are evaluated via simulation studies and applied to a data set that combines results from a variety of clinical studies of salvage regimens. Our focus is on relating HIV genotype to viralogical response to abacavir after adjustment for the number of active antiretroviral drugs (excluding abacavir) in the patient's regimen. Illustrative data were provided by the Forum for HIV Collaborative Research, which collected baseline genotype, treatment history, and virological response on over 1300 patients from a range of clinical research studies in North America and Europe. These methods are extended to consider the identification of single nucleotide polymorphisms (SNPs) associated with toxicities related to antiretroviral drugs; an additional challenge in this research arises from fact that the genotype is unphased.

Thursday, October 25, 2007
3:30 PM
Room 2-6408 (K-207) Medical Center


Variance Estimators of Cross-Validation Estimators of the Generalization Error

Prof. Marianthi Markatou
Department of Biostatistics
Columbia University

We bring together methods from two different disciplines, machine learning and statistics, in order to address the problem of estimating the variance of cross-validation estimators of the generalization error. Specifically, we approach the problem of variance estimation of the CV estimators of the generalization error of computer algorithms as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and tests sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and tests sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y, Z, where Y denotes the cardinality of the intersection of two different training sets and Z denotes the cardinality of the intersection of two different test sets. We prove that the distribution of these two random variables in hypergeometric and we compare our estimator with the estimator proposed by Nadeau and Bengio (2003). We extend these results to the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case and the general case of kernel regression.

Thursday, October 11, 2007
3:30 PM
Room 1-7619 (Adolph Auditorium) Medical Center


Capturing Heterogeneity and Dependence in Gene Expression Studies by Surrogate Variable Analysis
John Storey
University of Washington
School of Public Health and Community Medicine

It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have widespread effects on gene expression levels. These factors are often unmeasured or unmodeled in the significance analysis of an expression study. We show that this "expression variation heterogeneity" can have a profound impact on the statistical and biological results obtained from nearly every microarray study. We propose surrogate variable analysis (SVA) to reduce the effect of expression heterogeneity in microarray studies, both by removing confounding of signal and by eliminating dependence across genes. We discuss connections between SVA and factor analysis, compare SVA with other methods for addressing dependence in multiple testing, and apply SVA to both simulated and experimental data.

Thursday, Sept 27, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center


Quantification of Protein Lysate Arrays: A Nonparametric Approach

Prof. Ximing He 
Department of Statistics
University of Illinois at Urbana-Champaign

The reverse-phase protein lysate arrays is an emerging technology that allows us to quantify the relative expression levels of a protein in many different cellular samples. At this moment, the applications of protein lysate arrays are still exploratory with a lack of reliable analysis tools for quantifying the information from protein arrays. In this talk, we show that a nonparametric protein expression curve often provides better fit to the data from the dilution series, whereas rigid parametric models such as the commonly used logistic curves are prone to bias. The problem of quantifying protein expression levels demands serious statistical work, and this talk serves as an introduction. In addition, I will discuss some interesting research problems in statistics that are motivated by our work in protein lysate array data.  Part of the talk is based on joint work with colleagues at the M.D. Anderson Cancer Center.

Thursday, Sept 6, 2007
3:30 PM
Room 2-6408 (K-207) Medical Center


Spring 2007 Biostatistics Colloquia

Global Influenza Surveillance and Bioinformatics in Genome and Epidemiological Studies

Prof. Oleg I. Kiselev 
State Institute of Influenza
Russian Academy of Medical Sciences
St. Petersburg, Russia

Influenza viruses type A are a leading pathogen in mass illness and high mortality rate during pandemics. The United States and countries in the frame of G8 decided to strengthen their efforts in preparedness to an influenza pandemic. In the frame of implementation of the National Pandemic Plan the first priority is a prediction of epidemiological situations on the local level and genetic properties of potential pandemic strains. Bioinformatics should have a leading role in this direction. The Hong Kong spring of 1997 outbreak caused by a highly pathogenic influenza virus was registered. This outbreak was very unusual in comparison with seasonal flu because of a very high mortality rate among all age groups of patients. The virus was isolated and investigated at the CDC and other laboratories. As a result of these studies, molecular signs of pathogenicity were recognized in hemagglutinin and NS1 genes. Due to the strong induction of cytokine gene expression, the virus caused systemic organ failure and lung edema as a fatal complication. Based on genetic evidence, fine mapping of viral genome vaccines and diagnostics were designed and produced. A growing body of sequence data creates a strong demand for bioinformatics service of molecular biology work. The current epidemiological situation in Indonesia and other Eastern countries is getting worse. H5N1 virus spreads in many countries along the flyways of waterfall birds. In many countries the epidemiological situation should be characterized as a stable endemic one. This means that the virus is in the latent phase in animals and can be activated by unknown factors and cause epidemics. In my presentation, the system of influenza surveillance and control in the frame of WHO Global Influenza network will be discussed. The importance of a development of a new global bioinformatics approaches and software for genetic and epidemiological influenza surveillance system will be proven and proposed. Examples of new Russian developments in this field will be provided and overviewed.

Monday, June 11, 2007
2:00 PM
Room 2-6408 (K-207) Medical Center

 

Robust Methods for Personalized Prediction of Clinical Outcomes
Tianxi Cai
Department of Biostatistics
Harvard School of Public Health

Continuing technological advancements allow researchers and clinicians to measure an increasingly vast diversity of clinical and biological markers, rapidly increasing our understanding of disease processes. The wide range of newly available markers holds great potential for the personalization of medical care through accurate prediction of outcomes in individual patients. Traditional statistical methods for using patient's marker values to make personalized predictions are derived under a strong assumption that the true model relating markers to the response can be identified, at least with a large enough sample. In practice, however, it is difficult if not impossible even to locate a class of models containing the truth. In this talk, I will discuss various methods for construction, evaluation and comparison of prediction rules without having to assume that the fitted regression models are correct. These methods will be illustrated using datasets from an AIDS clinical trial and a breast cancer gene expression study.

Thursday, May 24, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center



High-Dimensional Statistical Models in Genomics:
UIP, MCP and CSI in Perspectives
Pranab K. Sen, Ph.D.
University of North Carolina, Chapel Hill

The ongoing genomics evolution has posed some challenging statistical problems. Most statistical models arising in bioinformatics, data mining and a variety of other computer-intensive interdisciplinary research fields are complex in their design, sampling plan and associated probability law. The curse of dimensionality is so overwhelming that conventional likelihood ratio based statistical inference may not be useful. On top of that, such models are typically constrained by inequality, order, functional, shape or other restraints. Use of variants of likelihood ratio has also encountered similar impasses. S. N. Roy's (1953) ingenious union-intersection principle along with high-dimensional multivariate analysis provide an alternative avenue having some computational advantages, increased scope of application and beyond parametrics formulations. This scenario is illustrated with some
microarray data and SNP models.

Thursday, May 10, 2007 at 3:30 p.m.
Room 3-7619 Medical Center (Upper Auditorium)


***Cancelled***
Capturing Heterogeneity and Dependence in Gene Expression Studies by Surrogate Variable Analysis
John Storey
University of Washington
School of Public Health and Community Medicine


It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have widespread effects on gene expression levels. These factors are often unmeasured or unmodeled in the significance analysis of an expression study. We show that this "expression variation heterogeneity" can have a profound impact on the statistical and biological results obtained from nearly every microarray study. We propose surrogate variable analysis (SVA) to reduce the effect of expression heterogeneity in microarray studies, both by removing confounding of signal and by eliminating dependence across genes. We discuss connections between SVA and factor analysis, compare SVA with other methods for addressing dependence in multiple testing, and apply SVA to both simulated and experimental data.

Thursday, April 19, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center


PLASQ: A Generalized Linear Model-Based Procedure to Determine Allelic Dosage in Cancer Cells from SNP Array Data
Thomas Laframboise,
David Harrington,
Barbara A. Weir
Dana-Farber Cancer Institute

Human cancer is largely driven by the acquisition of mutations. One class of such mutations is copy number polymorphisms, comprised of deviations from the normal diploid two copies of each autosomal chromosome per cell.  We describe a probe-level allele-specific quantitation (PLASQ) procedure to determine copy number contributions from each of the parental chromosomes in cancer cells from SNP microarray data. Our approach is based upon a generalized linear model that takes advantage of a novel classification of probes on the array. As a result of this classification, we are able to fit the model to the data using an expectation-maximization algorithm designed for the purpose. We demonstrate a strong model fit to data from a variety of cell types. In normal diploid samples, PLASQ is able to genotype with very high accuracy. Moreover, we are able to provide a generalized genotype in cancer samples (e.g. CCCCT at an amplified SNP). Our approach is illustrated on a variety of lung cancer cell lines and tumors, and a number of events are validated by independent computational and experimental means.  An R software package containing the methods is freely available.

Thursday, April 5, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center


Data Monitoring In Clinical Trials: Experiences of a Biostatistician
Robert F. Woolson, Ph.D.
Professor,
Department of Biostatistics, Bioinformatics & Epidemiology
Medical University of South Carolina
&
Professor Emeritus,
Department of Biostatistics,
Department of Statistics and Actuarial Sciences
University of Iowa


Data Safety and Monitoring Boards (DSMB’s) are typically responsible for reviewing safety and efficacy data during the conduct of Phase III randomized clinical trials.  These groups are charged with reviewing accumulating evidence to see if there is sufficient evidence to conclude a trial on the basis of benefit, lack of benefit, logistical problems in the study’s conduct, or if there is undue harm to study participants.  Biostatisticians generally have an important role as a member of a DSMB; alternatively, to be a liaison between the trial and the external DSMB.  In this applied talk, I shall discuss some general issues, personal experiences and challenges associated with DSMB activities.

Thursday, March 29, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center

Fall 2006 Biostatistics Colloquia

Frequentist and Bayesian Approaches to High-Dimensional Testing
Dan Spitzner, Ph.D.
Department of Statistics
Virginia Tech

Functional data arise from samples of digitized or otherwise densely-measured random functions. They are intrinsically high-dimensional, and this poses a challenge to classical hypothesis testing by exacerbating difficulties in discerning “significant” large-scale attributes from spurious noise. A common resolution is to apply smooth goodness-of-fit tests using an adaptive mechanism to truncate the dimensionality of the data. In this talk, it will be discussed how such procedures disproportionately balance, in a desirable way, emphasis between large-scale and noise-like data attributes. This will motivate an investigation into tests that taper (rather than truncate) dimensionality through a test statistic given as weighted quadratic form. Main results concerning theoretical performance and near-optimal settings will be discussed within the context of “rates of testing” theory. Parallel results will then be discussed from a Bayesian viewpoint, in which the tapering concept is particularly appropriate.

Thursday, November 30, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center

Genetic Studies for Ordinal Traits
Heping Zhang, Ph.D.
Department of Epidemiology and Public Health
Yale University School of Medicine

For complex diseases, especially mental health conditions including nicotine dependence and substance use, the outcome variables are often recorded in an ordinal rather than quantitative scale. The naturally recorded ordinal traits are usually analyzed either as quantitative traits or being dichotomized. It has been demonstrated repeatedly in recent studies that this commonly used approach to dealing with ordinal traits is inadequate and results in loss of power. After discussing general principles and an overview of related work, I will present score test statistics that belong to a general class of family-based association tests (FBATs) for ordinal traits. This new approach can adjust for the effects of covariates. Simulation results will be presented to compare the type I error and power of our proposed tests with existing tests. The empirical result suggests that our test produces reasonable type I errors and has better power than the existing tests. The proposed test was used to analyze GAW14 data on alcoholism and identified several single nucleotide polymorphisms including rs485874, rs619, rs718251, rs1869907 that are significantly associated with alcohol dependence after adjusting for gender and age.

This is a series of joint work with Rui Feng, Xueqin Wang, Hongtu Zhu,
and Yuanqing Ye.

Thursday, November 16, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center

Genomic aberration analysis of tumor samples using SNP microarrays
Cheng Li, Ph.D.
Department of Biostatistics, Harvard School of Public Health
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute

Loss of heterozygosity (LOH) and copy number changes of chromosomal regions bearing tumor suppressor genes or oncogenes are keys event in the evolution of epithelial and mesenchymal tumors. Identification of LOH regions usually relies on genotyping tumor and counterpart normal DNA and noting regions where heterozygous alleles in the normal DNA become homozygous in the tumor. However, paired normal samples for tumors and cell lines are often not available. With the advent of oligonucleotide arrays that simultaneously assay thousands of single-nucleotide polymorphism (SNP) markers, genotyping can now be done at high enough resolution to allow identification of LOH events by the absence of heterozygous loci, without comparison to normal controls. Here we describe a hidden Markov model-based method to identify LOH from unpaired tumor samples, taking into account SNP intermarker distances, SNP-specific heterozygosity rates, and the haplotype structure of the human genome. In addition, copy number analysis incorporating LOH will be discussed.

Thursday, November 2, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center

From Data to Differential Equations
James Ramsay, Ph.D.
Department of Psychology
McGill University

Differential equations are the natural way to model systems with functional inputs and functional outputs. They allow us to study the system’s dynamics in the sense of explicitly modelling how the output changes in response to sudden changes in input. For example, engineers developing control systems for industrial processes routinely use DIFE’s as modelling tools. A new method is described for going directly from noisy discrete data, not necessarily sampled at equally spaced times, to a system of differential equations of arbitrary orders, linear or nonlinear, that describes the data. The method involves a generalization of nonparametric curve estimation in which the penalty functional rather than the smoothing functions is estimated. Examples are drawn from biology, chemical engineering and medicine.

Thursday, October 19, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center

 

Stratification on Post-Treatment Variables in Causal Inference: A Potential Outcomes Approach to Developmental Toxicity Analyses
Michael Elliott, Ph.D.
Department of Biostatistics
University of Michigan School of Public Health

In investigating the causal effect of a toxin in fetal toxicology studies in a counterfactual framework, we want to restrict consideration of the effect of dose on birthweight or malformation status to the subset of fetuses that would be born alive under the set of doses in question.  Additionally, a toxin may affect both birthweight and malformation status of fetuses (Sammel et al. 1997, Dunson et al. 2003), so that the direct effect of a toxin on birthweight may be confounded by the effect of the toxin on the number of fetuses that implant and are carried to term, since the resources available to the fetus may be different under different doses of toxins.  Use of a principal stratum model (Frangakis et al. 2004) that considers the survival status of fetuses under different doses of a toxin can account for both of these forms of selection that may result from utilizing the observed data rather than the “complete” (counterfactual) data.  This model also addresses issues of incorporating post-randomization observations in a principal stratum framework.

Thursday, October 5, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center

 

Spring 2006 Biostatistics Colloquia

Estimating Mean Response as a Function of Treatment Duration in an Observational Study, Where Duration may be Informatively Censored
Butch Tsiatis, Ph.D.
Department of Statistics
North Carolina State University

In a recent clinical trial "ESPRIT" of patients with coronary heart disease who were scheduled to undergo percutaneous coronary intervention (PCI), patients randomized to receive Integrilin therapy had significantly better outcomes than patients randomized to placebo. The protocol recommended that Integrilin be given as a continuous infusion for 18-24 hours. There was debate among the clinicians on the optimal infusion duration in this 18-24 hour range, and we were asked to study this question statistically. Two issues complicated this analysis: (i) The choice of treatment duration was left to the discretion of the physician and (ii) treatment duration would have to be terminated (censored) if the patient experienced serious complications during the infusion period. To formalize the question, "What is the optimal infusion duration?" in terms of a statistical model, we developed a framework where the problem was cast using ideas developed for adaptive treatment strategies in causal inference. The problem is defined through parameters of the distribution of (unobserved) potential outcomes. We then show how, under some reasonable assumptions, these parameters could be estimated. The methods are illustrated using the data from the ESPRIT trial.


Up and Down Designs for Dose-Finding Trials
Nancy Flournoy, Ph.D.
Department of Statistics
University of Missouri-Columbia

In this talk we review nonparametric treatment allocation procedures for dose-finding trials, focusing on more recent results. We cover (1) the situation in acute toxicity studies in which the toxicity rate is assumed to increase with dose and (2) the situation in which both toxicity and efficacy are considered jointly with the goal being to identify the dose that maximizes P{efficacy and no toxicity} and (3) how they can be used to approximate optimal designs – which cannot be directly implemented because in these setting the response functions are nonlinear and hence optimal designs are functions of the unknown parameters. Included are group up-and-down designs with and without randomization and their extension, the “zoom-in designs”, optimizing up-and-down designs and balancing up-and-down designs. Where possible we provide theoretical results that aid the comparisons of these designs.


Rehabilitating LAD Regression: Breakdown, Smoothing, and Robustness
Jeffrey S. Simonoff, Ph.D.
Leonard N. Stern School of Business
New York University

The most common estimation method for regression models is, of course, least squares, which minimizes the sum of squared deviations from the regression surface. It is well-known, however, that least squares regression is highly nonrobust, being sensitive to unusual values in both the response and predictor spaces. An alternative approach is least absolute deviation (LAD) regression, which minimizes the sum of absolute deviations. It is known that LAD regression is more robust than least squares in the presence of outliers in the response variable, but it has not gained favor in the robustness literature because of its sensitivity to unusual values in the predictors (leverage points). In this talk we describe recent research using mixed integer programming designed to evaluate and improve the robustness of LAD regression, through determination of the finite sample breakdown point. We show how recent research on the breakdown point for LAD regression can be adapted to nonparametric (local linear) regression, providing the first quantification of robustness for any nonparametric regression estimator. We show how knowledge of the breakdown point implies good properties of the quadratic (Epanechnikov) kernel for local linear LAD regression, and describe how post-smoothing can result in a more appealing regression curve. We build on these results by demonstrating that the introduction of nonuniform weights can improve the robustness of parametric (linear) LAD regression, and develop an algorithm for choosing those weights with the goal of increasing the breakdown point of the method by downweighting leverage points. We generalize these weights using an easily implemented robustification of Mahalanobis distance. We derive the asymptotic properties of the weighted LAD estimator, and use Monte Carlo simulation and application to real examples to illustrate its effectiveness.

This is joint work with Avi Giloni and Baskar Sengupta.


A Nonstationary Negative Binomial Time Series with Time-Dependent Covariates: Enterococcus Counts in Boston Harbor
Brent Coull, Ph.D.
Department of Biostatistics
Harvard School of Public Health

Boston Harbor has had a history of poor water quality, including contamination by enteric pathogens. We conduct a statistical analysis of data collected by the Massachusetts Water Resources Authority (MWRA) between 1996 and 2002 to evaluate the effects of court-mandated improvements in sewage treatment. Motivated by the ineffectiveness of standard Poisson mixture models and their zero-inflated counterparts, we propose a new negative binomial model for time series of Enterococcus counts in Boston Harbor, where nonstationarity and autocorrelation are modeled using a nonparametric smooth function of time in the predictor. Without further restrictions, this function is not identifiable in the presence of time-dependent covariates; consequently we use a basis orthogonal to the space spanned by the covariates and use penalized quasi-likelihood (PQL) for estimation. We conclude that Enterococcus counts were greatly reduced near the Nut Island Treatment Plant (NITP) outfalls following the transfer of wastewaters from NITP to the Deer Island Treatment Plant (DITP) and that the transfer of wastewaters from Boston Harbor to the offshore diffusers in Massachusetts Bay reduced the Enterococcus counts near the DITP outfalls.

This is joint work with Andy Houseman and Jim Shine.


Fall 2005 Biostatistics Colloquia

Hierarchical Bayesian Analysis of Genetic Diversity in Geographically Structured Populations
Dipak K. Dey, Ph.D.
Department of Statistics
University of Connecticut

Populations may become differentiated from one another as a result of genetic drift. The amounts and patterns of differentiation at neutral loci are determined by local population sizes, migration rates among populations, and mutation rates. We provide exact analytical expressions for the mean, variance and covariance of a stochastic model for hierarchically structured populations subject to migration, mutation, and drift. In addition to the expected correlation in allele frequencies among populations in the same geographical region, we demonstrate that there is a substantial correlation in allele frequencies among regions at the top level of the hierarchy. We propose a hierarchical Bayesian model for inference of Wright?s F-statistics. We illustrate the approach through an analysis of human microsatellite data, revealing that approaches ignoring the among population correlation of allele frequencies underestimate the amount of genetic differentiation among major geographical population groups by approximately 50%, and we discuss the implications of these results for the use and interpretation of F-statistics in evolutionary studies. We further provide exact expressions for the first two moments of a stochastic model appropriate for studying microsatellite evolution under the assumption that the range of allele sizes is bounded. Using these results we study the behavior of several measures related to Wright?s FST, including Slatkin?s RST.


Non- and Semiparametric Modeling in Applications
Naisyin Wang, Ph.D.
Professor of Statistics and Toxicology
Texas A&M University

Due to its flexibility and easy implementation, various non- and semiparametric models have been used more often in recent biological or medical studies. These models allow underlying trends of the responses to be unspecified and nonparametric. In this talk, I will discuss several recent applications of non- and semiparametric modeling. They include a colon tumorigenesis study which links BCL2 expression with DNA adduct, a membrane protein clustering tendency investigation, and if time allows, a microarray normalization study to normalize partially degraded mRNA bioarray data. Theoretical support behind the methods will be briefly discussed. I will also use examples and simulations to illustrate the connections between the theoretical findings and their implication in applications.


Modeling Viral Infections
Alan S. Perelson, Ph.D.
Senior Fellow
Los Alamos National Laboratory

I will review basic models of viral infection that have been used to model HIV, hepatitis C virus and influenza infection. I will show how such models can be fit to data to estimate basic parameters describing the viral lifecycle and the effects of antiviral therapy.


Strength and Frailty of Frailty Modeling in Population Studies of Aging
Anatoli Yashin, Ph.D.
Center for Demographic Studies
Duke University

In this talk I will review recent results and ideas related to frailty (random effect, or hidden heterogeneity) modeling in population studies of aging and longevity in humans and animals. The models and ideas belong to the areas of survival analysis, biostatistics and genetic epidemiology.

The idea of frailty modeling was initially discussed in demographic and actuarial applications to explain deceleration and leveling off human mortality rates at advanced ages. Initially these models were investigated without taking observed covariates into account. The non-identifiability is a crucial feature of such models, which substantially restricts their applications. Later such models were implemented to the analysis of data from stress-experiments with laboratory animals. It turns out that the presence of data from experimental and control groups allows us to solve identifiability problem. I will define basic models of this class and discuss their strength and limitations.

Then I will introduce extension of frailty models to the case with observed covariates. Such models were initially developed in econometrics and biostatistics. In contrast to frailty models without observed covariates these models are identifiable. This feature motivated development of statistical methods which allow for evaluation of the role of hidden frailty in estimated effects of observed covariates on survival.

The concept of shared frailty emerged in response to the epidemiological idea related to the design of the matched pair experiments. The development of such models was accompanied by a number of confusions indicating that the concept of shared frailty was not well understood. I will discuss the origin of such confusions and approaches capable of avoiding them. One such approach deals with idea of correlated frailty. I will introduce the correlated frailty models, discuss their properties and elucidate applications of these models to genetic studies of human aging and longevity. I will also discuss applications of these models to analysis of dependent competing risks problem as well as directions of further research.


Spring 2005 Biostatistics Colloquia

What Does a Bayesian Approach Offer in Clinical Research?
Donald A. Berry, Ph.D.
Department of Biostatistics and Applied Mathematics
The University of Texas MD Anderson Cancer Center

My presentation is in two parts. First I argue that almost all statistical analyses are wrong, regardless of their philosophical underpinnings! Bayesian analyses are especially susceptible to erroneous conclusions. A sufficiently rigorous frequentist approach is immune. But it comes with heavy baggage that slows progress. In attempting to lighten the load, the second part of my presentation addresses Bayesian innovations in clinical trials, with particular focus on design. There is renewed interest in a greater appreciation of the benefits of using a Bayesian approach in medical research. I will describe some of these benefits and relate them to modern attitudes in pharmaceutical and medical device development, and to attitudes in cancer cooperative groups and at my home institution. Of special importance are the uses of (i) flexible, adaptive designs, (ii) predictive probabilities, and (iii) hierarchical modeling.


Modeling HIV-1 Drug Resistance and Fitness
John Mittler, Ph.D.
Department of Microbiology
University of Washington

Drug resistance is a major obstacle to the successful treatment of human immunodeficiency virus type 1 (HIV-1) infection. Viral fitness strongly influences the within-patient frequency of drug resistant mutants both in the presence and absence of therapy. We have created models for both drug resistance (IC50 values) and viral fitness in the absence of drug. To estimate IC50 values, we used standard stepwise linear regression to construct drug resistance models for 7 protease inhibitors and 10 reverse transcriptase inhibitors using data obtained from the Stanford HIV drug resistance database. We evaluated these models by hold-one-out experiments and by tests on an independent dataset. Our linear model outperformed other publicly available genotypic interpretation algorithms, including decision tree, support vector machine, and four rules-based algorithms (HIVdb, VGI, ANRS and Rega) under both tests. Interestingly, our model did well despite the absence of any terms for interactions between different residues in protease or reverse transcriptase. The resulting linear models are easy to understand, and can potentially assist in choosing combination therapy regimens. To test our ability to predict viral fitness in the absence of drug, we have used an all-atom distance-dependent conditional probability discriminatory function (RAPDF), a function that has been used successfully in protein structure prediction, to estimate pseudoenergies for 132 HIV-1 protease flap-region mutants whose cleavage rates had been determined experimentally. Although individual discrepancies were noted, the overall correlation between RAPDF scores and experimentally determined cleavage rates was excellent (r = 0.93 for binned data). Our RAPDF function was particularly good at identifying mutants with very low fitness, with the 15 mutants with the lowest RAPDF scores all having undetectable cleavage rates. Progress in predicting IC50 and viral fitness values may lead to improved strategies for treating HIV-1 patients.


Criteria for Evaluating Models of Absolute Risk
Mitchell H. Gail, M.D., Ph.D.
Division of Cancer Epidemiology and Genetics
National Cancer Institute

Absolute risk is the probability that an individual who is free of a given disease at an initial age, a, will develop that disease in the subsequent interval (a, t]. Absolute risk is reduced by mortality from competing risks. Models of absolute risk that depend on covariates have been used to design interventions studies, to counsel patients regarding their risks of disease, and to inform clinical decisions, such as whether or not to take tamoxifen to prevent breast cancer. Several general criteria have been used to evaluate models of absolute risk, including how well the model predicts the observed numbers of events in subsets of the population ("calibration"), and "discriminatory power", measured by the concordance statistic (e.g. Rockhill et al., J Natl Cancer Inst, 93, 358-366, 2001). In this paper we review some general criteria and develop specific loss function-based criteria for two applications, namely whether or not to screen a population to select subjects for further evaluation or treatment and whether or not to use a preventive intervention that has both beneficial and adverse effects. We find that high discriminatory power is much more crucial in the screening application than in the preventive intervention application. These examples indicate that the usefulness of a general criterion such as concordance depends on the application, and that using specific loss functions can lead to more appropriate assessments.
          This is joint work with Ruth M. Pfeiffer, Ph.D.


Microarray studies: Can they be reproduced? Can they be combined?
Giovanni Parmigiani, Ph.D.
Johns Hopkins University
Bloomberg School of Public Health

Investigations of transcript levels on a genomic scale using hybridization-based arrays led to formidable advances in our understanding of the biology of many human illnesses. At the same time, these investigations have generated controversy, because of the probabilistic nature of the conclusions, and the surfacing of noticeable discrepancies between the results of studies addressing the same biological question. In this lecture I will present simple exploratory data analysis tools for gauging the degree to which the finding of one study are reproduced by others, and for integrating multiple studies in a single analysis. I will describe these approaches in the context of studies of both lung and breast cancer. The main conclusion of our work to date is that it is possible to identify a substantial, biologically relevant, subset of the human genome within which hybridization results are reproducible. The subset generally varies with the platform used, the tissues studied, and the populations being sampled. Despite important differences, it is also possible to develop simple expression measures that allow comparison across platforms, studies, labs and populations. While these are not perfect, important biological signal is often preserved or enhanced. Cross-study validation and combination of microarray results requires careful, but not overly complex, statistical thinking, and can become a routine component of genomic analysis.

Design and Analysis of Microarray Assays for Defining Predictive Gene Expression Signatures
Lutz Edler, Ph.D.
German Cancer Research Center
Heidelberg, Germany

Functional genomic data, in particular, data on gene expression levels in patient blood or tissue samples obtained from microarrays, have nourished the desire of clinicians as well as pharmaceutical industry to make use of this wealth of data for the development of new and better targeted drugs. When these high-dimensional data are going to be collected in the course of a clinical trial, new questions arise on how design and analyze the trial such that ambitious questions on gene expression can be answered properly. The prediction of clinical phenotypes such as tumor class, drug response, toxicity, and even survival by a small set of predictive genes, a so-called gene expression signature or profile, has become an important issue of clinical research, naturally coupled with the question on the best treatment for a subgroup of patients which has been defined by such a gene expression signature (GES). This question combines actually two tasks: the definition of the GES and its validation as prognostic factor for the chosen clinical endpoint, and, the determination of a sensitive subgroup defined through that GES which then can be evaluated for treatment effects.

This presentation discusses statistical methods for the determination of predictive factors. Methods of class prediction/prognostic prediction are reviewed and results from recent comparative studies presented. The need to distinguish carefully between feature selection, construction of the predictor and assessment of the performance of the predictor is emphasized. The positive predictive value, better known from diagnostic studies, is used to design a clinical trial when treatment response is chosen as the predictive endpoint of the GES. Logistic regression is an appropriate analysis method to account for additional prognostic factors. By a simulation with a logistic model where treatment and GES are independent variables one can calculate sample sizes for the logistic regression and compare them with the sizes obtained with the predictive value. These methods were applied in a research project on neo-adjuvant chemotherapy of breast cancer patients. Forthcoming challenges on the combined use of genomic and proteomic data and questions of the validation of findings will be illustrated by examples.


 

Fall 2004 Biostatistics Colloquia

A Class of Bayesian Box-Cox Transformation Hazard Regression Models
Joseph Ibrahim, Ph.D.
Professor, Department of Biostatistics
UNC School of Public Health, University of North Carolina at Chapel Hill

We propose a novel and general class of Box-Cox transformation models on the hazard functions for right censored survival data. This new class of models allows a very broad range of shapes and relationships between the baseline hazard as well as the hazard function. It includes the Cox proportional hazards model and the additive hazards model as two special cases. Several properties of the model are derived, and interpretations as well as illustrations of the behavior of the Box-Cox transformation parameter are provided. A novel class of joint prior distributions is proposed for the model parameters. Due to the requirement of a positive hazard function in the survival model, complex multidimensional nonlinear parameter constraints must be imposed in the model formulation. As a result, computations for this new Bayesian model pose many new challenges. We propose an efficient Markov chain Monte Carlo (MCMC) computational scheme for sampling from the posterior distribution of the parameters. The proposed prior distributions facilitate a tractable computational algorithm. The joint priors are constructed through a conditional-marginal specification, in which the conditional distribution is univariate, and one which absorbs all of the non-linear parameter constraints. The marginal part of the prior specification is free of any constraints. This novel class of prior distributions allows us to easily compute the full conditionals needed for Gibbs sampling, incorporating the constraints, and hence implement the Markov chain Monte Carlo algorithm in a relatively straightforward fashion. This new class of models is illustrated with a detailed simulation study as well as a real dataset involving a melanoma clinical trial. Extensions to frailty models and cure rate models are discussed.


Statistical Comparison of Medical Images
Eugene Demidenko, Ph.D.
Associate Professor, Section of Biostatistics and Epidemiology
Dartmouth Medical School

Imaging technology becomes an essential tool of biomedical research. On the other hand, issues of statistical image analysis and particularly image comparison are underdeveloped. Today, it is hard to publish a paper without providing a p-value when comparing two treatment groups. When it comes to image comparison, researchers just show several arbitrary picked images to illustrate their findings. We develop a statistical theory of content independent image comparison based on the multinomial distribution of 256 gray level intensities. This statistical model is suitable for medical microscopic images frequently emerging in cellular and molecular biology. Parametric, such as likelihood-ratio, or nonparametric, such as Kolmogorov-Smirnov, tests are applied. We generalize our tests to ensembles of images to account for biological heterogeneity via mixed effects approach (E. Demidenko. Mixed Models: Theory and Applications, Wiley, Hoboken, NJ, 2004). The advantage of our approach is that images may be adjusted for patient age, gender, experimental conditions, etc. We illustrate our analysis by comparison of cancer cell images from four treatment groups.


Biostatistical Challenges in Molecular Epidemiology
William D. Shannon, Ph.D.
Associate Professor of Biostatistics in Medicine
Division of General Medical Sciences and Biostatistics
Washington University School of Medicine

Epidemiology is the study of the distribution and size of disease problems in human populations, in particular to identify etiological factors in the pathogenesis of diseases and to provide the data essential for the management, evaluation and planning of services for the prevention, control and treatment of disease (Everitt). Molecular epidemiology uses molecular biology to identify the etiological factors, and is a growing and important area of biomedical research.

Molecular epidemiology presents new challenges to data analysts. Modern molecular biology can measure tens of thousands of molecular variables rapidly and cheaply (e.g., gene chips measure the activity of tens of thousands of genes, genotyping is routinely done at hundreds or thousands of markers, and proteomics has the potential of characterizing the entire protein content of tissues). The limiting step in molecular epidemiology is the small number of human subjects these measurements are made on (the large P, small N problem).

In this talk I address three statistical problems faced when analyzing molecular epidemiology data. The first problem is the proper identification of patient subgroups within which statistical tests of genotype-phenotype association should be applied. The second problem is the testing of clinical covariates against a large number of molecular variables. The third problem is the selection of important molecular factors related to disease. While these problems can be defined in the language of classical statistics (i.e., population stratification, over-determined systems, and variable selection, respectively), classical statistics will fail due to the 'large P, small N' problem (and there ain't no getting around that!).

I will argue that new ways of thinking about statistics will be needed for this data.

http://ilya.wustl.edu/~shannon/UnivRochester.ppt
http://ilya.wustl.edu/~shannon/DIMACsPaperSubmitted.pdf


Quantile Volcano Plots for Identifying Significant Genes in Microarray Data
William D. Shannon, Ph.D.
Associate Professor of Biostatistics in Medicine
Division of General Medical Sciences and Biostatistics
Washington University School of Medicine

Quantile Volcano Plots are proposed as a modification to standard Volcano Plots to improve identification of genes from microarray experiments with statistical and biological significance. Standard Volcano Plots declare genes to have significantly different expression between two sample types based on both biological difference (absolute log2(estimated fold change) greater than some arbitrary threshold) and statistical difference (-log10(P value) greater than some arbitrary threshold). Quantile Volcano Plots improve this method by fitting a quantile regression curve to the null distribution of the standard Volcano Plot data and declaring genes significant based on their relationship to this curve. Since the quantile regression curve adapts to the shape of the data, this method avoids the use of arbitrary constant thresholds for deciding which genes are differentially expressed. In this talk I will describe the algorithm and illustrate its use with pharmacogenomic microarray data.

http://ilya.wustl.edu/~shannon/QVPTalk.ppt
http://ilya.wustl.edu/~shannon/QuantileVolcanoPlot.pdf


Mixed-Effects Models for Ordinal Data with Scaling Terms
Donald Hedeker, Ph.D.
Professor of Biostatistics
Division of Epidemiology and Biostatistics
School of Public Health, University of Illinois at Chicago

Mixed-effects logistic regression models are described for analysis of two-level ordinal outcomes, where observations are observed nested within clusters. Random effects are included in the model to account for the correlation of the clustered observations. This correlation can be the same for all clusters or allowed to vary by groups of clusters. Additionally, whereas the usual logistic model assumes that the covariate effects are the same across the cumulative logits (i.e., proportional odds assumption), we describe two extensions to relax this assumption. The first permits separate covariate effects to be estimated for each of the C-1 cumulative logits (where C = number of ordered categories). The second extension instead allows covariates to influence the scale of the ordinal response, in addition to their usual influence on the location. This latter extension can be more parsimonious since it adds only one parameter for each covariate. Additionally, it can be used to partition the degree of within- and between-cluster variance. For implementation, a maximum marginal likelihood (MML) solution is described. An analysis is presented of a dataset from an adolescent smoking study, highlighting and comparing these extensions of the proportional odds mixed model.


Autoregression and Measurement Error
John Staudenmeyer, Ph.D.
Department of Mathematics and Statistics
University of Massachusetts Amherst

Motivated by common experimental designs and models in ecology, we consider the problem of a time series that has been observed with measurement error. Focusing on autoregressive models and additive measurement error, we derive the biases that are caused by ignoring the measurement error. After that, we develop some new methods to correct for the effects of measurement error. The new methods take advantage of estimates of the measurement error model's parameters that commonly are available in applications. The new methods are based on estimating equations and pseudo-maximum likelihood. Asymptotic comparisons and small sample simulations demonstrate (not surprisingly) that new methods that use estimates of the measurement error model's parameters are much more efficient than existing methods that (somewhat surprisingly) do not. There is little difference between the simple estimating equation approach and the more complicated pseudo-maximum likelihood approach. Time permitting, we will also talk about the effect of measurement error on second order bias. This is joint work with John Buonaccorsi.


 

Spring 2004 Biostatistics Colloquia

Information Mining and Services Research: It?s not computing prowess alone
Siddhartha R. Dalal, Ph.D.
Vice President, Imaging and Services Technology Service Center
Xerox Corporation

With the need for tremendous amount of information processing, Information Technologies is the fastest growing area affecting almost every facet of human life. In spite of impressive gains, there are still many basic technology and business challenges in Information Mining and Services Research that cannot be solved by computational prowess alone. I will describe Information Mining and Services Research and discuss examples of the challenges involving Search Engine technologies, Imaging Science and Software Engineering. On the surface, traditional information theoretic considerations do not offer solutions. Accordingly, researchers looking for conventional solutions would have difficulty in solving these problems. I will describe how alternative information sciences based formulations have played a critical role in addressing these problems.

Biographical Sketch:
Siddhartha Dalal is Vice President of Imaging and Services Technology Center (ISTC) at Xerox. ISTC's staff of world-class scientists and engineers creates Xerox's benchmark digital imaging technologies and document solutions. Prior to Xerox, Sid started his industrial research career in Math Research Center at Bell Labs, and worked at Bellcore as a Chief Scientist and Telcordia Technologies as an Executive Director. Sid?s past research has focused on information extraction, analysis and services. He has published over 70 research papers and has coauthored two reports on Software Engineering on behalf of National Academy of Sciences. He has an MBA in Marketing (1973) and a PhD in Statistics (1975) from the University of Rochester.


Likelihood Ratio Tests That Certain Variance Components Are Zero
Ciprian Crainiceanu, Ph.D.
Visiting Professor
School of Operations Research and Industrial Engineering
Cornell University

We consider the problem of testing null hypotheses that include the constraint that some specified variance components are zero in a Linear Mixed Model (LMM). The finite sample and asymptotic distribution of Likelihood Ratio Test (LRT) and Restricted Likelihood Ratio Test (RLRT) are derived for LMMs with one random effects variance component. A parametric bootstrap approach is recommended for LMM with more than one variance component. In particular, the large sample chi-square mixture approximation of these distributions using the usual asymptotic theory (e.g, Self and Liang) for a parameter on the boundary is shown to be inadequate for this problem.

We discuss possible applications such as testing for subject effect in one-way ANOVA models and linear or nonlinear regression against a general alternative modeled by penalized splines. Extensions to testing semiparametric versus nonparametric regression models are presented. Results apply to virtually all types of basis function used in nonparametric statistics (truncated polynomials, B-splines, trigonometric polynomials, etc.) and for any type of quadratic penalty.


Modeling Prostate Cancer Incidence
Aniko Szabo, Ph.D.
Huntsman Cancer Institute
University of Utah

The introduction of a screening regimen changes disease history, presentation and survival in many ways. A striking example of this phenomenon is prostate cancer screening. Prostate cancer is one of the most common cancers in American men. PSA screening for prostate cancer has been available since the late 80s, and prostate cancer mortality has been decreasing since the early 90s, leading one to hypothesize a casual link. Surprisingly, the benefit of PSA screening still has not been conclusively established or quantified. I will talk about the various effects of screening in general and present a statistical model of prostate cancer incidence that allows us to estimate these effects for PSA screening.


Haplotype Inference, Genotyping Uncertainty, and Disease Mapping
Jun Liu, Ph.D.
Department of Statistics
Harvard University

Haplotypes have become increasingly popular because of the abundance of single nucleotide polymorphisms (SNPs) and the limited power of the single-locus analyses. Since experimental procedures for determining haplotype phases for an individual are expensive, many computational methods have been developed to infer haplotypes from genotype data of a group of unrelated individuals. In the past few years, our group has the partition-ligation idea for handling data with a large number of SNP markers for each individual. The idea is to partition the whole haplotype into smaller segments. Then we use either the Gibbs sampler or the EM algorithm to construct the partial haplotypes of each segment and to assemble these segments together.

This talk will review some of these haplotype inference models and algorithms, discuss issues and problems in human haplotype structures (e.g., haplotype blocks), and examine the impact of haplotype inference on linkage disequilibrium (LD) mapping of disease mutations. We found that haplotype inference should be carried out jointly with the LD mapping model to achieve the most accurate location estimation.


Testing and adjusting for dependent truncation
Rebecca Betensky, Ph.D.
Department of Biostatistics
Harvard School of Public Health

Randomly truncated survival data arise when the failure time is observed only if it falls within a subject-specific truncating set. Available estimators of the survival function and regression models rely on the key assumption that the joint density of failure and truncation times factors into a product proportional to the individual densities in the observable region. This assumption of quasi-independence may be tested to determine whether standard estimation methods apply. I will describe tests for complex truncation schemes including double truncation and bivariate left-truncation. In addition, I will describe semiparametric structural models for survival analysis that are applicable when quasi-independence does not hold. The aim of these models is to estimate the survival function or the association between failure time and covariates, while accounting for dependent truncation. I will illustrate the methods using several real data sets.


Systems Biology of the Drosophila blastoderm: What can we learn?
John Reinitz, Ph.D.
Department of Applied Mathematics and Statistics
Stony Brook University

A central problem in developmental biology is to understand the dynamics of the determination of a morphogenetic field. This process entails the expression of genes in precise spatial patterns, and is a consequence of transcriptional control, itself a central problem of molecular biology. Spatially controlled gene expression cannot as yet be assayed in microarrays, but certain special properties of the fruit fly Drosophila which make it a premier system for developmental genetics also enable it to be used as a naturally grown differential display system for a systems biology analysis of segment determination and transcriptional control. We are analyzing these problems in the early Drosophila embryo using a combination of experiment, theory, and large scale numerical computation.

In the course of this analysis, we have obtained quantitative gene expression data of unprecedented spatial and temporal resolution. These data show that expression domains in the posterior portion of the embryo move anteriorly during the blastoderm stage of development. I will report on the results of a dynamical analysis of this phenomenon that shows that it is incompatible with the positional information model of Wolpert. In addition, I will present some new results in other areas of investigation.


 

Fall 2003 Biostatistics Colloquia

Distribution-based marginal regression models for longitudinal data
Jianhua Huang
Department of Statistics
University of Pennsylvania

The increasing popularity of longitudinal studies in clinical trials and epidemiological studies has made statistical methodology capable of handling repeated measurements an intensive subject of investigation for the past two decades. By allowing the subjects to be repeatedly measured over time, longitudinal studies are well suited for investigating the temporal trends of the outcomes and the covariates. Currently, most of the statistical methods that have been developed for this type of data, most notably the generalized estimating equations and the mixed-effects models, are focused on modeling the conditional mean of a repeatedly measured outcome variable given time and a set of covariates through regression.

Although successful in many applications, the "conditional mean-based regression" approach is potentially inadequate when the conditional mean is an inappropriate measure for the scientific question being investigated. Such situations may arise when (a) the outcome variable has a highly skewed or non-Gaussian conditional distribution whose characteristics can not be well captured by the mean, or (b) the outcome variable has ordinal scales or a mixed distribution, so that its mean does not have a meaningful interpretation.

In this talk, we will discuss a class of marginal regression models for longitudinal data based on conditional distributions, which provides an alternative to the traditional conditional mean-based regression. The focus of the talk will be on the two sample problems. More general cases involving arbitrary covariates will only be sketched.


Non-Parametric, Hypothesis-Based Analysis of Molecular Heterogeneity for Comparative Phenotype Characterization
Jeanne Kowalski
Assistant Professor of Oncology and Biostatistics
Johns Hopkins Kimmel Cancer Center, Johns Hopkins University

Advances in technology have led to an explosion of molecular research in many fields. Oncology researchers study molecular markers for diagnostic tools by relating expressions from thousands of genes to cancer status, while HIV researchers study drug resistance by relating genetic mutations to altered drug susceptibility. Both tasks include statistical issues of high dimensionality coupled with small sample sizes and thus preclude formal hypothesis testing based on conventional principles.

In this talk, I describe two novel, inference-based approaches to analysis of molecular heterogeneity associated with phenotypes. A common theme among them is the construction of testable hypotheses with assumptions that reflect the complex structure of genetic data. With a modest sample, I discuss a distance-based approach to analysis of genetic heterogeneity based on population sequence data. With the extreme case of several single samples that are to be compared from a microarray experiment, I introduce a stochastic linear hypothesis approach to estimate a number of genes that meet several criteria, beyond experimental variation. In each setting, I also discuss bioinformatics approaches to characterize genes or locations and mutation patterns that depict phenotypes. As motivation for the methods, I examine two separate problems, one for relating differences in a region of the HIV genome to drug resistance, and a second for relating gene expressions with hypothesized pathways for immunogenetic analysis of T cells.


Parameter Estimation for Stochastic Systems
Peter W. Glynn
Thomas W. Ford Professor
School of Engineering
Management Science & Engineering Department, Stanford University

Stochastic models often give rise to difficult parameter estimation problems. These problems can be both analytically and computationally challenging. In this talk, we will discuss several mathematical and computational issues that arise in this setting, and describe some of the theory and algorithms that are appropriate to solving such problems.


Local Likelihood Density Estimation for Interval Censored Data
W. John Braun
Associate Professor
Department of Statistical & Actuarial Sciences
University of Western Ontario

We propose a class of local likelihood density estimates for data that is either interval-censored or has been aggregated into bins. One member of this class retains the simplicity and intuitive appeal of the usual kernel density estimate for complete data. It results from an algorithm that generalizes the self-consistency algorithms of Efron (1967), Turnbull (1976), and Li et al. (1997) by introducing kernel smoothing at each iteration. Intuition suggests this is unlikely to perturb algorithms known to converge, however establishing convergence for the class proceeds from implementation of an estimator as a Newton iteration. Newton iteration for the class requires an explicit solution of the local likelihood equations which, when not directly available, can be found by using symbolic Newton-Raphson (Andrews and Stafford 2000).

The entire class results from a local EM approach using the methods of Loader (1996) and Hjort and Jones (1996) who propose local likelihood density estimates for complete data. We focus on local polynomial expansions of the log density that offer adjustments having the potential to reduce bias at peaks and endpoints. Use of the methods for smoothing histograms and scatterplot smoothing are considered. The methods are applied to HIV data, where interval censoring is common, and to the Ontario health survey, where data has been aggregated into bins.

Stochastic models often give rise to difficult parameter estimation problems. These problems can be both analytically and computationally challenging. In this talk, we will discuss several mathematical and computational issues that arise in this setting, and describe some of the theory and algorithms that are appropriate to solving such problems.

      This is joint work with Thierry Duchesne at Universit? Laval and Jamie Stafford at the University of Toronto.


Spring 2003 Biostatistics Colloquia

Multivariate regression models for estimating global exposure effects
Jason Roy
Brown University

Nonparametric Regression Methods for Longitudinal Data Modeling with Applications in AIDS Clinical Trials
Hulin Wu
Frontier Science & Technology Research Foundation and Center for Biostatistics in AIDS Research (CBAR) Harvard School of Public Health

Longitudinal data such as repeated measurements taken on each of a number of subjects arise frequently in many clinical and biomedical studies. Parametric mixed-effects models such as linear mixed-effects (LME) models (Laird and Ware 1982, Diggle, Liang and Zeger 1994) and nonlinear mixed-effects (NLME) models (Davidian and Giltinan 1995, Vonesh and Chinchilli 1996) have been widely used in longitudinal data analysis. However, in many cases the parametric models may not be available or the parametric assumption may not be reliable, the nonparametric regression techniques need to be developed for longitudinal data modeling and analysis. In this talk, I will introduce the mixed-effects modeling idea into local polynomial smoothing approach to deal with the special correlation structure of longitudinal data. We can show that the proposed estimators are more powerful and efficient compared to the standard working-independent estimators. Our modeling strategy accounts for the within-subject and between-subject variations of the longitudinal data in a natural way, and we can obtain the estimate of the population profile as well as the individual profiles using the empirical Bayes method. The bandwidth selection strategies will be discussed. The asymptotic theories of our population estimators are established. Simulation studies are conducted to illustrate the efficiency of the proposed estimators. We apply the proposed methods to an AIDS clinical trial for modeling the repeated measurements of two biomarkers, plasma HIV RNA copies and CD4 cell counts.

If time permits, I will also briefly mention my research in computational biology/bioengineering, modeling HIV RNA/ immune cell dynamics and AIDS clinical trial simulations.

Asymptotic Distribution-free Confidence Intervals for a New Measure of Bivariate, Partial and Multiple Correlation
Douglas Bonett
Statistical Laboratory and Department of Statistics
Iowa State University of Science and Technology

Transitive Functional Annotation by Shortest Path Analysis of Gene Expression Data
Jasmine (Xianghong) Zhou
Department of Biostatistics
Harvard School of Public Health

Current methods for the functional analysis of microarray gene expression data make the implicit assumption that genes with similar expression profiles have similar functions in cells. However, among genes involved in the same biological pathway, not all gene pairs show high expression similarity. Here, we propose that transitive expression similarity among genes can be used as an important attribute to link genes of the same biological pathway. Based on large-scale yeast microarray expression data, we use the shortest-path analysis to identify transitive genes between two given genes from the same biological process. We find that not only functionally related genes with correlated expression profiles are identified but also those without. In the latter case, we compare our method to hierarchical clustering, and show that our method can reveal functional relationships among genes in a more precise manner. Finally, we show that our method can be used to reliably predict the function of unknown genes from known genes lying on the same shortest path. We assigned functions for 146 yeast genes that are considered as unknown by the Saccharomyces Genome Database and by the Yeast Proteome Database. These genes constitute around 5% of the unknown yeast ORFome.

A Statistical Method for Identifying Informative Genes in Microarrays
James Yang
University of Florida

DNA microarrays can be used to monitor thousands of gene expressions in a single experiment. Statistical analysis on microarray data provides genetics researchers a scientific approach to answering research questions. In this talk, a cost-effective method of making microarrays and reading microarray data will be presented. Statistical methods to solve the following three primary methodological problems in microarray data analysis are proposed: (1) identify differentially expressed genes; (2) estimate the expression difference; and (3) determine the sample size.

This talk provides a comprehensive review of statistical methods for identifying differentially expressed genes in two-condition microarray experiments. Following this review, a new method is proposed to select informative genes. Simulation experiments and statistical analysis on real data were conducted to compare the proposed method with commonly used methods. The results indicate that the proposed gene selection method did better than commonly used methods.

To estimate the gene expression differences under different conditions, a new method has been developed in this study. The estimator is proved to be consistent.

This study investigates a practically important yet relatively unexplored issue: sample size determination. A new statistical method is developed and compared with two existing methods.

A semigroup representation and asymptotic behavior of the fisher-wright moran coalescent
Marek Kimmel
Rice University

Interval-censoring, Medical Researches and Statistical Methods
Tony Sun
Department of Statistics, University of Missouri-Columbia

Interval-censoring is a type of censoring mechanisms that biostatisticians often have to face. This talk will discuss interval-censoring problems that often occur in medical researches with focus on AIDS studies. In particular, the first part of the talk will review several types of interval-censoring that we usually have to deal with and the situations that result in these interval-censoring. In the second part of the talk, I will consider a particular type of interval-censoring that frequently occurs in longitudinal studies and may be informative about the response under study. Some statistical methods for inference are presented and the properties of the proposed methods are established.

Functional Response Models and their Applications to Psychosocial Research
Xin M. Tu
Department of Biostatistics and Epidemiology, University of Pennsylvania Medical Center

We introduce a new class of semi-parametric (distribution-free) regression models with functional responses. This class of functional response models (FRM) generalizes the traditional regression models by defining the response variable as a function of several responses from multiple subjects. By using such multiple-subjects-based responses, the FRM not only integrates some popular non- and semi-parametric approaches within a unified modeling framework, but also provides a platform for developing new models for addressing limitations of existing non- and semi-models. For example, by viewing the popular non-parametric two-sample Mann-Whitney-Wilcoxon (MWW) as a regression under FRM, we can readily generalize it to account for multiple groups and to examine second-order variability of the distributions (MWW is based on comparing the median or first-order variability between two distributions), the latter of which is an important consideration for effectiveness studies. The FRM is also quite effective in addressing limitations of parametric models. For example, latent variable models such as the linear mixed-effects model (LMM) and the structural equation model (SEM) are popular in psychosocial research. By developing new semi-parametric approaches under FRM, we can provide robust estimates for both the population and cluster specific parameters. In addition, these new models can even entertain interactions of random effects, which are difficult to implement under existing inference theory. Because of the dependency introduced by using multiple subjects in defining the response variable, existing generalized estimating equation (GEE) based approaches are not appropriate for making inference about FRM. A novel approach is developed to address the dependence issue by integrating the U-statistic theory with the GEE. The methodology is illustrated with a real data application in psychosocial research involving modeling correlated correlations within a longitudinal data setting.

Modeling Breast Cancer Screening
Andrei Yakovlev
Department of Biostatistics and Computational Biology, University of Rochester

I will talk about mechanistic models of cancer screening and the natural history of cancer. Our approach is different from that Dr. M. Zelen will present later. Its distinct advantage is that one can derive the joint distribution of some important clinical covariates (age, tumor size, nodal involvement) at the time of diagnosis. Using this distribution, we obtain estimates of model parameters from the data generated by the Canadian Breast Screening Studies. This approach allows us to model both cancer incidence and mortality, while other models require the incidence to be input by the investigator. The conditional survival function (given covariates) is estimated from the SEER data using an extended hazard regression model allowing for a non-zero cure rate. All this information is put together in a comprehensive simulation model to make predictions of the national trends in breast cancer incidence and mortality. When making such predictions and comparing them with actual observations, we came up with some conclusions that may have serious medical implications. The last story is probably the most interesting part of my presentation.

Inference of multiple pedigree relationships based on genotypic data
Anthony Almudevar
Department of Mathematics and Statistics, Acadia University, Wolfville, Nova Scotia

The estimation of pedigree relationships between individuals is a problem of some interest in the biological, medical and forensic sciences. Many important genetic parameters associated with a group of organisms depend directly on knowledge of their pedigree. In addition, knowledge of pedigree relationships is crucial in selective breeding or conservation programs. However, pedigrees are often unknown or suspect, and must be estimated.

Pedigree estimation can be performed using codominant genetic markers, with a statistical basis in the rules of Mendelian inheritance, from which maximum likelihood pedigree relationships may be deduced. While this procedure is commonly used for pairs or triplets of individuals, the problem of reconstructing a pedigree among numerous individuals introduces considerable computational challenges. The large number of putative pedigrees rules out an enumerative approach for all but the smallest samples (Painter 1997). One approach commonly used is to construct larger pedigrees from separate pairwise or triplet kinship estimates. However, much information can be lost in this approach (Geyer et al. 1993, Thomas & Hill 2000), hence there will be some benefit to the development of algorithms for the maximization of the pedigree likelihood function defined on all individuals simultaneously.

I will discuss a general approach to this problem, which uses a type of hybrid algorithm. A class of constraints on the admissible set of pedigrees is defined in such a way that a constrained optimization of the pedigree likelihood is computationally feasible. A simulated annealing algorithm is then used to determine the constraint yielding the global maximum likelihood pedigree.

This approach will be demonstrated for two types of problem. In the first, it is assumed that parents of all nonfounders are represented in the sample, and that the founders themselves are unrelated (a complete sample). In the second, genetically important individuals need not be in the sample (an incomplete sample). This situation arises, for example, when siblings, but possibly not parents, are present in the sample.

Bayesian Normalization and Identification for Differential Gene Expression Data
Dabao Zhang
Cornell University

A new framework for normalizing spotted microarray data and identifying differentially expressed genes is developed by using a Bayesian analysis. First, we propose a measurement-error model, which improves the usual semiparametric model for intensity-dependent normalization and takes account of the measurement errors in the observed overall intensities. Second, a Bayesian analysis of the semiparametric measurement-error model is constructed. The analysis avoids the potential risk in using the common two-step procedure for intensity-dependent normalization. We also suggest a Bayesian identification of differentially expressed genes which automatically takes into consideration of the dimension of multiple tests of hypotheses by shrinking the alternative posteriors to zero. Both simulation and application to real microarray data demonstrate promising results.

Early Detection of Disease and Stochastic Models
Marvin Zelen
Department of Biostatistics
Harvard School of Public Health

The early detection of disease presents opportunities for using existing technologies to significantly improve patient benefit. The possibility of diagnosing a chronic disease early, while it is asymptomatic, may result in diagnosing the disease in an earlier stage leading to better prognosis. Many cancers, diabetes, tuberculosis, cardiovascular disease, HIV related diseases, etc. may have better prognosis when combined with an effective treatment. However gathering scientific evidence to demonstrate benefit has proved to be difficult. Clinical trials have been arduous to carry out, because of the need to have large numbers of subjects, long follow-up periods and problems of non-compliance. Implementing public health early detection programs have proved to be costly and not based on analytic considerations. Many of these difficulties are a result of not understanding the early disease detection process and the disease natural histories. One way to approach these problems is to model the early detection process. This talk will discuss stochastic models for the early detection of disease. Breast cancer will be used to illustrate some of the ideas. The talk will discuss breast cancer randomized trials, stage shift and benefit, scheduling of examinations, issue of screening younger women and those at elevated risk and the planning of trials.

Getting Usable Data from Microarrays: The Role of Statisticians
Rafael A. Irizzary
Dept. of Biostatistics, Johns Hopkins University

In this talk I will give some examples of why I think it is important that statisticians be involved in preprocessing of microarray data. I will then describe a specific example related to preprocessing Affymetrix GeneChip high density oligonucleotide array raw data. High density oligonucleotide expression array technology is widely used in many areas of biomedical research for quantitative and highly parallel measurements of gene expression. Affymetrix GeneChip arrays are the most popular. In this technology each gene is typically represented by a set of 11-20 pairs of oligonucleotides separately referred to as probes. Typically 12,000 to 20,000 probe sets are arrayed on a silicon chip. RNA samples are prepared, labeled and hybridized to the arrays. Arrays are then scanned, and images produced and analyzed to obtain an intensity value for each probe. These intensities quantify the extent of the hybridization between the labeled target sample and the oligonucleotide probe. A final step to obtain expression measures is to summarize the probe intensities for a given gene in order to quantify the amount of the corresponding mRNA species in the sample. Using two extensive spike-in studies and a dilution study, we performed a careful assessment of the method of summarizing probe level data provided by the current version of the Affymetrix Microarray Suite (MAS 5.0). We found that the performance of the Affymetrix technology can be greatly improved by the use of expression measures derived from empirically motivated statistical models. The advantages of a new expression measure are assessed through bias, variance, sensitivity, and specificity. In particular, the improvements achieved by a 10-fold decrease in variability for low expression levels are demonstrated. A paper describing this example can be found on the web: http://www.biostat.jhsph.edu/~ririzarr/papers

Generalized Self-Consistency Methods in Cancer Survival
Alexander Tsodikov
Huntsman Cancer Institute, University of Utah

A unified approach is proposed for model building and construction of numerically efficient algorithms for maximum likelihood inference for a large class of semiparametric survival models. The approach is based on a generalization of the idea of self-consistency and links EM algorithms for frailty models and recently developed MM algorithms. Composition technique is developed for building hierarchical model families compatible with the algorithms. Applications of the methodology to various cancer studies is described.

Measurement Errors and Data Transformation for Gene Expression Data, Proteomics and Metabolomics Data
David M. Rocke
University of California, Davis

Gene expression microarrays comprise a suite of related technologies for measuring the expression of thousands of genes simultaneously from a single biological sample. There are also numerous other high-throughput biological assays that can measure large numbers of proteins, lipids, and other biologically active compounds. In this talk, I will describe an important statistical challenge in the use of such data. Using raw data, logarithms, or ratios, the variability of the measurements is strongly dependent on the level of expression, causing a failure of the assumptions of most standard methods of statistical analysis. We present a solution to this problem via a specially tuned data transformation and show how it promotes the effectiveness of simple and sophisticated analyses of the data.

Graphical Analysis of Recurrence Data on Disease Episodes, Product Repairs, and Other Applications
Wayne Nelson
Consultant, Schenectady, New York

Most reliability and survival data analysis methods concern life data on units that fail once and thus have a life distribution. In contrast, in many applications, units experience recurrent events, which require special models and data analyses, which are not well known. Examples include number or cost of recurrent disease episodes in patients, repairs of products, customer purchases on Amazon.com, and births of children to statisticians. Then one wants to estimate the population mean cumulative function (MCF) for the 1) number or 2) cost of recurrences per unit. This talk presents simple nonparametric estimate and plot of the MCF, which is used to a) evaluate whether the population recurrence rate is increasing or decreasing as the population ages, information useful for product burn-in, overhaul, and retirement decisions, b) predict future numbers and costs of recurrences for a unit or population, c) compare data sets from different populations, e.g., different disease treatments or different product designs or productions periods, d) reveal unexpected information, a great advantage of data plots.

This talk also presents approximate confidence limits for a population MCF, allowing one to assess the accuracy of an MCF estimate. Previous counting process methods for recurrent events data apply only to counts of recurrences, but the methods here also apply to costs, product downtimes, and other measures of events. The analyses are illustrated with data on auto and locomotive repairs, recurrent bladder tumors, births of children to statisticians, and other applications.


Fall 2002 Biostatistics Colloquia

Analysis of Controlled Experiments in Which the Response is a Curve
Naomi Altman
Department of Statistics, Pennsylvania State University

In this talk I discuss the use of self-modeling regression to analyze experiments in which the response is a curve. Differences among treatments and covariate effects are summarized by a parametric model, while the shape of the curve is modeled nonparametrically. Tests based on linear and nonlinear mixed models are discussed along with a simulation study of the null distribution for the test statistics. An example using data on a bird growth experiment will be presented. The experiment includes fixed and random factors, and a covariate. There are several response variables, each of which has a different growth curve - however, because the parametric part of the model has an interpretation which is free of the shape of the growth curves, comparison among responses are simplified.

An Estimator for Treatment Comparisons among Survivors in Randomized Trials
David A. Schoenfeld
Professor of Medicine, Harvard Medical School and Professor in the Department of Biostatistics, Harvard School of Public Health

This work is Joint with Douglas Hayden and Donna Pauler.
Abstract: In clinical trials of advanced-stage disease it is often of interest to perform treatment comparisons in the subgroup of survivors. For example, in ventilation studies a primary endpoint is time on ventilation, which is only of interest in survivors. In health-related quality of life (QOL) studies, a secondary endpoint of interest to the primary endpoint of survival is change in QOL over the observation period. Randomized treatment comparisons for these endpoints can not be performed since the outcomes are only observable in the non-randomly selected subgroup of survivors. In cancer studies duration of response to therapy has the same problem, Schroder and Schumacher(1997), Morgan (1988). Following Rubin (1998,2000), we propose evaluation of the Survivor Average Causal Effect (SACE) for treatment evaluations for endpoints censored by death. We provide an estimator of SACE in the presence of no unmeasured confounders, a nontestable assumption which identifies SACE and outline a sensitivity analysis for exploring robustness of conclusions to deviations from this assumption. We apply the method to three applications, duration of ventilation from a clinical trial of Acute Respiratory Distress Syndrome (ARDS), and QOL for patients treated for advanced-stage colorectal cancer in a clinical trial of several chemotheraupetic regimes performed by the Southwest Oncology Group.

Remeasurement and Corrected-Score Methods for Statistical Inference in the Presence of Measurement Error
Leonard A. Stefanski
Department of Statistics, North Carolina State University

Abstract: This talk will start with an introduction and overview of inference problems in the presence of measurement error. Two general approaches for tackling measurement error problems, remeasurement methods and corrected score methods, will be described and a connection between the two methods will be examined. The latter part of the talk will focus on some recent results on corrected scores for the case of replicate measurements and heteroscedastic measurement errors.

Nonparametric Inference Under Constraints
Peter Hall
Australian National University

The greater part of contemporary nonparametric inference employs methods that are linear in the data. The exceptions to this rule generally involve estimators with empirically chosen tuning parameters; examples of those parameters include the bandwidth in kernel-type estimation, and the threshold in wavelet methods. Nevertheless, the estimator is still ``intrinsically'' linear, not least because its first-order theoretical properties are equivalent to those of a linear estimator. However, estimators are often nonlinear in a substantial way if constraints are imposed; examples of constraints include those based on order, such as monotonicity or unimodality of a regression estimator or a density estimator. The talk wil