Fall 2008 Biostatistics Colloquia
Nonparametric Variance Estimation for Systematic Samples
Jean Opsomer, PhD
Colorado State University
Systematic sampling is a frequently used sampling method in natural resource surveys, because of its ease of implementation and its design efficiency. An important drawback of systematic sampling, however, is that no direct estimator of the design variance is available. We describe a new estimator of the model-based expectation of the design variance, under a nonparametric model for the population. The nonparametric model is sufficiently flexible that it can be expected to hold at least approximately for many practical situations. We prove the consistency of the estimator for both the anticipated variance and the design variance under the nonparametric model. The approach is used on a forest survey dataset, on which we compare a number of design-based and model-based variance estimators.
Thursday, November 20, 2008
3:30 PM
K-207 (Room 2-6408) Medical Center
Bayesian Inference for High Dimensional Functional and Image Data using Functional Mixed Models
Jeffrey S. Morris, PhD
Department of Biostatistics
The University of Texas MD Anderson Cancer Center
High dimensional, irregular functional data are increasingly encountered in scientific research. For example, MALDI-MS yields proteomics data consisting of one-dimensional spectra with many peaks, array CGH or SNP chip arrays yield one-dimensional functions of copy number information along the genome, 2D gel electrophoresis and LC-MS yield two-dimensional images with spots that correspond to peptides present in the sample, and fMRI yields four-dimensional data consisting of three-dimensional brain images observed over a sequence of time points on a fine grid. In this talk, I will discuss how to identify regions of the functions/images that are related to factors of interest using Bayesian wavelet-based functional mixed models. The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while borrowing strength between observations in all dimensions. I will demonstrate how to identify regions of the functions that are significantly associated with factors of interest, in a way that takes both statistical and practical significance into account and controls the Bayesian false discovery rate to a pre-specified level. I will also discuss how to extend this framework to include functional predictors with coefficient surfaces. These methods will be applied to a series of functional data sets.
Thursday, November 13, 2008
3:30 PM
Adolph Auditorium (Room 1-7619) Medical Center
Semiparametric Analysis of Recurrent and Terminal Event Data
Douglas E. Schaubel, PhD
Department of Biostatistics, University of Michigan
In clinical and observational studies, the event of interest is often one which can occur multiple times for the same subject (i.e., a recurrent event). Moreover, there may be a terminal event (e.g. death) which stops the recurrent event process and, typically, is strongly correlated with the recurrent event process. We consider the recurrent/terminal event setting and model the dependence through a shared gamma frailty that is included in both the recurrent event rate and terminal event hazard functions. Conditional on the frailty, a model is specified only for the marginal recurrent event process, hence avoiding the strong Poisson-type assumptions traditionally used. Analysis is based on estimating functions that allow for estimation of covariate effects on the marginal recurrent event rate and terminal event hazard. The method also permits estimation of the degree of association between the two processes. Closed-form asymptotic variance estimators are proposed. The proposed methods are evaluated through simulations to assess the applicability of the asymptotic results in finite samples, and to evaluate the sensitivity of the method to departures from its underlying assumptions. The methods are illustrated in an analysis of hospitalization data for patients in an international multi-center study of outcomes among peritoneal dialysis patients. This is joint work with Yining Ye and Jack Kalbfleisch.
Thursday, November 6, 2008
3:30 PM
K-207 (Room 2-6408) Medical Center
Spatio-temporal Analysis via Generalized Additive Models
Kung-Sik Chan, PhD
The University of Iowa
Generalized Additive Model (GAM) has been widely used in practice. However, GAM assumes iid errors, which invalidates its use for many spatio-temporal data. For the latter kind of data, the Generalized Additive Mixed Model (GAMM) may be more appropriate. While there exist several approaches for estimating a GAMM, these approaches suffer from the problems of being numerically unstable or computer-intensive.
In this talk, I will discuss some recent, joint work with Xiangming Fang. We develop an iterative algorithm for Penalized Maximum Likelihood (PML) and Restricted Penalized Maximum Likelihood (REML) estimation of a GAM with correlated errors. Although the new approach does not assume any specific correlation structure, the Mátern spatial correlation model is of particular interest, as motivated by our biological applications. As some of the Mátern parameters are not consistently estimable under the fixed domain asymptotics, situations for the spatio-temporal case are investigated, where the spatial design is assumed to be fixed with temporally independent repeated measurements and the spatial correlation structure does not change over time. Our theoretical investigation exploits the fact that penalized likelihood estimation can be given a Bayesian interpretation. The conditions under which the asymptotic posterior normality holds are discussed. We also develop a model diagnosis method for checking the assumption of independence across time for spatio-temporal data. In practice, selecting the best model is often of interest. A model selection criterion based on the Bayesian framework is proposed to compare different candidate models. The proposed methods are illustrated by simulation and a fisheries application.
Thursday, October 23, 2008
3:30 PM
Adolph Auditorium (Room 1-7619) Medical Center
Challenges in Joint Modeling of Longitudinal and Survival Data
Jane-Ling Wang, PhD
Department of Statistics
University of California at Davis
It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties.
We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches.
Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.
*This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis)
Yakovlev Colloquium*: Detecting Disparities in Long-term Cancer Survivals: Challenges and Possible Solutions
Yi Li , PhD
Department of Biostatistics
Harvard University
Dana-Farber Cancer Institute
This talk deals with long-term disease-specific survivals among the prostate cancer patients in the NIH Surveillance Epidemiology and End Results (SEER) program, wherein the main endpoint (e.g. deaths from prostate cancer) and the censoring causes (e.g. deaths from heart diseases) may be dependent. While a number of authors have studied the mixture survival model to analyze survival data with non-negligible long-term survival fractions, none has studied the mixture model in the presence of dependent censoring. To account for such dependence, we propose a more general long-term survival model that allows for dependent censoring. We derive the models from the perspective of competing risks and model the dependence between the censoring time and the survival time using a class Archimedean copula models. Within this framework, we consider the parameter estimation, the long-term survival detection, and the two-sample comparison of latency distributions in the presence of dependent censoring when a proportion of patients is deemed to be long-term survivors. Large sample results using the martingale theory are obtained. We examine the finite sample performance of the proposed methods via simulation and apply them to analyze the SEER prostate cancer data.
Thursday, September 18, 2008
3:30 PM
K-207 (Room 2-6408) Medical Center
*To honor Dr. Andrei Yakovlev’s major contributions to the department, our first colloquium each academic year will be dedicated to his memory.
Spring 2008 Biostatistics Colloquia
Discovery of Latent Patterns in Disability Data and the
Issue of Model Choice
Tanzy Mae Love, PhD
Department of Statistics
Carnegie Mellon University
Model choice is a major methodological issue in the explosive growth of data-mining models involving latent structure for clustering and classification. Here, we work from a general formulation of hierarchical Bayesian mixed-membership models and present several model specifications and variations, both parametric and nonparametric, in the context of learning the number of latent groups and associated patterns for clustering units. We elucidate strategies for comparing models and specifications by producing novel analyses of the following data set: data on functionally disabled American seniors from the National Long Term Care Survey.
Thursday, April 24, 2008
3:30 PM
Upper Auditorium (Room 3-7619) Medical Center
Funding Opportunities at the National Science Foundation
Grace Yang, PhD
Program Director, Statistics & Probability
National Science Foundation
Division of Mathematical Sciences
Thursday, April 17, 2008
3:30 PM
K-307 (Room 3-6408) Medical Center
Multiple imputation methods in application of a random slope
coefficient linear model to randomized clinical trial data
Moonseong Heo, PhD
Department of Psychiatry
Weill Medical College of Cornell University
Two types of multiple imputation methods, proper and improper, for imputing missing not at random (MNAR) continuous data are considered in the context of attrition problems arising from antidepressant clinical trials, whose primary interest is to compare treatment effects on the declines in depressive symptoms over the study period. Both methods borrow information from completers data to construct pseudo donor sampling distributions from which imputed values are drawn, but differ in characterizing those distributions. A joint likelihood of each method is constructed based on a selection model for missing data. Their performance was evaluated based on maximum likelihood estimates of a random slope coefficient model that fits the imputed data to test the treatment effect via modeling interaction between the treatment and the slope of depressive symptom decline. The following performance evaluation criteria were considered: bias, statistical power, root mean square error, coverage probability of the 95% confidence interval (CI), and width of the CI. The two methods are compared with other analytic strategies for incomplete data: completers-only data analysis, available observations analysis, and last observation carried forward (LOCF) analysis. A simulation study showed that the two multiple imputation methods have favorable results in bias and statistical power and width of the 95% CI, whereas the available observations analysis showed favorable results in bias, root mean square and coverage rate. Completers-only analysis showed better results than the LOCF analysis. Those findings guided interpretation of results from an antidepressant trial for geriatric depression. Finally, a comparison with a sequential hot deck multiple imputation method in application to analysis with missing binary outcome from a recently completed antipsychotic trial will be discussed.
Wednesday, April 9, 2008
3:45 PM
K-307 (Room 3-6408) Medical Center
Improved Measurement Modeling and Regression with Latent Variables
Karen Bandeen-Roche, PhD
Professor of Biostatistics and Medicine
Johns Hopkins Bloomberg School of Public Health
Latent variable models have long been utilized by behavioral scientists to summarize constructs that are represented by multiple measured variables or are difficult to measure, such as health practices and psychiatric syndromes. They have been regarded as particularly useful when variables that can be measured are highly imperfect surrogates for the construct of inferential interest, but they are also criticized as being overly abstract, weakly estimable, computationally intensive and sensitive to unverifiable modeling assumptions. My talk describes two lines of research to improve the utility of latent variable modeling, counterbalancing strengths and weaknesses. First, it reviews methods I have developed for assessing modeling assumptions and delineating what are the targets of parameter estimation in the case of maximum likelihood fitting, allowing for a mis-specified model. Then, it describes new strategies for developing measurement models for subsequent use in developing regression outcomes. One affords approximately unbiased estimation vis a vis full latent variable regression. A second counterbalances standard latent variable modeling assumptions—focused on internal validity of measurement—with alternative assumptions—say, focused on external or concurrent validation. Small sample performance properties are evaluated. The methods will be illustrated using data on post traumatic stress disorder in a population-based sample and aging and adverse health in older adults. It is hoped that the findings will lead to improved usage of latent variable models in scientific investigations.
Thursday, April 3, 2008
3:30 PM
Class of 62 Auditorium (Room G-9425) Medical Center
Branching Processes as Models of Progenitor Cell Populations and Estimation of the Offspring Distributions
In memory of Andrei Yakovlev
Nikolay Yanev, PhD
Professor and Chair
Dept of Probability and Statistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences
This paper considers two new models of reducible age-dependent branching processes with emigration in conjunction with estimation problems arising in cell biology. Methods of statistical inference are developed using the relevant embedded discrete branching structure. Based on observations of the branching process with emigration, estimators of the offspring probabilities are proposed for the hidden unobservable process without emigration, the latter being of prime interest to investigators. The problem under consideration is motivated by experimental data generated by time-lapse video-recording of cultured cells that provides abundant information on their individual evolutions and thus on the basic parameters of their life cycle in tissue culture. Some parameters, such as the mean and variance of the mitotic cycle time, can be estimated nonparametrically without resorting to any mathematical model of cell population kinetics. For other parameters, such as the offspring distribution, a model-based inference is needed. Age-dependent branching processes have proven to be useful models for that purpose. A special feature of the data generated by time-lapse experiments is the presence of censoring effects due to migration of cells out of the field of observation. For the time-to-event observations, such as the mitotic cycle time, the effects of data censoring can be accounted for by standard methods of survival analysis. No methods are available to accommodate such effects in the statistical inference on the offspring distribution. Within the framework of branching processes, the loss of cells to follow-up can be modeled as a process of emigration. Incorporating the emigration process into a pertinent branching model of cell evolution provides the basis for the proposed estimation techniques. The statistical inference on the offspring distribution is illustrated with an application to the development of oligodendrocytes in cell culture.
This talk is based on joint work with Drs. A. Yakovlev and V. Stoimenova.
Thursday, March 27, 2008
3:30 PM
Upper Auditorium (Room 3-7619) Medical Center
Challenges in Joint Modeling of Longitudinal and Survival Data
Jane-Ling Wang, PhD
Professor
University of California at Davis
It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties. We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches. Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.
This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis).
Thursday, March 6, 2008
3:30 PM
Upper Auditorium (Medical Center, Room 3-7619)
Fall 2007 Biostatistics Colloquia
General Transformation Models for Joint Analysis of Recurrent Events and Terminal Event
Donglin Zeng, PhD
Associate Professor
University of North Carolina, Chapel Hill
We propose a class of transformation models with random effects for joint modeling recurrent events and a terminal event. The class of transformation models include both the proportional hazards model and the proportional odds model as special cases. The nonparametric maximum likelihood estimation method is used to derive the estimators, which are then shown to be consistent, asymptotically normal and asymptotically efficient. A simple algorithm is proposed to calculate the estimators. Simulation studies are conducted to examine the small-sample performance of the proposed method. The method is further applied to a real data set.
Friday, December 7, 2007
1:30 PM
Biostatistics Conference Room (MRBX G-11213)
Sequential evaluation of measurement error in a reliability study
Aiyu Liu, PhD
Senior Investigator
National Institute of Child Health & Human Development
We introduce sequential testing procedures for the planning and analysis of reliability studies to assess the measurement error in measuring the level of a biomarker. The designs allow repeated evaluation of reliability of the measurements and stop testing if early evidence shows the measurement error to be within the level of tolerance. Methods are developed and critical values tabulated for a number of two-stage designs. The methods are exemplified using an example evaluating the reliability of an oxidative stress biomarker.
Thursday, November 15, 2007
3:30 PM
Room 1-7619 (Adolph Auditorium) Medical Center
Resampling-based Multiple Testing Methods with Covariate Adjustment: Application to Investigation of Antiretroviral Drug Susceptibility
Victor DeGruttola, ScD
Professor of Biostatistics
Harvard School of Public Health
Identification of patterns of genetic mutations that are associated with clinical resistance to specific antiretroviral drugs in HIV-infected patients requires adjustment for potential confounders, such as the number of active drugs in a patient's regimen other than the one of interest. A variety of methods (e.g. regression trees, neural networks, support vector regression, least squares regression, least angle regression) are available for fitting high dimensional models, which are especially useful for prediction. Our goal focuses on the discovery of important patterns of mutations associated with resistance to a specific drug, after robust adjustment for the impact of covariates. Motivated by this problem, we investigated resampling-based methods to test equal mean response across multiple groups defined by HIV genotype, after adjustment for covariates. We consider construction of test statistics and their null distributions under two types of model: parametric and semiparametric. The covariate function (e.g., linear or quadratic) is explicitly specified in the parametric but not in the semiparametric approach. The parametric approach is more precise when models are correctly specified, but suffers from bias when they are not; the semiparametric approach is more robust to model misspecification, but may be less efficient. To help preserve Type I error while also improving power in both approaches, we propose resampling approaches based on matching of observations with similar covariate values. Matching reduces the impact of model misspecification as well as imprecision in estimation. These methods are evaluated via simulation studies and applied to a data set that combines results from a variety of clinical studies of salvage regimens. Our focus is on relating HIV genotype to viralogical response to abacavir after adjustment for the number of active antiretroviral drugs (excluding abacavir) in the patient's regimen. Illustrative data were provided by the Forum for HIV Collaborative Research, which collected baseline genotype, treatment history, and virological response on over 1300 patients from a range of clinical research studies in North America and Europe. These methods are extended to consider the identification of single nucleotide polymorphisms (SNPs) associated with toxicities related to antiretroviral drugs; an additional challenge in this research arises from fact that the genotype is unphased.
Thursday, October 25, 2007
3:30 PM
Room 2-6408 (K-207) Medical Center
Variance Estimators of Cross-Validation Estimators of the Generalization Error
Prof. Marianthi Markatou
Department of Biostatistics
Columbia University
We bring together methods from two different disciplines, machine learning and statistics, in order to address the problem of estimating the variance of cross-validation estimators of the generalization error. Specifically, we approach the problem of variance estimation of the CV estimators of the generalization error of computer algorithms as a problem in approximating the moments of a statistic. The approximation illustrates the role of training and tests sets in the performance of the algorithm. It provides a unifying approach to evaluation of various methods used in obtaining training and tests sets and it takes into account the variability due to different training and test sets. For the simple problem of predicting the sample mean and in the case of smooth loss functions, we show that the variance of the CV estimator of the generalization error is a function of the moments of the random variables Y, Z, where Y denotes the cardinality of the intersection of two different training sets and Z denotes the cardinality of the intersection of two different test sets. We prove that the distribution of these two random variables in hypergeometric and we compare our estimator with the estimator proposed by Nadeau and Bengio (2003). We extend these results to the regression case and the case of absolute error loss, and indicate how the methods can be extended to the classification case and the general case of kernel regression.
Thursday, October 11, 2007
3:30 PM
Room 1-7619 (Adolph Auditorium) Medical Center
Capturing Heterogeneity and Dependence in Gene Expression Studies by Surrogate Variable Analysis
John Storey
University of Washington
School of Public Health and Community Medicine
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have widespread effects on gene expression levels. These factors are often unmeasured or unmodeled in the significance analysis of an expression study. We show that this "expression variation heterogeneity" can have a profound impact on the statistical and biological results obtained from nearly every microarray study. We propose surrogate variable analysis (SVA) to reduce the effect of expression heterogeneity in microarray studies, both by removing confounding of signal and by eliminating dependence across genes. We discuss connections between SVA and factor analysis, compare SVA with other methods for addressing dependence in multiple testing, and apply SVA to both simulated and experimental data.
Thursday, Sept 27, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
Quantification of Protein Lysate Arrays:
A Nonparametric Approach
Prof. Ximing He
Department of Statistics
University of Illinois at Urbana-Champaign
The reverse-phase protein lysate arrays is an emerging technology that allows us to quantify the relative expression levels of a protein in many different cellular samples. At this moment, the applications of protein lysate arrays are still exploratory with a lack of reliable analysis tools for quantifying the information from protein arrays. In this talk, we show that a nonparametric protein expression curve often provides better fit to the data from the dilution series, whereas rigid parametric models such as the commonly used logistic curves are prone to bias. The problem of quantifying protein expression levels demands serious statistical work, and this talk serves as an introduction. In addition, I will discuss some interesting research problems in statistics that are motivated by our work in protein lysate array data. Part of the talk is based on joint work with colleagues at the M.D. Anderson Cancer Center.
Thursday, Sept 6, 2007
3:30 PM
Room 2-6408 (K-207) Medical Center
Spring 2007 Biostatistics Colloquia
Global Influenza Surveillance and Bioinformatics in Genome and Epidemiological Studies
Prof. Oleg I. Kiselev
State Institute of Influenza
Russian Academy of Medical Sciences
St. Petersburg, Russia
Influenza viruses type A are a leading pathogen in mass illness and high mortality rate during pandemics. The United States and countries in the frame of G8 decided to strengthen their efforts in preparedness to an influenza pandemic. In the frame of implementation of the National Pandemic Plan the first priority is a prediction of epidemiological situations on the local level and genetic properties of potential pandemic strains. Bioinformatics should have a leading role in this direction. The Hong Kong spring of 1997 outbreak caused by a highly pathogenic influenza virus was registered. This outbreak was very unusual in comparison with seasonal flu because of a very high mortality rate among all age groups of patients. The virus was isolated and investigated at the CDC and other laboratories. As a result of these studies, molecular signs of pathogenicity were recognized in hemagglutinin and NS1 genes. Due to the strong induction of cytokine gene expression, the virus caused systemic organ failure and lung edema as a fatal complication. Based on genetic evidence, fine mapping of viral genome vaccines and diagnostics were designed and produced. A growing body of sequence data creates a strong demand for bioinformatics service of molecular biology work. The current epidemiological situation in Indonesia and other Eastern countries is getting worse. H5N1 virus spreads in many countries along the flyways of waterfall birds. In many countries the epidemiological situation should be characterized as a stable endemic one. This means that the virus is in the latent phase in animals and can be activated by unknown factors and cause epidemics. In my presentation, the system of influenza surveillance and control in the frame of WHO Global Influenza network will be discussed. The importance of a development of a new global bioinformatics approaches and software for genetic and epidemiological influenza surveillance system will be proven and proposed. Examples of new Russian developments in this field will be provided and overviewed.
Monday, June 11, 2007
2:00 PM
Room 2-6408 (K-207) Medical Center
Robust Methods for Personalized Prediction of Clinical Outcomes
Tianxi Cai
Department of Biostatistics
Harvard School of Public Health
Continuing technological advancements allow researchers and clinicians to measure an increasingly vast diversity of clinical and biological markers, rapidly increasing our understanding of disease processes. The wide range of newly available markers holds great potential for the personalization of medical care through accurate prediction of outcomes in individual patients. Traditional statistical methods for using patient's marker values to make personalized predictions are derived under a strong assumption that the true model relating markers to the response can be identified, at least with a large enough sample. In practice, however, it is difficult if not impossible even to locate a class of models containing the truth. In this talk, I will discuss various methods for construction, evaluation and comparison of prediction rules without having to assume that the fitted regression models are correct. These methods will be illustrated using datasets from an AIDS clinical trial and a breast cancer gene expression study.
Thursday, May 24, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
High-Dimensional Statistical Models in Genomics:
UIP, MCP and CSI in Perspectives
Pranab K. Sen, Ph.D.
University of North Carolina, Chapel Hill
The ongoing genomics evolution has posed some challenging statistical problems. Most statistical models arising in bioinformatics, data mining
and a variety of other computer-intensive interdisciplinary research
fields are complex in their design, sampling plan and associated
probability law. The curse of dimensionality is so overwhelming that
conventional likelihood ratio based statistical inference may not be
useful. On top of that, such models are typically constrained by
inequality, order, functional, shape or other restraints. Use of variants
of likelihood ratio has also encountered similar impasses. S. N. Roy's
(1953) ingenious union-intersection principle along with high-dimensional
multivariate analysis provide an alternative avenue having some
computational advantages, increased scope of application and beyond
parametrics formulations. This scenario is illustrated with some
microarray data and SNP models.
Thursday, May 10, 2007 at 3:30 p.m.
Room 3-7619 Medical Center (Upper Auditorium)
***Cancelled***
Capturing Heterogeneity and Dependence in Gene Expression Studies by Surrogate Variable Analysis
John Storey
University of Washington
School of Public Health and Community Medicine
It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have widespread effects on gene expression levels. These factors are often unmeasured or unmodeled in the significance analysis of an expression study. We show that this "expression variation heterogeneity" can have a profound impact on the statistical and biological results obtained from nearly every microarray study. We propose surrogate variable analysis (SVA) to reduce the effect of expression heterogeneity in microarray studies, both by removing confounding of signal and by eliminating dependence across genes. We discuss connections between SVA and factor analysis, compare SVA with other methods for addressing dependence in multiple testing, and apply SVA to both simulated and experimental data.
Thursday, April 19, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
PLASQ: A Generalized Linear Model-Based Procedure to Determine Allelic Dosage in Cancer Cells from SNP Array Data
Thomas Laframboise,
David Harrington,
Barbara A. Weir
Dana-Farber Cancer Institute
Human cancer is largely driven by the acquisition of mutations. One class of such mutations is copy number polymorphisms, comprised of deviations from the normal diploid two copies of each autosomal chromosome per cell. We describe a probe-level allele-specific quantitation (PLASQ) procedure to determine copy number contributions from each of the parental chromosomes in cancer cells from SNP microarray data. Our approach is based upon a generalized linear model that takes advantage of a novel classification of probes on the array. As a result of this classification, we are able to fit the model to the data using an expectation-maximization algorithm designed for the purpose. We demonstrate a strong model fit to data from a variety of cell types. In normal diploid samples, PLASQ is able to genotype with very high accuracy. Moreover, we are able to provide a generalized genotype in cancer samples (e.g. CCCCT at an amplified SNP). Our approach is illustrated on a variety of lung cancer cell lines and tumors, and a number of events are validated by independent computational and experimental means. An R software package containing the methods is freely available.
Thursday, April 5, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
Data Monitoring In Clinical Trials: Experiences of a Biostatistician
Robert F. Woolson, Ph.D.
Professor,
Department of Biostatistics, Bioinformatics & Epidemiology
Medical University of South Carolina
&
Professor Emeritus,
Department of Biostatistics,
Department of Statistics and Actuarial Sciences
University of Iowa
Data Safety and Monitoring Boards (DSMB’s) are typically responsible for reviewing safety and efficacy data during the conduct of Phase III randomized clinical trials. These groups are charged with reviewing accumulating evidence to see if there is sufficient evidence to conclude a trial on the basis of benefit, lack of benefit, logistical problems in the study’s conduct, or if there is undue harm to study participants. Biostatisticians generally have an important role as a member of a DSMB; alternatively, to be a liaison between the trial and the external DSMB. In this applied talk, I shall discuss some general issues, personal experiences and challenges associated with DSMB activities.
Thursday, March 29, 2007 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
Fall 2006 Biostatistics Colloquia
Frequentist and Bayesian Approaches to High-Dimensional Testing
Dan Spitzner, Ph.D.
Department of Statistics
Virginia Tech
Functional data arise from samples of digitized or otherwise densely-measured random functions. They are intrinsically high-dimensional, and this poses a challenge to classical hypothesis testing by exacerbating difficulties in discerning “significant” large-scale attributes from spurious noise. A common resolution is to apply smooth goodness-of-fit tests using an adaptive mechanism to truncate the dimensionality of the data. In this talk, it will be discussed how such procedures disproportionately balance, in a desirable way, emphasis between large-scale and noise-like data attributes. This will motivate an investigation into tests that taper (rather than truncate) dimensionality through a test statistic given as weighted quadratic form. Main results concerning theoretical performance and near-optimal settings will be discussed within the context of “rates of testing” theory. Parallel results will then be discussed from a Bayesian viewpoint, in which the tapering concept is particularly appropriate.
Thursday, November 30, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
Genetic Studies for Ordinal Traits
Heping Zhang, Ph.D.
Department of Epidemiology and Public Health
Yale University School of Medicine
For complex diseases, especially mental health conditions including
nicotine dependence and substance use, the outcome variables are often
recorded in an ordinal rather than quantitative scale. The naturally
recorded ordinal traits are usually analyzed either as quantitative
traits or being dichotomized. It has been demonstrated repeatedly in
recent studies that this commonly used approach to dealing with ordinal
traits is inadequate and results in loss of power. After discussing
general principles and an overview of related work, I will present score
test statistics that belong to a general class of family-based
association tests (FBATs) for ordinal traits. This new approach can
adjust for the effects of covariates. Simulation results will be presented
to compare the type I error and power of our proposed tests with existing
tests. The empirical result suggests that our test produces reasonable
type I errors and has better power than the existing tests. The proposed
test was used to analyze GAW14 data on alcoholism and identified
several single nucleotide polymorphisms including rs485874, rs619,
rs718251, rs1869907 that are significantly associated with alcohol
dependence after adjusting for gender and age.
This is a series of joint work with Rui Feng, Xueqin Wang, Hongtu Zhu,
and Yuanqing Ye.
Thursday, November 16, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
Genomic aberration analysis of tumor samples using SNP microarrays
Cheng Li, Ph.D.
Department of Biostatistics, Harvard School of Public Health
Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute
Loss of heterozygosity (LOH) and copy number changes of chromosomal regions bearing tumor suppressor genes or oncogenes are keys event in the evolution of epithelial and mesenchymal tumors. Identification of LOH regions usually relies on genotyping tumor and counterpart normal DNA and noting regions where heterozygous alleles in the normal DNA become homozygous in the tumor. However, paired normal samples for tumors and cell lines are often not available. With the advent of oligonucleotide arrays that simultaneously assay thousands of single-nucleotide polymorphism (SNP) markers, genotyping can now be done at high enough resolution to allow identification of LOH events by the absence of heterozygous loci, without comparison to normal controls. Here we describe a hidden Markov model-based method to identify LOH from unpaired tumor samples, taking into account SNP intermarker distances, SNP-specific heterozygosity rates, and the haplotype structure of the human genome. In addition, copy number analysis incorporating LOH will be discussed.
Thursday, November 2, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
From Data to Differential Equations
James Ramsay, Ph.D.
Department of Psychology
McGill University
Differential equations are the natural way to model systems with functional inputs and functional outputs. They allow us to study the system’s dynamics in the sense of explicitly modelling how the output changes in response to sudden changes in input. For example, engineers developing control systems for industrial processes routinely use DIFE’s as modelling tools. A new method is described for going directly from noisy discrete data, not necessarily sampled at equally spaced times, to a system of differential equations of arbitrary orders, linear or nonlinear, that describes the data. The method involves a generalization of nonparametric curve estimation in which the penalty functional rather than the smoothing functions is estimated. Examples are drawn from biology, chemical engineering and medicine.
Thursday, October 19, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
Stratification on Post-Treatment Variables in Causal Inference: A Potential Outcomes Approach to Developmental Toxicity Analyses
Michael Elliott, Ph.D.
Department of Biostatistics
University of Michigan School of Public Health
In investigating the causal effect of a toxin in fetal toxicology studies in a counterfactual framework, we want to restrict consideration of the effect of dose on birthweight or malformation status to the subset of fetuses that would be born alive under the set of doses in question. Additionally, a toxin may affect both birthweight and malformation status of fetuses (Sammel et al. 1997, Dunson et al. 2003), so that the direct effect of a toxin on birthweight may be confounded by the effect of the toxin on the number of fetuses that implant and are carried to term, since the resources available to the fetus may be different under different doses of toxins. Use of a principal stratum model (Frangakis et al. 2004) that considers the survival status of fetuses under different doses of a toxin can account for both of these forms of selection that may result from utilizing the observed data rather than the “complete” (counterfactual) data. This model also addresses issues of incorporating post-randomization observations in a principal stratum framework.
Thursday, October 5, 2006 at 3:30 p.m.
Room 2-6408 (K-207) Medical Center
Spring 2006 Biostatistics Colloquia
Estimating Mean Response as a Function of Treatment Duration in an
Observational Study, Where Duration may be Informatively Censored Butch Tsiatis,
Ph.D. Department of Statistics North Carolina State University
In a recent clinical trial "ESPRIT" of patients with coronary heart
disease who were scheduled to undergo percutaneous coronary
intervention (PCI), patients randomized to receive Integrilin therapy
had significantly better outcomes than patients randomized to
placebo. The protocol recommended that Integrilin be given as a
continuous infusion for 18-24 hours. There was debate among the
clinicians on the optimal infusion duration in this 18-24 hour range,
and we were asked to study this question statistically. Two issues
complicated this analysis: (i) The choice of treatment duration was
left to the discretion of the physician and (ii) treatment duration
would have to be terminated (censored) if the patient experienced
serious complications during the infusion period. To formalize the
question, "What is the optimal infusion duration?" in terms of a
statistical model, we developed a framework where the problem was cast
using ideas developed for adaptive treatment strategies in causal
inference. The problem is defined through parameters of the
distribution of (unobserved) potential outcomes. We then show how,
under some reasonable assumptions, these parameters could be
estimated. The methods are illustrated using the data from the ESPRIT
trial.
Up and Down Designs for Dose-Finding Trials Nancy Flournoy,
Ph.D. Department of Statistics University of Missouri-Columbia
In this talk we review nonparametric treatment allocation procedures for dose-finding trials, focusing on more recent results. We cover (1) the situation in acute toxicity studies in which the toxicity rate is assumed to increase with dose and (2) the situation in which both toxicity and efficacy are considered jointly with the goal being to identify the dose that maximizes P{efficacy and no toxicity} and (3) how they can be used to approximate optimal designs – which cannot be directly implemented because in these setting the response functions are nonlinear and hence optimal designs are functions of the unknown parameters. Included are group up-and-down designs with and without randomization and their extension, the “zoom-in designs”, optimizing up-and-down designs and balancing up-and-down designs. Where possible we provide theoretical results that aid the comparisons of these designs.
Rehabilitating LAD Regression: Breakdown, Smoothing,
and Robustness Jeffrey S. Simonoff,
Ph.D. Leonard N. Stern School of Business New York University
The most common estimation method for regression models is, of course, least squares,
which minimizes the sum of squared deviations from the regression surface. It is
well-known, however, that least squares regression is highly nonrobust, being
sensitive to unusual values in both the response and predictor spaces. An alternative
approach is least absolute deviation (LAD) regression, which minimizes the sum of
absolute deviations. It is known that LAD regression is more robust than least
squares in the presence of outliers in the response variable, but it has not gained
favor in the robustness literature because of its sensitivity to unusual values in
the predictors (leverage points). In this talk we describe recent research using
mixed integer programming designed to evaluate and improve the robustness of LAD
regression, through determination of the finite sample breakdown point. We show how
recent research on the breakdown point for LAD regression can be adapted to
nonparametric (local linear) regression, providing the first quantification of
robustness for any nonparametric regression estimator. We show how knowledge of the
breakdown point implies good properties of the quadratic (Epanechnikov) kernel for
local linear LAD regression, and describe how post-smoothing can result in a
more appealing regression curve. We build on these results by demonstrating that the
introduction of nonuniform weights can improve the robustness of parametric (linear)
LAD regression, and develop an algorithm for choosing those weights with the goal of
increasing the breakdown point of the method by downweighting leverage points. We
generalize these weights using an easily implemented robustification of Mahalanobis
distance. We derive the asymptotic properties of the weighted LAD estimator, and use
Monte Carlo simulation and application to real examples to illustrate its effectiveness.
This is joint work with Avi Giloni and Baskar Sengupta.
A Nonstationary Negative Binomial Time Series with Time-Dependent
Covariates: Enterococcus Counts in Boston Harbor
Brent Coull, Ph.D. Department of Biostatistics
Harvard School of Public Health
Boston Harbor has had a history of poor water quality, including
contamination by enteric pathogens. We conduct a statistical analysis
of data collected by the Massachusetts Water Resources Authority
(MWRA) between 1996 and 2002 to evaluate the effects of court-mandated
improvements in sewage treatment. Motivated by the ineffectiveness of
standard Poisson mixture models and their zero-inflated counterparts,
we propose a new negative binomial model for time series of
Enterococcus counts in Boston Harbor, where nonstationarity and
autocorrelation are modeled using a nonparametric smooth function of
time in the predictor. Without further restrictions, this function is
not identifiable in the presence of time-dependent covariates;
consequently we use a basis orthogonal to the space spanned by the
covariates and use penalized quasi-likelihood (PQL) for estimation. We
conclude that Enterococcus counts were greatly reduced near the Nut
Island Treatment Plant (NITP) outfalls following the transfer of
wastewaters from NITP to the Deer Island Treatment Plant (DITP) and
that the transfer of wastewaters from Boston Harbor to the offshore
diffusers in Massachusetts Bay reduced the Enterococcus counts near
the DITP outfalls.
This is joint work with Andy Houseman and Jim Shine.
Fall 2005 Biostatistics Colloquia
Hierarchical Bayesian Analysis of Genetic Diversity
in Geographically Structured Populations Dipak K. Dey,
Ph.D. Department of Statistics University of Connecticut
Populations may become differentiated from one another as a result of
genetic drift. The amounts and patterns of differentiation at neutral loci
are determined by local population sizes, migration rates among
populations, and mutation rates. We provide exact analytical expressions
for the mean, variance and covariance of a stochastic model for
hierarchically structured populations subject to migration, mutation, and
drift. In addition to the expected correlation in allele frequencies among
populations in the same geographical region, we demonstrate that there is
a substantial correlation in allele frequencies among regions at the top
level of the hierarchy. We propose a hierarchical Bayesian model for
inference of Wright?s F-statistics. We illustrate the approach
through an analysis of human microsatellite data, revealing that
approaches ignoring the among population correlation of allele frequencies
underestimate the amount of genetic differentiation among major
geographical population groups by approximately 50%, and we discuss the
implications of these results for the use and interpretation of
F-statistics in evolutionary studies. We further provide exact
expressions for the first two moments of a stochastic model appropriate
for studying microsatellite evolution under the assumption that the range
of allele sizes is bounded. Using these results we study the behavior of
several measures related to Wright?s FST, including
Slatkin?s RST.
Non- and Semiparametric Modeling in Applications
Naisyin Wang, Ph.D. Professor of Statistics and
Toxicology Texas A&M University
Due to its flexibility and easy implementation, various non- and
semiparametric models have been used more often in recent biological or
medical studies. These models allow underlying trends of the responses to
be unspecified and nonparametric. In this talk, I will discuss several
recent applications of non- and semiparametric modeling. They include a
colon tumorigenesis study which links BCL2 expression with DNA adduct, a
membrane protein clustering tendency investigation, and if time allows, a
microarray normalization study to normalize partially degraded mRNA
bioarray data. Theoretical support behind the methods will be briefly
discussed. I will also use examples and simulations to illustrate the
connections between the theoretical findings and their implication in
applications.
Modeling Viral Infections
Alan S. Perelson, Ph.D. Senior Fellow Los Alamos National Laboratory
I will review basic models of viral infection that have been used to model
HIV, hepatitis C virus and influenza infection. I will show how such models
can be fit to data to estimate basic parameters describing the viral lifecycle
and the effects of antiviral therapy.
Strength and Frailty of Frailty Modeling
in Population Studies of Aging Anatoli Yashin, Ph.D. Center for
Demographic Studies
Duke University
In this talk I will review recent results and ideas related to frailty (random effect, or
hidden heterogeneity) modeling in population studies of aging and longevity in humans and
animals. The models and ideas belong to the areas of survival analysis, biostatistics and
genetic epidemiology.
The idea of frailty modeling was initially discussed in demographic and actuarial
applications to explain deceleration and leveling off human mortality rates at advanced
ages. Initially these models were investigated without taking observed covariates
into account. The non-identifiability is a crucial feature of such models, which
substantially restricts their applications. Later such models were implemented to the
analysis of data from stress-experiments with laboratory animals. It turns out that the
presence of data from experimental and control groups allows us to solve identifiability
problem. I will define basic models of this class and discuss their strength and limitations.
Then I will introduce extension of frailty models to the case with observed covariates.
Such models were initially developed in econometrics and biostatistics. In contrast
to frailty models without observed covariates these models are identifiable. This feature
motivated development of statistical methods which allow for evaluation of the role of
hidden frailty in estimated effects of observed covariates on survival.
The concept of shared frailty emerged in response to the epidemiological
idea related to the design of the matched pair experiments. The development
of such models was accompanied by a number of confusions indicating that the
concept of shared frailty was not well understood. I will discuss the origin
of such confusions and approaches capable of avoiding them.
One such approach deals with idea of correlated frailty. I will introduce
the correlated frailty models, discuss their properties and elucidate
applications of these models to genetic studies of human aging and longevity.
I will also discuss applications of these models to analysis of dependent
competing risks problem as well as directions of further research.
Spring 2005 Biostatistics Colloquia
What Does a Bayesian Approach Offer in Clinical
Research? Donald A. Berry, Ph.D. Department of
Biostatistics and Applied Mathematics The University of Texas MD
Anderson Cancer Center
My presentation is in two parts. First I argue that almost all
statistical analyses are wrong, regardless of their philosophical
underpinnings! Bayesian analyses are especially susceptible to erroneous
conclusions. A sufficiently rigorous frequentist approach is immune. But
it comes with heavy baggage that slows progress. In attempting to lighten
the load, the second part of my presentation addresses Bayesian
innovations in clinical trials, with particular focus on design. There is
renewed interest in a greater appreciation of the benefits of using a
Bayesian approach in medical research. I will describe some of these
benefits and relate them to modern attitudes in pharmaceutical and medical
device development, and to attitudes in cancer cooperative groups and at
my home institution. Of special importance are the uses of (i) flexible,
adaptive designs, (ii) predictive probabilities, and (iii) hierarchical
modeling.
Modeling HIV-1 Drug Resistance and
Fitness John Mittler, Ph.D. Department of Microbiology
University of Washington
Drug resistance is a major obstacle to the successful treatment of
human immunodeficiency virus type 1 (HIV-1) infection. Viral fitness
strongly influences the within-patient frequency of drug resistant mutants
both in the presence and absence of therapy. We have created models for
both drug resistance (IC50 values) and viral fitness in the absence of
drug. To estimate IC50 values, we used standard stepwise linear regression
to construct drug resistance models for 7 protease inhibitors and 10
reverse transcriptase inhibitors using data obtained from the Stanford HIV
drug resistance database. We evaluated these models by hold-one-out
experiments and by tests on an independent dataset. Our linear model
outperformed other publicly available genotypic interpretation algorithms,
including decision tree, support vector machine, and four rules-based
algorithms (HIVdb, VGI, ANRS and Rega) under both tests. Interestingly,
our model did well despite the absence of any terms for interactions
between different residues in protease or reverse transcriptase. The
resulting linear models are easy to understand, and can potentially assist
in choosing combination therapy regimens. To test our ability to predict
viral fitness in the absence of drug, we have used an all-atom
distance-dependent conditional probability discriminatory function
(RAPDF), a function that has been used successfully in protein structure
prediction, to estimate pseudoenergies for 132 HIV-1 protease flap-region
mutants whose cleavage rates had been determined experimentally. Although
individual discrepancies were noted, the overall correlation between RAPDF
scores and experimentally determined cleavage rates was excellent (r =
0.93 for binned data). Our RAPDF function was particularly good at
identifying mutants with very low fitness, with the 15 mutants with the
lowest RAPDF scores all having undetectable cleavage rates. Progress in
predicting IC50 and viral fitness values may lead to improved strategies
for treating HIV-1 patients.
Criteria for Evaluating Models of Absolute
Risk Mitchell H. Gail, M.D., Ph.D. Division of Cancer
Epidemiology and Genetics National Cancer Institute
Absolute risk is the probability that an individual who is free of a
given disease at an initial age, a, will develop that disease in the
subsequent interval (a, t]. Absolute risk is reduced by mortality from
competing risks. Models of absolute risk that depend on covariates have
been used to design interventions studies, to counsel patients regarding
their risks of disease, and to inform clinical decisions, such as whether
or not to take tamoxifen to prevent breast cancer. Several general
criteria have been used to evaluate models of absolute risk, including how
well the model predicts the observed numbers of events in subsets of the
population ("calibration"), and "discriminatory power", measured by the
concordance statistic (e.g. Rockhill et al., J Natl Cancer Inst,
93, 358-366, 2001). In this paper we review some general criteria and
develop specific loss function-based criteria for two applications, namely
whether or not to screen a population to select subjects for further
evaluation or treatment and whether or not to use a preventive
intervention that has both beneficial and adverse effects. We find that
high discriminatory power is much more crucial in the screening
application than in the preventive intervention application. These
examples indicate that the usefulness of a general criterion such as
concordance depends on the application, and that using specific loss
functions can lead to more appropriate
assessments. This
is joint work with Ruth M. Pfeiffer, Ph.D.
Microarray studies: Can they be reproduced?
Can they be combined? Giovanni Parmigiani, Ph.D. Johns
Hopkins University Bloomberg School of Public Health
Investigations of transcript levels on a genomic scale using
hybridization-based arrays led to formidable advances in our understanding
of the biology of many human illnesses. At the same time, these
investigations have generated controversy, because of the probabilistic
nature of the conclusions, and the surfacing of noticeable discrepancies
between the results of studies addressing the same biological question. In
this lecture I will present simple exploratory data analysis tools for
gauging the degree to which the finding of one study are reproduced by
others, and for integrating multiple studies in a single analysis. I will
describe these approaches in the context of studies of both lung and
breast cancer. The main conclusion of our work to date is that it is
possible to identify a substantial, biologically relevant, subset of the
human genome within which hybridization results are reproducible. The
subset generally varies with the platform used, the tissues studied, and
the populations being sampled. Despite important differences, it is also
possible to develop simple expression measures that allow comparison
across platforms, studies, labs and populations. While these are not
perfect, important biological signal is often preserved or enhanced.
Cross-study validation and combination of microarray results requires
careful, but not overly complex, statistical thinking, and can become a
routine component of genomic analysis.
Design and Analysis of Microarray Assays for
Defining Predictive Gene Expression Signatures Lutz Edler,
Ph.D. German Cancer Research Center Heidelberg, Germany
Functional genomic data, in particular, data on gene expression levels
in patient blood or tissue samples obtained from microarrays, have
nourished the desire of clinicians as well as pharmaceutical industry to
make use of this wealth of data for the development of new and better
targeted drugs. When these high-dimensional data are going to be collected
in the course of a clinical trial, new questions arise on how design and
analyze the trial such that ambitious questions on gene expression can be
answered properly. The prediction of clinical phenotypes such as tumor
class, drug response, toxicity, and even survival by a small set of
predictive genes, a so-called gene expression signature or profile, has
become an important issue of clinical research, naturally coupled with the
question on the best treatment for a subgroup of patients which has been
defined by such a gene expression signature (GES). This question combines
actually two tasks: the definition of the GES and its validation as
prognostic factor for the chosen clinical endpoint, and, the determination
of a sensitive subgroup defined through that GES which then can be
evaluated for treatment effects.
This presentation discusses statistical methods for the determination
of predictive factors. Methods of class prediction/prognostic prediction
are reviewed and results from recent comparative studies presented. The
need to distinguish carefully between feature selection, construction of
the predictor and assessment of the performance of the predictor is
emphasized. The positive predictive value, better known from diagnostic
studies, is used to design a clinical trial when treatment response is
chosen as the predictive endpoint of the GES. Logistic regression is an
appropriate analysis method to account for additional prognostic factors.
By a simulation with a logistic model where treatment and GES are
independent variables one can calculate sample sizes for the logistic
regression and compare them with the sizes obtained with the predictive
value. These methods were applied in a research project on neo-adjuvant
chemotherapy of breast cancer patients. Forthcoming challenges on the
combined use of genomic and proteomic data and questions of the validation
of findings will be illustrated by examples.
Fall 2004 Biostatistics Colloquia
A Class of Bayesian Box-Cox Transformation
Hazard Regression Models Joseph Ibrahim, Ph.D. Professor,
Department of Biostatistics UNC School of Public Health, University of
North Carolina at Chapel Hill
We propose a novel and general class of Box-Cox transformation models
on the hazard functions for right censored survival data. This new class
of models allows a very broad range of shapes and relationships between
the baseline hazard as well as the hazard function. It includes the Cox
proportional hazards model and the additive hazards model as two special
cases. Several properties of the model are derived, and interpretations as
well as illustrations of the behavior of the Box-Cox transformation
parameter are provided. A novel class of joint prior distributions is
proposed for the model parameters. Due to the requirement of a positive
hazard function in the survival model, complex multidimensional nonlinear
parameter constraints must be imposed in the model formulation. As a
result, computations for this new Bayesian model pose many new challenges.
We propose an efficient Markov chain Monte Carlo (MCMC) computational
scheme for sampling from the posterior distribution of the parameters. The
proposed prior distributions facilitate a tractable computational
algorithm. The joint priors are constructed through a conditional-marginal
specification, in which the conditional distribution is univariate, and
one which absorbs all of the non-linear parameter constraints. The
marginal part of the prior specification is free of any constraints. This
novel class of prior distributions allows us to easily compute the full
conditionals needed for Gibbs sampling, incorporating the constraints, and
hence implement the Markov chain Monte Carlo algorithm in a relatively
straightforward fashion. This new class of models is illustrated with a
detailed simulation study as well as a real dataset involving a melanoma
clinical trial. Extensions to frailty models and cure rate models are
discussed.
Statistical Comparison of Medical Images
Eugene Demidenko, Ph.D. Associate Professor, Section of
Biostatistics and Epidemiology Dartmouth Medical School
Imaging technology becomes an essential tool of biomedical research. On
the other hand, issues of statistical image analysis and particularly
image comparison are underdeveloped. Today, it is hard to publish a paper
without providing a p-value when comparing two treatment groups. When it
comes to image comparison, researchers just show several arbitrary picked
images to illustrate their findings. We develop a statistical theory of
content independent image comparison based on the multinomial distribution
of 256 gray level intensities. This statistical model is suitable for
medical microscopic images frequently emerging in cellular and molecular
biology. Parametric, such as likelihood-ratio, or nonparametric, such as
Kolmogorov-Smirnov, tests are applied. We generalize our tests to
ensembles of images to account for biological heterogeneity via mixed
effects approach (E. Demidenko. Mixed Models: Theory and Applications,
Wiley, Hoboken, NJ, 2004). The advantage of our approach is that images
may be adjusted for patient age, gender, experimental conditions, etc. We
illustrate our analysis by comparison of cancer cell images from four
treatment groups.
Biostatistical Challenges in Molecular
Epidemiology William D. Shannon, Ph.D. Associate
Professor of Biostatistics in Medicine Division of General Medical
Sciences and Biostatistics Washington University School of Medicine
Epidemiology is the study of the distribution and size of disease
problems in human populations, in particular to identify etiological
factors in the pathogenesis of diseases and to provide the data essential
for the management, evaluation and planning of services for the
prevention, control and treatment of disease (Everitt). Molecular
epidemiology uses molecular biology to identify the etiological factors,
and is a growing and important area of biomedical research.
Molecular epidemiology presents new challenges to data analysts. Modern
molecular biology can measure tens of thousands of molecular variables
rapidly and cheaply (e.g., gene chips measure the activity of tens of
thousands of genes, genotyping is routinely done at hundreds or thousands
of markers, and proteomics has the potential of characterizing the entire
protein content of tissues). The limiting step in molecular epidemiology
is the small number of human subjects these measurements are made on (the
large P, small N problem).
In this talk I address three statistical problems faced when analyzing
molecular epidemiology data. The first problem is the proper
identification of patient subgroups within which statistical tests of
genotype-phenotype association should be applied. The second problem is
the testing of clinical covariates against a large number of molecular
variables. The third problem is the selection of important molecular
factors related to disease. While these problems can be defined in the
language of classical statistics (i.e., population stratification,
over-determined systems, and variable selection, respectively), classical
statistics will fail due to the 'large P, small N' problem (and there
ain't no getting around that!).
I will argue that new ways of thinking about statistics will be needed
for this data.
http://ilya.wustl.edu/~shannon/UnivRochester.ppt http://ilya.wustl.edu/~shannon/DIMACsPaperSubmitted.pdf
Quantile Volcano Plots for Identifying
Significant Genes in Microarray Data William D. Shannon,
Ph.D. Associate Professor of Biostatistics in Medicine Division
of General Medical Sciences and Biostatistics Washington University
School of Medicine
Quantile Volcano Plots are proposed as a modification to standard
Volcano Plots to improve identification of genes from microarray
experiments with statistical and biological significance. Standard Volcano
Plots declare genes to have significantly different expression between two
sample types based on both biological difference (absolute log2(estimated
fold change) greater than some arbitrary threshold) and statistical
difference (-log10(P value) greater than some arbitrary threshold).
Quantile Volcano Plots improve this method by fitting a quantile
regression curve to the null distribution of the standard Volcano Plot
data and declaring genes significant based on their relationship to this
curve. Since the quantile regression curve adapts to the shape of the
data, this method avoids the use of arbitrary constant thresholds for
deciding which genes are differentially expressed. In this talk I will
describe the algorithm and illustrate its use with pharmacogenomic
microarray data.
http://ilya.wustl.edu/~shannon/QVPTalk.ppt http://ilya.wustl.edu/~shannon/QuantileVolcanoPlot.pdf
Mixed-Effects Models for Ordinal Data with
Scaling Terms Donald Hedeker, Ph.D. Professor of
Biostatistics Division of Epidemiology and Biostatistics School of
Public Health, University of Illinois at Chicago
Mixed-effects logistic regression models are described for analysis of
two-level ordinal outcomes, where observations are observed nested within
clusters. Random effects are included in the model to account for the
correlation of the clustered observations. This correlation can be the
same for all clusters or allowed to vary by groups of clusters.
Additionally, whereas the usual logistic model assumes that the covariate
effects are the same across the cumulative logits (i.e., proportional odds
assumption), we describe two extensions to relax this assumption. The
first permits separate covariate effects to be estimated for each of the
C-1 cumulative logits (where C = number of ordered categories). The second
extension instead allows covariates to influence the scale of the ordinal
response, in addition to their usual influence on the location. This
latter extension can be more parsimonious since it adds only one parameter
for each covariate. Additionally, it can be used to partition the degree
of within- and between-cluster variance. For implementation, a maximum
marginal likelihood (MML) solution is described. An analysis is presented
of a dataset from an adolescent smoking study, highlighting and comparing
these extensions of the proportional odds mixed model.
Autoregression and Measurement Error
John Staudenmeyer, Ph.D. Department of Mathematics and
Statistics University of Massachusetts Amherst
Motivated by common experimental designs and models in ecology, we
consider the problem of a time series that has been observed with
measurement error. Focusing on autoregressive models and additive
measurement error, we derive the biases that are caused by ignoring the
measurement error. After that, we develop some new methods to correct for
the effects of measurement error. The new methods take advantage of
estimates of the measurement error model's parameters that commonly are
available in applications. The new methods are based on estimating
equations and pseudo-maximum likelihood. Asymptotic comparisons and small
sample simulations demonstrate (not surprisingly) that new methods that
use estimates of the measurement error model's parameters are much more
efficient than existing methods that (somewhat surprisingly) do not. There
is little difference between the simple estimating equation approach and
the more complicated pseudo-maximum likelihood approach. Time permitting,
we will also talk about the effect of measurement error on second order
bias. This is joint work with John Buonaccorsi.
Spring 2004 Biostatistics Colloquia
Information Mining and Services Research: It?s
not computing prowess alone Siddhartha R. Dalal,
Ph.D. Vice President, Imaging and Services Technology Service
Center Xerox Corporation
With the need for tremendous amount of information processing,
Information Technologies is the fastest growing area affecting almost
every facet of human life. In spite of impressive gains, there are still
many basic technology and business challenges in Information Mining and
Services Research that cannot be solved by computational prowess alone. I
will describe Information Mining and Services Research and discuss
examples of the challenges involving Search Engine technologies, Imaging
Science and Software Engineering. On the surface, traditional information
theoretic considerations do not offer solutions. Accordingly, researchers
looking for conventional solutions would have difficulty in solving these
problems. I will describe how alternative information sciences based
formulations have played a critical role in addressing these problems.
Biographical Sketch: Siddhartha Dalal is Vice President of Imaging
and Services Technology Center (ISTC) at Xerox. ISTC's staff of
world-class scientists and engineers creates Xerox's benchmark digital
imaging technologies and document solutions. Prior to Xerox, Sid started
his industrial research career in Math Research Center at Bell Labs, and
worked at Bellcore as a Chief Scientist and Telcordia Technologies as an
Executive Director. Sid?s past research has focused on information
extraction, analysis and services. He has published over 70 research
papers and has coauthored two reports on Software Engineering on behalf of
National Academy of Sciences. He has an MBA in Marketing (1973) and a PhD
in Statistics (1975) from the University of Rochester.
Likelihood Ratio Tests That Certain Variance
Components Are Zero Ciprian Crainiceanu, Ph.D. Visiting
Professor School of Operations Research and Industrial Engineering
Cornell University
We consider the problem of testing null hypotheses that include the
constraint that some specified variance components are zero in a Linear
Mixed Model (LMM). The finite sample and asymptotic distribution of
Likelihood Ratio Test (LRT) and Restricted Likelihood Ratio Test (RLRT)
are derived for LMMs with one random effects variance component. A
parametric bootstrap approach is recommended for LMM with more than one
variance component. In particular, the large sample chi-square mixture
approximation of these distributions using the usual asymptotic theory
(e.g, Self and Liang) for a parameter on the boundary is shown to be
inadequate for this problem.
We discuss possible applications such as testing for subject effect in
one-way ANOVA models and linear or nonlinear regression against a general
alternative modeled by penalized splines. Extensions to testing
semiparametric versus nonparametric regression models are presented.
Results apply to virtually all types of basis function used in
nonparametric statistics (truncated polynomials, B-splines, trigonometric
polynomials, etc.) and for any type of quadratic penalty.
Modeling Prostate Cancer Incidence
Aniko Szabo, Ph.D. Huntsman Cancer Institute
University of Utah
The introduction of a screening regimen changes disease history,
presentation and survival in many ways. A striking example of this
phenomenon is prostate cancer screening. Prostate cancer is one of the
most common cancers in American men. PSA screening for prostate cancer has
been available since the late 80s, and prostate cancer mortality has been
decreasing since the early 90s, leading one to hypothesize a casual link.
Surprisingly, the benefit of PSA screening still has not been conclusively
established or quantified. I will talk about the various effects of
screening in general and present a statistical model of prostate cancer
incidence that allows us to estimate these effects for PSA screening.
Haplotype Inference, Genotyping Uncertainty, and
Disease Mapping Jun Liu, Ph.D. Department of Statistics
Harvard University
Haplotypes have become increasingly popular because of the abundance of
single nucleotide polymorphisms (SNPs) and the limited power of the
single-locus analyses. Since experimental procedures for determining
haplotype phases for an individual are expensive, many computational
methods have been developed to infer haplotypes from genotype data of a
group of unrelated individuals. In the past few years, our group has the
partition-ligation idea for handling data with a large number of SNP
markers for each individual. The idea is to partition the whole haplotype
into smaller segments. Then we use either the Gibbs sampler or the EM
algorithm to construct the partial haplotypes of each segment and to
assemble these segments together.
This talk will review some of these haplotype inference models and
algorithms, discuss issues and problems in human haplotype structures
(e.g., haplotype blocks), and examine the impact of haplotype inference on
linkage disequilibrium (LD) mapping of disease mutations. We found that
haplotype inference should be carried out jointly with the LD mapping
model to achieve the most accurate location estimation.
Testing and adjusting for dependent truncation
Rebecca Betensky, Ph.D. Department of Biostatistics
Harvard School of Public Health
Randomly truncated survival data arise when the failure time is
observed only if it falls within a subject-specific truncating set.
Available estimators of the survival function and regression models rely
on the key assumption that the joint density of failure and truncation
times factors into a product proportional to the individual densities in
the observable region. This assumption of quasi-independence may be tested
to determine whether standard estimation methods apply. I will describe
tests for complex truncation schemes including double truncation and
bivariate left-truncation. In addition, I will describe semiparametric
structural models for survival analysis that are applicable when
quasi-independence does not hold. The aim of these models is to estimate
the survival function or the association between failure time and
covariates, while accounting for dependent truncation. I will illustrate
the methods using several real data sets.
Systems Biology of the Drosophila
blastoderm: What can we learn? John Reinitz,
Ph.D. Department of Applied Mathematics and Statistics Stony
Brook University
A central problem in developmental biology is to understand the
dynamics of the determination of a morphogenetic field. This process
entails the expression of genes in precise spatial patterns, and is a
consequence of transcriptional control, itself a central problem of
molecular biology. Spatially controlled gene expression cannot as yet be
assayed in microarrays, but certain special properties of the fruit fly
Drosophila which make it a premier system for developmental
genetics also enable it to be used as a naturally grown differential
display system for a systems biology analysis of segment determination and
transcriptional control. We are analyzing these problems in the early
Drosophila embryo using a combination of experiment, theory, and
large scale numerical computation.
In the course of this analysis, we have obtained quantitative gene
expression data of unprecedented spatial and temporal resolution. These
data show that expression domains in the posterior portion of the embryo
move anteriorly during the blastoderm stage of development. I will report
on the results of a dynamical analysis of this phenomenon that shows that
it is incompatible with the positional information model of Wolpert. In
addition, I will present some new results in other areas of investigation.
Fall 2003 Biostatistics Colloquia
Distribution-based marginal regression
models for longitudinal data Jianhua Huang Department of
Statistics University of Pennsylvania
The increasing popularity of longitudinal studies in clinical trials
and epidemiological studies has made statistical methodology capable of
handling repeated measurements an intensive subject of investigation for
the past two decades. By allowing the subjects to be repeatedly measured
over time, longitudinal studies are well suited for investigating the
temporal trends of the outcomes and the covariates. Currently, most of the
statistical methods that have been developed for this type of data, most
notably the generalized estimating equations and the mixed-effects models,
are focused on modeling the conditional mean of a repeatedly measured
outcome variable given time and a set of covariates through regression.
Although successful in many applications, the "conditional
mean-based regression" approach is potentially inadequate when the
conditional mean is an inappropriate measure for the scientific question
being investigated. Such situations may arise when (a) the outcome
variable has a highly skewed or non-Gaussian conditional distribution
whose characteristics can not be well captured by the mean, or (b) the
outcome variable has ordinal scales or a mixed distribution, so that its
mean does not have a meaningful interpretation.
In this talk, we
will discuss a class of marginal regression models for longitudinal data
based on conditional distributions, which provides an alternative to the
traditional conditional mean-based regression. The focus of the talk will
be on the two sample problems. More general cases involving arbitrary
covariates will only be sketched.
Non-Parametric, Hypothesis-Based Analysis of
Molecular Heterogeneity for Comparative Phenotype Characterization
Jeanne Kowalski Assistant Professor of Oncology and
Biostatistics Johns Hopkins Kimmel Cancer Center, Johns Hopkins
University
Advances in technology have led to an explosion of molecular research
in many fields. Oncology researchers study molecular markers for
diagnostic tools by relating expressions from thousands of genes to cancer
status, while HIV researchers study drug resistance by relating genetic
mutations to altered drug susceptibility. Both tasks include statistical
issues of high dimensionality coupled with small sample sizes and thus
preclude formal hypothesis testing based on conventional
principles.
In this talk, I describe two novel, inference-based
approaches to analysis of molecular heterogeneity associated with
phenotypes. A common theme among them is the construction of testable
hypotheses with assumptions that reflect the complex structure of genetic
data. With a modest sample, I discuss a distance-based approach to
analysis of genetic heterogeneity based on population sequence data. With
the extreme case of several single samples that are to be compared from a
microarray experiment, I introduce a stochastic linear hypothesis approach
to estimate a number of genes that meet several criteria, beyond
experimental variation. In each setting, I also discuss bioinformatics
approaches to characterize genes or locations and mutation patterns that
depict phenotypes. As motivation for the methods, I examine two separate
problems, one for relating differences in a region of the HIV genome to
drug resistance, and a second for relating gene expressions with
hypothesized pathways for immunogenetic analysis of T cells.
Parameter Estimation for Stochastic Systems
Peter W. Glynn Thomas W. Ford Professor School of
Engineering Management Science & Engineering Department, Stanford
University
Stochastic models often give rise to difficult parameter estimation
problems. These problems can be both analytically and computationally
challenging. In this talk, we will discuss several mathematical and
computational issues that arise in this setting, and describe some of the
theory and algorithms that are appropriate to solving such problems.
Local Likelihood Density Estimation for Interval
Censored Data W. John Braun Associate
Professor Department of Statistical & Actuarial
Sciences University of Western Ontario
We propose a class of local likelihood density estimates for data that
is either interval-censored or has been aggregated into bins. One member
of this class retains the simplicity and intuitive appeal of the usual
kernel density estimate for complete data. It results from an algorithm
that generalizes the self-consistency algorithms of Efron (1967), Turnbull
(1976), and Li et al. (1997) by introducing kernel smoothing at each
iteration. Intuition suggests this is unlikely to perturb algorithms known
to converge, however establishing convergence for the class proceeds from
implementation of an estimator as a Newton iteration. Newton iteration for
the class requires an explicit solution of the local likelihood equations
which, when not directly available, can be found by using symbolic
Newton-Raphson (Andrews and Stafford 2000).
The entire class results from a local EM approach using the methods of
Loader (1996) and Hjort and Jones (1996) who propose local likelihood
density estimates for complete data. We focus on local polynomial
expansions of the log density that offer adjustments having the potential
to reduce bias at peaks and endpoints. Use of the methods for smoothing
histograms and scatterplot smoothing are considered. The methods are
applied to HIV data, where interval censoring is common, and to the
Ontario health survey, where data has been aggregated into bins.
Stochastic models often give rise to difficult parameter estimation
problems. These problems can be both analytically and computationally
challenging. In this talk, we will discuss several mathematical and
computational issues that arise in this setting, and describe some of the
theory and algorithms that are appropriate to solving such problems.
This is joint work with Thierry
Duchesne at Universit? Laval and Jamie Stafford at the University of
Toronto.
Spring 2003 Biostatistics Colloquia
Multivariate regression models for estimating global
exposure effects Jason Roy Brown University
Nonparametric Regression Methods for Longitudinal
Data Modeling with Applications in AIDS Clinical Trials Hulin
Wu Frontier Science & Technology Research Foundation and Center
for Biostatistics in AIDS Research (CBAR) Harvard School of Public Health
Longitudinal data such as repeated measurements taken on each of a
number of subjects arise frequently in many clinical and biomedical
studies. Parametric mixed-effects models such as linear mixed-effects
(LME) models (Laird and Ware 1982, Diggle, Liang and Zeger 1994) and
nonlinear mixed-effects (NLME) models (Davidian and Giltinan 1995, Vonesh
and Chinchilli 1996) have been widely used in longitudinal data analysis.
However, in many cases the parametric models may not be available or the
parametric assumption may not be reliable, the nonparametric regression
techniques need to be developed for longitudinal data modeling and
analysis. In this talk, I will introduce the mixed-effects modeling idea
into local polynomial smoothing approach to deal with the special
correlation structure of longitudinal data. We can show that the proposed
estimators are more powerful and efficient compared to the standard
working-independent estimators. Our modeling strategy accounts for the
within-subject and between-subject variations of the longitudinal data in
a natural way, and we can obtain the estimate of the population profile as
well as the individual profiles using the empirical Bayes method. The
bandwidth selection strategies will be discussed. The asymptotic theories
of our population estimators are established. Simulation studies are
conducted to illustrate the efficiency of the proposed estimators. We
apply the proposed methods to an AIDS clinical trial for modeling the
repeated measurements of two biomarkers, plasma HIV RNA copies and CD4
cell counts.
If time permits, I will also briefly mention my research in
computational biology/bioengineering, modeling HIV RNA/ immune cell
dynamics and AIDS clinical trial simulations.
Asymptotic Distribution-free Confidence Intervals
for a New Measure of Bivariate, Partial and Multiple
Correlation Douglas Bonett Statistical Laboratory and
Department of Statistics Iowa State University of Science and
Technology
Transitive Functional Annotation by Shortest Path
Analysis of Gene Expression Data Jasmine (Xianghong)
Zhou Department of Biostatistics Harvard School of Public Health
Current methods for the functional analysis of microarray gene
expression data make the implicit assumption that genes with similar
expression profiles have similar functions in cells. However, among genes
involved in the same biological pathway, not all gene pairs show high
expression similarity. Here, we propose that transitive expression
similarity among genes can be used as an important attribute to link genes
of the same biological pathway. Based on large-scale yeast microarray
expression data, we use the shortest-path analysis to identify transitive
genes between two given genes from the same biological process. We find
that not only functionally related genes with correlated expression
profiles are identified but also those without. In the latter case, we
compare our method to hierarchical clustering, and show that our method
can reveal functional relationships among genes in a more precise manner.
Finally, we show that our method can be used to reliably predict the
function of unknown genes from known genes lying on the same shortest
path. We assigned functions for 146 yeast genes that are considered as
unknown by the Saccharomyces Genome Database and by the Yeast Proteome
Database. These genes constitute around 5% of the unknown yeast ORFome.
A Statistical Method for Identifying Informative
Genes in Microarrays James Yang University of Florida
DNA microarrays can be used to monitor thousands of gene expressions in
a single experiment. Statistical analysis on microarray data provides
genetics researchers a scientific approach to answering research
questions. In this talk, a cost-effective method of making microarrays and
reading microarray data will be presented. Statistical methods to solve
the following three primary methodological problems in microarray data
analysis are proposed: (1) identify differentially expressed genes; (2)
estimate the expression difference; and (3) determine the sample size.
This talk provides a comprehensive review of statistical methods for
identifying differentially expressed genes in two-condition microarray
experiments. Following this review, a new method is proposed to select
informative genes. Simulation experiments and statistical analysis on real
data were conducted to compare the proposed method with commonly used
methods. The results indicate that the proposed gene selection method did
better than commonly used methods.
To estimate the gene expression differences under different conditions,
a new method has been developed in this study. The estimator is proved to
be consistent.
This study investigates a practically important yet relatively
unexplored issue: sample size determination. A new statistical method is
developed and compared with two existing methods.
A semigroup representation and asymptotic
behavior of the fisher-wright moran coalescent Marek
Kimmel Rice University
Interval-censoring, Medical Researches and
Statistical Methods Tony Sun Department of Statistics,
University of Missouri-Columbia
Interval-censoring is a type of censoring mechanisms that
biostatisticians often have to face. This talk will discuss
interval-censoring problems that often occur in medical researches with
focus on AIDS studies. In particular, the first part of the talk will
review several types of interval-censoring that we usually have to deal
with and the situations that result in these interval-censoring. In the
second part of the talk, I will consider a particular type of
interval-censoring that frequently occurs in longitudinal studies and may
be informative about the response under study. Some statistical methods
for inference are presented and the properties of the proposed methods are
established.
Functional Response Models and their Applications to
Psychosocial Research Xin M. Tu Department of
Biostatistics and Epidemiology, University of Pennsylvania Medical Center
We introduce a new class of semi-parametric (distribution-free)
regression models with functional responses. This class of functional
response models (FRM) generalizes the traditional regression models by
defining the response variable as a function of several responses from
multiple subjects. By using such multiple-subjects-based responses, the
FRM not only integrates some popular non- and semi-parametric approaches
within a unified modeling framework, but also provides a platform for
developing new models for addressing limitations of existing non- and
semi-models. For example, by viewing the popular non-parametric two-sample
Mann-Whitney-Wilcoxon (MWW) as a regression under FRM, we can readily
generalize it to account for multiple groups and to examine second-order
variability of the distributions (MWW is based on comparing the median or
first-order variability between two distributions), the latter of which is
an important consideration for effectiveness studies. The FRM is also
quite effective in addressing limitations of parametric models. For
example, latent variable models such as the linear mixed-effects model
(LMM) and the structural equation model (SEM) are popular in psychosocial
research. By developing new semi-parametric approaches under FRM, we can
provide robust estimates for both the population and cluster specific
parameters. In addition, these new models can even entertain interactions
of random effects, which are difficult to implement under existing
inference theory. Because of the dependency introduced by using multiple
subjects in defining the response variable, existing generalized
estimating equation (GEE) based approaches are not appropriate for making
inference about FRM. A novel approach is developed to address the
dependence issue by integrating the U-statistic theory with the GEE. The
methodology is illustrated with a real data application in psychosocial
research involving modeling correlated correlations within a longitudinal
data setting.
Modeling Breast Cancer
Screening Andrei Yakovlev Department of Biostatistics and
Computational Biology, University of Rochester
I will talk about mechanistic models of cancer screening and the
natural history of cancer. Our approach is different from that Dr. M.
Zelen will present later. Its distinct advantage is that one can derive
the joint distribution of some important clinical covariates (age, tumor
size, nodal involvement) at the time of diagnosis. Using this
distribution, we obtain estimates of model parameters from the data
generated by the Canadian Breast Screening Studies. This approach allows
us to model both cancer incidence and mortality, while other models
require the incidence to be input by the investigator. The conditional
survival function (given covariates) is estimated from the SEER data using
an extended hazard regression model allowing for a non-zero cure rate. All
this information is put together in a comprehensive simulation model to
make predictions of the national trends in breast cancer incidence and
mortality. When making such predictions and comparing them with actual
observations, we came up with some conclusions that may have serious
medical implications. The last story is probably the most interesting part
of my presentation.
Inference of multiple pedigree relationships
based on genotypic data Anthony Almudevar Department of
Mathematics and Statistics, Acadia University, Wolfville, Nova Scotia
The estimation of pedigree relationships between individuals is a
problem of some interest in the biological, medical and forensic sciences.
Many important genetic parameters associated with a group of organisms
depend directly on knowledge of their pedigree. In addition, knowledge of
pedigree relationships is crucial in selective breeding or conservation
programs. However, pedigrees are often unknown or suspect, and must be
estimated.
Pedigree estimation can be performed using codominant genetic markers,
with a statistical basis in the rules of Mendelian inheritance, from which
maximum likelihood pedigree relationships may be deduced. While this
procedure is commonly used for pairs or triplets of individuals, the
problem of reconstructing a pedigree among numerous individuals introduces
considerable computational challenges. The large number of putative
pedigrees rules out an enumerative approach for all but the smallest
samples (Painter 1997). One approach commonly used is to construct larger
pedigrees from separate pairwise or triplet kinship estimates. However,
much information can be lost in this approach (Geyer et al. 1993, Thomas
& Hill 2000), hence there will be some benefit to the development of
algorithms for the maximization of the pedigree likelihood function
defined on all individuals simultaneously.
I will discuss a general approach to this problem, which uses a type of
hybrid algorithm. A class of constraints on the admissible set of
pedigrees is defined in such a way that a constrained optimization of the
pedigree likelihood is computationally feasible. A simulated annealing
algorithm is then used to determine the constraint yielding the global
maximum likelihood pedigree.
This approach will be demonstrated for two types of problem. In the
first, it is assumed that parents of all nonfounders are represented in
the sample, and that the founders themselves are unrelated (a complete
sample). In the second, genetically important individuals need not be in
the sample (an incomplete sample). This situation arises, for example,
when siblings, but possibly not parents, are present in the sample.
Bayesian Normalization and Identification for
Differential Gene Expression Data Dabao Zhang Cornell
University
A new framework for normalizing spotted microarray data and identifying
differentially expressed genes is developed by using a Bayesian analysis.
First, we propose a measurement-error model, which improves the usual
semiparametric model for intensity-dependent normalization and takes
account of the measurement errors in the observed overall intensities.
Second, a Bayesian analysis of the semiparametric measurement-error model
is constructed. The analysis avoids the potential risk in using the common
two-step procedure for intensity-dependent normalization. We also suggest
a Bayesian identification of differentially expressed genes which
automatically takes into consideration of the dimension of multiple tests
of hypotheses by shrinking the alternative posteriors to zero. Both
simulation and application to real microarray data demonstrate promising
results.
Early Detection of Disease and Stochastic
Models Marvin Zelen Department of
Biostatistics Harvard School of Public Health
The early detection of disease presents opportunities for using
existing technologies to significantly improve patient benefit. The
possibility of diagnosing a chronic disease early, while it is
asymptomatic, may result in diagnosing the disease in an earlier stage
leading to better prognosis. Many cancers, diabetes, tuberculosis,
cardiovascular disease, HIV related diseases, etc. may have better
prognosis when combined with an effective treatment. However gathering
scientific evidence to demonstrate benefit has proved to be difficult.
Clinical trials have been arduous to carry out, because of the need to
have large numbers of subjects, long follow-up periods and problems of
non-compliance. Implementing public health early detection programs have
proved to be costly and not based on analytic considerations. Many of
these difficulties are a result of not understanding the early disease
detection process and the disease natural histories. One way to approach
these problems is to model the early detection process. This talk will
discuss stochastic models for the early detection of disease. Breast
cancer will be used to illustrate some of the ideas. The talk will discuss
breast cancer randomized trials, stage shift and benefit, scheduling of
examinations, issue of screening younger women and those at elevated risk
and the planning of trials.
Getting Usable Data from Microarrays: The Role
of Statisticians Rafael A. Irizzary Dept. of
Biostatistics, Johns Hopkins University
In this talk I will give some examples of why I think it is important
that statisticians be involved in preprocessing of microarray data. I will
then describe a specific example related to preprocessing Affymetrix
GeneChip high density oligonucleotide array raw data. High density
oligonucleotide expression array technology is widely used in many areas
of biomedical research for quantitative and highly parallel measurements
of gene expression. Affymetrix GeneChip arrays are the most popular. In
this technology each gene is typically represented by a set of 11-20 pairs
of oligonucleotides separately referred to as probes. Typically 12,000 to
20,000 probe sets are arrayed on a silicon chip. RNA samples are prepared,
labeled and hybridized to the arrays. Arrays are then scanned, and images
produced and analyzed to obtain an intensity value for each probe. These
intensities quantify the extent of the hybridization between the labeled
target sample and the oligonucleotide probe. A final step to obtain
expression measures is to summarize the probe intensities for a given gene
in order to quantify the amount of the corresponding mRNA species in the
sample. Using two extensive spike-in studies and a dilution study, we
performed a careful assessment of the method of summarizing probe level
data provided by the current version of the Affymetrix Microarray Suite
(MAS 5.0). We found that the performance of the Affymetrix technology can
be greatly improved by the use of expression measures derived from
empirically motivated statistical models. The advantages of a new
expression measure are assessed through bias, variance, sensitivity, and
specificity. In particular, the improvements achieved by a 10-fold
decrease in variability for low expression levels are demonstrated. A
paper describing this example can be found on the web:
http://www.biostat.jhsph.edu/~ririzarr/papers
Generalized Self-Consistency Methods in Cancer
Survival Alexander Tsodikov Huntsman Cancer Institute,
University of Utah
A unified approach is proposed for model building and construction of
numerically efficient algorithms for maximum likelihood inference for a
large class of semiparametric survival models. The approach is based on a
generalization of the idea of self-consistency and links EM algorithms for
frailty models and recently developed MM algorithms. Composition technique
is developed for building hierarchical model families compatible with the
algorithms. Applications of the methodology to various cancer studies is
described.
Measurement Errors and Data Transformation for
Gene Expression Data, Proteomics and Metabolomics Data David M.
Rocke University of California, Davis
Gene expression microarrays comprise a suite of related technologies
for measuring the expression of thousands of genes simultaneously from a
single biological sample. There are also numerous other high-throughput
biological assays that can measure large numbers of proteins, lipids, and
other biologically active compounds. In this talk, I will describe an
important statistical challenge in the use of such data. Using raw data,
logarithms, or ratios, the variability of the measurements is strongly
dependent on the level of expression, causing a failure of the assumptions
of most standard methods of statistical analysis. We present a solution to
this problem via a specially tuned data transformation and show how it
promotes the effectiveness of simple and sophisticated analyses of the
data.
Graphical Analysis of Recurrence Data on Disease
Episodes, Product Repairs, and Other Applications Wayne
Nelson Consultant, Schenectady, New York
Most reliability and survival data analysis methods concern life data
on units that fail once and thus have a life distribution. In contrast, in
many applications, units experience recurrent events, which require
special models and data analyses, which are not well known. Examples
include number or cost of recurrent disease episodes in patients, repairs
of products, customer purchases on Amazon.com, and births of children to
statisticians. Then one wants to estimate the population mean cumulative
function (MCF) for the 1) number or 2) cost of recurrences per unit. This
talk presents simple nonparametric estimate and plot of the MCF, which is
used to a) evaluate whether the population recurrence rate is increasing
or decreasing as the population ages, information useful for product
burn-in, overhaul, and retirement decisions, b) predict future numbers and
costs of recurrences for a unit or population, c) compare data sets from
different populations, e.g., different disease treatments or different
product designs or productions periods, d) reveal unexpected information,
a great advantage of data plots.
This talk also presents
approximate confidence limits for a population MCF, allowing one to assess
the accuracy of an MCF estimate. Previous counting process methods for
recurrent events data apply only to counts of recurrences, but the methods
here also apply to costs, product downtimes, and other measures of events.
The analyses are illustrated with data on auto and locomotive repairs,
recurrent bladder tumors, births of children to statisticians, and other
applications.
Fall 2002 Biostatistics Colloquia
Analysis of Controlled Experiments in Which the
Response is a Curve Naomi Altman Department of
Statistics, Pennsylvania State University
In this talk I discuss the use of self-modeling regression to analyze
experiments in which the response is a curve. Differences among treatments
and covariate effects are summarized by a parametric model, while the
shape of the curve is modeled nonparametrically. Tests based on linear and
nonlinear mixed models are discussed along with a simulation study of the
null distribution for the test statistics. An example using data on a bird
growth experiment will be presented. The experiment includes fixed and
random factors, and a covariate. There are several response variables,
each of which has a different growth curve - however, because the
parametric part of the model has an interpretation which is free of the
shape of the growth curves, comparison among responses are simplified.
An Estimator for Treatment Comparisons among
Survivors in Randomized Trials David A.
Schoenfeld Professor of Medicine, Harvard Medical School and
Professor in the Department of Biostatistics, Harvard School of Public
Health
This work is Joint with Douglas Hayden and Donna Pauler. Abstract:
In clinical trials of advanced-stage disease it is often of interest to
perform treatment comparisons in the subgroup of survivors. For example,
in ventilation studies a primary endpoint is time on ventilation, which is
only of interest in survivors. In health-related quality of life (QOL)
studies, a secondary endpoint of interest to the primary endpoint of
survival is change in QOL over the observation period. Randomized
treatment comparisons for these endpoints can not be performed since the
outcomes are only observable in the non-randomly selected subgroup of
survivors. In cancer studies duration of response to therapy has the same
problem, Schroder and Schumacher(1997), Morgan (1988). Following Rubin
(1998,2000), we propose evaluation of the Survivor Average Causal Effect
(SACE) for treatment evaluations for endpoints censored by death. We
provide an estimator of SACE in the presence of no unmeasured confounders,
a nontestable assumption which identifies SACE and outline a sensitivity
analysis for exploring robustness of conclusions to deviations from this
assumption. We apply the method to three applications, duration of
ventilation from a clinical trial of Acute Respiratory Distress Syndrome
(ARDS), and QOL for patients treated for advanced-stage colorectal cancer
in a clinical trial of several chemotheraupetic regimes performed by the
Southwest Oncology Group.
Remeasurement and Corrected-Score Methods for
Statistical Inference in the Presence of Measurement Error
Leonard A. Stefanski Department of Statistics, North
Carolina State University
Abstract: This talk will start with an introduction and overview of
inference problems in the presence of measurement error. Two general
approaches for tackling measurement error problems, remeasurement methods
and corrected score methods, will be described and a connection between
the two methods will be examined. The latter part of the talk will focus
on some recent results on corrected scores for the case of replicate
measurements and heteroscedastic measurement errors.
Nonparametric Inference Under
Constraints Peter Hall Australian National University
The greater part of contemporary nonparametric inference employs
methods that are linear in the data. The exceptions to this rule generally
involve estimators with empirically chosen tuning parameters; examples of
those parameters include the bandwidth in kernel-type estimation, and the
threshold in wavelet methods. Nevertheless, the estimator is still
``intrinsically'' linear, not least because its first-order theoretical
properties are equivalent to those of a linear estimator. However,
estimators are often nonlinear in a substantial way if constraints are
imposed; examples of constraints include those based on order, such as
monotonicity or unimodality of a regression estimator or a density
estimator. The talk wil |