# Departmental Colloquia

## Spring 2017 Biostatistics Colloquia

### A Big Datum; A Statistical Role is to Wrangle It

*Scott Zeger, PhD*

*Johns Hopkins University*

Today, the American healthcare system wastes at least one trillion dollars or 6% of GDP or 140 Xerox Corporations every year. More than a 125 years ago, American universities created the current academic medical model on a foundation of the emerging biological sciences. Today, because of the intertwined revolutions in biological and information technologies, biomedicine has become data intensive. Medical research is being driven by new biomedical measurements and analyses with the goal to provide at lower cost more accurate and precise answers to clinical questions including: what is an individual’s current health state; what is her health trajectory; and what are the likely benefits and harms associated with each available intervention.

This talk discusses a statistical perspective for pursuing scientific answers to clinical questions like the ones above and to reduce services that add little value. We use hierarchical models that describe the flow of information from the population level to individual to inform health decisions whose outcomes in turn update population-level knowledge. We consider some of the roles statisticians can play to effectively mobilize modern biomedical and data science/engineering to improve clinical care of Americans with reduced waste.

Thursday, May 18, 2017

2:30 P.M.

Helen Wood Hall – Room 1W-502

### How You Can Use 70,000 RNA-seq Samples in Your Research

*Jeff Leek, PhD*

*Johns Hopkins University*

There are hundreds of thousands of sequencing samples available through public archives. In this talk I'll discuss a long term collaboration focused on organizing, tidying, and labeling more than 70,000 human RNA-seq samples. I'll discuss some insights we've gained from analyzing all these data together and how you can you use the data we've processed to improve analyses in your group.

Thursday, April 13, 2017

### Statistical Analysis of SMART Studies via Artificial Randomization

*Abdus Wahed, PhD*

*University of Pittsburgh*

Hypothesis testing to compare dynamic treatment regimes (DTR) from a sequential multiple assignment randomization trial (SMART) is generally based on inverse probability weighting or g-estimation. However, regression methods allowing for comparison of DTRs that flexibly adjust for baseline covariates using these methods are not as straight-forward due to the fact that one patient can belong to multiple DTRs. This poses a challenge for data analysts as it violates basic assumptions of regression modeling of unique group membership. In this talk, we will propose an “artificial randomization" technique to make the data appear that each subject belongs to a single DTR. This enables treatment strategy indicators to be inserted as covariates in a regression model and apply standard tests such as t- or F-tests. The properties of this method are investigated analytically and through simulation. We demonstrate the application of this method by applying to a SMART study of cancer regimes. This is a joint work with Semhar Ogbagaber, PhD of Food and Drug Administration.

Thursday, April 6, 2017

### Identifying Critical Pregnancy Windows of Susceptibility to Ambient Air Pollution Exposure

*Joshua Warren, PhD*

*Yale University*

Exposure to ambient air pollution during pregnancy is associated with a number of adverse birth outcomes for the child. The majority of past statistical models investigating these associations incorporate exposures within a regression framework using averages based on different pre-specified periods of the pregnancy (e.g., trimesters). These models are typically fit separately for the different averaging periods. Recently, there is increasing interest in identifying more precise periods of vulnerability to environmental exposures, known as critical windows (CWs), within a single modeling framework. An improved understanding of the specific timing of exposure and outcome development could lead to improved mechanistic explanations for disease development as well as focused guidelines for protection of the unborn child. To this end, a number of statistical methods have recently been developed to estimate CWs of development and have been successfully applied to study adverse birth outcomes including preterm birth, low birthweight, and cardiac congenital anomalies. These models have identified important and particularly vulnerable daily and weekly CWs with respect to exposure to a number of ambient air pollutants. In this talk, I discuss the current framework for CW estimation as well as recent extensions and future directions.

Thursday, March 23, 2017

### Collaborative Targeted Learning using Regression Shrinkage

*Mireille Schnitzer, PhD*

*Universite de Montreal*

Causal inference practitioners are routinely presented with the challenge of wanting to adjust for large numbers of covariates despite limited sample sizes. Collaborative Targeted Maximum Likelihood Estimation (CTMLE) is a general framework for constructing doubly robust semiparametric causal estimators that data-adaptively reduce model complexity in the propensity score in order to optimize a preferred loss function. This stepwise complexity reduction is based on a loss function placed on a strategically updated model for the outcome variable, assessed through cross-validation. New work involves integrating penalized regression methods into a stepwise CTMLE procedure that may allow for a more flexible type of model selection than existing variable selection techniques. Two new algorithms are presented and methods to reduce computational complexity, based on previous work, are assessed.

Thursday, March 2, 2017

### Variable Selection in the Presence of Measurement Error

*Julian Wolfson, PhD*

*University of Minnesota*

Existing methods for performing variable selection mostly ignore the fact that covariates may be measured with error. In this talk, I will introduce MEBoost, a novel variable selection algorithm which follows a path defined by estimating equations that correct for covariate measurement error. MEBoost is simple to implement and much faster than competing methods which are based on computationally expensive matrix projections, allowing it to be applied to large-scale problems. Across a wide range of simulated scenarios, MEBoost outperforms the "naive" Lasso (which makes no correction for measurement error) as well as the recently-proposed Convex Conditioned Lasso. I will illustrate the use of MEBoost in practice by analyzing data from the Box Lunch Study, a clinical trial in nutrition where several variables are based on self-report and hence measured with error. This is joint work with my current PhD student, Ben Brown.

Thursday, February 16, 2017

### Bayesian Adaptive Trial Design Utilizing Multiple Efficacy Endpoints for Predictive Biomarker-Based Subgroups

*Lindsay Renfro, PhD*

*Mayo Clinic *

Randomized clinical trials are the cornerstone of evidence-based medicine and the gold standard for establishing causal relationships between new treatments and improved patient outcomes. However, as diseases like cancer are increasingly understood on a molecular level, clinical trials that are designed for too-general patient populations will often fail to reveal subpopulations where a therapy is more or less effective. Uncertainty regarding the best clinical endpoint(s) compounds the challenge to reaching early and precise conclusions regarding treatment benefit, particularly when different patient subgroups enrolled a trial may respond to treatment via different mechanisms of action, thereby "washing out" a treatment effect measured from one endpoint and across all patients. We develop a novel randomized clinical trial design capable of detecting efficacy through multiple endpoints (e.g., binary and time-to-event) and in the presence of possible patient subpopulations (e.g., biomarker defined). This design allows patients from different subgroups to respond to treatment via different mechanisms of action, i.e., through different endpoints, and further allows for early stopping for efficacy or futility, either overall or within a marker-based subgroup. Using both simulations and patient-level data from a collection of clinical trials in advanced colorectal cancer, we derive the operating characteristics of our design framework and evaluate its practical performance compared to traditional group-sequential designs assuming a single primary endpoint and homogeneous patient population.

Thursday, February 9, 2017

## Fall 2016 Biostatistics Colloquia

### Regularity of a Renewal Process Estimated from Binary Data

*John Rice, PhD*

*University of Rochester *

Assessment of the regularity of a sequence of events over time is an important public health problem. However, there is no commonly accepted definition of “regular.” In our motivating example, a study of HIV self-testing behavior among men who have sex with men, the primary interest lies in determining the effect of an intervention on regularity of the self-testing events. However, we only observe the presence or absence of any testing events in a sequence of predefined time intervals, not exact event times. The goal of this work is to develop suitable methods for estimating the parameters associated with a gamma renewal process when only binary summaries of the process are observed and baseline covariates are available. To facilitate measurement of regularity, we propose two approaches to estimation and inference: first, an efficient likelihood-based method in the case where only the data on the first interval containing at least one testing event is utilized; and second, a quasi-likelihood approach that uses all of the available data on each subject. We conduct simulation studies to evaluate performance of the proposed methods and apply them to our motivating example, concluding that the use of text message reminders significantly improves the regularity of self-testing. A discussion on interesting directions for further research is provided.

Thursday, December 15, 2016

### Bioconductor for Analysis and Comprehension of Genomic Data

*Martin Morgan, PhD*

*Roswell Park Cancer Institute *

Bioconductor is well established, widely used project for the statistical analysis and comprehension of high-throughput genomic data. Major areas of application include sequence-based assays (e.g., RNAseq, ChIPseq, called variants), microarrays (e.g., methylation, expression, copy number), flow cytometry, and proteomics. This talk introduces the project, including overall goals, common statistical approaches, and computational challenges. Emphasis is on reduction of individual raw data sets to statistically and biologically comprehensible insights, and integration of these insights across data sets. This process exposes shared as well as idiosyncratic issues in the development and application of appropriate statistical methodology. Technological advances, especially single-cell assays and the application of multiple assays to the same samples, confront researchers with important statistical and computational challenges.

Thursday, December 1, 2016

### Multiple Imputation Using the Weighted Finite Population Bayesian Bootstrap

*Michael Elliott, PhD*

*University of Michigan *

Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for design features when generating datasets for multiple imputation is a non-trivial task, particularly, as is often the case, cluster sampling is accompanied by unequal probabilities of selection necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generated imputation, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that fail to have correct nominal coverage. Here we extend a weighted finite population Bayesian bootstrap procedure (Dong et al. 2014) to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where probabilities of selection are known at the first and second stage of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities are known. We also consider extensions to clustered and/or stratified sample designs. We show via simulation study this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency even when compared with correct fully parametric models. We illustrate the method with examples from the Behavioral Risk Factor Surveillance System and the National Health and Examination Survey.

Thursday, November 17, 2016

### Penalized Maximum Likelihood Estimation of Multi-layered Gaussian Graphical Models

*Sumanta Basu, PhD*

*Cornell University *

Analyzing multi-layered graphical models provides insight into understanding the conditional relationships among nodes within layers after adjusting for and quantifying the effects of nodes from other layers. We obtain the penalized maximum likelihood estimator for Gaussian multilayered graphical models, based on a computational approach involving screening of variables, iterative estimation of the directed edges between layers and undirected edges within layers and a final refitting and stability selection step that provides improved performance in finite sample settings. We establish the consistency of the estimator in a high-dimensional setting. To obtain this result, we develop a strategy that leverages the biconvexity of the likelihood function to ensure convergence of the developed iterative algorithm to a stationary point, as well as careful uniform error control of the estimates over iterations. The performance of the maximum likelihood estimator is illustrated on synthetic data.

Thursday, November 3, 2016

### Confounding in Imaging-based Predictive Modeling

*Kristin Linn, PhD*

*University of Pennsylvania - Perelman School of Medicine *

The multivariate pattern analysis (MVPA) of neuroimaging data typically consists of one or more statistical learning models applied within a broader image analysis pipeline. The goal of MVPA is often to learn about patterns of variation encoded in magnetic resonance images (MRI) of the brain that are associated with brain disease incidence, progression, and response to therapy. Every model choice that is made during image processing and analysis can have implications with respect to the results of neuroimaging studies. Here, attention is given to two important steps within the MVPA framework: 1) the standardization of features prior to training a supervised learning model, and 2) the training of learning models in the presence of confounding. Specific examples focus on the use of the support vector machine, as it is a common model choice for MVPA, but the general concepts apply to a large set of models employed in the field. We propose novel methods that lead to improved classifier performance and interpretability, and we illustrate the methods on real neuroimaging data from a study of Alzheimer’s disease.

Thursday, October 27, 2016

### Computational Methods for Neuroimaging in R, an Example in Hemorrhagic Stroke Hemorrhage

*John Muschelli, PhD*

*Johns Hopkins - Bloomberg School of Public Health *

Intracranial hemorrhage (ICH), or hemorrhagic stroke, is a potentially lethal condition when a blood vessel ruptures in the brain. Currently, the location of the hemorrhage is described manually and qualitatively. I will present a full pipeline to describe the location of hemorrhage quantitatively using X-ray computed tomography (CT) scans. As many pieces of software were used to preprocess and analyze the data, I will present the Neuroconductor project, at attempt to integrate commonly-used neuroimaging software packages into R. This integration and additional tutorials will hopefully lead to more statisticians and R users to perform full analyses of neuroimaging data.

Thursday, October 13, 2016

### Bayesian Kernel Machine Regression for Estimating the Health Effects of Multi-Pollutant Mixtures

*Jennifer F. Bobb, PhD
Group Health Research Institute *

Studying the health effects of mixtures of environmental stressors is an important problem with applications ranging from estimating how simultaneous exposure to multiple air pollutants impacts mortality, to investigating how metal mixtures jointly affect cognitive function. However, most epidemiological studies tend to focus on single agents at a time, because we currently lack statistical methods to more realistically capture the complexity of true exposure. In this talk, we describe a new approach for estimating the health effects of mixtures, Bayesian kernel machine regression (BKMR). This approach simultaneously estimates the (potentially high-dimensional) exposure-response function and incorporates variable selection to identify important mixture components. An R package that flexibly implements the methods and provides features for summarizing the multivariate exposure-response surface is also described. We then introduce a newly developed extension of BKMR to the setting of exposure to time-varying, multi-pollutant mixtures. Simulation studies demonstrate the performance of the methods under realistic exposure-response scenarios, and application of the methods to environmental health studies highlight their potential to lead to new scientific insights.

Thursday, September 29, 2016

## Spring 2016 Biostatistics Colloquia

### Matrix Completion Discriminant Analysis

#### The Parametric t-test’s Latent Weakness

*Daniel P. Gaile, PhD
State University of New York at Buffalo *

When a latent class structure is present, parametric t-tests conducted on the observed continuous variable can be anti-conservative. This problem is exacerbated by: A) test multiplicity across large numbers of manifest assays, each with a latent structure, and B) increased accuracy of the manifest assays to discriminate underlying latent structures. While it is not surprising that violations of the parametric t-test's underlying assumptions can impact its performance, we demonstrate that latent state conditions can lead to profound overstatements of statistical significance and profound loss of error control. For example, we provide a motivating 'toy' data-set for which the parametric t-test quantifies the evidence against the null hypothesis as approximately 12.5 million to 1 when it should be quantified as approximately 250 to 1. This result is relevant in many modern experimental settings, such as pilot array / next-generation sequencing studies, where an underlying latent structure is either known to be true (e.g., methylation and array comparative genomic hybridization) or plausible (e.g., down/up-regulated gene networks). Our findings are also applicable to small animal studies (e.g., mouse and rat studies), for which latent state biological mechanisms are often plausible and the parametric t-test is often applied. Time permitting, we will briefly discuss the design and analysis of pilot studies to compare the "Capability of Detection" of two different biomarker platforms. We demonstrate that, in small sample settings, commonly employed estimators for "Limit of Blank" cut-offs do not control the false positives rates as intended and we provide a corrected estimator. We also demonstrate that a naive analysis of such data can lead to a loss of type I error control. We relate this, and our previous result, to the general problem of recognizing conditional tests and their complications.

Thursday, April 21, 2016

### Finite Sample Post-Model Selection Inference

*Anand N. Vidyashankar, PhD
George Mason University *

In several instances of statistical practice it is common, when using parametric or semi-parametric models, that a preliminary test of a subset of parameters is made and based on the results of the pre- test an appropriate model is chosen. However, when evaluating the resulting inferential procedure the variability induced by the model-selection needs to be taken into account. In this presentation, we describe two distinct approaches to data analyses in these contexts. We evaluate both these approaches, using recently developed concentration inequalities, in finite samples as opposed to the traditional large sample investigations. We also describe new central limit theorems facilitating comparisons between the methods.

Thursday, April 7, 2016

### Statistical Methods for Profiling the Functional Evolution of Tumor Cell Subpopulations in Response to Chemotherapy

*W. Evan Johnson, PhD
Boston University School of Medicine *

Cancer is a heterogeneous disease, as a typical tumor contains multiple evolutionarily related subpopulations of cells with a different complement of somatically acquired mutations, microenvironment, and functional characteristics. When chemotherapeutic agents are administered to the patient, some of these subpopulations may gain a selective advantage and develop resistance to the treatment, resulting in cancer relapse. In this study, we use multiple ‘-omic’ profiling data types to provide a multi-dimensional, longitudinal ‘window’ into a patient’s tumor biology and selective response to treatment. We present novel approaches for standardizing and integrating heterogeneous data produced by different labs, protocols, or profiling platforms. In addition, we present robust Bayesian factor analysis and structural equations models that for the simultaneous profiling functional oncogenic pathways and for the adaptation of pathway signatures into specific disease contexts. We will discuss appropriate future extensions to our models that include integrated longitudinal profiling, accommodations for multiple cancer subpopulations, and adaptations for single cell profiling. We will demonstrate our methods and results on metastatic breast cancer samples and illustrate the potential implications for precision therapeutics in this context.

Thursday, March 17, 2016

### Competing Risks Predictions on Two Time Scales

*Jason Fine, ScD
Department of Biostatistics, Department of Statistics
University of North Carolina - Chapel Hill *

In the standard analysis of competing risks data, proportional hazards models are fit to the cause-specific hazard functions for all causes on the same time scale. These regression analyses are the foundation for predictions of cause-specific cumulative incidence functions based on combining the estimated cause-specific hazard functions. However, in predictions arising from disease registries, where only subjects with disease enter the database, disease related mortality may be more naturally modelled on the time since diagnosis time scale while death from other causes may be more naturally modelled on the age time scale. The single time scale methodology may be biased if an incorrect time scale is employed for one of the causes and alternative methodology is not available. We propose inferences for the cumulative incidence function in which regression models for the cause-specific hazard functions may be specified on different time scales. Using the disease registry data, the analysis of other cause mortality on the age scale requires left truncating the event time at the age of disease diagnosis, complicating the analysis. In addition, standard martingale theory is not applicable when combining regression models on different time scales. We establish that the covariate conditional predictions are consistent and asymptotically normal using empirical process techniques and propose consistent variance estimators which may be used to construct confidence intervals. Simulation studies show that the proposed two time scale methods perform well, outperforming the single time scale predictions when the time scale is misspecified. The methods are illustrated with stage III colon cancer data obtained from the Surveillance, Epidemiology, and End Results (SEER) program of National Cancer Institute.

Thursday, February 11, 2016

### Nonparametric Functional Divergence-Based Flow Cytometric Classifiers

*Ollivier Hyrien, PhD
Department of Biostatistics and Computational Biology
University of Rochester *

Flow cytometry is routinely used in clinical settings for disease diagnosis and prognosis. The construction of supervised classifiers in this context often begins by preliminarily extracting candidate features from the data from which a decision rule is subsequently constructed. We propose a different approach to flow cytometric classification in which we perform class assignment by summarizing the evidence provided by the data that the subject belongs to each class by means of divergences. The proposed approach makes no distributional assumptions about phenotypes. It automatically integrates predictive patterns in the classifier, eliminating the construction and selection of candidate features from the process. The finite sample performances of the approach are studied in simulations. An application to real data is also presented in which we evaluate a panel of leukemia stem cell markers for detecting aberrancy in blood samples of patients with acute myeloid leukemia.

Thursday, January 14, 2016

## Fall 2015 Biostatistics Colloquia

### Matrix Completion Discriminant Analysis

*Tongtong Wu, PhD
Department of Biostatistics and Computational Biology
University of Rochester *

Matrix completion discriminant analysis (MCDA) is designed for semi-supervised learning where the rate of missingness is high and predictors vastly outnumber cases. MCDA operates by mapping class labels to the vertices of a regular simplex. With $k$ classes, these vertices are arranged on the surface of the unit sphere in $k-1$ dimensional Euclidean space. To assign unlabeled cases to classes, the data is entered into a large matrix that is augmented by vertex coordinates stored in the last $k-1$ columns. Once the matrix is constructed, its missing entries can be filled in by matrix completion. To carry out matrix completion, one minimizes a sum of squares plus a nuclear norm penalty. The simplest solution invokes an MM algorithm and singular value decomposition. A new MM Proximal Distance algorithm is also introduced for fast computation. Once the matrix is completed, an unlabeled case is assigned to the class vertex closest to the point deposited in its last $k-1$ columns. Additionally, we investigate the case where the predictor variables are contaminated. A variety of examples drawn from the statistical literature demonstrate that MCDA is competitive on traditional problems and outperforms alternatives on large-scale problems.

Wednesday, December 16, 2015

### Optimal Tests of Treatment Effects for the Overall Population and Two Subpopulations in Randomized Trials, using Sparse Linear Programming

*Michael Rosenblum, PhD
Department of Biostatistics
Johns Hopkins Bloomberg School of Public Health*

We propose new, optimal methods for analyzing randomized trials, when it is suspected that treatment effects may differ in two predefined subpopulations. Such subpopulations could be defined by a biomarker or risk factor measured at baseline. The goal is to simultaneously learn which subpopulations benefit from an experimental treatment, while providing strong control of the familywise Type I error rate. We formalize this as a multiple testing problem and show it is computationally infeasible to solve using existing techniques. Our solution involves first transforming the original multiple testing problem into a large, sparse linear program. We then solve this problem using advanced optimization techniques. This general method can solve a variety of multiple testing problems and decision theory problems related to optimal trial design, for which no solution was previously available. Specifically, we construct new multiple testing procedures that satisfy minimax and Bayes optimality criteria. For a given optimality criterion, our approach yields the optimal tradeoff between power to detect an effect in the overall population versus power to detect effects in subpopulations. We give examples where this tradeoff is a favorable one, in that improvements in power to detect subpopulation treatment effects are possible at relatively little cost in additional sample size. We demonstrate our approach in examples motivated by two randomized trials of new treatments for HIV. This work has been accepted for publication in the Journal of the American Statistical Association (Theory and Methods). This and related papers are available here: https://mrosenblumbiostat.wordpress.com/learners.

Thursday, December 3, 2015

### Statistics over Computations: Inference, Structure, and Human Thought

*Steven T. Piantadosi, PhD
Department of Brain and Cognitive Sciences
University of Rochester *

An overview of research in cognitive science that is aimed at understanding how human learners solve complex inductive problems. Recent theories of human learning have developed techniques to combine statistical inference with hypothesis spaces that are capable of expressing rich computational and algorithmic processes. Human learners are modeled as idealized statistical inferrers of these unobserved computations. This approach shows promise in explaining human behavior across a variety of domains, including language learning and conceptual development. It also allows the field to address even more basic questions about what types of knowledge might be "built in" for humans, and how children move beyond this initial state to develop the rich systems of knowledge found in adults. More broadly, the ideas used in this work have the potential to inform the discovery of structure, algorithmic processes, scientific laws, and causal relations in other data sets by using techniques inspired by the remarkable statistical inferences carried out by human learners.

Thursday, November 12, 2015

### How Small Clinical Trials Can Have Big Vision: Beyond Sample Size

*Steven Piantadosi, MD, PhD
Samuel Oschin Comprehensive Cancer Institute
Cedars-Sinai Medical Center *

Thursday, October 29, 2015

### Using Integrative Networks to Identify Disease Mechanisms

*Kimberly Glass, PhD
Harvard Medical School
Channing Division of Network Medicine
Department of Medicine, Brigham and Women’s Hospital*

Rapidly evolving genomic technologies are providing unprecedented opportunities to develop a more unified understanding of the processes driving disease. We now recognize that in many cases a single gene or pathway cannot fully characterize a biological system. Instead disease states are often better represented by shifts in the underlying cellular regulatory network. PANDA (Passing Attributes between Networks for Data Assimilation) is a network inference method that uses “message passing” to integrate multiple sources of genomic data. PANDA begins with a map of potential regulatory “edges” and uses phenotype-specific expression and other data types to maximize the flow of information through the transcriptional network. We have applied PANDA to reconstruct subtype-specific regulatory networks in ovarian cancer and sex-specific networks in Chronic Obstructive Pulmonary Disease (COPD). By comparing networks between phenotypic states, we have identified potential therapeutic interventions that would not have been discovered by comparing gene-state information in isolation from the regulatory context. Finally, as the data available on individual samples increases, we are now working to extend PANDA to better integrate additional complementary data types and developing methods to infer the most likely regulatory network for individual patients.

Thursday, October 15, 2015

### Nonparametric Estimation of Spatial Functions on Regions with Irregular Boundaries

*Julie McIntyre, PhD
Department of Mathematics and Statistics
University of Alaska Fairbanks *

The estimation of a density or regression function from geospatial data is a common goal in many studies. For example ecologists use such data to construct maps of the locations or characteristics of plants and animals. Popular methods for estimating these spatial functions include nonparametric kernel density and regression estimation as well as parametric spatial regression (e.g. kriging). However these estimators ignore irregular boundaries and holes in a region, leading to biases. We present an alternative estimator that accounts for boundaries and holes in the estimation process. The estimator is a type of kernel estimator that is based on the density of random walks of length $k$ on a lattice constrained to stay within the region's boundaries. Unbiased cross validation is used to find the optimal walk length $k$. Simulations show that the estimator is superior to other estimators in the presence of boundaries, and comparable in the absence of boundaries. Several examples illustrate the method..

Thursday, October 1, 2015

## Spring 2015 Biostatistics Colloquia

### Joint Modeling Approaches for Longitudinal Studies

*Cheng Yong Tang, PhD
Department of Statistics
Fox School of Business
Temple University *

In longitudinal studies, it is fundamentally important to understand the dynamics in the mean function, variance function, and correlations of the repeated or clustered measurements. We will discuss new joint mean-variance-correlation regression approaches for modeling continuous and discrete repeated measurements from longitudinal studies. By applying hyperspherical coordinates, we obtain an unconstrained interpretable parametrization of the correlation matrix. We then propose regression approaches to model the correlation matrix of the longitudinal measurements by exploiting the unconstrained parametrization. The proposed modeling framework is parsimonious, interpretable, flexible, and it automatically guarantees the resulting correlation matrix to be non-negative definite. Data examples and simulations support the effectiveness of the proposed approaches. This talk is based on joint works with Weiping Zhang and Chenlei Leng.

Thursday, April 9, 2015

### Real-Time Prediction in Clinical Trials: A Statistical History of REMATCH

*Daniel Francis Heitjan, PhD
Department of Statistical Science
Southern Methodist University *

Randomized clinical trial designs often incorporate one or more planned interim analyses. In event-based trials, one may prefer to schedule the interim analyses at the times of occurrence of specified landmark events, such as the 100th event, the 200th event, and so on. Because an interim analysis can impose a considerable logistical burden, and the timing of the triggering event in this kind of study is itself a random variable, it is natural to seek to predict the times of future landmark events as accurately as possible. Early approaches to prediction used data only from previous trials, which are of questionable value when, as commonly occurs, enrollment and event rates differ unpredictably across studies. With contemporary clinical trial management systems, however, one can populate trial databases essentially instantaneously. This makes it possible to create predictions from the trial data itself — predictions that are as likely as any to be reliable and well calibrated statistically. This talk will describe work that some colleagues and I have done in this area. I will set the methodologic development in the context of the study that motivated our research: REMATCH, an RCT of a heart assist device that ran from 1998 to 2001 and is considered a landmark of rigor in the device industry.

Thursday, March 26, 2015

### Causal and statistical inference with social network data: Massive challenges and meager progress

*Elizabeth L. Ogburn, PhD
Department of Biostatistics
Johns Hopkins Bloomberg School of Public Health *

Interest in and availability of social network data has led to increasing attempts to make causal and statistical inferences using data collected from subjects linked by social network ties. But inference about all kinds of estimands, from simple sample means to complicated causal peer effects, is challenging when only a single network of non-independent observations is available. There is a dearth of principled methods for dealing with the dependence that such observations can manifest. We demonstrate the dangerously anticonservative inference that can result from a failure to account for network dependence, explain why results on spatial-temporal dependence are not immediately applicable to this new setting, and describe a few different avenues towards valid statistical and causal inference using social network data.

Thursday, February 12, 2015

### Bayesian models for multiple outcomes in domains with application to the Seychelles Child Development Study

*Sally Thurston, PhD
Department of Biostatistics and Computational Biology
University of Rochester *

The Seychelles Child Development Study (SCDS) examines the effects of prenatal exposure to methylmercury on the functioning of the central nervous system. The SCDS data include 20 outcomes measured on 9-year old children that can be classified broadly in four outcome classes or "domains": cognition, memory, motor, and social behavior. We first consider the problem of estimating the effect of exposure on multiple outcomes in a single model when each outcome belongs to one domain and domain memberships are known. We then extend this to the situation in which the outcomes may belong to more than one domain and where we also want to learn about the assignment of outcomes to domains. In this talk I will discuss the Seychelles data, motivate and give results of the models with the aid of pictures, and very briefly compare these models to a structural equation model.

Thursday, January 22, 2015

## Fall 2014 Biostatistics Colloquia

### Clustering Tree-structured Data on Manifold

*Hongyu Miao, PhD
Department of Biostatistics and Computational Biology
University of Rochester *

Tree-structured data usually contain both topological and geometrical information, and are necessarily considered on manifold instead of Euclidean space for appropriate data parameterization and analysis. In this study, we propose a novel tree-structured data parameterization, called Topology-Attribute matrix (T-A matrix), so the data clustering task can be conducted on matrix manifold. We incorporate the structure constraints embedded in data into the negative matrix factorization method to determine meta-trees from the T-A matrix, and the signature vector of each single tree can then be extracted by meta-tree decomposition. The meta-tree space turns out to be a cone space, in which we explore the distance metric and implement the clustering algorithm based on the concepts like Fréchet mean. Finally, the T-A matrix based clustering (TAMBAC) framework is evaluated and compared using both simulated data and real retinal images to illustrate its efficiency and accuracy.

Thursday, December 4, 2014

### Walking, sliding, and detaching: time series analysis for kinesin

*John Fricks, PhD
Department of Statistics
Pennsylvania State University *

Kinesin is a molecular motor that, along with dynein, moves cargo such as organelles and vesicles along microtubules through axons. Studying these transport process is vital, since non-functioning kinesin has been implicated in a number of neurodegenerative diseases, such as Alzheimer’s disease. Over the last twenty years, these motors have been extensively studied through in vitro experiments of single molecular motors using laser traps and fluorescence techniques. However, an open challenge has been to explain in vivo behavior of these systems when incorporating the data from in vitro experiments into straightforward models.

In this talk, I will discuss recent work with experimental collaborator, Will Hancock (Penn State), to understand more subtle behavior of a single kinesin than has previously been studied, such as sliding and detachment and how such behavior can contribute to our understanding of in vivo transport. Data from these experiments include time series taken from fluorescence experiments for kinesin. In particular, we will use novel applications of switching time series models to explain the shifts between different modes of transport.

Thursday, November 13, 2014

Estimating Physical Activity with an Accelerometer

*John Staudenmayer, PhD
Department of Mathematics and Statistics
University of Massachusetts, Amherst*

Measurement of physical activity (PA) in a free-living setting is essential for several purposes: understanding why some people are more active than others, evaluating the effectiveness of interventions designed to increase PA, performing PA surveillance, and quantifying the relationship between PA dose and health outcomes. One way to estimate PA is to use an accelerometer (a small electronic device that records a time stamped record of acceleration) and a statistical model that predicts aspects of PA (such as energy expenditure, time walking, time sitting, etc.) from the acceleration signals. This talk will describe methods to do this. We will present several calibration studies where acceleration is measured concurrently with objective measurements of PA, describe the statistical models used to relate the two sets of measurements, and examine independent evaluations of the methods.

Thursday, October 30, 2014

The Importance of Simple Tests in High-Throughput Biology: Case Studies in Forensic Bioinformatics

*Keith A. Baggerly, PhD
Department of Bioinformatics and Computational Biology*

*University of Texas MD Anderson Cancer Center*

Modern high-throughput biological assays let us ask detailed questions about how diseases operate, and promise to let us personalize therapy. Careful data processing is essential, because our intuition about what the answers “should” look like is very poor when we have to juggle thousands of things at once. When documentation of such processing is absent, we must apply “forensic bioinformatics” to work from the raw data and reported results to infer what the methods must have been. We will present several case studies where simple errors may have put patients at risk. This work has been covered in both the scientific and lay press, and has prompted several journals to revisit the types of information that must accompany publications. We discuss steps we take to avoid such errors, and lessons that can be applied to large data sets more broadly.

Thursday, October 16, 2014

### Transforming Antibiotic Stewardship Using Response Adjusted for Duration of Antibiotic Risk (RADAR): Using Endpoints to Analyze Patients Rather than Patients to Analyze Endpoints

*Scott R. Evans, PhD, MS
Department of Biostatistics
Harvard School of Public Health *

Unnecessary antibiotic (AB) use is unsafe, wasteful, and leads to emergence of AB resistance. AB stewardship trials are limited by noninferiority (NI) design complexities (e.g., NI margin selection, constancy assumption validity), competing risks of mortality and other outcomes, lack of patient-level interpretation incorporating benefits and harms, and feasibility issues (large sample sizes).

Response Adjusted for Days of Antibiotic Risk (RADAR) is a novel methodology for stewardship trial design that effectively addresses these challenges. RADAR utilizes a superiority design framework evaluating if new strategies are better (in totality) than current strategies. RADAR has 2 steps: (1) creation of an ordinal overall clinical outcome variable incorporating important patient benefits, harms, and quality of life, and (2) construction of a desirability of outcome ranking (DOOR) where (i) patients with better clinical outcomes receive higher ranks than patients with worse outcomes, and (ii) patients with similar clinical outcomes are ranked by AB exposure, with lower exposure achieving higher ranks. N is based on a superiority test of ranks.

Conclusions: RADAR will transform and enhance clinical trials in antibacterial stewardship. RADAR avoids the complexities associated with NI trials (resulting in reduced sample size in many cases), alleviates competing risk problems, provides more informative benefit:risk evaluation, and allows for patient-level interpretation. Researchers should considering using endpoints to analyze patients rather than patients to analyze endpoints in clinical trials.

Thursday, October 2, 2014

## Spring 2014 Biostatistics Colloquia

Teaching Statistics for the Future, the MOOC Revolution and Beyond

*Brian S. Caffo, PhD
Department of Biostatistics
Johns Hopkins University *

Massive open online classes (MOOCs) have become an international phenomenon where millions of students are accessing free educational materials from top universities. MOOC startups, such as Coursera, EdX and Udacity, are growing at an astounding rate. At the same time, industry predictions suggest a massive deficit in STEM, and particularly statistics, graduates to meet the growing demand for statistics, machine learning, data analysis, data science and big data expertise. The Johns Hopkins Department of Biostatistics has been a leading content generator for MOOCs, with over half a million students enrolled in just under a year and more MOOC courses than most universities. The new MOOC Data Science series (www.coursera.org/specialization/jhudatascience/1[coursera.org]) created by Brian Caffo, Jeff Leek and Roger Peng is a novel concept featuring a complete redesign of the standard statistical master’s program. Notably, it features a completely open source educational model. This talk discusses MOOCs, development technology, financial models, and the future of statistics education. We end with a discussion of post-MOOC technology via a novel intelligent tutoring system called SWIRL (http://swirlstats.com/[swirlstats.com]).

Thursday, May 15, 2014

Linear differential equations: Their statistical history, their roles in dynamic systems, and some new tools

*Jim Ramsay, PhD
Department of Psychology
McGill University *

Most of the statistical literature on dynamic systems deals with a single time series with equally-spaced observations, focuses on its internal linear structure, and has little to say about causal factors and covariates. Stochastic time series and dynamic systems have adopted a “data emulation” approach to modelling that is foreign to most of statistical science. We need a more versatile approach that focusses on input/output systems unconstrained by how the data happen to be distributed.

A parameter estimation framework for fitting linear differential equations to data promises to extend and strengthen classical time series analysis in several directions. Its capacity to model forcing functions makes it especially suitable for input/output systems, including spread of disease models. A variety of examples will serve as illustrations.

Thursday, April 10, 2014

### Convex Banding of the Covariance Matrix

*Jacob Bien, PhD
Department of Statistical Science
Cornell University *

We introduce a sparse and positive definite estimator of the covariance matrix designed for high-dimensional situations in which the variables have a known ordering. Our estimator is the solution to a convex optimization problem that involves a hierarchical group lasso penalty. We show how it can be efficiently computed, compare it to other methods such as tapering by a fixed matrix, and develop several theoretical results that demonstrate its strong statistical properties. Finally, we show how using convex banding can improve the performance of high-dimensional procedures such as linear and quadratic discriminant analysis.

Thursday, March 27, 2014

### Statistical techniques for the normalization and segmentation of structural MRI

*Russell Taki Shinohara, PhD
Perelman School of Medicine
University of Pennsylvania *

While computed tomography and other imaging techniques are measured in absolute units with physical meaning, magnetic resonance images are expressed in arbitrary units that are difficult to interpret and differ between study visits and subjects. Much work in the image processing literature has centered on histogram matching and other histogram mapping techniques, but little focus has been on normalizing images to have biologically interpretable units. We explore this key goal for statistical analysis and the impact of normalization on cross-sectional and longitudinal segmentation of pathology.

Thursday, March 6, 2014

### Estimating the average treatment effect on mean survival time when treatment is time-dependent and censoring is dependent

*Douglas E. Schaubel, PhD
Department of Biostatistics
University of Michigan *

We propose methods for estimating the average difference in restricted mean survival time attributable to a time-dependent treatment. In the data structure of interest, the time until treatment is received and the pre-treatment death hazard are both heavily influenced by a longitudinal process. In addition, subjects may experience periods of treatment ineligibility. The pre-treatment death hazard is modeled using inverse weighted partly conditional methods, while the post-treatment hazard is handled through Cox regression. Subject-specific differences in pre- versus post-treatment survival are estimated, then averaged in order to estimate the average treatment effect among the treated. Asymptotic properties of the proposed estimators are derived and evaluated in finite samples through simulation. The proposed methods are applied to liver failure data obtained from a national organ transplant registry. This is joint work with Qi Gong.

Thursday, January 23, 2014

## Fall 2013 Biostatistics Colloquia

### Statistical Learning for Complex Data: Targeted Local Classification and Logic Rules

*Yuanjia Wang, PhD
Department of Biostatistics
Columbia University
Mailman School of Public Health*

We discuss two statistical learning methods to build effective classification models for predicting at risk subjects and diagnosis of a disease. In the first example, we develop methods to predict whether pre-symptomatic individuals are at risk of a disease based on their marker profiles, which offers an opportunity for early intervention well before receiving definitive clinical diagnosis. For many diseases, the risk of disease varies with some marker of biological importance such as age, and the markers themselves may be age dependent. To identify effective prediction rules using nonparametric decision functions, standard statistical learning approaches treat markers with clear biological importance (e.g., age) and other markers without prior knowledge on disease etiology interchangeably as input variables. Therefore, these approaches may be inadequate in singling out and preserving the effects from the biologically important variables, especially in the presence of high-dimensional markers. Using age as an example of a salient marker to receive special care in the analysis, we propose a local smoothing large margin classifier to construct effective age-dependent classification rules. The method adaptively adjusts for age effect and separately tunes age and other markers to achieve optimal performance. We apply the proposed method to a registry of individuals at risk for Huntington's disease (HD) and controls to construct age-sensitive predictive scores for the risk of receiving HD diagnosis during study period in premanifest individuals. In the second example, we develop methods for building formal diagnostic criteria sets for a new psychiatric disorder introduced in the recently released fifth edition of the Diagnostic and Statistical Manual of Psychiatric Disorders (DSM-5). The methods take into account of the unique logic structure of the DSM-like criteria sets and domain knowledge from experts’ opinions.

Thursday, November 7, 2013

### Delta what? Choice of Outcome Scale in Non-Inferiority Trials

*Rick Chappell, PhD
University of Wisconsin
Department of Statistics
Department of Biostatistics and Medical Informatics *

Non-inferiority (equivalence) trials are clinical experiments which attempt to show that one intervention is not too much inferior to another on some quantitative scale. The cutoff value is commonly denoted as Delta. For example, one might wish to show that the hazard ratio of disease-free survival among patients given an experimental chemotherapy versus a currently approved regimen is Delta = 1.3 or less, especially if the former is thought to be less toxic than or otherwise advantageous over the latter.

Naturally, a lot of attention is given to choice of Delta. In addition to this, I assert that even more than in superiority clinical trials the scale of Delta in equivalence trials must be carefully chosen. Since null hypotheses in superiority studies generally imply no effect, they are often identical or at least compatible when formulated on different scales. However, nonzero Deltas on one scale usually conflict with those on another. For example, the four hypotheses of arithmetic or multiplicative differences of either survival or hazard in general all mean different things unless Delta = 0 for differences or 1 for ratios. This can lead to problems in interpretation when the clinically natural scale is not a statistically convenient one.

Thursday, October 24, 2013

## Spring 2013 Biostatistics Colloquia

### Revolutionizing Policy Analysis Using "Big Data" Analytics

*Siddhartha R. Dalal, PhD
RAND Corporation and Columbia University*

As policy analysis becomes applicable to new domains, it is being challenged by the “curse of dimensionality”—that is, the vastness of available information, the need for increasingly detailed and delicate analysis, and the speed with which new analysis is needed and old analysis must be refreshed. Moreover, with the proliferation of digital information available at one’s fingertips, and the expectation that this information be quickly leveraged, policy analysis in these new domains are being handicapped without scalable methods.

I will describe the results of a new initiative I started at RAND which developed new methods to apply to these “big data” problems to create new information and to convert enormous amounts of existing information into the knowledge needed for policy analysis. These methods draw on social interaction models, information analytics, and web technologies that have already revolutionized research in other areas. The specific examples include applications in medical informatics to find adverse effects of drugs and chemicals, terrorism analysis, and improvement in the efficiency of communicating individually tailored, policy-relevant information directly to policymakers. On the surface, traditional information theoretic considerations do not offer solutions. Accordingly, researchers looking for conventional solutions would have difficulty in solving these problems. I will describe how alternative formulations based on statistical underpinnings including Bayesian methods, sequential stopping and combinatorial designs have played a critical role in addressing these challenges.

Tuesday, June 4, 2013

Biographical Sketch: Siddhartha Dalal is currently Adjunct Professor at RAND Corporation and at Columbia University. Prior to this, he was Chief Technology Officer at RAND, and Vice President of Research at Xerox. Sid’s industrial research career began at Math Research Center at Bell Labs followed by Bellcore/Telcordia Technologies. He has co-authored over 100 publications, patents and monographs covering the areas of medical informatics, risk analysis, image processing, stochastic optimization, data/document mining, software engineering and Bayesian methods.

### Efficient and optimal estimation of a marginal mean in a dynamic infusion length regimen

*Brent A. Johnson, PhD
Department of Biostatistics and Bioinformatics*

*Rollins School of Public Health*

*Emory University*

In post-operative medical care, some drugs are administered intravenously through an infusion pump. For example, in infusion studies following heart surgery, anti-coagulants are delivered intravenously for many hours, even days while a patient recovers. A common primary endpoint of infusion studies is to compare two or more infusion drugs or rates and one can employ standard statistical analyses to address the primary endpoint in an intent-to-treat analysis. However, the presence of infusion-terminating events can adversely affect the analysis of primary endpoints and complicate statistical analyses of secondary endpoints. In this talk, I will focus on a popular secondary analysis of evaluating infusion lengths. The analysis is complicated due to presence or absence of infusion-terminating events and potential time-dependent confounding in treatment assignment. I will show how the theory of dynamic treatment regimes lends itself to this problem and offers a principled approach to construct adaptive, personalized infusion length policies. I will present some recent achievements that allow one to construct an improved, doubly-robust estimator for a particular class of nonrandom dynamic treatment regimes. All techniques will be exemplified through the ESPRIT infusion trial data from Duke University Medical Center.

Thursday, April 18, 2013

### Flexible Modeling of Medical Cost Data

*Lei Liu, PhD
Department of Preventive Medicine*

*Northwestern University School of Medicine*

Medical cost data are often skewed to the right and heteroscedastic, having a nonlinear relation with covariates. To tackle these issues, we consider an extension to generalized linear models by assuming nonlinear covariate effects in the mean function and allowing the variance to be an unknown but smooth function of the mean. We make no further assumption on the distributional form. The unknown functions are described by penalized splines, and the estimation is carried out using nonparametric quasi-likelihood. Simulation studies show the flexibility and advantages of our approach. We apply the model to the annual medical costs of heart failure patients in the clinical data repository (CDR) at the University of Virginia Hospital System. We also discuss how to adopt this modeling framework in correlated medical costs data.

Thursday, March 28, 2013

## Fall 2012 Biostatistics Colloquia

### Optimization of Dynamic Treatment Regimes for Recurrent Diseases

*Xuelin Huang, PhD*

*Department of Biostatistics*

*University of Texas MD Anderson Cancer Center*

Patients with a non-curable disease such as many types of cancer usually go through the process of initial treatment, a various number of disease recurrences and salvage treatments. Such multistage treatments are inevitably dynamic. That is, the choice of the next treatment depends on the patient's response to previous therapies. Dynamic treatment regimes (DTRs) are routinely used in clinics, but are rarely optimized. A systematic optimization of DTRs is highly desirable, but it poses immense challenges for statisticians given their complex nature. Our approach to address this issue is do optimization by backward induction. That is, we first optimize the treatments for the last stage, conditional on patient treatment and response history. Then, by induction, after optimization for stage k is done, for stage k-1, we plug in the optimized results of the all the k+ stages, and assume such optimized survival time from stage k-1 follows an accelerated failure time (AFT) model. Again, the optimization of treatments at stage k-1 is done under the assumed AFT model. Repeat this process until the optimization for the first stage is completed. By doing that, the effects of different treatments at each stage on survival can be consistently estimated and fairly compared, and the overall optimal DTR for each patient can be identified. Simulation studies show that the proposed method performs well and is useful in practical situations. The proposed method is applied to a study for acute myeloid leukemia, to identify the optimal treatment strategies for different subgroups of patients. Potential problems, alternative models, and optimization of the estimation methods are also discussed.

Thursday, December 13, 2012

### Testing with Correlated Data in Genome-wide Association Studies

*Elizabeth Schifano, PhD*

*Department of Statistics *

*University of Connecticut *

The complexity of the human genome makes it challenging to identify genetic markers associated with clinical outcomes. This identification is further complicated by the vast number of available markers, the majority of which are unrelated to outcome. As a consequence, the standard assessment of individual (marginal) marker effects on a single outcome is often ineffective. It is thus desirable to borrow information and strength from the large amounts of observed data to develop more powerful testing strategies. In this talk, I will discuss testing procedures that capitalize on various forms of correlation observed in genome-wide association studies.

This is joint work with Dr. Xihong Lin (Harvard School of Public Health).

Thursday, November 29, 2012

### Misspecification of Cox Regression Models with Composite Endpoints

*Richard J. Cook, PhD
Department of Statistics and Actuarial Science *

*University of Waterloo*

Researchers routinely adopt composite endpoints in multicenter randomized trials designed to evaluate the effect of experimental interventions in cardiovascular disease, diabetes, and cancer. Despite their widespread use, relatively little attention has been paid to the statistical properties of estimators of treatment effect based on composite endpoints. We consider this here in the context of multivariate models for time to event data in which copula functions link marginal distributions with a proportional hazards structure. We then examine the asymptotic and empirical properties of the estimator of treatment effect arising from a Cox regression model for the time to the first event. We point out that even when the treatment effect is the same for the component events, the limiting value of the estimator based on the composite endpoint is usually inconsistent for this common value. We find that in this context the limiting value is determined by the degree of association between the events, the stochastic ordering of events, and the censoring distribution. Within the framework adopted, marginal methods for the analysis of multivariate failure time data yield consistent estimators of treatment effect and are therefore preferred. We illustrate the methods by application to a recent asthma study.

This is joint work with Longyang Wu.

Thursday, October 25, 2012

## Spring 2012 Biostatistics Colloquia

### Growth Trajectories and Bayesian Inverse Problems

*Ian McKeague, PhD
Mailman School of Public Health *

*Columbia University*

Growth trajectories play a central role in life course epidemiology, often providing fundamental indicators of prenatal or childhood development, as well as an array of potential determinants of adult health outcomes. Statistical methods for the analysis of growth trajectories have been widely studied, but many challenging problems remain. Repeated measurements of length, weight and head circumference, for example, may be available on most subjects in a study, but usually only sparse temporal sampling of such variables is feasible. It can thus be challenging to gain a detailed understanding of growth velocity patterns, and smoothing techniques are inevitably needed. Moreover, the problem is exacerbated by the presence of large fluctuations in growth velocity during early infancy, and high variability between subjects. Existing approaches, however, can be inflexible due to a reliance on parametric models, and require computationally intensive methods that are unsuitable for exploratory analyses. This talk introduces a nonparametric Bayesian inversion approach to such problems, along with an R package that implements the proposed method.

Thursday, April 19, 2012

### Study Design and Statistical Inference for Data from an Outcome Dependent Sampling Scheme with a Continuous Outcome

*Haibo Zhou, PhD*

*Department of Biostatistics*

*University of North Carolina at Chapel Hill*

Outcome dependent sampling (ODS) schemes can be cost effective ways to enhance study efficiency. The case-control design has been widely used in epidemiologic studies. However, when the outcome is measured in continuous scale, dichotomizing the outcome could lead to a loss of efficiency. Recent epidemiologic studies have used ODS sampling schemes where, in addition to an overall random sample, there are also a number of supplemental samples that are collected based on a continuous outcome variable. We consider a semiparametric empirical likelihood inference procedure in which the underlying distribution of covariates is treated as a nuisance parameter and left unspecified. The proposed estimator has asymptotic normality properties. The likelihood ratio statistic using the semiparametric empirical likelihood function has Wilks type properties in that, under the null, it follows a Chi-square distribution asymptotically and is independent of the nuisance parameters. Simulation results indicate that, for data obtained using an ODS design, the proposed estimator is more efficient than competing estimators with the same size.

A data set from the Collaborative Perinatal Project (CPP) is used to illustrate the proposed method to assess the impact of maternal polychlorinated biphenyl (PCB) and children’s IQ test performance.

Thursday, March 8, 2012

### HARK: A New Method for Regression with Functional Predictors, with Application to the Sleep Heart Health Study

*Dawn Woodard, PhD*

*Assistant Professor
Operations Research and Information Engineering
Cornell University *

We propose a new method for regression using a parsimonious and scientifically interpretable representation of functional predictors. Our approach is designed for data that exhibit features such as spikes, dips, and plateaus whose frequency, location, size, and shape varies stochastically across subjects. Our method is motivated by the goal of quantifying the association between sleep characteristics and health outcomes, using a large and complex dataset from the Sleep Heart Health Study. We propose Bayesian inference of the joint functional and exposure models, and give a method for efficient computation. We contrast our approach with existing state-of-the-art methods for regression with functional predictors, and show that our method is more effective and efficient for data that include features occurring at varying locations.

Thursday, February 9, 2012

### Choice of Optimal Estimators in Structural Nested Mean Models With Application to Initiating HAART in HIV Positive Patients After Varying Duration of Infection

*Judith Lok, PhD*

*Assistant Professor
Department of Biostatistics
Harvard School of Public Health*

We estimate how the effect of a fixed duration of antiretroviral treatment depends on the time from HIV infection to initiation of treatment, using observational data. A major challenge in making inferences from such observational data is that treatment is not randomly assigned; e.g., if time of initiation depends on disease status, this dependence will induce bias in the estimation of the effect of interest. Previously, Lok and De Gruttola have developed a new class of Structural Nested Mean Models to estimate this effect. This led to a large class of consistent, asymptotically normal estimators, under the assumption that all confounders are measured. However, estimates and standard errors turn out to depend significantly on the choice of estimators within this class, advocating the study of optimal ones. We will present an explicit solution for the choice of optimal estimators under some extra conditions. In the absence of those extra conditions, the resulting estimator is still consistent and asymptotically normal, although possibly not optimal. This estimator is also doubly robust: it is consistent and asymptotically normal not only if the model for treatment initiation is correct, but also if a certain outcome-regression model is correct.

We illustrate our methods using the AIEDRP (Acute Infection and Early Disease Research Program) Core01 database on HIV. Delaying the initiation of HAART has the advantage of postponing onset of adverse events or drug resistance, but may also lead to irreversible immune system damage. Application of our methods to observational data on treatment initiation will help provide insight into these tradeoffs. The current interest in using treatment to control epidemic spread heightens interest in these issues, as early treatment can only be ethically justified if it benefits individual patients, regardless of the potential for community-wide benefits.

This is joint work with Victor De Gruttola, Ray Griner, and James Robins.

Thursday, January 19, 2012

## Fall 2011 Biostatistics Colloquia

### Conditional Inference Functions for Mixed-Effects Models with Unspecified Random-Effects Distribution

*Annie Qu, PhD*

*Professor
Department of Biostatistics
University of Illinois at Urbana-Champaign *

In longitudinal studies, mixed-effects models are important for addressing subject-specific effects. However, most existing approaches assume a normal distribution for the random effects, and this could affect the bias and efficiency of the fixed-effects estimator. Even in cases where the estimation of the fixed effects is robust with a misspecified distribution of the random effects, the estimation of the random effects could be invalid. We propose a new approach to estimate fixed and random effects using conditional quadratic inference functions. The new approach does not require the specification of likelihood functions or a normality assumption for random effects. It can also accommodate serial correlation between observations within the same cluster, in addition to mixed-effects modeling. Other advantages include not requiring the estimation of the unknown variance components associated with the random effects, or the nuisance parameters associated with the working correlations. Real data examples and simulations are used to compare the new approach with the penalized quasi-likelihood approach, and SAS GLIMMIX and nonlinear mixed effects model (NLMIXED) procedures.

This is joint work with Peng Wang and Guei-feng Tsai.

Thursday, November 17, 2011

### Collection, Analysis and Interpretation of Dietary Intake Data

*Alicia L. Carriquiry *

*Professor of Statistics
Iowa State University *

The United States government spends billions of dollars each year on food assistance programs, on food safety and food labeling efforts and in general on interventions and other activities with the goal of improving the nutritional status of the population. To do so, the government relies on large, nationwide food consumption and health surveys that are carried out regularly

Of interest to policy makers, researchers and practitioners is the usual intake of a nutrient or other food components. The distribution of usual intakes in population sub-groups is also of interest, as is the association between consumption and health outcomes. Today we focus on the estimation and interpretation of distributions of nutrient intake and on their use for policy decision-making.

From a statistical point of view, estimating the distribution of usual intakes of a nutrient or other food components is challenging. Usual intakes are unobservable in practice and are subject to large measurement error, skewness and other survey-related effects. The problem of estimating usual nutrient intake distributions can therefore be thought of as the problem of estimating the density of a non-normal random variable that is observed with error. We describe what is now considered to be the standard approach for estimation and will spend some time discussing problems in this area that remain to be addressed. We use data from the most recent NHANES survey to illustrate the methods and provide examples.

Thursday, November 3, 2011

Merging Surveillance Cohorts in HIV/AIDS Studies

*Peter X. K. Song, PhD*

*Professor of Biostatistics
Department of Biostatistics
University of Michigan School of Public Health *

Rising HIV/AIDS prevalence in China has become a serious public health concern in recent years. Data from established surveillance networks across the country have provided timely information for intervention, control and prevention. In this talk, I will focus on the study population of drug injection users in Sichuan province over years 2006-2009, and the evaluation of HIV prevalence across regions in this province. In particular, I will introduce a newly developed estimating equation approach to merging clustered/longitudinal cohort study datasets, which enabled us to not only effectively detect risk factors associated with worsening prevalence rates but also to estimate the effect sizes of the detected risk factors. Both simulation studies and real data analysis will be presented.

Thursday, October 13, 2011

## Spring 2011 Biostatistics Colloquia

Independent Component Analysis Involving Autocorrelated Sources with an Application to Functional Magnetic Resonance Imaging

*Haipeng Shen, PhD
Department of Statistics & Operations Research
University of North Carolina at Chapel Hill*

Independent component analysis (ICA) is an effective data-driven method for blind source separation. It has been successfully applied to separate source signals of interest from their mixtures. Most existing ICA procedures are carried out by relying solely on the estimation of the marginal density functions. In many applications, correlation structures within each source also play an important role besides the marginal distributions. One important example is functional magnetic resonance imaging (fMRI) analysis where the brain-function-related signals are temporally correlated.

I shall talk about a novel ICA approach that fully exploits the correlation structures within the source signals. Specifically, we propose to estimate the spectral density functions of the source signals instead of their marginal density functions. Our methodology is described and implemented using spectral density functions from frequently used time series models such as ARMA processes. The time series parameters and the mixing matrix are estimated via maximizing the Whittle likelihood function. The performance of the proposed method will be illustrated through extensive simulation studies and a real fMRI application. The numerical results indicate that our approach outperforms several popular methods including the most widely used fastICA algorithm.

Thursday, May 19, 2011

### Identification of treatment responders and non-responders via a multivariate growth curve latent class model

*Mary D. Sammel, ScD
Department of Biostatistics and Epidemiology
University of Pennsylvania School of Medicine *

In many clinical studies, the disease of interest is multi-faceted and multiple outcomes are needed to adequately characterize the disease or its severity. In such studies, it is often difficult to determine what constitutes improvement due to the multivariate nature of the response. Furthermore, when the disease of interest has an unknown etiology and/or is primarily a symptom-defined syndrome, there is potential for the study population to be heterogeneous with respect to their symptom profiles. Identification of population subgroups is of interest as it may enable clinicians to provide targeted treatments or develop accurate prognoses. We propose a multivariate growth curve latent class model that group subjects based on multiple outcomes measured repeatedly over time. These groups or latent classes are characterized by distinctive longitudinal profiles of a latent variable which is used to summarize the multivariate outcomes at each point in time. The mean growth curve for the latent variable in each class defines the features of the class. We develop this model for any combination of continuous, binary, ordinal or count outcomes within a Bayesian hierarchical framework. Simulation studies are used to validate the estimation procedures. We apply our models to data from a randomized clinical trial evaluating the efficacy of Bacillus Calmette-Guerin in treating symptoms of IC where we are able to identify a class of subjects who were not responsive to treatment, and a class of subjects where treatment was effective in reducing symptoms over time.

Thursday, May 12, 2011

Recent development on a pseudolikelihood method for analysis of multivariate data with nonignorable nonresponse

*Gong Tang, PhD*

*Assistant Professor of Biostatistics
University of Pittsburgh Graduate School of Public Health *

Consider regression analysis on data with nonignorable nonresponse, standard methods require modeling the nonresponse mechanism. Tang, Little and Raghunathan (2003) proposed a pseudolikelihood method for analysis of data with a class of nonignorable nonresponse mechanisms without modeling the nonresponse mechanism, and extended it to multivariate monotone data with nonignorable nonresponse. In the multivariate case, the joint distribution of response variables was factored into the product of conditional distributions and the pseudolikelihood estimates of the conditional distribution parameters were shown asymptotically normal. However, these estimates were based on different subsets of the data, which were dictated by the missing-data pattern, and their joint distribution was unclear. Here we provide a modification of the likelihood functions and derive the asymptotic joint distributions of these estimates. We also consider an imputation approach for this pseudolikelihood method. Usual imputation approaches impute the missing values and summarize via multiple imputations. Without knowing or modeling the nonresponse mechanism in our setting, the missing values cannot be predicted. We propose a novel approach via imputing the necessary sufficient statistics to circumvent this barrier.

Thursday, March 10, 2011

Asymptotic Properties of Permutation Tests for ANOVA Designs

*John Kolassa, PhD*

*Professor of Statistics
Rutgers University*

We show that under mild conditions, which will allow the application of the approximations in bootstrap, permutation and rank statistics used for multiparameter cases, the integral of the formal saddlepoint density approximation can be used to give an approximation, with relative error of order 1/n, to the tail probability of a likelihood ratio-like statistic. This then permits the approximation to be put into a form analogous to those given either by Lugananni-Rice or Barndorff-Nielsen.

*This is joint work with John Robinson, University of Sydney*

Thursday, February 17, 2011

## Fall 2010 Biostatistics Colloquia

Hierarchical Commensurate and Power Prior Models for Adaptive Incorporation of Historical Information in Clinical Trials

*Bradley Carlin, PhD*

*Professor and Head, Division of Biostatistics*

*University of Minnesota School of Public Health *

Bayesian clinical trial designs offer the possibility of a substantially reduced sample size, increased statistical power, and reductions in cost and ethical hazard. However when prior and current information conflict, Bayesian methods can lead to higher than expected Type I error, as well as the possibility of a costlier and lengthier trial. This motivates an investigation of the feasibility of hierarchical Bayesian methods for incorporating historical data that are adaptively robust to prior information that reveals itself to be inconsistent with the accumulating experimental data. In this paper, we present novel modifications to the traditional hierarchical modeling approach that allows the commensurability of the information in the historical and current data to determine how much historical information is used. We describe the method in the Gaussian case, but then add several important extensions, including the ability to incorporate covariates, random effects, and non-Gaussian likelihoods (especially for binary and time-to-event data). We compare the frequentist performance of our methods as well as existing, more traditional alternatives using simulation, calibrating our methods so they could be feasibly employed in FDA-regulated trials. We also give an example in a colon cancer trial setting where our proposed design produces more precise estimates of the model parameters, in particular conferring statistical significance to the observed reduction in tumor size for the experimental regimen as compared to the control. Finally, we indicate how the method may be combined with adaptive randomization to further increase its utility.

Tuesday, December 7, 2010

### Risk Prediction Models from Genome Wide Association Data

*Hongyu Zhao, PhD
Professor of Public Health (Biostatistics)
Professor of Genetics and of Statistics
Yale School of Public Health *

Recent genome wide association studies have identified many genetic variants affecting complex human diseases. It is of great interest to build disease risk prediction models based on these data. In this presentation, we will present the statistical challenges in using genome wide association data for risk predictions, and discuss different methods through both simulation studies and applications to real-world data. This is joint work with Jia Kang and Judy Cho.

Thursday, December 2, 2010

### Delaying Time-to-Event in Parkinson’s Disease

*Nick Holford, PhD
Professor*

*Department of Pharmacology and Clinical Pharmacology*

*University of Auckland, New Zealand*

There are two reasons for studying the time course of disease status as a predictor in time-to-event analysis. Firstly, it is well understood that pharmacologic treatments may influence both the time course of disease progress and a clinical event such as death. Secondly, the two outcome variables (i.e., disease status and clinical event) are highly correlated; for example, the probability of a clinical event may be increased by the worsening disease status. Despite these reasons, a separate analysis for each type of outcome measurement is usually performed and often only baseline disease status is used as a time-constant covariate in the time-to-event analysis. We contend that more useful information can be gained when time course of disease status is modeled as a time-dependent covariate, providing some mechanistic insight for the effectiveness of treatment. Furthermore, an integrated model to describe the effect of treatment on the time course of both outcomes would provide a basis for clinicians to make better prognostic predictions of the eventual clinical outcome. We illustrate these points using data from 800 Parkinson’s disease (PD) patients who participated in the DATATOP trial and were followed for 8 years. It is shown that the hazards for four clinical events in PD (depression, disability, dementia, and death) are not constant over time and are clearly influenced by PD progression. With the integrated model of time course of disease progress and clinical events, differences in the probabilities of clinical events can be explained by the symptomatic and/or protective effects of anti-parkinsonian medications on PD progression. The use of early disease-status measurements may have clinical application in predicting the probability of clinical events and giving patients better individual prognostic advice.

Thursday, November 11, 2010

Profile Likelihood and Semi-parametric Models, with Application to Multivariate Survival Analysis

*Jerry Lawless, PhD
Distinguished Professor Emeritus*

*Department of Statistics and Actuarial Science*

*University of Waterloo*

We consider semi-parametric models involving a finite dimensional parameter along with functional parameters. Profile likelihoods for finite dimensional parameters have regular asymptotic behaviour in many settings. In this talk we review profile likelihood and then consider several inference problems related to copulas; these include tests for parametric copulas, estimation of marginal distributions and association parameters and semi-parametric likelihood and pseudo-likelihood comparisons. Applications involving parallel and sequentially observed survival times will be considered.

Thursday, November 4, 2010

Nonparametric Modeling of Next Generation Sequencing Data

*Ping Ma, PhD*

*Assistant Professor, Department of Statistics *

*University of Illinois at Urbana-Champaign*

With the rapid development of next generation sequencing technologies, ChIP-seq and RNA-seq have become popular methods for genome-wide protein-DNA interaction analysis and gene expression analysis respectively. Compared to their hybridization-based counterparts, e.g., ChIP-chip and microarray, ChIP-seq and RNA-seq offer down to a single-base resolution signals. In particular, the two technologies produce tens of millions of short reads in a single run. After mapping these reads to reference genome (or transcripts), researchers get a sequence of read counts. That is, at each nucleotide position, researchers get a count which stands for the number of reads whose mapping starts at that position. Depending on research goals, researchers may opt to either analyze these counts directly or derive other types of data based on these counts to facilitate biological discoveries. In this talk, I will present some nonparametric methods we recently developed in analyzing next generation sequencing data.

Thursday, October 28, 2010

## Summer 2010 Biostatistics Colloquia

Bayesian Inference in Semiparametric Mixed Models for Longitudinal Data

*Yisheng Li, PhD
University of Texas M.D. Anderson Cancer Center*

We consider Bayesian inference in semiparametric mixed models (SPMMs) for longitudinal data. SPMMs are a class of models that use a nonparametric function to model a time effect, a parametric function to model other covariate effects, and parametric or nonparametric random effects to account for the within-subject correlation. We model the nonparametric function using a Bayesian formulation of a cubic smoothing spline, and the random effect distribution using a normal distribution and alternatively a nonparametric Dirichlet process (DP) prior. When the random effect distribution is assumed to be normal, we propose a uniform shrinkage prior (USP) for the variance components and the smoothing parameter. When the random effect distribution is modeled nonparametrically, we use a DP prior with a normal base measure and propose a USP for the hyperparameters of the DP base measure. We argue that the commonly assumed DP prior implies a nonzero mean of the random effect distribution, even when a base measure with mean zero is specified. This implies weak identifiability for the fixed effects, and can therefore lead to biased estimators and poor inference for the regression coefficients and the spline estimator of the nonparametric function. We propose an adjustment using a postprocessing technique. We show that under mild conditions the posterior is proper under the proposed USP, a flat prior for the fixed effect parameters, and an improper prior for the residual variance. We illustrate the proposed approach using a longitudinal hormone dataset, and carry out extensive simulation studies to compare its finite sample performance with existing methods.

*This is joint work with Xihong Lin and Peter Mueller.*

Friday, July 23 , 2010

## Spring 2010 Biostatistics Colloquia

Statistical challenges in identifying biomarkers for Alzheimer’s disease: Insights from the Alzheimer's Disease Neuroimaging Initiative (ADNI)

*Laurel Beckett, PhD
University of California, Davis*

The aim of the Alzheimer's Disease Neuroimaging Initiative (ADNI) is to evaluate potential biomarkers for clinical disease progression. ADNI has enrolled more than 800 people including normal controls (NC), people with mild cognitive impairment (MCI), and people with mild to moderate Alzheimer’s disease (AD). For each person, we now have two years of follow-up clinical data including neuropsychological tests, functional measures, and clinical diagnosis to detect conversion from normal to MCI or MCI to AD. We also have longitudinal data on potential biomarkers based on MRI and PET neuroimaging and on serum and cerebrospinal fluid samples. Our goal is to find the best biomarkers to help us track the early preclinical and later clinical progression of AD, and to help speed up drug testing.

ADNI poses many challenges for statisticians. There are many potential biomarkers, arising from high-dimensional, correlated longitudinal data. Even if we pick a single biomarker to examine, we don’t have a single “gold standard” for performance. Instead, we have many different tests we would like a biomarker to pass. One very simple criterion is that it should be different in NC, MCI and AD. But we also want biomarkers that are sensitive to change over time, have a high signal-to-noise ratio, and correlate well with clinical endpoints. I will show some statistical approaches to these questions, and illustrate with current ADNI data.

Thursday, May 20, 2010

### Some Recent Statistical Developments on Cancer Clinical Trials and Computational Biology

*Junfeng (Jeffrey) Liu, PhD
The Cancer Institute of New Jersey and*

*University of Medicine & Dentistry of New Jersey*

In the first part, we extend Simon's two-stage design (1989) for single-arm phase II cancer clinical trials by studying a realistic scenario where the standard and experimental treatment overall response rates (ORRs) follow two beta distributions (rather than two single values). Our results show that this type of design retains certain desirable properties for hypothesis testing purpose. However, some designs may not exist under certain hypothesis and error rate (type I&II) setups in practice. Theoretical conditions are derived for asymptotic two-stage design non-existence and improving design search efficiency. In the second part, we introduce Monte-Carlo simulation based algorithms for rigorously calculating the probability of a set of orthologous DNA sequences within evolutionary biology framework, where pairwise Needleman-Wunsch alignment (1970) between the imputed and species sequences is utilized to induce the posterior conditional probabilities which lead to efficient calculation using central limit theorem. The importance of evolution-adaptive alignment algorithm is highlighted. If time allows, we will also briefly introduce some necessary conditions for realizing a self-consistent (Chapman-Kolmogorov equation) and self-contained (concurrent substitution-insertion-deletion) continuous-time finite-state Markov chain from modeling nucleotide site evolution under certain assumptions.

Thursday, May 6, 2010

Median regression in survival analysis via transform-both-sides model

*Debajyoti Sinha, PhD
Florida State University *

For analysis of survival data, median regression offers a useful alternative to the popular proportional hazards (Cox,1972) and accelerated failure time models. We propose a new simple method for estimating the parameters of censored median regression based on transform-both-sides model. Numerical studies are conducted to show that our likelihood based and Bayes estimators perform well compared to existing estimators for censored data with wide range of skewness. In addition, the simulated variance of our proposed maximum likelihood estimators are substantially lower than those of Portnoy (2003) and other existing estimators. Our Bayesian estimators can handle semiparametric model where the model requires to be of symmetric distribution after transformation in both sides. We also extend our methods to deal with median regressions for multivariate survival data.

*This is joint work with Jianchang Lin and Stuart Lipsitz*.

Thursday, April 29, 2010

### Planning Survival Analysis Studies with Two-Stage Randomization Trials

*Zhiguo Li, PhD
University of Michigan *

Two-stage randomization trials are growing in importance in developing and comparing adaptive treatment strategies (i.e., treatment policies or dynamic treatment regimes). Usually the first stage involves randomization to one of several initial treatments. The second stage of treatment begins when a response (or nonresponse) criterion is met. In the second stage subjects are again randomized among treatments. With time-to-event outcomes, sample size calculations for planning these two-stage randomization trials are challenging because the variances of common test statistics depend in a complex manner on the joint distribution of time to response (or nonresponse) criterion and the primary time-to-event outcome. We produce simple, albeit conservative, sample size formulae by using upper bounds on the variances. The resulting sample size formulae only require the same working assumptions needed to size a single stage randomized trial. Furthermore in most common settings the sample size formulae are only mildly conservative. These sample size formulae are based on either a weighted Kaplan-Meier estimator of survival probabilities at a fixed time point or a weighted version of the log rank test. We also consider several variants of the two-stage randomization design.

Tuesday, April 27, 2010

### Spatial-Temporal Association Between Daily Mortality and Exposure to Particulate Matter

*Eric J. Kalendra, M.S.
North Carolina State University *

Fine particulate matter (PM2.5) is a mixture of pollutants that has been linked to serious health problems, including premature mortality. Since the chemical composition of PM2.5 varies across space and time, the association between PM2.5 and mortality might also be expected to vary with space and season. This study uses a unique spatial data architecture consisting of geocoded North Carolina mortality data for 2001-2002, combined with US Census 2000 data. We study the association between mortality and air pollution exposure using different metrics (monitoring data and air quality numerical models) to characterize the pollution exposure. We develop and implement a novel statistical multi-stage Bayesian framework that provides a very broad, flexible approach to studying the spatiotemporal associations between mortality and population exposure to daily PM2.5 mass, while accounting for different sources of uncertainty. Most of the pollution-mortality risk assessment has been done using aggregated mortality and pollution data (e.g., at the county level), and that can lead to significant ecological bias and error in the estimated risk. In this work, we introduce a new framework to adjustment for the ecological bias in the risk assessment analysis by using the aggregated data. We present results for the State of North Carolina.

Thursday, April 1, 2010

### Modeling Treatment Efficacy under Screening

*Alexander Tsodikov, PhD
University of Michigan *

Modeling the treatment null hypothesis and the alternative when population is subject to cancer screening is a challenge. Survival is subject to length and lead-time bias, and there is a shift of the distribution of disease stage towards earlier stages under screen-based diagnosis. However, screening is not a treatment, and all these dynamic changes are expected to occur under the null hypothesis of no treatment effect. Under the alternative hypothesis, treatment effect for an unscreened person may be different from the screen-detected one, as early detection may enhance the effect. The challenge is that these treatments are applied at different points of disease development. We provide a statistical modeling approach to address the question of treatment efficacy in this dynamic situation.

Thursday, March 11, 2010

## Fall 2009 Biostatistics Colloquia

### A General Framework for Combining Information and a Frequentist Approach to Incorporate Expert Opinions

*Minge Xie, PhD
Rutgers University *

Incorporating external information, such as prior information and expert opinions, can play an important role in the design, analysis and interpretation of clinical trials. Seeking effective schemes for incorporating prior information with the primary outcomes of interest has drawn increasing attention in pharmaceutical applications in recent years. Most methods currently used for combining prior information with clinical trial data are Bayesian. But we demonstrate that they may encounter problems in the analysis of clinical trials with binary outcomes, especially when informative prior distribution is skewed

In this talk, we present a frequentist framework of combining information using confidence distributions (CDs), and illustrate it through an example of incorporating expert opinions with information from clinical trial data. A confidence distribution (CD), which uses a distribution function to estimate a parameter of interest, contains a wealth of information for inferences; much more than a point estimator or a confidence interval (“interval estimator”). In this talk, we present a formal definition of CDs, and develop a general framework of combining information based on CDs. This CD combining framework not only unifies most existing meta-analysis approaches, but also leads to development of new approaches. In particular, we develop a Frequentist approach to combine surveys of expert opinions with binomial clinical trial data, and illustrate it using data from a collaborative research with Johnson & Johnson Pharmaceuticals. The results from the Frequentist approach are compared with those from Bayesian approaches, and it is demonstrated that the Frequentist approach has distinct advantages.

Thursday, December 10, 2009

### The Emerging Role of the Data and Safety Monitoring Board: Implications of Adaptive Clinical Trial Designs

*Christopher S. Coffey, PhD
University of Iowa *

In recent years, there has been substantial interest in the use of adaptive or novel randomized trial designs. Although there are a large number of proposed adaptations, all generally share the common characteristic that they allow for some design modifications during an ongoing clinical trial. Unfortunately, the rapid proliferation of research on adaptive designs, and inconsistent use of terminology, has created confusion about the similarities and, more importantly, the differences among the techniques. In the first half of this talk, I will attempt to provide some clarification on the topic and describe some of the more commonly proposed adaptive designs

Furthermore, sequential monitoring of safety and efficacy data has become integral to modern clinical trials. A Data and Safety Monitoring Board (DSMB) is often given the responsibility of monitoring accumulating data over the course of the trial. DSMB’s have traditionally had the responsibility to monitor the trial and make recommendations as to when a trial should be stopped for efficacy, futility, or safety. As more trials start to utilize the adaptive framework, the roles and responsibilities of the DSMB are becoming more complex. In the latter half of this talk, I will report on the experience of the DSMB during the clinical trial of high dose Coenzyme Q10 in Amyotrophic Lateral Sclerosis (QALS). This trial utilized an adaptive design involving two stages. The objective of the first stage was to identify which of two doses of CoQ10 (1000 or 2000 mg/day) is preferred for ALS. The objective of stage 2 was to conduct a futility test to compare the preferred dose from stage 1 against placebo to determine whether there is sufficient evidence of efficacy to justify proceeding to a definitive phase III trial. As a result of the complexity of the adaptive design for this study, there were a number of issues that the DSMB had to address. I will briefly describe how the DSMB addressed each issue during the conduct of the trial and provide suggestions for how such issues might be addressed in future trials.

Thursday, October 29, 2009

### Modeling the Dynamics of T Cell Responses and Infections

*Rustom Antia, PhD
Emory University *

In the first part of the talk I will discuss how mathematical models have helped us understand the rules which govern the dynamics of immune responses and the generation of immunological memory. The second part of the talk will focus on the role of different factors (resource limitation, innate and specific immunity) in the control of the infections and if time permits discuss their application to SIV/HIV, influenza and malaria.

Friday, October 23, 2009

### Local CQR Smoothing: An Efficient and Safe Alternative to Local Polynomial Regression

*Hui Zou, PhD
University of Minnesota*

Local polynomial regression is a useful nonparametric regression tool to explore fine data structures and has been widely used in practice. In this talk, we will introduce a new nonparametric regression technique called local CQR smoothing in order to further improve the local polynomial regression. Sampling properties of the proposed estimation procedure are studied. We derive the asymptotic bias, variance and normality of the proposed estimate. Asymptotic relative efficiency of the proposed estimate with respect to the local polynomial regression is investigated. It is shown that the proposed estimate can be much more efficient than the local polynomial regression estimate for various non-normal errors, while being almost as efficient as the local polynomial regression estimate for normal errors. Simulation is conducted to examine the performance of the proposed estimates. The simulation results are consistent with our theoretic findings. A real data example is used to illustrate the proposed method.

Thursday, October 8, 2009

### Challenges and Statistical Issues in Estimating HIV Incidence

*Ruiguang Song, PhD
Centers for Disease Control *

Knowing the trends and current pattern of HIV infections is important for planning and evaluating prevention efforts and for resource allocation. However, it is difficult to estimate HIV incidence because HIV infections may not be detected or diagnosed until many years after the infection. Historically, HIV incidence was estimated based on the numbers of AIDS diagnoses and the back-calculation method. This method was no longer valid when the highly active antiretroviral therapy was introduced in 1996. This is because the therapy changes the incubation distribution by extending the period from HIV infection to AIDS diagnosis. Since then, the empirical estimate of 40,000 was used until the new estimate published in 2008. This presentation will describe the development of the new method and discuss the statistical issues in producing the new estimate of HIV incidence in the United States.

Thursday, September 24, 2009

## Spring 2009 Biostatistics Colloquia

### Individual Prediction in Prostate Cancer Studies Using a Joint Longitudinal-Survival Model

*Jeremy M.G. Taylor, PhD
Department of Biostatistics
University of Michigan*

For monitoring patients treated for prostate cancer, Prostate Specific Antigen (PSA) is measured periodically after they receive treatment. Increases in PSA are suggestive of recurrence of the cancer and are used in making decisions about possible new treatments. The data from studies of such patients typically consist of longitudinal PSA measurements, censored event times and baseline covariates. Methods for the combined analysis of both longitudinal and survival data have been developed in recent years, with the main emphasis being on modeling and estimation. We analyze data from a prostate cancer study in which the patients are treated with radiation therapy using a joint model. Here we focus on utilizing the model to make individualized prediction of disease progression for censored and alive patients, based on all their available pre-treatment and follow-up data.

In this model the longitudinal PSA data follows a non-linear hierarchical mixed model. The clinical recurrences are modeled using a time-dependent proportional hazards model where the time dependent covariates include both the current value and the slope of post-treatment PSA profile. Estimates of the parameters in the model are obtained by the Markov chain Monte Carlo (MCMC) technique. The model is used to give individual predictions of both future PSA values and the predicted probability of recurrence up to four years in the future. An efficient algorithm is developed to give individual predictions for subjects who were not part of the original data from which the model was developed. Thus the model can be used by others remotely through a website portal, to give individual predictions that can be updated as more follow-up data is obtained. In this talk I will discuss the data, the models, the estimation methods, the statistical issues and the website, psacalc.sph.umich.edu .

This is joint work with Menggang Yu, Donna Ankerst, Cecile Proust-Lima, Ning Liu, Yongseok Park and Howard Sandler.

Thursday, April 30, 2009

### Recent Developments of Generalized Inference in Small Sample Diagnostic Studies

*Lili Tian, PhD
Department of Biostatistics
SUNY at Buffalo*

Exact generalized method, proposed by Tsui and Weerahandi (JASA, 1989) and Weerahandi (JASA, 1993), has received much research attention recently due to the fact that it allows us to make exact (non asymptotic) inference for the statistical problems for which the standard inference methods do not exist. This method has proved to be very fruitful in providing accessible, admissible and preferable solutions to small sample problems in many practical settings. In this talk, I will present a brief introduction of this field followed by some recent developments including applications in small sample diagnostic studies.

Thursday, April 23, 2009

### High Dimensional Statistics in Genomics: Some New Problems and Solutions

*Hongzhe Li, PhD
Department of Biostatistics and Epidemiology
University of Pennsylvania School of Medicine*

Large-scale systematic genomic datasets have been generated to inform our biological understanding of both the normal workings of organisms in biology and disrupted processes which cause human disease. The integrative analysis of these datasets, which has become an increasingly important part of genomics and systems biology research, poses many interesting statistical problems, largely driven by the complex inter-relationships between high-dimensional genomic measurements. In this talk, I will present three problems in genomics research that require the development of new statistical methods: (1) identification of active transcription factors in microarray time-course experiments; (2) identification of subnetworks that are associated with some clinical outcomes; and (3) identification of the genetic variants that explain higher-order gene expression modules. I will present several regularized estimation methods to address these questions and demonstrate their applications using real data examples. I will also discuss some theoretical properties of these procedures.

Thursday, March 26, 2009

### Multiscale Computational Cell Biology

*Martin Meier-Schellersheim, PhD
National Institute of Allergy and Infectious Diseases
National Institutes of Health *

The modeling and simulation tool Simmune allows for the definition of detailed models of cell biological processes ranging from interactions between molecular binding sites to the behavior of populations of cells. Based on the inputs the user provides through a graphical interface, the software automatically constructs the resulting sets of partial differential equations describing intra- and extra-cellular reaction-diffusion and integrates them, providing numerous ways to display the behavior of the simulated systems and to interact with running simulations in a way that closely resembles wet-lab manipulations. In the talk, I will explain the technical foundations and typical use cases for simmune.

Thursday, January 15, 2009

## Fall 2008 Biostatistics Colloquia

### Nonparametric Variance Estimation for Systematic Samples

*Jean Opsomer, PhD
Colorado State University*

Systematic sampling is a frequently used sampling method in natural resource surveys, because of its ease of implementation and its design efficiency. An important drawback of systematic sampling, however, is that no direct estimator of the design variance is available. We describe a new estimator of the model-based expectation of the design variance, under a nonparametric model for the population. The nonparametric model is sufficiently flexible that it can be expected to hold at least approximately for many practical situations. We prove the consistency of the estimator for both the anticipated variance and the design variance under the nonparametric model. The approach is used on a forest survey dataset, on which we compare a number of design-based and model-based variance estimators.

Thursday, November 20, 2008

### Bayesian Inference for High Dimensional Functional and Image Data using Functional Mixed Models

*Jeffrey S. Morris, PhD
Department of Biostatistics
The University of Texas MD Anderson Cancer Center*

High dimensional, irregular functional data are increasingly encountered in scientific research. For example, MALDI-MS yields proteomics data consisting of one-dimensional spectra with many peaks, array CGH or SNP chip arrays yield one-dimensional functions of copy number information along the genome, 2D gel electrophoresis and LC-MS yield two-dimensional images with spots that correspond to peptides present in the sample, and fMRI yields four-dimensional data consisting of three-dimensional brain images observed over a sequence of time points on a fine grid. In this talk, I will discuss how to identify regions of the functions/images that are related to factors of interest using Bayesian wavelet-based functional mixed models. The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while borrowing strength between observations in all dimensions. I will demonstrate how to identify regions of the functions that are significantly associated with factors of interest, in a way that takes both statistical and practical significance into account and controls the Bayesian false discovery rate to a pre-specified level. I will also discuss how to extend this framework to include functional predictors with coefficient surfaces. These methods will be applied to a series of functional data sets.

Thursday, November 13, 2008

### Semiparametric Analysis of Recurrent and Terminal Event Data

*Douglas E. Schaubel, PhD
Department of Biostatistics, University of Michigan*

In clinical and observational studies, the event of interest is often one which can occur multiple times for the same subject (i.e., a recurrent event). Moreover, there may be a terminal event (e.g. death) which stops the recurrent event process and, typically, is strongly correlated with the recurrent event process. We consider the recurrent/terminal event setting and model the dependence through a shared gamma frailty that is included in both the recurrent event rate and terminal event hazard functions. Conditional on the frailty, a model is specified only for the marginal recurrent event process, hence avoiding the strong Poisson-type assumptions traditionally used. Analysis is based on estimating functions that allow for estimation of covariate effects on the marginal recurrent event rate and terminal event hazard. The method also permits estimation of the degree of association between the two processes. Closed-form asymptotic variance estimators are proposed. The proposed methods are evaluated through simulations to assess the applicability of the asymptotic results in finite samples, and to evaluate the sensitivity of the method to departures from its underlying assumptions. The methods are illustrated in an analysis of hospitalization data for patients in an international multi-center study of outcomes among peritoneal dialysis patients. This is joint work with Yining Ye and Jack Kalbfleisch.

Thursday, November 6, 2008

### Spatio-temporal Analysis via Generalized Additive Models

*Kung-Sik Chan, PhD
The University of Iowa*

Generalized Additive Model (GAM) has been widely used in practice. However, GAM assumes iid errors, which invalidates its use for many spatio-temporal data. For the latter kind of data, the Generalized Additive Mixed Model (GAMM) may be more appropriate. While there exist several approaches for estimating a GAMM, these approaches suffer from the problems of being numerically unstable or computer-intensive.

In this talk, I will discuss some recent, joint work with Xiangming Fang. We develop an iterative algorithm for Penalized Maximum Likelihood (PML) and Restricted Penalized Maximum Likelihood (REML) estimation of a GAM with correlated errors. Although the new approach does not assume any specific correlation structure, the Mátern spatial correlation model is of particular interest, as motivated by our biological applications. As some of the Mátern parameters are not consistently estimable under the fixed domain asymptotics, situations for the spatio-temporal case are investigated, where the spatial design is assumed to be fixed with temporally independent repeated measurements and the spatial correlation structure does not change over time. Our theoretical investigation exploits the fact that penalized likelihood estimation can be given a Bayesian interpretation. The conditions under which the asymptotic posterior normality holds are discussed. We also develop a model diagnosis method for checking the assumption of independence across time for spatio-temporal data. In practice, selecting the best model is often of interest. A model selection criterion based on the Bayesian framework is proposed to compare different candidate models. The proposed methods are illustrated by simulation and a fisheries application.

Thursday, October 23, 2008

### Challenges in Joint Modeling of Longitudinal and Survival Data

*Jane-Ling Wang, PhD
Department of Statistics
University of California at Davis*

It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties.

We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches.

Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.

This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis).

Spring 2008 Biostatistics Colloquia

### Discovery of Latent Patterns in Disability Data and the Issue of Model Choice

*Tanzy Mae Love, PhD
Department of Statistics
Carnegie Mellon University *

Model choice is a major methodological issue in the explosive growth of data-mining models involving latent structure for clustering and classification. Here, we work from a general formulation of hierarchical Bayesian mixed-membership models and present several model specifications and variations, both parametric and nonparametric, in the context of learning the number of latent groups and associated patterns for clustering units. We elucidate strategies for comparing models and specifications by producing novel analyses of the following data set: data on functionally disabled American seniors from the National Long Term Care Survey.

Thursday, April 24, 2008

### Funding Opportunities at the National Science Foundation

*Grace Yang, PhD
Program Director, Statistics & Probability
National Science Foundation
Division of Mathematical Sciences*

Thursday, April 17, 2008

### Multiple imputation methods in application of a random slope coefficient linear model to randomized clinical trial data

*Moonseong Heo, PhD
Department of Psychiatry
Weill Medical College of Cornell University *

Two types of multiple imputation methods, proper and improper, for imputing missing not at random (MNAR) continuous data are considered in the context of attrition problems arising from antidepressant clinical trials, whose primary interest is to compare treatment effects on the declines in depressive symptoms over the study period. Both methods borrow information from completers data to construct pseudo donor sampling distributions from which imputed values are drawn, but differ in characterizing those distributions. A joint likelihood of each method is constructed based on a selection model for missing data. Their performance was evaluated based on maximum likelihood estimates of a random slope coefficient model that fits the imputed data to test the treatment effect via modeling interaction between the treatment and the slope of depressive symptom decline. The following performance evaluation criteria were considered: bias, statistical power, root mean square error, coverage probability of the 95% confidence interval (CI), and width of the CI. The two methods are compared with other analytic strategies for incomplete data: completers-only data analysis, available observations analysis, and last observation carried forward (LOCF) analysis. A simulation study showed that the two multiple imputation methods have favorable results in bias and statistical power and width of the 95% CI, whereas the available observations analysis showed favorable results in bias, root mean square and coverage rate. Completers-only analysis showed better results than the LOCF analysis. Those findings guided interpretation of results from an antidepressant trial for geriatric depression. Finally, a comparison with a sequential hot deck multiple imputation method in application to analysis with missing binary outcome from a recently completed antipsychotic trial will be discussed.

Wednesday, April 9, 2008

### Improved Measurement Modeling and Regression with Latent Variables

*Karen Bandeen-Roche, PhD
Professor of Biostatistics and Medicine
Johns Hopkins Bloomberg School of Public Health*

Latent variable models have long been utilized by behavioral scientists to summarize constructs that are represented by multiple measured variables or are difficult to measure, such as health practices and psychiatric syndromes. They have been regarded as particularly useful when variables that can be measured are highly imperfect surrogates for the construct of inferential interest, but they are also criticized as being overly abstract, weakly estimable, computationally intensive and sensitive to unverifiable modeling assumptions. My talk describes two lines of research to improve the utility of latent variable modeling, counterbalancing strengths and weaknesses. First, it reviews methods I have developed for assessing modeling assumptions and delineating what are the targets of parameter estimation in the case of maximum likelihood fitting, allowing for a mis-specified model. Then, it describes new strategies for developing measurement models for subsequent use in developing regression outcomes. One affords approximately unbiased estimation vis a vis full latent variable regression. A second counterbalances standard latent variable modeling assumptions—focused on internal validity of measurement—with alternative assumptions—say, focused on external or concurrent validation. Small sample performance properties are evaluated. The methods will be illustrated using data on post traumatic stress disorder in a population-based sample and aging and adverse health in older adults. It is hoped that the findings will lead to improved usage of latent variable models in scientific investigations.

Thursday, April 3, 2008

### Branching Processes as Models of Progenitor Cell Populations and Estimation of the Offspring Distributions

In memory of Andrei Yakovlev

*Nikolay Yanev, PhD
Professor and Chair
Dept of Probability and Statistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences *

This paper considers two new models of reducible age-dependent branching processes with emigration in conjunction with estimation problems arising in cell biology. Methods of statistical inference are developed using the relevant embedded discrete branching structure. Based on observations of the branching process with emigration, estimators of the offspring probabilities are proposed for the hidden unobservable process without emigration, the latter being of prime interest to investigators. The problem under consideration is motivated by experimental data generated by time-lapse video-recording of cultured cells that provides abundant information on their individual evolutions and thus on the basic parameters of their life cycle in tissue culture. Some parameters, such as the mean and variance of the mitotic cycle time, can be estimated nonparametrically without resorting to any mathematical model of cell population kinetics. For other parameters, such as the offspring distribution, a model-based inference is needed. Age-dependent branching processes have proven to be useful models for that purpose. A special feature of the data generated by time-lapse experiments is the presence of censoring effects due to migration of cells out of the field of observation. For the time-to-event observations, such as the mitotic cycle time, the effects of data censoring can be accounted for by standard methods of survival analysis. No methods are available to accommodate such effects in the statistical inference on the offspring distribution. Within the framework of branching processes, the loss of cells to follow-up can be modeled as a process of emigration. Incorporating the emigration process into a pertinent branching model of cell evolution provides the basis for the proposed estimation techniques. The statistical inference on the offspring distribution is illustrated with an application to the development of oligodendrocytes in cell culture.

This talk is based on joint work with Drs. A. Yakovlev and V. Stoimenova.

Thursday, March 27, 2008

### Challenges in Joint Modeling of Longitudinal and Survival Data

*Jane-Ling Wang, PhD
Professor
University of California at Davis *

It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties. We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches. Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.

This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis).

Thursday, March 6, 2008