## Fall 2014 Biostatistics Colloquia

### Clustering Tree-structured Data on Manifold

*Hongyu Miao , PhD
Department of Biostatistics and Computational Biology
University of Rochester *

Tree-structured data usually contain both topological and geometrical information, and are necessarily considered on manifold instead of Euclidean space for appropriate data parameterization and analysis. In this study, we propose a novel tree-structured data parameterization, called Topology-Attribute matrix (T-A matrix), so the data clustering task can be conducted on matrix manifold. We incorporate the structure constraints embedded in data into the negative matrix factorization method to determine meta-trees from the T-A matrix, and the signature vector of each single tree can then be extracted by meta-tree decomposition. The meta-tree space turns out to be a cone space, in which we explore the distance metric and implement the clustering algorithm based on the concepts like Fréchet mean. Finally, the T-A matrix based clustering (TAMBAC) framework is evaluated and compared using both simulated data and real retinal images to illustrate its efficiency and accuracy.

Thursday, December 4, 2014

3:30 P.M.

Helen Wood Hall – 1W501 Classroom

### Walking, sliding, and detaching: time series analysis for kinesin

*John Fricks, PhD
Department of Statistics
Pennsylvania State University *

Kinesin is a molecular motor that, along with dynein, moves cargo such as organelles and vesicles along microtubules through axons. Studying these transport process is vital, since non-functioning kinesin has been implicated in a number of neurodegenerative diseases, such as Alzheimer’s disease. Over the last twenty years, these motors have been extensively studied through in vitro experiments of single molecular motors using laser traps and fluorescence techniques. However, an open challenge has been to explain in vivo behavior of these systems when incorporating the data from in vitro experiments into straightforward models.

In this talk, I will discuss recent work with experimental collaborator, Will Hancock (Penn State), to understand more subtle behavior of a single kinesin than has previously been studied, such as sliding and detachment and how such behavior can contribute to our understanding of in vivo transport. Data from these experiments include time series taken from fluorescence experiments for kinesin. In particular, we will use novel applications of switching time series models to explain the shifts between different modes of transport.

Thursday, November 13, 2014

3:30 P.M.

Helen Wood Hall – 1W502 Classroom

Estimating Physical Activity with an Accelerometer

*John Staudenmayer, PhD
Department of Mathematics and Statistics
University of Massachusetts, Amherst*

Measurement of physical activity (PA) in a free-living setting is essential for several purposes: understanding why some people are more active than others, evaluating the effectiveness of interventions designed to increase PA, performing PA surveillance, and quantifying the relationship between PA dose and health outcomes. One way to estimate PA is to use an accelerometer (a small electronic device that records a time stamped record of acceleration) and a statistical model that predicts aspects of PA (such as energy expenditure, time walking, time sitting, etc.) from the acceleration signals. This talk will describe methods to do this. We will present several calibration studies where acceleration is measured concurrently with objective measurements of PA, describe the statistical models used to relate the two sets of measurements, and examine independent evaluations of the methods.

Thursday, October 30, 2014

3:30 P.M.

Helen Wood Hall – 1W502 Classroom

The Importance of Simple Tests in High-Throughput Biology: Case Studies in Forensic Bioinformatics

*Keith A. Baggerly, PhD
Department of Bioinformatics and Computational Biology
*

*University of Texas MD Anderson Cancer Center*

Modern high-throughput biological assays let us ask detailed questions about how diseases operate, and promise to let us personalize therapy. Careful data processing is essential, because our intuition about what the answers “should” look like is very poor when we have to juggle thousands of things at once. When documentation of such processing is absent, we must apply “forensic bioinformatics” to work from the raw data and reported results to infer what the methods must have been. We will present several case studies where simple errors may have put patients at risk. This work has been covered in both the scientific and lay press, and has prompted several journals to revisit the types of information that must accompany publications. We discuss steps we take to avoid such errors, and lessons that can be applied to large data sets more broadly.

Thursday, October 16, 2014

3:30 P.M.

Helen Wood Hall – 1W502 Classroom

### Transforming Antibiotic Stewardship Using Response Adjusted for Duration of Antibiotic Risk (RADAR): Using Endpoints to Analyze Patients Rather than Patients to Analyze Endpoints

*Scott R. Evans, PhD, MS
Department of Biostatistics
Harvard School of Public Health *

Unnecessary antibiotic (AB) use is unsafe, wasteful, and leads to emergence of AB resistance. AB stewardship trials are limited by noninferiority (NI) design complexities (e.g., NI margin selection, constancy assumption validity), competing risks of mortality and other outcomes, lack of patient-level interpretation incorporating benefits and harms, and feasibility issues (large sample sizes).

Response Adjusted for Days of Antibiotic Risk (RADAR) is a novel methodology for stewardship trial design that effectively addresses these challenges. RADAR utilizes a superiority design framework evaluating if new strategies are better (in totality) than current strategies. RADAR has 2 steps: (1) creation of an ordinal overall clinical outcome variable incorporating important patient benefits, harms, and quality of life, and (2) construction of a desirability of outcome ranking (DOOR) where (i) patients with better clinical outcomes receive higher ranks than patients with worse outcomes, and (ii) patients with similar clinical outcomes are ranked by AB exposure, with lower exposure achieving higher ranks. N is based on a superiority test of ranks.

Conclusions: RADAR will transform and enhance clinical trials in antibacterial stewardship. RADAR avoids the complexities associated with NI trials (resulting in reduced sample size in many cases), alleviates competing risk problems, provides more informative benefit:risk evaluation, and allows for patient-level interpretation. Researchers should considering using endpoints to analyze patients rather than patients to analyze endpoints in clinical trials.

Thursday, October 2, 2014

3:30 P.M.

Helen Wood Hall – 1W502 Classroom

## Spring 2014 Biostatistics Colloquia

Teaching Statistics for the Future, the MOOC Revolution and Beyond

* Brian S. Caffo, PhD
Department of Biostatistics
Johns Hopkins University *

Massive open online classes (MOOCs) have become an international phenomenon where millions of students are accessing free educational materials from top universities. MOOC startups, such as Coursera, EdX and Udacity, are growing at an astounding rate. At the same time, industry predictions suggest a massive deficit in STEM, and particularly statistics, graduates to meet the growing demand for statistics, machine learning, data analysis, data science and big data expertise. The Johns Hopkins Department of Biostatistics has been a leading content generator for MOOCs, with over half a million students enrolled in just under a year and more MOOC courses than most universities. The new MOOC Data Science series (www.coursera.org/specialization/jhudatascience/1[coursera.org]) created by Brian Caffo, Jeff Leek and Roger Peng is a novel concept featuring a complete redesign of the standard statistical master’s program. Notably, it features a completely open source educational model. This talk discusses MOOCs, development technology, financial models, and the future of statistics education. We end with a discussion of post-MOOC technology via a novel intelligent tutoring system called SWIRL (http://swirlstats.com/[swirlstats.com]).

Thursday, May 15, 2014

Linear differential equations: Their statistical history, their roles in dynamic systems, and some new tools

*Jim Ramsay, PhD
Department of Psychology
McGill University *

Most of the statistical literature on dynamic systems deals with a single time series with equally-spaced observations, focuses on its internal linear structure, and has little to say about causal factors and covariates. Stochastic time series and dynamic systems have adopted a “data emulation” approach to modelling that is foreign to most of statistical science. We need a more versatile approach that focusses on input/output systems unconstrained by how the data happen to be distributed.

A parameter estimation framework for fitting linear differential equations to data promises to extend and strengthen classical time series analysis in several directions. Its capacity to model forcing functions makes it especially suitable for input/output systems, including spread of disease models. A variety of examples will serve as illustrations.

Thursday, April 10, 2014

**Convex Banding of the Covariance Matrix**

*Jacob Bien, PhD
Department of Statistical Science
Cornell University *

We introduce a sparse and positive definite estimator of the covariance matrix designed for high-dimensional situations in which the variables have a known ordering. Our estimator is the solution to a convex optimization problem that involves a hierarchical group lasso penalty. We show how it can be efficiently computed, compare it to other methods such as tapering by a fixed matrix, and develop several theoretical results that demonstrate its strong statistical properties. Finally, we show how using convex banding can improve the performance of high-dimensional procedures such as linear and quadratic discriminant analysis.

Thursday, March 27, 2014

**Statistical techniques for the normalization and segmentation of structural MRI**

*Russell Taki Shinohara, PhD
Perelman School of Medicine
University of Pennsylvania *

While computed tomography and other imaging techniques are measured in absolute units with physical meaning, magnetic resonance images are expressed in arbitrary units that are difficult to interpret and differ between study visits and subjects. Much work in the image processing literature has centered on histogram matching and other histogram mapping techniques, but little focus has been on normalizing images to have biologically interpretable units. We explore this key goal for statistical analysis and the impact of normalization on cross-sectional and longitudinal segmentation of pathology.

Thursday, March 6, 2014

**Estimating the average treatment effect on mean survival time when treatment is time-dependent and censoring is dependent**

* Douglas E. Schaubel, PhD
Department of Biostatistics
University of Michigan *

We propose methods for estimating the average difference in restricted mean survival time attributable to a time-dependent treatment. In the data structure of interest, the time until treatment is received and the pre-treatment death hazard are both heavily influenced by a longitudinal process. In addition, subjects may experience periods of treatment ineligibility. The pre-treatment death hazard is modeled using inverse weighted partly conditional methods, while the post-treatment hazard is handled through Cox regression. Subject-specific differences in pre- versus post-treatment survival are estimated, then averaged in order to estimate the average treatment effect among the treated. Asymptotic properties of the proposed estimators are derived and evaluated in finite samples through simulation. The proposed methods are applied to liver failure data obtained from a national organ transplant registry. This is joint work with Qi Gong.

Thursday, January 23, 2014

## Fall 2013 Biostatistics Colloquia

** Statistical Learning for Complex Data: Targeted Local Classification and Logic Rules**

* Yuanjia Wang, PhD
Department of Biostatistics
Columbia University
Mailman School of Public Health*

We discuss two statistical learning methods to build effective classification models for predicting at risk subjects and diagnosis of a disease. In the first example, we develop methods to predict whether pre-symptomatic individuals are at risk of a disease based on their marker profiles, which offers an opportunity for early intervention well before receiving definitive clinical diagnosis. For many diseases, the risk of disease varies with some marker of biological importance such as age, and the markers themselves may be age dependent. To identify effective prediction rules using nonparametric decision functions, standard statistical learning approaches treat markers with clear biological importance (e.g., age) and other markers without prior knowledge on disease etiology interchangeably as input variables. Therefore, these approaches may be inadequate in singling out and preserving the effects from the biologically important variables, especially in the presence of high-dimensional markers. Using age as an example of a salient marker to receive special care in the analysis, we propose a local smoothing large margin classifier to construct effective age-dependent classification rules. The method adaptively adjusts for age effect and separately tunes age and other markers to achieve optimal performance. We apply the proposed method to a registry of individuals at risk for Huntington's disease (HD) and controls to construct age-sensitive predictive scores for the risk of receiving HD diagnosis during study period in premanifest individuals. In the second example, we develop methods for building formal diagnostic criteria sets for a new psychiatric disorder introduced in the recently released fifth edition of the Diagnostic and Statistical Manual of Psychiatric Disorders (DSM-5). The methods take into account of the unique logic structure of the DSM-like criteria sets and domain knowledge from experts’ opinions.

Thursday, November 7, 2013

3:30 P.M.

Helen Wood Hall – 1W502 Classroom

** ****Delta what? Choice of Outcome Scale in Non-Inferiority Trials**

**Delta what? Choice of Outcome Scale in Non-Inferiority Trials**

*Rick Chappell, PhD
University of Wisconsin
Department of Statistics
Department of Biostatistics and Medical Informatics
*

Non-inferiority (equivalence) trials are clinical experiments which attempt to show that one intervention is not too much inferior to another on some quantitative scale. The cutoff value is commonly denoted as Delta. For example, one might wish to show that the hazard ratio of disease-free survival among patients given an experimental chemotherapy versus a currently approved regimen is Delta = 1.3 or less, especially if the former is thought to be less toxic than or otherwise advantageous over the latter.

Naturally, a lot of attention is given to choice of Delta. In addition to this, I assert that even more than in superiority clinical trials the scale of Delta in equivalence trials must be carefully chosen. Since null hypotheses in superiority studies generally imply no effect, they are often identical or at least compatible when formulated on different scales. However, nonzero Deltas on one scale usually conflict with those on another. For example, the four hypotheses of arithmetic or multiplicative differences of either survival or hazard in general all mean different things unless Delta = 0 for differences or 1 for ratios. This can lead to problems in interpretation when the clinically natural scale is not a statistically convenient one.

Thursday, October 24, 2013

3:30 P.M.

Helen Wood Hall – 1W502 Classroom

## Spring 2013 Biostatistics Colloquia

**Revolutionizing Policy Analysis Using "Big Data" Analytics**

* Siddhartha R. Dalal, PhD
RAND Corporation and Columbia University
*

As policy analysis becomes applicable to new domains, it is being challenged by the “curse of dimensionality”—that is, the vastness of available information, the need for increasingly detailed and delicate analysis, and the speed with which new analysis is needed and old analysis must be refreshed. Moreover, with the proliferation of digital information available at one’s fingertips, and the expectation that this information be quickly leveraged, policy analysis in these new domains are being handicapped without scalable methods.

I will describe the results of a new initiative I started at RAND which developed new methods to apply to these “big data” problems to create new information and to convert enormous amounts of existing information into the knowledge needed for policy analysis. These methods draw on social interaction models, information analytics, and web technologies that have already revolutionized research in other areas. The specific examples include applications in medical informatics to find adverse effects of drugs and chemicals, terrorism analysis, and improvement in the efficiency of communicating individually tailored, policy-relevant information directly to policymakers. On the surface, traditional information theoretic considerations do not offer solutions. Accordingly, researchers looking for conventional solutions would have difficulty in solving these problems. I will describe how alternative formulations based on statistical underpinnings including Bayesian methods, sequential stopping and combinatorial designs have played a critical role in addressing these challenges.

Tuesday, June 4, 2013

Biographical Sketch: Siddhartha Dalal is currently Adjunct Professor at RAND Corporation and at Columbia University. Prior to this, he was Chief Technology Officer at RAND, and Vice President of Research at Xerox. Sid’s industrial research career began at Math Research Center at Bell Labs followed by Bellcore/Telcordia Technologies. He has co-authored over 100 publications, patents and monographs covering the areas of medical informatics, risk analysis, image processing, stochastic optimization, data/document mining, software engineering and Bayesian methods.

**Efficient and optimal estimation of a marginal mean in a dynamic infusion length regimen**

* Brent A. Johnson, PhD
Department of Biostatistics and Bioinformatics
*

*Rollins School of Public Health*

*Emory University*

In post-operative medical care, some drugs are administered intravenously through an infusion pump. For example, in infusion studies following heart surgery, anti-coagulants are delivered intravenously for many hours, even days while a patient recovers. A common primary endpoint of infusion studies is to compare two or more infusion drugs or rates and one can employ standard statistical analyses to address the primary endpoint in an intent-to-treat analysis. However, the presence of infusion-terminating events can adversely affect the analysis of primary endpoints and complicate statistical analyses of secondary endpoints. In this talk, I will focus on a popular secondary analysis of evaluating infusion lengths. The analysis is complicated due to presence or absence of infusion-terminating events and potential time-dependent confounding in treatment assignment. I will show how the theory of dynamic treatment regimes lends itself to this problem and offers a principled approach to construct adaptive, personalized infusion length policies. I will present some recent achievements that allow one to construct an improved, doubly-robust estimator for a particular class of nonrandom dynamic treatment regimes. All techniques will be exemplified through the ESPRIT infusion trial data from Duke University Medical Center.

Thursday, April 18, 2013

**Flexible Modeling of Medical Cost Data**

*Lei Liu, PhD
Department of Preventive Medicine
*

*Northwestern University School of Medicine*

Medical cost data are often skewed to the right and heteroscedastic, having a nonlinear relation with covariates. To tackle these issues, we consider an extension to generalized linear models by assuming nonlinear covariate effects in the mean function and allowing the variance to be an unknown but smooth function of the mean. We make no further assumption on the distributional form. The unknown functions are described by penalized splines, and the estimation is carried out using nonparametric quasi-likelihood. Simulation studies show the flexibility and advantages of our approach. We apply the model to the annual medical costs of heart failure patients in the clinical data repository (CDR) at the University of Virginia Hospital System. We also discuss how to adopt this modeling framework in correlated medical costs data.

Thursday, March 28, 2013

## Fall 2012 Biostatistics Colloquia

**Optimization of Dynamic Treatment Regimes for Recurrent Diseases**

*Xuelin Huang, PhD
*

*Department of Biostatistics*

*University of Texas MD Anderson Cancer Center*

Patients with a non-curable disease such as many types of cancer usually go through the process of initial treatment, a various number of disease recurrences and salvage treatments. Such multistage treatments are inevitably dynamic. That is, the choice of the next treatment depends on the patient's response to previous therapies. Dynamic treatment regimes (DTRs) are routinely used in clinics, but are rarely optimized. A systematic optimization of DTRs is highly desirable, but it poses immense challenges for statisticians given their complex nature. Our approach to address this issue is do optimization by backward induction. That is, we first optimize the treatments for the last stage, conditional on patient treatment and response history. Then, by induction, after optimization for stage k is done, for stage k-1, we plug in the optimized results of the all the k+ stages, and assume such optimized survival time from stage k-1 follows an accelerated failure time (AFT) model. Again, the optimization of treatments at stage k-1 is done under the assumed AFT model. Repeat this process until the optimization for the first stage is completed. By doing that, the effects of different treatments at each stage on survival can be consistently estimated and fairly compared, and the overall optimal DTR for each patient can be identified. Simulation studies show that the proposed method performs well and is useful in practical situations. The proposed method is applied to a study for acute myeloid leukemia, to identify the optimal treatment strategies for different subgroups of patients. Potential problems, alternative models, and optimization of the estimation methods are also discussed.

Thursday, December 13, 2012

**Testing with Correlated Data in Genome-wide Association Studies**

*Elizabeth Schifano, PhD
*

*Department of Statistics*

*University of Connecticut*

The complexity of the human genome makes it challenging to identify genetic markers associated with clinical outcomes. This identification is further complicated by the vast number of available markers, the majority of which are unrelated to outcome. As a consequence, the standard assessment of individual (marginal) marker effects on a single outcome is often ineffective. It is thus desirable to borrow information and strength from the large amounts of observed data to develop more powerful testing strategies. In this talk, I will discuss testing procedures that capitalize on various forms of correlation observed in genome-wide association studies.

This is joint work with Dr. Xihong Lin (Harvard School of Public Health).

Thursday, November 29, 2012

### Misspecification of Cox Regression Models with Composite Endpoints

*Richard J. Cook, PhD
Department of Statistics and Actuarial Science
*

*University of Waterloo*

Researchers routinely adopt composite endpoints in multicenter randomized trials designed to evaluate the effect of experimental interventions in cardiovascular disease, diabetes, and cancer. Despite their widespread use, relatively little attention has been paid to the statistical properties of estimators of treatment effect based on composite endpoints. We consider this here in the context of multivariate models for time to event data in which copula functions link marginal distributions with a proportional hazards structure. We then examine the asymptotic and empirical properties of the estimator of treatment effect arising from a Cox regression model for the time to the first event. We point out that even when the treatment effect is the same for the component events, the limiting value of the estimator based on the composite endpoint is usually inconsistent for this common value. We find that in this context the limiting value is determined by the degree of association between the events, the stochastic ordering of events, and the censoring distribution. Within the framework adopted, marginal methods for the analysis of multivariate failure time data yield consistent estimators of treatment effect and are therefore preferred. We illustrate the methods by application to a recent asthma study.

This is joint work with Longyang Wu.

Thursday, October 25, 2012

## Spring 2012 Biostatistics Colloquia

### Growth Trajectories and Bayesian Inverse Problems

*Ian McKeague, PhD
Mailman School of Public Health
*

*Columbia University*

Growth trajectories play a central role in life course epidemiology, often providing fundamental indicators of prenatal or childhood development, as well as an array of potential determinants of adult health outcomes. Statistical methods for the analysis of growth trajectories have been widely studied, but many challenging problems remain. Repeated measurements of length, weight and head circumference, for example, may be available on most subjects in a study, but usually only sparse temporal sampling of such variables is feasible. It can thus be challenging to gain a detailed understanding of growth velocity patterns, and smoothing techniques are inevitably needed. Moreover, the problem is exacerbated by the presence of large fluctuations in growth velocity during early infancy, and high variability between subjects. Existing approaches, however, can be inflexible due to a reliance on parametric models, and require computationally intensive methods that are unsuitable for exploratory analyses. This talk introduces a nonparametric Bayesian inversion approach to such problems, along with an R package that implements the proposed method.

Thursday, April 19, 2012

### Study Design and Statistical Inference for Data from an Outcome Dependent Sampling Scheme with a Continuous Outcome

*Haibo Zhou, PhD
*

*Department of Biostatistics*

*University of North Carolina at Chapel Hill*

Outcome dependent sampling (ODS) schemes can be cost effective ways to enhance study efficiency. The case-control design has been widely used in epidemiologic studies. However, when the outcome is measured in continuous scale, dichotomizing the outcome could lead to a loss of efficiency. Recent epidemiologic studies have used ODS sampling schemes where, in addition to an overall random sample, there are also a number of supplemental samples that are collected based on a continuous outcome variable. We consider a semiparametric empirical likelihood inference procedure in which the underlying distribution of covariates is treated as a nuisance parameter and left unspecified. The proposed estimator has asymptotic normality properties. The likelihood ratio statistic using the semiparametric empirical likelihood function has Wilks type properties in that, under the null, it follows a Chi-square distribution asymptotically and is independent of the nuisance parameters. Simulation results indicate that, for data obtained using an ODS design, the proposed estimator is more efficient than competing estimators with the same size.

A data set from the Collaborative Perinatal Project (CPP) is used to illustrate the proposed method to assess the impact of maternal polychlorinated biphenyl (PCB) and children’s IQ test performance.

Thursday, March 8, 2012

### HARK: A New Method for Regression with Functional Predictors, with Application to the Sleep Heart Health Study

* Dawn Woodard, PhD
*

*Assistant Professor*

Operations Research and Information Engineering

Cornell University

Operations Research and Information Engineering

Cornell University

We propose a new method for regression using a parsimonious and scientifically interpretable representation of functional predictors. Our approach is designed for data that exhibit features such as spikes, dips, and plateaus whose frequency, location, size, and shape varies stochastically across subjects. Our method is motivated by the goal of quantifying the association between sleep characteristics and health outcomes, using a large and complex dataset from the Sleep Heart Health Study. We propose Bayesian inference of the joint functional and exposure models, and give a method for efficient computation. We contrast our approach with existing state-of-the-art methods for regression with functional predictors, and show that our method is more effective and efficient for data that include features occurring at varying locations.

Thursday, February 9, 2012

### Choice of Optimal Estimators in Structural Nested Mean Models With Application to Initiating HAART in HIV Positive Patients After Varying Duration of Infection

* Judith Lok, PhD
*

*Assistant Professor*

Department of Biostatistics

Harvard School of Public Health

Department of Biostatistics

Harvard School of Public Health

We estimate how the effect of a fixed duration of antiretroviral treatment depends on the time from HIV infection to initiation of treatment, using observational data. A major challenge in making inferences from such observational data is that treatment is not randomly assigned; e.g., if time of initiation depends on disease status, this dependence will induce bias in the estimation of the effect of interest. Previously, Lok and De Gruttola have developed a new class of Structural Nested Mean Models to estimate this effect. This led to a large class of consistent, asymptotically normal estimators, under the assumption that all confounders are measured. However, estimates and standard errors turn out to depend significantly on the choice of estimators within this class, advocating the study of optimal ones. We will present an explicit solution for the choice of optimal estimators under some extra conditions. In the absence of those extra conditions, the resulting estimator is still consistent and asymptotically normal, although possibly not optimal. This estimator is also doubly robust: it is consistent and asymptotically normal not only if the model for treatment initiation is correct, but also if a certain outcome-regression model is correct.

We illustrate our methods using the AIEDRP (Acute Infection and Early Disease Research Program) Core01 database on HIV. Delaying the initiation of HAART has the advantage of postponing onset of adverse events or drug resistance, but may also lead to irreversible immune system damage. Application of our methods to observational data on treatment initiation will help provide insight into these tradeoffs. The current interest in using treatment to control epidemic spread heightens interest in these issues, as early treatment can only be ethically justified if it benefits individual patients, regardless of the potential for community-wide benefits.

This is joint work with Victor De Gruttola, Ray Griner, and James Robins.

Thursday, January 19, 2012

## Fall 2011 Biostatistics Colloquia

### Conditional Inference Functions for Mixed-Effects Models with Unspecified Random-Effects Distribution

* Annie Qu, PhD
*

*Professor*

Department of Biostatistics

University of Illinois at Urbana-Champaign

Department of Biostatistics

University of Illinois at Urbana-Champaign

In longitudinal studies, mixed-effects models are important for addressing subject-specific effects. However, most existing approaches assume a normal distribution for the random effects, and this could affect the bias and efficiency of the fixed-effects estimator. Even in cases where the estimation of the fixed effects is robust with a misspecified distribution of the random effects, the estimation of the random effects could be invalid. We propose a new approach to estimate fixed and random effects using conditional quadratic inference functions. The new approach does not require the specification of likelihood functions or a normality assumption for random effects. It can also accommodate serial correlation between observations within the same cluster, in addition to mixed-effects modeling. Other advantages include not requiring the estimation of the unknown variance components associated with the random effects, or the nuisance parameters associated with the working correlations. Real data examples and simulations are used to compare the new approach with the penalized quasi-likelihood approach, and SAS GLIMMIX and nonlinear mixed effects model (NLMIXED) procedures.

This is joint work with Peng Wang and Guei-feng Tsai.

Thursday, November 17, 2011

### Collection, Analysis and Interpretation of Dietary Intake Data

* Alicia L. Carriquiry
*

*Professor of Statistics*

Iowa State University

Iowa State University

The United States government spends billions of dollars each year on food assistance programs, on food safety and food labeling efforts and in general on interventions and other activities with the goal of improving the nutritional status of the population. To do so, the government relies on large, nationwide food consumption and health surveys that are carried out regularl

Of interest to policy makers, researchers and practitioners is the usual intake of a nutrient or other food components. The distribution of usual intakes in population sub-groups is also of interest, as is the association between consumption and health outcomes. Today we focus on the estimation and interpretation of distributions of nutrient intake and on their use for policy decision-making.

From a statistical point of view, estimating the distribution of usual intakes of a nutrient or other food components is challenging. Usual intakes are unobservable in practice and are subject to large measurement error, skewness and other survey-related effects. The problem of estimating usual nutrient intake distributions can therefore be thought of as the problem of estimating the density of a non-normal random variable that is observed with error. We describe what is now considered to be the standard approach for estimation and will spend some time discussing problems in this area that remain to be addressed. We use data from the most recent NHANES survey to illustrate the methods and provide examples.

Thursday, November 3, 2011

Merging Surveillance Cohorts in HIV/AIDS Studies

* Peter X. K. Song, PhD
*

*Professor of Biostatistics*

Department of Biostatistics

University of Michigan School of Public Health

Department of Biostatistics

University of Michigan School of Public Health

Rising HIV/AIDS prevalence in China has become a serious public health concern in recent years. Data from established surveillance networks across the country have provided timely information for intervention, control and prevention. In this talk, I will focus on the study population of drug injection users in Sichuan province over years 2006-2009, and the evaluation of HIV prevalence across regions in this province. In particular, I will introduce a newly developed estimating equation approach to merging clustered/longitudinal cohort study datasets, which enabled us to not only effectively detect risk factors associated with worsening prevalence rates but also to estimate the effect sizes of the detected risk factors. Both simulation studies and real data analysis will be presented.

Thursday, October 13, 2011

## Spring 2011 Biostatistics Colloquia

Independent Component Analysis Involving Autocorrelated Sources with an Application to Functional Magnetic Resonance Imaging

*Haipeng Shen, PhD
Department of Statistics & Operations Research
University of North Carolina at Chapel Hill*

Independent component analysis (ICA) is an effective data-driven method for blind source separation. It has been successfully applied to separate source signals of interest from their mixtures. Most existing ICA procedures are carried out by relying solely on the estimation of the marginal density functions. In many applications, correlation structures within each source also play an important role besides the marginal distributions. One important example is functional magnetic resonance imaging (fMRI) analysis where the brain-function-related signals are temporally correlated.

I shall talk about a novel ICA approach that fully exploits the correlation structures within the source signals. Specifically, we propose to estimate the spectral density functions of the source signals instead of their marginal density functions. Our methodology is described and implemented using spectral density functions from frequently used time series models such as ARMA processes. The time series parameters and the mixing matrix are estimated via maximizing the Whittle likelihood function. The performance of the proposed method will be illustrated through extensive simulation studies and a real fMRI application. The numerical results indicate that our approach outperforms several popular methods including the most widely used fastICA algorithm.

Thursday, May 19, 2011

### Identification of treatment responders and non-responders via a multivariate growth curve latent class model

*Mary D. Sammel, ScD
Department of Biostatistics and Epidemiology
University of Pennsylvania School of Medicine *

In many clinical studies, the disease of interest is multi-faceted and multiple outcomes are needed to adequately characterize the disease or its severity. In such studies, it is often difficult to determine what constitutes improvement due to the multivariate nature of the response. Furthermore, when the disease of interest has an unknown etiology and/or is primarily a symptom-defined syndrome, there is potential for the study population to be heterogeneous with respect to their symptom profiles. Identification of population subgroups is of interest as it may enable clinicians to provide targeted treatments or develop accurate prognoses. We propose a multivariate growth curve latent class model that group subjects based on multiple outcomes measured repeatedly over time. These groups or latent classes are characterized by distinctive longitudinal profiles of a latent variable which is used to summarize the multivariate outcomes at each point in time. The mean growth curve for the latent variable in each class defines the features of the class. We develop this model for any combination of continuous, binary, ordinal or count outcomes within a Bayesian hierarchical framework. Simulation studies are used to validate the estimation procedures. We apply our models to data from a randomized clinical trial evaluating the efficacy of Bacillus Calmette-Guerin in treating symptoms of IC where we are able to identify a class of subjects who were not responsive to treatment, and a class of subjects where treatment was effective in reducing symptoms over time.

Thursday, May 12, 2011

Recent development on a pseudolikelihood method for analysis of multivariate data with nonignorable nonresponse

* Gong Tang, PhD
*

*Assistant Professor of Biostatistics*

University of Pittsburgh Graduate School of Public Health

University of Pittsburgh Graduate School of Public Health

Consider regression analysis on data with nonignorable nonresponse, standard methods require modeling the nonresponse mechanism. Tang, Little and Raghunathan (2003) proposed a pseudolikelihood method for analysis of data with a class of nonignorable nonresponse mechanisms without modeling the nonresponse mechanism, and extended it to multivariate monotone data with nonignorable nonresponse. In the multivariate case, the joint distribution of response variables was factored into the product of conditional distributions and the pseudolikelihood estimates of the conditional distribution parameters were shown asymptotically normal. However, these estimates were based on different subsets of the data, which were dictated by the missing-data pattern, and their joint distribution was unclear. Here we provide a modification of the likelihood functions and derive the asymptotic joint distributions of these estimates. We also consider an imputation approach for this pseudolikelihood method. Usual imputation approaches impute the missing values and summarize via multiple imputations. Without knowing or modeling the nonresponse mechanism in our setting, the missing values cannot be predicted. We propose a novel approach via imputing the necessary sufficient statistics to circumvent this barrier.

Thursday, March 10, 2011

Asymptotic Properties of Permutation Tests for ANOVA Designs

* John Kolassa, PhD
*

*Professor of Statistics*

Rutgers University

Rutgers University

We show that under mild conditions, which will allow the application of the approximations in bootstrap, permutation and rank statistics used for multiparameter cases, the integral of the formal saddlepoint density approximation can be used to give an approximation, with relative error of order 1/n, to the tail probability of a likelihood ratio-like statistic. This then permits the approximation to be put into a form analogous to those given either by Lugananni-Rice or Barndorff-Nielsen.

*This is joint work with John Robinson, University of Sydney*

Thursday, February 17, 2011

## Fall 2010 Biostatistics Colloquia

Hierarchical Commensurate and Power Prior Models for Adaptive Incorporation of Historical Information in Clinical Trials

* Bradley Carlin, PhD
*

*Professor and Head, Division of Biostatistics*

*University of Minnesota School of Public Health*

Bayesian clinical trial designs offer the possibility of a substantially reduced sample size, increased statistical power, and reductions in cost and ethical hazard. However when prior and current information conflict, Bayesian methods can lead to higher than expected Type I error, as well as the possibility of a costlier and lengthier trial. This motivates an investigation of the feasibility of hierarchical Bayesian methods for incorporating historical data that are adaptively robust to prior information that reveals itself to be inconsistent with the accumulating experimental data. In this paper, we present novel modifications to the traditional hierarchical modeling approach that allows the commensurability of the information in the historical and current data to determine how much historical information is used. We describe the method in the Gaussian case, but then add several important extensions, including the ability to incorporate covariates, random effects, and non-Gaussian likelihoods (especially for binary and time-to-event data). We compare the frequentist performance of our methods as well as existing, more traditional alternatives using simulation, calibrating our methods so they could be feasibly employed in FDA-regulated trials. We also give an example in a colon cancer trial setting where our proposed design produces more precise estimates of the model parameters, in particular conferring statistical significance to the observed reduction in tumor size for the experimental regimen as compared to the control. Finally, we indicate how the method may be combined with adaptive randomization to further increase its utility.

Tuesday, December 7, 2010r

### Risk Prediction Models from Genome Wide Association Data

*Hongyu Zhao, PhD
Professor of Public Health (Biostatistics)
Professor of Genetics and of Statistics
Yale School of Public Health
*

Recent genome wide association studies have identified many genetic variants affecting complex human diseases. It is of great interest to build disease risk prediction models based on these data. In this presentation, we will present the statistical challenges in using genome wide association data for risk predictions, and discuss different methods through both simulation studies and applications to real-world data. This is joint work with Jia Kang and Judy Cho.

Thursday, December 2, 2010

**Delaying Time-to-Event in Parkinson’s Disease**

* Nick Holford, PhD
Professor
*

*Department of Pharmacology and Clinical Pharmacology*

*University of Auckland, New Zealand*

There are two reasons for studying the time course of disease status as a predictor in time-to-event analysis. Firstly, it is well understood that pharmacologic treatments may influence both the time course of disease progress and a clinical event such as death. Secondly, the two outcome variables (i.e., disease status and clinical event) are highly correlated; for example, the probability of a clinical event may be increased by the worsening disease status. Despite these reasons, a separate analysis for each type of outcome measurement is usually performed and often only baseline disease status is used as a time-constant covariate in the time-to-event analysis. We contend that more useful information can be gained when time course of disease status is modeled as a time-dependent covariate, providing some mechanistic insight for the effectiveness of treatment. Furthermore, an integrated model to describe the effect of treatment on the time course of both outcomes would provide a basis for clinicians to make better prognostic predictions of the eventual clinical outcome. We illustrate these points using data from 800 Parkinson’s disease (PD) patients who participated in the DATATOP trial and were followed for 8 years. It is shown that the hazards for four clinical events in PD (depression, disability, dementia, and death) are not constant over time and are clearly influenced by PD progression. With the integrated model of time course of disease progress and clinical events, differences in the probabilities of clinical events can be explained by the symptomatic and/or protective effects of anti-parkinsonian medications on PD progression. The use of early disease-status measurements may have clinical application in predicting the probability of clinical events and giving patients better individual prognostic advice.

Thursday, November 11, 2010

Profile Likelihood and Semi-parametric Models, with Application to Multivariate Survival Analysis

* Jerry Lawless, PhD
Distinguished Professor Emeritus
*

*Department of Statistics and Actuarial Science*

*University of Waterloo*

We consider semi-parametric models involving a finite dimensional parameter along with functional parameters. Profile likelihoods for finite dimensional parameters have regular asymptotic behaviour in many settings. In this talk we review profile likelihood and then consider several inference problems related to copulas; these include tests for parametric copulas, estimation of marginal distributions and association parameters and semi-parametric likelihood and pseudo-likelihood comparisons. Applications involving parallel and sequentially observed survival times will be considered.

Thursday, November 4, 2010

Nonparametric Modeling of Next Generation Sequencing Data

* Ping Ma, PhD
*

*Assistant Professor, Department of Statistics*

*University of Illinois at Urbana-Champaign*

With the rapid development of next generation sequencing technologies, ChIP-seq and RNA-seq have become popular methods for genome-wide protein-DNA interaction analysis and gene expression analysis respectively. Compared to their hybridization-based counterparts, e.g., ChIP-chip and microarray, ChIP-seq and RNA-seq offer down to a single-base resolution signals. In particular, the two technologies produce tens of millions of short reads in a single run. After mapping these reads to reference genome (or transcripts), researchers get a sequence of read counts. That is, at each nucleotide position, researchers get a count which stands for the number of reads whose mapping starts at that position. Depending on research goals, researchers may opt to either analyze these counts directly or derive other types of data based on these counts to facilitate biological discoveries. In this talk, I will present some nonparametric methods we recently developed in analyzing next generation sequencing data.

Thursday, October 28, 2010

## Summer 2010 Biostatistics Colloquia

Bayesian Inference in Semiparametric Mixed Models for Longitudinal Data

* Yisheng Li, PhD
University of Texas M.D. Anderson Cancer Center*

We consider Bayesian inference in semiparametric mixed models (SPMMs) for longitudinal data. SPMMs are a class of models that use a nonparametric function to model a time effect, a parametric function to model other covariate effects, and parametric or nonparametric random effects to account for the within-subject correlation. We model the nonparametric function using a Bayesian formulation of a cubic smoothing spline, and the random effect distribution using a normal distribution and alternatively a nonparametric Dirichlet process (DP) prior. When the random effect distribution is assumed to be normal, we propose a uniform shrinkage prior (USP) for the variance components and the smoothing parameter. When the random effect distribution is modeled nonparametrically, we use a DP prior with a normal base measure and propose a USP for the hyperparameters of the DP base measure. We argue that the commonly assumed DP prior implies a nonzero mean of the random effect distribution, even when a base measure with mean zero is specified. This implies weak identifiability for the fixed effects, and can therefore lead to biased estimators and poor inference for the regression coefficients and the spline estimator of the nonparametric function. We propose an adjustment using a postprocessing technique. We show that under mild conditions the posterior is proper under the proposed USP, a flat prior for the fixed effect parameters, and an improper prior for the residual variance. We illustrate the proposed approach using a longitudinal hormone dataset, and carry out extensive simulation studies to compare its finite sample performance with existing methods.

*This is joint work with Xihong Lin and Peter Mueller.*

Friday, July 23 , 2010

## Spring 2010 Biostatistics Colloquia

Statistical challenges in identifying biomarkers for Alzheimer’s disease: Insights from the Alzheimer's Disease Neuroimaging Initiative (ADNI)

* Laurel Beckett, PhD
University of California, Davis
*The aim of the Alzheimer's Disease Neuroimaging Initative (ADNI) is to evaluate potential biomarkers for clinical disease progression. ADNI has enrolled more than 800 people including normal controls (NC), people with mild cognitive impairment (MCI), and people with mild to moderate Alzheimer’s disease (AD). For each person, we now have two years of follow-up clinical data including neuropsychological tests, functional measures, and clinical diagnosis to detect conversion from normal to MCI or MCI to AD. We also have longitudinal data on potential biomarkers based on MRI and PET neuroimaging and on serum and cerebrospinal fluid samples. Our goal is to find the best biomarkers to help us track the early preclinical and later clinical progression of AD, and to help speed up drug testing.

ADNI poses many challenges for statisticians. There are many potential biomarkers, arising from high-dimensional, correlated longitudinal data. Even if we pick a single biomarker to examine, we don’t have a single “gold standard” for performance. Instead, we have many different tests we would like a biomarker to pass. One very simple criterion is that it should be different in NC, MCI and AD. But we also want biomarkers that are sensitive to change over time, have a high signal-to-noise ratio, and correlate well with clinical endpoints. I will show some statistical approaches to these questions, and illustrate with current ADNI data.

Thursday, May 20, 2010

**Some Recent Statistical Developments on Cancer Clinical Trials and Computational Biology**

* Junfeng (Jeffrey) Liu, PhD
The Cancer Institute of New Jersey and
*

*University of Medicine & Dentistry of New Jersey*

In the first part, we extend Simon's two-stage design (1989) for single-arm phase II cancer clinical trials by studying a realistic scenario where the standard and experimental treatment overall response rates (ORRs) follow two beta distributions (rather than two single values). Our results show that this type of design retains certain desirable properties for hypothesis testing purpose. However, some designs may not exist under certain hypothesis and error rate (type I&II) setups in practice. Theoretical conditions are derived for asymptotic two-stage design non-existence and improving design search efficiency. In the second part, we introduce Monte-Carlo simulation based algorithms for rigorously calculating the probability of a set of orthologous DNA sequences within evolutionary biology framework, where pairwise Needleman-Wunsch alignment (1970) between the imputed and species sequences is utilized to induce the posterior conditional probabilities which lead to efficient calculation using central limit theorem. The importance of evolution-adaptive alignment algorithm is highlighted. If time allows, we will also briefly introduce some necessary conditions for realizing a self-consistent (Chapman-Kolmogorov equation) and self-contained (concurrent substitution-insertion-deletion) continuous-time finite-state Markov chain from modeling nucleotide site evolution under certain assumptions.

Thursday, May 6, 2010

Median regression in survival analysis via transform-both-sides model

* Debajyoti Sinha, PhD
Florida State University
*For analysis of survival data, median regression offers a useful alternative to the popular proportional hazards (Cox,1972) and accelerated failure time models. We propose a new simple method for estimating the parameters of censored median regression based on transform-both-sides model. Numerical studies are conducted to show that our likelihood based and Bayes estimators perform well compared to existing estimators for censored data with wide range of skewness. In addition, the simulated variance of our proposed maximum likelihood estimators are substantially lower than those of Portnoy (2003) and other existing estimators. Our Bayesian estimators can handle semiparametric model where the model requires to be of symmetric distribution after transformation in both sides. We also extend our methods to deal with median regressions for multivariate survival data.

*This is joint work with Jianchang Lin and Stuart Lipsitz*.

Thursday, April 29, 2010

### Planning Survival Analysis Studies with Two-Stage Randomization Trials

* Zhiguo Li, PhD
University of Michigan
*Two-stage randomization trials are growing in importance in developing and comparing adaptive treatment strategies (i.e., treatment policies or dynamic treatment regimes). Usually the first stage involves randomization to one of several initial treatments. The second stage of treatment begins when a response (or nonresponse) criterion is met. In the second stage subjects are again randomized among treatments. With time-to-event outcomes, sample size calculations for planning these two-stage randomization trials are challenging because the variances of common test statistics depend in a complex manner on the joint distribution of time to response (or nonresponse) criterion and the primary time-to-event outcome. We produce simple, albeit conservative, sample size formulae by using upper bounds on the variances. The resulting sample size formulae only require the same working assumptions needed to size a single stage randomized trial. Furthermore in most common settings the sample size formulae are only mildly conservative. These sample size formulae are based on either a weighted Kaplan-Meier estimator of survival probabilities at a fixed time point or a weighted version of the log rank test. We also consider several variants of the two-stage randomization design.

Tuesday, April 27, 2010

**Spatial-Temporal Association Between Daily Mortality and Exposure to Particulate Matter**

**Spatial-Temporal Association Between Daily Mortality and Exposure to Particulate Matter**

* Eric J. Kalendra, M.S.
North Carolina State University
*Fine particulate matter (PM2.5) is a mixture of pollutants that has been linked to serious health problems, including premature mortality. Since the chemical composition of PM2.5 varies across space and time, the association between PM2.5 and mortality might also be expected to vary with space and season. This study uses a unique spatial data architecture consisting of geocoded North Carolina mortality data for 2001-2002, combined with US Census 2000 data. We study the association between mortality and air pollution exposure using different metrics (monitoring data and air quality numerical models) to characterize the pollution exposure. We develop and implement a novel statistical multi-stage Bayesian framework that provides a very broad, flexible approach to studying the spatiotemporal associations between mortality and population exposure to daily PM2.5 mass, while accounting for different sources of uncertainty. Most of the pollution-mortality risk assessment has been done using aggregated mortality and pollution data (e.g., at the county level), and that can lead to significant ecological bias and error in the estimated risk. In this work, we introduce a new framework to adjustment for the ecological bias in the risk assessment analysis by using the aggregated data. We present results for the State of North Carolina.

Thursday, April 1, 2010

** Modeling Treatment Efficacy under Screening **

* Alexander Tsodikov, PhD
University of Michigan
*Modeling the treatment null hypothesis and the alternative when population is subject to cancer screening is a challenge. Survival is subject to length and lead-time bias, and there is a shift of the distribution of disease stage towards earlier stages under screen-based diagnosis. However, screening is not a treatment, and all these dynamic changes are expected to occur under the null hypothesis of no treatment effect. Under the alternative hypothesis, treatment effect for an unscreened person may be different from the screen-detected one, as early detection may enhance the effect. The challenge is that these treatments are applied at different points of disease development. We provide a statistical modeling approach to address the question of treatment efficacy in this dynamic situation.

Thursday, March 11, 2010

## Fall 2009 Biostatistics Colloquia

** A General Framework for Combining Information and a Frequentist Approach to Incorporate Expert Opinions **

* Minge Xie, PhD
Rutgers University
*Incorporating external information, such as prior information and expert opinions, can play an important role in the design, analysis and interpretation of clinical trials. Seeking effective schemes for incorporating prior information with the primary outcomes of interest has drawn increasing attention in pharmaceutical applications in recent years. Most methods currently used for combining prior information with clinical trial data are Bayesian. But we demonstrate that they may encounter problems in the analysis of clinical trials with binary outcomes, especially when informative prior distribution is skewed

In this talk, we present a frequentist framework of combining information using confidence distributions (CDs), and illustrate it through an example of incorporating expert opinions with information from clinical trial data. A confidence distribution (CD), which uses a distribution function to estimate a parameter of interest, contains a wealth of information for inferences; much more than a point estimator or a confidence interval (“interval estimator”). In this talk, we present a formal definition of CDs, and develop a general framework of combining information based on CDs. This CD combining framework not only unifies most existing meta-analysis approaches, but also leads to development of new approaches. In particular, we develop a Frequentist approach to combine surveys of expert opinions with binomial clinical trial data, and illustrate it using data from a collaborative research with Johnson & Johnson Pharmaceuticals. The results from the Frequentist approach are compared with those from Bayesian approaches, and it is demonstrated that the Frequentist approach has distinct advantages.

Thursday, December 10, 2009

**The Emerging Role of the Data and Safety Monitoring Board: Implications of Adaptive Clinical Trial Designs**

* Christopher S. Coffey, PhD
University of Iowa
*In recent years, there has been substantial interest in the use of adaptive or novel randomized trial designs. Although there are a large number of proposed adaptations, all generally share the common characteristic that they allow for some design modifications during an ongoing clinical trial. Unfortunately, the rapid proliferation of research on adaptive designs, and inconsistent use of terminology, has created confusion about the similarities and, more importantly, the differences among the techniques. In the first half of this talk, I will attempt to provide some clarification on the topic and describe some of the more commonly proposed adaptive designs

Furthermore, sequential monitoring of safety and efficacy data has become integral to modern clinical trials. A Data and Safety Monitoring Board (DSMB) is often given the responsibility of monitoring accumulating data over the course of the trial. DSMB’s have traditionally had the responsibility to monitor the trial and make recommendations as to when a trial should be stopped for efficacy, futility, or safety. As more trials start to utilize the adaptive framework, the roles and responsibilities of the DSMB are becoming more complex. In the latter half of this talk, I will report on the experience of the DSMB during the clinical trial of high dose Coenzyme Q10 in Amyotrophic Lateral Sclerosis (QALS). This trial utilized an adaptive design involving two stages. The objective of the first stage was to identify which of two doses of CoQ10 (1000 or 2000 mg/day) is preferred for ALS. The objective of stage 2 was to conduct a futility test to compare the preferred dose from stage 1 against placebo to determine whether there is sufficient evidence of efficacy to justify proceeding to a definitive phase III trial. As a result of the complexity of the adaptive design for this study, there were a number of issues that the DSMB had to address. I will briefly describe how the DSMB addressed each issue during the conduct of the trial and provide suggestions for how such issues might be addressed in future trials.

Thursday, October 29, 2009

### Modeling the Dynamics of T Cell Responses and Infections

*Rustom Antia, PhD
Emory University
*In the first part of the talk I will discuss how mathematical models have helped us understand the rules which govern the dynamics of immune responses and the generation of immunological memory. The second part of the talk will focus on the role of different factors (resource limitation, innate and specific immunity) in the control of the infections and if time permits discuss their application to SIV/HIV, influenza and malaria.

Friday, October 23, 2009

### Local CQR Smoothing: An Efficient and Safe Alternative to Local Polynomial Regression

*Hui Zou, PhD
University of Minnesota
*Local polynomial regression is a useful nonparametric regression tool to explore fine data structures and has been widely used in practice. In this talk, we will introduce a new nonparametric regression technique called local CQR smoothing in order to further improve the local polynomial regression. Sampling properties of the proposed estimation procedure are studied. We derive the asymptotic bias, variance and normality of the proposed estimate. Asymptotic relative efficiency of the proposed estimate with respect to the local polynomial regression is investigated. It is shown that the proposed estimate can be much more efficient than the local polynomial regression estimate for various non-normal errors, while being almost as efficient as the local polynomial regression estimate for normal errors. Simulation is conducted to examine the performance of the proposed estimates. The simulation results are consistent with our theoretic findings. A real data example is used to illustrate the proposed method.

Thursday, October 8, 2009

### Challenges and Statistical Issues in Estimating HIV Incidence

*Ruiguang Song, PhD
Centers for Disease Control
*Knowing the trends and current pattern of HIV infections is important for planning and evaluating prevention efforts and for resource allocation. However, it is difficult to estimate HIV incidence because HIV infections may not be detected or diagnosed until many years after the infection. Historically, HIV incidence was estimated based on the numbers of AIDS diagnoses and the back-calculation method. This method was no longer valid when the highly active antiretroviral therapy was introduced in 1996. This is because the therapy changes the incubation distribution by extending the period from HIV infection to AIDS diagnosis. Since then, the empirical estimate of 40,000 was used until the new estimate published in 2008. This presentation will describe the development of the new method and discuss the statistical issues in producing the new estimate of HIV incidence in the United States.

Thursday, September 24, 2009

## Spring 2009 Biostatistics Colloquia

**Individual Prediction in Prostate Cancer Studies Using a Joint Longitudinal-Survival Model**

*Jeremy M.G. Taylor, PhD
Department of Biostatistics
University of Michigan
*For monitoring patients treated for prostate cancer, Prostate Specific Antigen (PSA) is measured periodically after they receive treatment. Increases in PSA are suggestive of recurrence of the cancer and are used in making decisions about possible new treatments. The data from studies of such patients typically consist of longitudinal PSA measurements, censored event times and baseline covariates. Methods for the combined analysis of both longitudinal and survival data have been developed in recent years, with the main emphasis being on modeling and estimation. We analyze data from a prostate cancer study in which the patients are treated with radiation therapy using a joint model. Here we focus on utilizing the model to make individualized prediction of disease progression for censored and alive patients, based on all their available pre-treatment and follow-up data.

In this model the longitudinal PSA data follows a non-linear hierarchical mixed model. The clinical recurrences are modeled using a time-dependent proportional hazards model where the time dependent covariates include both the current value and the slope of post-treatment PSA profile. Estimates of the parameters in the model are obtained by the Markov chain Monte Carlo (MCMC) technique. The model is used to give individual predictions of both future PSA values and the predicted probability of recurrence up to four years in the future. An efficient algorithm is developed to give individual predictions for subjects who were not part of the original data from which the model was developed. Thus the model can be used by others remotely through a website portal, to give individual predictions that can be updated as more follow-up data is obtained. In this talk I will discuss the data, the models, the estimation methods, the statistical issues and the website, psacalc.sph.umich.edu .

This is joint work with Menggang Yu, Donna Ankerst, Cecile Proust-Lima, Ning Liu, Yongseok Park and Howard Sandler.

Thursday, April 30, 2009

**Recent Developments of Generalized Inference in Small Sample Diagnostic Studies**

*Lili Tian, PhD
Department of Biostatistics
SUNY at Buffalo
*Exact generalized method, proposed by Tsui and Weerahandi (JASA, 1989) and Weerahandi (JASA, 1993), has received much research attention recently due to the fact that it allows us to make exact (non asymptotic) inference for the statistical problems for which the standard inference methods do not exist. This method has proved to be very fruitful in providing accessible, admissible and preferable solutions to small sample problems in many practical settings. In this talk, I will present a brief introduction of this field followed by some recent developments including applications in small sample diagnostic studies.

Thursday, April 23, 2009

** High Dimensional Statistics in Genomics: Some New Problems and Solutions**

**High Dimensional Statistics in Genomics: Some New Problems and Solutions**

*Hongzhe Li, PhD
Department of Biostatistics and Epidemiology
University of Pennsylvania School of Medicine
*Large-scale systematic genomic datasets have been generated to inform our biological understanding of both the normal workings of organisms in biology and disrupted processes which cause human disease. The integrative analysis of these datasets, which has become an increasingly important part of genomics and systems biology research, poses many interesting statistical problems, largely driven by the complex inter-relationships between high-dimensional genomic measurements. In this talk, I will present three problems in genomics research that require the development of new statistical methods: (1) identification of active transcription factors in microarray time-course experiments; (2) identification of subnetworks that are associated with some clinical outcomes; and (3) identification of the genetic variants that explain higher-order gene expression modules. I will present several regularized estimation methods to address these questions and demonstrate their applications using real data examples. I will also discuss some theoretical properties of these procedures.

Thursday, March 26, 2009

**Multiscale Computational Cell Biology**

*Martin Meier-Schellersheim, PhD
National Institute of Allergy and Infectious Diseases
National Institutes of Health *

**The modeling and simulation tool Simmune (http://www3.niaid.nih.gov/labs/aboutlabs/psiim/computationalBiology/)**

allows for the definition of detailed models of cell biological processes ranging from interactions between molecular binding sites to the behavior of populations of cells. Based on the inputs the user provides through a graphical interface, the software automatically constructs the resulting sets of partial differential equations describing intra- and extra-cellular reaction-diffusion and integrates them, providing numerous ways to display the behavior of the simulated systems and to interact with running simulations in a way that closely resembles wet-lab manipulations. In the talk, I will explain the technical foundations and typical use cases for simmune.

Thursday, January 15, 2009

## Fall 2008 Biostatistics Colloquia

### Nonparametric Variance Estimation for Systematic Samples

*Jean Opsomer, PhD
Colorado State University*

**Systematic sampling is a frequently used sampling method in natural resource surveys, because of its ease of implementation and its design efficiency. An important drawback of systematic sampling, however, is that no direct estimator of the design variance is available. We describe a new estimator of the model-based expectation of the design variance, under a nonparametric model for the population. The nonparametric model is sufficiently flexible that it can be expected to hold at least approximately for many practical situations. We prove the consistency of the estimator for both the anticipated variance and the design variance under the nonparametric model. The approach is used on a forest survey dataset, on which we compare a number of design-based and model-based variance estimators.**

Thursday, November 20, 2008

### Bayesian Inference for High Dimensional Functional and Image Data using Functional Mixed Models

*Jeffrey S. Morris, PhD
Department of Biostatistics
The University of Texas MD Anderson Cancer Center
*High dimensional, irregular functional data are increasingly encountered in scientific research. For example, MALDI-MS yields proteomics data consisting of one-dimensional spectra with many peaks, array CGH or SNP chip arrays yield one-dimensional functions of copy number information along the genome, 2D gel electrophoresis and LC-MS yield two-dimensional images with spots that correspond to peptides present in the sample, and fMRI yields four-dimensional data consisting of three-dimensional brain images observed over a sequence of time points on a fine grid. In this talk, I will discuss how to identify regions of the functions/images that are related to factors of interest using Bayesian wavelet-based functional mixed models. The flexibility of this framework in modeling nonparametric fixed and random effect functions enables it to model the effects of multiple factors simultaneously, allowing one to perform inference on multiple factors of interest using the same model fit, while borrowing strength between observations in all dimensions. I will demonstrate how to identify regions of the functions that are significantly associated with factors of interest, in a way that takes both statistical and practical significance into account and controls the Bayesian false discovery rate to a pre-specified level. I will also discuss how to extend this framework to include functional predictors with coefficient surfaces. These methods will be applied to a series of functional data sets.

Thursday, November 13, 2008

### Semiparametric Analysis of Recurrent and Terminal Event Data

*Douglas E. Schaubel, PhD
Department of Biostatistics, University of Michigan
*In clinical and observational studies, the event of interest is often one which can occur multiple times for the same subject (i.e., a recurrent event). Moreover, there may be a terminal event (e.g. death) which stops the recurrent event process and, typically, is strongly correlated with the recurrent event process. We consider the recurrent/terminal event setting and model the dependence through a shared gamma frailty that is included in both the recurrent event rate and terminal event hazard functions. Conditional on the frailty, a model is specified only for the marginal recurrent event process, hence avoiding the strong Poisson-type assumptions traditionally used. Analysis is based on estimating functions that allow for estimation of covariate effects on the marginal recurrent event rate and terminal event hazard. The method also permits estimation of the degree of association between the two processes. Closed-form asymptotic variance estimators are proposed. The proposed methods are evaluated through simulations to assess the applicability of the asymptotic results in finite samples, and to evaluate the sensitivity of the method to departures from its underlying assumptions. The methods are illustrated in an analysis of hospitalization data for patients in an international multi-center study of outcomes among peritoneal dialysis patients. This is joint work with Yining Ye and Jack Kalbfleisch.

Thursday, November 6, 2008

**Spatio-temporal Analysis via Generalized Additive Models**

*Kung-Sik Chan, PhD
The University of Iowa
*Generalized Additive Model (GAM) has been widely used in practice. However, GAM assumes iid errors, which invalidates its use for many spatio-temporal data. For the latter kind of data, the Generalized Additive Mixed Model (GAMM) may be more appropriate. While there exist several approaches for estimating a GAMM, these approaches suffer from the problems of being numerically unstable or computer-intensive.

In this talk, I will discuss some recent, joint work with Xiangming Fang. We develop an iterative algorithm for Penalized Maximum Likelihood (PML) and Restricted Penalized Maximum Likelihood (REML) estimation of a GAM with correlated errors. Although the new approach does not assume any specific correlation structure, the Mátern spatial correlation model is of particular interest, as motivated by our biological applications. As some of the Mátern parameters are not consistently estimable under the fixed domain asymptotics, situations for the spatio-temporal case are investigated, where the spatial design is assumed to be fixed with temporally independent repeated measurements and the spatial correlation structure does not change over time. Our theoretical investigation exploits the fact that penalized likelihood estimation can be given a Bayesian interpretation. The conditions under which the asymptotic posterior normality holds are discussed. We also develop a model diagnosis method for checking the assumption of independence across time for spatio-temporal data. In practice, selecting the best model is often of interest. A model selection criterion based on the Bayesian framework is proposed to compare different candidate models. The proposed methods are illustrated by simulation and a fisheries application.

Thursday, October 23, 2008

### Challenges in Joint Modeling of Longitudinal and Survival Data

*Jane-Ling Wang, PhD
Department of Statistics
University of California at Davis
*It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties.

We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches.

Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.

This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis).

Spring 2008 Biostatistics Colloquia

###
**Discovery of Latent Patterns in Disability Data and the Issue of Model Choice**

*Tanzy Mae Love, PhD
Department of Statistics
Carnegie Mellon University
*Model choice is a major methodological issue in the explosive growth of data-mining models involving latent structure for clustering and classification. Here, we work from a general formulation of hierarchical Bayesian mixed-membership models and present several model specifications and variations, both parametric and nonparametric, in the context of learning the number of latent groups and associated patterns for clustering units. We elucidate strategies for comparing models and specifications by producing novel analyses of the following data set: data on functionally disabled American seniors from the National Long Term Care Survey.

Thursday, April 24, 2008

**Funding Opportunities at the National Science Foundation**

*Grace Yang, PhD
Program Director, Statistics & Probability
National Science Foundation
Division of Mathematical Sciences
*Thursday, April 17, 2008

**Multiple imputation methods in application of a random slope coefficient linear model to randomized clinical trial data **

*Moonseong Heo, PhD
Department of Psychiatry
Weill Medical College of Cornell University
*Two types of multiple imputation methods, proper and improper, for imputing missing not at random (MNAR) continuous data are considered in the context of attrition problems arising from antidepressant clinical trials, whose primary interest is to compare treatment effects on the declines in depressive symptoms over the study period. Both methods borrow information from completers data to construct pseudo donor sampling distributions from which imputed values are drawn, but differ in characterizing those distributions. A joint likelihood of each method is constructed based on a selection model for missing data. Their performance was evaluated based on maximum likelihood estimates of a random slope coefficient model that fits the imputed data to test the treatment effect via modeling interaction between the treatment and the slope of depressive symptom decline. The following performance evaluation criteria were considered: bias, statistical power, root mean square error, coverage probability of the 95% confidence interval (CI), and width of the CI. The two methods are compared with other analytic strategies for incomplete data: completers-only data analysis, available observations analysis, and last observation carried forward (LOCF) analysis. A simulation study showed that the two multiple imputation methods have favorable results in bias and statistical power and width of the 95% CI, whereas the available observations analysis showed favorable results in bias, root mean square and coverage rate. Completers-only analysis showed better results than the LOCF analysis. Those findings guided interpretation of results from an antidepressant trial for geriatric depression. Finally, a comparison with a sequential hot deck multiple imputation method in application to analysis with missing binary outcome from a recently completed antipsychotic trial will be discussed.

Wednesday, April 9, 2008

**Improved Measurement Modeling and Regression with Latent Variables **

*Karen Bandeen-Roche, PhD
Professor of Biostatistics and Medicine
Johns Hopkins Bloomberg School of Public Health
*Latent variable models have long been utilized by behavioral scientists to summarize constructs that are represented by multiple measured variables or are difficult to measure, such as health practices and psychiatric syndromes. They have been regarded as particularly useful when variables that can be measured are highly imperfect surrogates for the construct of inferential interest, but they are also criticized as being overly abstract, weakly estimable, computationally intensive and sensitive to unverifiable modeling assumptions. My talk describes two lines of research to improve the utility of latent variable modeling, counterbalancing strengths and weaknesses. First, it reviews methods I have developed for assessing modeling assumptions and delineating what are the targets of parameter estimation in the case of maximum likelihood fitting, allowing for a mis-specified model. Then, it describes new strategies for developing measurement models for subsequent use in developing regression outcomes. One affords approximately unbiased estimation vis a vis full latent variable regression. A second counterbalances standard latent variable modeling assumptions—focused on internal validity of measurement—with alternative assumptions—say, focused on external or concurrent validation. Small sample performance properties are evaluated. The methods will be illustrated using data on post traumatic stress disorder in a population-based sample and aging and adverse health in older adults. It is hoped that the findings will lead to improved usage of latent variable models in scientific investigations.

Thursday, April 3, 2008

**Branching Processes as Models of Progenitor Cell Populations and Estimation of the Offspring Distributions**

In memory of Andrei Yakovlev

*Nikolay Yanev, PhD
Professor and Chair
Dept of Probability and Statistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences
*

This paper considers two new models of reducible age-dependent branching processes with emigration in conjunction with estimation problems arising in cell biology. Methods of statistical inference are developed using the relevant embedded discrete branching structure. Based on observations of the branching process with emigration, estimators of the offspring probabilities are proposed for the hidden unobservable process without emigration, the latter being of prime interest to investigators. The problem under consideration is motivated by experimental data generated by time-lapse video-recording of cultured cells that provides abundant information on their individual evolutions and thus on the basic parameters of their life cycle in tissue culture. Some parameters, such as the mean and variance of the mitotic cycle time, can be estimated nonparametrically without resorting to any mathematical model of cell population kinetics. For other parameters, such as the offspring distribution, a model-based inference is needed. Age-dependent branching processes have proven to be useful models for that purpose. A special feature of the data generated by time-lapse experiments is the presence of censoring effects due to migration of cells out of the field of observation. For the time-to-event observations, such as the mitotic cycle time, the effects of data censoring can be accounted for by standard methods of survival analysis. No methods are available to accommodate such effects in the statistical inference on the offspring distribution. Within the framework of branching processes, the loss of cells to follow-up can be modeled as a process of emigration. Incorporating the emigration process into a pertinent branching model of cell evolution provides the basis for the proposed estimation techniques. The statistical inference on the offspring distribution is illustrated with an application to the development of oligodendrocytes in cell culture.

This talk is based on joint work with Drs. A. Yakovlev and V. Stoimenova.*
*Thursday, March 27, 2008

**Challenges in Joint Modeling of Longitudinal and Survival Data **

*Jane-Ling Wang, PhD
Professor
University of California at Davis
*It has become increasingly common to observe the survival time of a subject along with baseline and longitudinal covariates. Due to several complications, traditional approaches to marginally model the survival or longitudinal data encounter difficulties. Jointly modeling these two types of data emerges as an effective way to overcome these difficulties. We will discuss the challenges in this area and provide several solutions. One of the difficulties is with the likelihood approaches when the survival component is modeled semi parametrically as in Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLE’s, the method of sieves, and pseudo-likelihood approaches. Another difficulty has to do with the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.

This talk is based on joint work with Jinmin Ding (Washington University) and Fushing Hsieh (University of California at Davis).

*Thursday, March 6, 2008*