# Andrei Yakovlev Colloquium

Our first colloquium of each academic year is named for the late Dr. Andrei Yakovlev (Chair 2002-2008) in honor of his many major contributions to the Department. The public is invited to attend.

## 2023 Andrei Yakovlev Colloquium

### William F. Rosenberger, PhD

Distinguished University Professor

George Mason University

**Thursday, September 7, 2023**

3:30 p.m. - 5:00 p.m.

Helen Wood Hall - Room 1W-510

### Design and Inference for Enrichment Trials with a Continuous Biomarker

We describe the philosophical approaches to two-stage enrichment designs, in which a benefitting subpopulation is targeted in a second stage, after a first stage identifies the threshold of a predictive, continuous biomarker. The design issue we address is sample size estimation for the first and second stages, and the consequences of poorly estimating the threshold. These design issues are established based on an approach where the two stages are conducted and analyzed separately, and stage two is considered a confirmatory trial. Another approach is to combine the data from the two stages, and we demonstrate how to do that by testing two hypotheses simultaneously with test statistics that (we show) have an asymptotic normal distribution. While a bivariate normal model is used to give insights into the predictive nature of the biomarker, and to visualize some closed-form solutions, in principle other models can certainly be used (but perhaps yield fewer insights). As in many ongoing, long-term research projects, our work probably raises more questions than it answers! (Joint work with Nancy Flournoy, Rosamarie Frieri, and Zhantao Lin)

## Previous Lectures

### 2022 Andrei Yakovlev Colloquium

**Lance Waller, PhD**

Professor

Department of Biostatistics and Bioinformatics

Emory University

**Maps: A Statistical View**

Spatial statistical analysis builds upon the premise that where something happens can influence what happens, i.e., the location of observations can provide information on the observations themselves. Location can be defined on geographic maps and in geometric space, but geography often involves information beyond simple location, distance, and direction. In this presentation, we will explore how geography influences inference in spatial statistical analyses and offer geographic insights on familiar statistical constructs such as data visualization, asymptotics, classical and Bayesian inference, weighted estimation, model diagnostics, and compromises between design and modeling. We will discuss compromises between geographic and statistical precision, statistical precision and local and global probabilistic strategies for ensuring data confidentiality. Using historical and contemporary examples, we will illustrate how maps provide a critical context for data visualization and interpretation, ranging from the known

(“You are here") to the unknown (“Here be dragons”).

Thursday, September 22, 2022

2021 Andrei Yakovlev Colloquium

**Joseph Hogan, ScD**

Carole and Lawrence Sirovich Professor of Public Health

Professor and Chair of Biostatistics

Brown University School of Public Health

*What's in a model? The role of Bayesian inference in the age of data science*

The emergence of data science as a multi- and inter-disciplinary field has brought about many advances in data processing and data analysis, particularly for large-scale and high-throughput settings. Many new tools and methods being used for analysis and decision making are purely algorithmic in nature in the sense that they do not require or rely on an underlying probability model. A particular limitation of some high-performing algorithmic methods is lack of formal methods for uncertainty quantification, which can present difficulties for decision making. In this talk I will illustrate, using examples drawn mainly from HIV and infectious disease research, the role of models – and Bayesian models in particular – for drawing principled inferences from complex data. The examples include estimation of SARS-CoV-2 seroprevalence from an incomplete and nonrandom sample; drawing predictive and causal inference about retention in care using electronic health records data; representing uncertainty attributable to unmeasured confounding; and combining information from multiple sources into a mechanistic model of infectious disease dynamics. Recent advances in nonparametric Bayesian methods and Bayesian machine learning have made it possible to use highly flexible likelihood-based models that are competitive with leading algorithms such as random forest, stacked ensembles, and gradient boosting. In each example I will highlight the importance of grounding inferences to a generative model (likelihood) and using prior distributions to represent missing information or untestable assumptions. In settings involving patient-level data, modern Bayesian methods enable both remarkable flexibility in model structure and the ability to quantify uncertainty about key parameters or quantities of interest.

Thursday, September 23, 2021

### 2020 Andrei Yakovlev Colloquium

Delayed due to COVID restrictions

### 2019 Andrei Yakovlev Colloquium

**Emery Brown, MD, PhD**

Edward Hood Taplin Professor of Medical Engineering and Computational Neuroscience

Massachusetts Institute of Technology

Warren M. Zapol Professor of Anesthesia

Harvard Medical School

**Uncovering the Mechanisms of General Anesthesia: Where Neuroscience Meets Statistics**

General anesthesia is a drug-induced, reversible condition involving unconsciousness, amnesia (loss of memory), analgesia (loss of pain sensation), akinesia (immobility), and hemodynamic stability. I will describe a primary mechanism through which anesthetics create these altered states of arousal. Our studies have allowed us to give a detailed characterization of the neurophysiology of loss and recovery of consciousness, in the case of propofol, and we have demonstrated that the state of general anesthesia can be rapidly reversed by activating specific brain circuits. The success of our research has depended critically on tight coupling of experiments, statistical signal processing and mathematical modeling.

Thursday, September 19, 2019

### 2018 Andrei Yakovlev Colloquium

**Amy Herring, PhD**

Sara and Charles Ayres Professor

Department of Statistical Science and Global Health

Duke University

**Statistics for Science's Sake**

In this talk in honor of the career of Dr. Andrei Yakovlev, we will consider two case studies of scientific problems that pose interesting statistical challenges and new methodological development. We will address the motivating scientific problems (in maternal and child health and adolescent development, respectively), drawbacks of existing or standard analysis approaches, subsequent methodological developments, and the process of collaboration in multiple disciplines, with a focus on strategies for generating ideas for research beyond graduate school and throughout one’s career.

Monday, September 17, 2018

### 2017 Andrei Yakovlev Colloquium

**Martin Wells, PhD **

Professor and Chairman

Department of Statistical Science

Cornell University

**A Scalable Empirical Bayes Approach to Variable Selection in Generalized Linear Models**

A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a non-zero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a modified form of the EM algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods.

Thursday, September 14, 2017

### 2016 Andrei Yakovlev Colloquium

**Richard**** J. Cook, PhD **

Professor of Statistics

University of Waterloo

*Augmented composite likelihood for the analysis of family data under biased sampling schemes*

The heritability of chronic diseases can be effectively studied by examining the nature and extent of within-family associations in disease onset times. In such studies families are typically recruited through a biased sampling scheme in which affected individuals in a disease registry are sampled, and their relatives are contacted to provide right-censored or current status data on their disease onset times. We develop likelihood and composite likelihood methods for modeling the within-family association in these times through copula models in which dependencies are characterized by Kendall's tau. Auxiliary data from independent individuals are exploited by augmentating composite likelihoods to increase precision of marginal parameter estimates and consequently increase efficiency in dependence parameter estimation. An application to a motivating family study in psoriatic arthritis illustrates the method and provides some evidence of excessive paternal transmission of risk.

September 8, 2016

2015 Andrei Yakovlev Colloquium

**Heping Zhang, PhD **

Susan Dwight Bliss Professor of Public Health (Biostatistics)

Yale School of Public Health

**Decision Trees for Precision Medicine**

Double-blind, randomized clinical trials are the preferred approach to demonstrating the effectiveness of one treatment against another. The comparison is, however, made on the average group effects. While patients and clinicians have always struggled to understand why patients respond differently to the same treatment, and while much hope has been held for the nascent field of predictive biomarkers (e.g. genetic markers), there is still much utility in exploring whether it is possible to estimate treatment efficacy based on demographic and baseline variables including biomarkers. To address this issue, we focused on a concept of the relative effectiveness of treatments that is of particular importance in precision medicine. The method can identify groups of patients that are more likely to respond one treatment than the other, in contrast to the tradition approach that searches for a superior treatment in a larger population. We developed an automated algorithm to construct decision trees and performed extensive simulation to evaluate our algorithm. We analyzed data from clinical trials to illustrate the practical potential of our method.

Thursday, September 24, 2015

2014 Andrei Yakovlev Colloquium

**Daniel Scharfstein, ScD**

Professor of Biostatistics

Johns Hopkins Bloomberg School of Public Health

**Global Sensitivity Analysis for Repeated Measures Studies with Informative Dropout: A Semi-Parametric Approach**

In 2010, the National Research Council issued the report: "The Prevention and Treatment of Missing Data in Clinical Trials." This report, commissioned by the FDA, provides 18 recommendations. Since inference in the presence of missing data requires untestable assumptions, Recommendation 15 states: “Sensitivity analyses should be part of the primary reporting of findings from clinical trials. Examining sensitivity to the assumptions about the missing data mechanism should be a mandatory component of reporting.” Broadly speaking, there are three main types of sensitivity analysis. Ad-hoc sensitivity analysis involves analyzing the data using a few different methods and evaluating whether the inferences are consistent. Local sensitivity analysis evaluates how inferences vary in a small neighborhood of a benchmark identification assumption, such as missing at random. Chapter 5 of the report emphasizes global sensitivity analysis, which considers how inferences vary over a much larger neighborhood of identification assumptions. In this talk, we present a global sensitivity analysis methodology for drawing inference about the mean at the final scheduled visit in a repeated measures study with dropout. We discuss a recently developed semi-parametric approach, the software for which is freely available at www.missingdatamatters.org. We present a detailed case study to illustrate the methodology.

Thursday, September 18, 2014

### 2013 Andrei Yakovlev Colloquium

**David Ruppert, PhD**

School of Operations Research and Information Engineering

Cornell University

**Fast Covariance Estimation for High-Dimensional Functional Data**

High dimensional functional data are becoming increasingly common, for example, in medical imaging. For such data, we propose fast methods for smooth estimation of the covariance function. These methods scale up linearly with J, the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension J greater than 500; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions, such as J = 10, 000. We introduce two new methods that circumvent this problem: 1) an extremely fast implementation of the sandwich smoother for covariance smoothing; and 2) a two-step procedure that first obtains the singular value decomposition of the data matrix and then smooths the eigenvectors. In high dimensions, these new approaches are at least an order of magnitude faster than standard methods and drastically reduce memory requirements. The new approaches provide instantaneous (a few seconds) smoothing for matrices of dimension J = 10,000 and very fast (< 10 minutes) smoothing for J = 100, 000.

This is joint work with Luo Xiao, Ciprian Crainiceanu, and Vadim Zippunikov.

Thursday, September 19, 2013

### 2012 Andrei Yakovlev Colloquium

**Ying Kuen K. Cheung, PhD **

Mailman School of Public Health

Columbia University

**On the Efficiency of Nonparametric Variance Estimation in Sequential Dose Finding**

Phase I clinical trials are experiments in which a drug is administered to humans to determine the maximum tolerated dose, defined as the maximum test dose that causes a toxicity with a target probability. As such, phase I dose-finding is often formulated as a quantile estimation problem. In this talk, I will focus on clinical scenarios where toxicity is defined by dichotomizing a continuous outcome, for which a correct specification of the variance function of the outcomes is important. This is especially true for sequential study where the variance assumption directly involves in the generation of the design points and hence sensitivity analysis may not be feasible after the data are collected. In this light, there is a strong reason for avoiding parametric assumptions on the variance function, although this may incur efficiency loss. This talk will show how much information one may retrieve by making additional parametric assumptions on the variance in the context of a sequential least squares recursion. By asymptotic comparison and simulation study, we demonstrate that assuming homoscedasticity achieves only a modest efficiency gain when compared to nonparametric variance estimation: when homoscedasticity in truth holds, the latter is at worst 88% as efficient as the former in the limiting case, and often achieves well over 90% efficiency for most practical situations.

Thursday, September 6, 2012

### 2011 Andrei Yakovlev Colloquium

**Xihong Lin, PhD**

Department of Biostatistics

Harvard School of Public Health

**Statistical Issues and Challenges in Analyzing High-throughput 'Omics Data in Population-Based Studies**

With the advance of biotechnology, massive "omics" data, such as genomic and proteomic data, become rapidly available in population based studies to study the interplay of genes and environment in causing human diseases. An increasing challenge is how to design such studies, manage the data, analyze such high-throughput "omics" data, interpret the results, and make the findings reproducible. We discuss several statistical issues in analysis of high-dimensional "omics" data in population based "omics" studies. We present statistical methods for analysis of several types of "omics" data, including incorporation of biological structures in analysis of data from genome-wide association studies, and next generation sequencing data for rare variants. Data examples are presented to illustrate the methods. Strategies for interdisciplinary training in statistical genetics, computational biology and genetic epidemiology will also be discussed.

Thursday, September 29, 2011

### 2010 Andrei Yakovlev Colloquium

**Michael R. Kosorok, PhD**

University of North Carolina at Chapel Hill

**Reinforcement Learning, Clinical Trials and Personalized Medicine**

In this talk, we discuss using reinforcement learning to discover optimal dynamic treatment regimes for treating cancer and other life-threatening diseases. The approach we propose is to use a specially designed sequence of two randomized clinical trials that enables discovery and validation of these optimal regimens. Because these regimens are optimized over patient characteristics, including biomarkers, they are a form of personalized medicine. We discuss applications in non-small cell lung cancer, colorectal cancer and cystic fibrosis. We will also discuss briefly several open technical questions.

Thursday, September 9, 2010

### 2009 Andrei Yakovlev Colloquium

**Dean Follmann, PhD**

Biostatistics Research Branch

National Institute of Allergy and Infectious Diseases

**Crossover Trials for Survival and Recurrent Event Endpoints**

The crossover is a popular and efficient trial design used in the context of patient heterogeneity to assess the effect of treatments that act relatively quickly and whose benefit disappears with discontinuation. Each patient can serve as her own control as within-individual treatment and placebo responses are compared. Conventional wisdom is that these designs are not appropriate for absorbing binary endpoints, such as death or HIV infection. We explore the use of crossover designs in the context of these non-repeatable binary endpoints and show that they can be more efficient than the standard parallel group design when there is heterogeneity in individuals’ risks. We also introduce a new two-period design where first period “survivors” are re-randomized for the second period. This design combines the crossover design with the parallel design and achieves some of the efficiency advantages of the crossover design while ensuring that the second period groups are comparable by randomization.

We discuss the validity of the new designs and evaluate mixture model and semi-parametric methods of inference. We extend our results to cross-over trials with recurrent events. Simulations are used to compare the different designs and examples are provided to explore practical issues in implementation.

Thursday, September 17, 2009

### 2008 Andrei Yakovlev Colloquium

**Yi Li, PhD**

Department of Biostatistics

Harvard University, Dana-Farber Cancer Institute

**Detecting Disparities in Long-term Cancer Survivals: Challenges and Possible Solutions**

This talk deals with long-term disease-specific survivals among the prostate cancer patients in the NIH Surveillance Epidemiology and End Results (SEER) program, wherein the main endpoint (e.g. deaths from prostate cancer) and the censoring causes (e.g. deaths from heart diseases) may be dependent. While a number of authors have studied the mixture survival model to analyze survival data with non-negligible long-term survival fractions, none has studied the mixture model in the presence of dependent censoring. To account for such dependence, we propose a more general long-term survival model that allows for dependent censoring. We derive the models from the perspective of competing risks and model the dependence between the censoring time and the survival time using a class Archimedean copula models. Within this framework, we consider the parameter estimation, the long-term survival detection, and the two-sample comparison of latency distributions in the presence of dependent censoring when a proportion of patients is deemed to be long-term survivors. Large sample results using the martingale theory are obtained. We examine the finite sample performance of the proposed methods via simulation and apply them to analyze the SEER prostate cancer data.

Thursday, September 18, 2008