URMC / Department of Biostatistics and Computational Biology / News & Events / Andrei Yakovlev Colloquium

Andrei Yakovlev Colloquium

Our first colloquium of each academic year is named for the late Dr. Andrei Yakovlev (Chair 2002-2008) in honor of his many major contributions to the Department. The public is invited to attend.

2025 Andrei Yakovlev Colloquium

Thursday, September 11, 2025
3:30−5:00 p.m.
Helen Wood Hall - Fiaretti Classroom, Room 1W-501

Debashis Ghosh, PhD
Professor, Department of Biostatistics and Informatics
Colorado School of Public Health
University of Colorado Anschutz Medical Campus

Sufficient Dimension Reduction for Survival Analysis and Beyond

Sufficient dimension reduction represents a class of ‘model-free’ methodologies that seek to find directions in the data that can capture the essential information in the regression relationship between an outcome variable and a set of predictors. In this talk, we describe a new approach to sufficient dimension reduction based on transforming a right-censored variable. The proposed approach exploits a simple property of so-called `slicing’ approaches in sufficient dimension reduction procedures. It is also easily adapted to work with many existing sufficient dimension reduction procedures for uncensored data methods in the literature. In contrast to several previous approaches for right-censored data, we also propose an approach that enjoys a double-robustness property. We will also describe the application of sufficient dimension reduction to the study of deep learning.

Previous Lectures

2024 Andrei Yakovlev Colloquium

Veera Baladandayuthapani, PhD
Chair and Professor of Biostatistics
University of Michigan

Quantifying Tumor Geography: Statistical Methods for Spatial Biology

The tumor microenvironment (TME) is emerging as a new frontier in cancer research, revealing how the spatial interactions among cells surrounding a tumor influence immune response, tumor growth, and treatment outcomes. State-of-the-art spatial profiling techniques, including spatial multiplex imaging, spatial transcriptomics, and digital pathology, enable comprehensive assessment, characterization, and visualization of the spatial interactions within the TME. The resulting high-resolution data introduce numerous statistical and modeling challenges, such as complex spatial relationships, high degree of spatial heterogeneity within and between samples, and non-conformable spaces for population-level analyses. In this talk, I will present statistical methods developed to address these challenges, focusing on key scientific questions, including the modeling of spatially varying genomic networks and transcriptional programs, the quantification of intercellular interactions within the TME, and their associations with patient-specific clinical outcomes. Emphasis will be placed on probabilistic formulations and effective dimension reduction techniques that allow for rigorous inference and uncertainty quantifications. The utility of these methods will be demonstrated through several case studies in cancer research.

2023 Andrei Yakovlev Colloquium

William F. Rosenberger, PhD
Distinguished University Professor
George Mason University

Design and Inference for Enrichment Trials with a Continuous Biomarker

We describe the philosophical approaches to two-stage enrichment designs, in which a benefitting subpopulation is targeted in a second stage, after a first stage identifies the threshold of a predictive, continuous biomarker. The design issue we address is sample size estimation for the first and second stages, and the consequences of poorly estimating the threshold. These design issues are established based on an approach where the two stages are conducted and analyzed separately, and stage two is considered a confirmatory trial. Another approach is to combine the data from the two stages, and we demonstrate how to do that by testing two hypotheses simultaneously with test statistics that (we show) have an asymptotic normal distribution. While a bivariate normal model is used to give insights into the predictive nature of the biomarker, and to visualize some closed-form solutions, in principle other models can certainly be used (but perhaps yield fewer insights). As in many ongoing, long-term research projects, our work probably raises more questions than it answers! (Joint work with Nancy Flournoy, Rosamarie Frieri, and Zhantao Lin)

2022 Andrei Yakovlev Colloquium

Lance Waller, PhD
Professor
Department of Biostatistics and Bioinformatics
Emory University

Maps: A Statistical View

Spatial statistical analysis builds upon the premise that where something happens can influence what happens, i.e., the location of observations can provide information on the observations themselves. Location can be defined on geographic maps and in geometric space, but geography often involves information beyond simple location, distance, and direction. In this presentation, we will explore how geography influences inference in spatial statistical analyses and offer geographic insights on familiar statistical constructs such as data visualization, asymptotics, classical and Bayesian inference, weighted estimation, model diagnostics, and compromises between design and modeling. We will discuss compromises between geographic and statistical precision, statistical precision and local and global probabilistic strategies for ensuring data confidentiality. Using historical and contemporary examples, we will illustrate how maps provide a critical context for data visualization and interpretation, ranging from the known
(“You are here") to the unknown (“Here be dragons”).

2021 Andrei Yakovlev Colloquium

Joseph Hogan, ScD
Carole and Lawrence Sirovich Professor of Public Health
Professor and Chair of Biostatistics
Brown University School of Public Health

What's in a model? The role of Bayesian inference in the age of data science

The emergence of data science as a multi- and inter-disciplinary field has brought about many advances in data processing and data analysis, particularly for large-scale and high-throughput settings. Many new tools and methods being used for analysis and decision making are purely algorithmic in nature in the sense that they do not require or rely on an underlying probability model. A particular limitation of some high-performing algorithmic methods is lack of formal methods for uncertainty quantification, which can present difficulties for decision making. In this talk I will illustrate, using examples drawn mainly from HIV and infectious disease research, the role of models – and Bayesian models in particular – for drawing principled inferences from complex data. The examples include estimation of SARS-CoV-2 seroprevalence from an incomplete and nonrandom sample; drawing predictive and causal inference about retention in care using electronic health records data; representing uncertainty attributable to unmeasured confounding; and combining information from multiple sources into a mechanistic model of infectious disease dynamics. Recent advances in nonparametric Bayesian methods and Bayesian machine learning have made it possible to use highly flexible likelihood-based models that are competitive with leading algorithms such as random forest, stacked ensembles, and gradient boosting. In each example I will highlight the importance of grounding inferences to a generative model (likelihood) and using prior distributions to represent missing information or untestable assumptions. In settings involving patient-level data, modern Bayesian methods enable both remarkable flexibility in model structure and the ability to quantify uncertainty about key parameters or quantities of interest.

2020 Andrei Yakovlev Colloquium

Delayed due to COVID restrictions

2019 Andrei Yakovlev Colloquium

Emery Brown, MD, PhD
Edward Hood Taplin Professor of Medical Engineering and Computational Neuroscience
Massachusetts Institute of Technology
Warren M. Zapol Professor of Anesthesia
Harvard Medical School

Uncovering the Mechanisms of General Anesthesia: Where Neuroscience Meets Statistics

General anesthesia is a drug-induced, reversible condition involving unconsciousness, amnesia (loss of memory), analgesia (loss of pain sensation), akinesia (immobility), and hemodynamic stability. I will describe a primary mechanism through which anesthetics create these altered states of arousal. Our studies have allowed us to give a detailed characterization of the neurophysiology of loss and recovery of consciousness, in the case of propofol, and we have demonstrated that the state of general anesthesia can be rapidly reversed by activating specific brain circuits. The success of our research has depended critically on tight coupling of experiments, statistical signal processing and mathematical modeling.

2018 Andrei Yakovlev Colloquium

Amy Herring, PhD
Sara and Charles Ayres Professor
Department of Statistical Science and Global Health
Duke University

Statistics for Science's Sake

In this talk in honor of the career of Dr. Andrei Yakovlev, we will consider two case studies of scientific problems that pose interesting statistical challenges and new methodological development. We will address the motivating scientific problems (in maternal and child health and adolescent development, respectively), drawbacks of existing or standard analysis approaches, subsequent methodological developments, and the process of collaboration in multiple disciplines, with a focus on strategies for generating ideas for research beyond graduate school and throughout one’s career.

2017 Andrei Yakovlev Colloquium

Martin Wells, PhD
Professor and Chairman
Department of Statistical Science
Cornell University

A Scalable Empirical Bayes Approach to Variable Selection in Generalized Linear Models

A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a non-zero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a modified form of the EM algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods.

2016 Andrei Yakovlev Colloquium

Richard J. Cook, PhD
Professor of Statistics
University of Waterloo

Augmented composite likelihood for the analysis of family data under biased sampling schemes

The heritability of chronic diseases can be effectively studied by examining the nature and extent of within-family associations in disease onset times. In such studies families are typically recruited through a biased sampling scheme in which affected individuals in a disease registry are sampled, and their relatives are contacted to provide right-censored or current status data on their disease onset times. We develop likelihood and composite likelihood methods for modeling the within-family association in these times through copula models in which dependencies are characterized by Kendall's tau. Auxiliary data from independent individuals are exploited by augmentating composite likelihoods to increase precision of marginal parameter estimates and consequently increase efficiency in dependence parameter estimation. An application to a motivating family study in psoriatic arthritis illustrates the method and provides some evidence of excessive paternal transmission of risk.

2015 Andrei Yakovlev Colloquium

Heping Zhang, PhD
Susan Dwight Bliss Professor of Public Health (Biostatistics)
Yale School of Public Health

Decision Trees for Precision Medicine

Double-blind, randomized clinical trials are the preferred approach to demonstrating the effectiveness of one treatment against another. The comparison is, however, made on the average group effects. While patients and clinicians have always struggled to understand why patients respond differently to the same treatment, and while much hope has been held for the nascent field of predictive biomarkers (e.g. genetic markers), there is still much utility in exploring whether it is possible to estimate treatment efficacy based on demographic and baseline variables including biomarkers. To address this issue, we focused on a concept of the relative effectiveness of treatments that is of particular importance in precision medicine. The method can identify groups of patients that are more likely to respond one treatment than the other, in contrast to the tradition approach that searches for a superior treatment in a larger population. We developed an automated algorithm to construct decision trees and performed extensive simulation to evaluate our algorithm. We analyzed data from clinical trials to illustrate the practical potential of our method.

2014 Andrei Yakovlev Colloquium

Daniel Scharfstein, ScD
Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health

Global Sensitivity Analysis for Repeated Measures Studies with Informative Dropout: A Semi-Parametric Approach

In 2010, the National Research Council issued the report: "The Prevention and Treatment of Missing Data in Clinical Trials." This report, commissioned by the FDA, provides 18 recommendations. Since inference in the presence of missing data requires untestable assumptions, Recommendation 15 states: “Sensitivity analyses should be part of the primary reporting of findings from clinical trials. Examining sensitivity to the assumptions about the missing data mechanism should be a mandatory component of reporting.” Broadly speaking, there are three main types of sensitivity analysis. Ad-hoc sensitivity analysis involves analyzing the data using a few different methods and evaluating whether the inferences are consistent. Local sensitivity analysis evaluates how inferences vary in a small neighborhood of a benchmark identification assumption, such as missing at random. Chapter 5 of the report emphasizes global sensitivity analysis, which considers how inferences vary over a much larger neighborhood of identification assumptions. In this talk, we present a global sensitivity analysis methodology for drawing inference about the mean at the final scheduled visit in a repeated measures study with dropout. We discuss a recently developed semi-parametric approach, the software for which is freely available at www.missingdatamatters.org. We present a detailed case study to illustrate the methodology.

2013 Andrei Yakovlev Colloquium

David Ruppert, PhD
School of Operations Research and Information Engineering
Cornell University

Fast Covariance Estimation for High-Dimensional Functional Data

High dimensional functional data are becoming increasingly common, for example, in medical imaging. For such data, we propose fast methods for smooth estimation of the covariance function. These methods scale up linearly with J, the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension J greater than 500; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions, such as J = 10, 000. We introduce two new methods that circumvent this problem: 1) an extremely fast implementation of the sandwich smoother for covariance smoothing; and 2) a two-step procedure that first obtains the singular value decomposition of the data matrix and then smooths the eigenvectors. In high dimensions, these new approaches are at least an order of magnitude faster than standard methods and drastically reduce memory requirements. The new approaches provide instantaneous (a few seconds) smoothing for matrices of dimension J = 10,000 and very fast (< 10 minutes) smoothing for J = 100, 000. This is joint work with Luo Xiao, Ciprian Crainiceanu, and Vadim Zippunikov.

2012 Andrei Yakovlev Colloquium

Ying Kuen K. Cheung, PhD
Mailman School of Public Health
Columbia University

On the Efficiency of Nonparametric Variance Estimation in Sequential Dose Finding

Phase I clinical trials are experiments in which a drug is administered to humans to determine the maximum tolerated dose, defined as the maximum test dose that causes a toxicity with a target probability. As such, phase I dose-finding is often formulated as a quantile estimation problem. In this talk, I will focus on clinical scenarios where toxicity is defined by dichotomizing a continuous outcome, for which a correct specification of the variance function of the outcomes is important. This is especially true for sequential study where the variance assumption directly involves in the generation of the design points and hence sensitivity analysis may not be feasible after the data are collected. In this light, there is a strong reason for avoiding parametric assumptions on the variance function, although this may incur efficiency loss. This talk will show how much information one may retrieve by making additional parametric assumptions on the variance in the context of a sequential least squares recursion. By asymptotic comparison and simulation study, we demonstrate that assuming homoscedasticity achieves only a modest efficiency gain when compared to nonparametric variance estimation: when homoscedasticity in truth holds, the latter is at worst 88% as efficient as the former in the limiting case, and often achieves well over 90% efficiency for most practical situations.

2011 Andrei Yakovlev Colloquium

Xihong Lin, PhD
Department of Biostatistics
Harvard School of Public Health

Statistical Issues and Challenges in Analyzing High-throughput 'Omics Data in Population-Based Studies

With the advance of biotechnology, massive "omics" data, such as genomic and proteomic data, become rapidly available in population based studies to study the interplay of genes and environment in causing human diseases. An increasing challenge is how to design such studies, manage the data, analyze such high-throughput "omics" data, interpret the results, and make the findings reproducible. We discuss several statistical issues in analysis of high-dimensional "omics" data in population based "omics" studies. We present statistical methods for analysis of several types of "omics" data, including incorporation of biological structures in analysis of data from genome-wide association studies, and next generation sequencing data for rare variants. Data examples are presented to illustrate the methods. Strategies for interdisciplinary training in statistical genetics, computational biology and genetic epidemiology will also be discussed.

2010 Andrei Yakovlev Colloquium

Michael R. Kosorok, PhD
University of North Carolina at Chapel Hill

Reinforcement Learning, Clinical Trials and Personalized Medicine

In this talk, we discuss using reinforcement learning to discover optimal dynamic treatment regimes for treating cancer and other life-threatening diseases. The approach we propose is to use a specially designed sequence of two randomized clinical trials that enables discovery and validation of these optimal regimens. Because these regimens are optimized over patient characteristics, including biomarkers, they are a form of personalized medicine. We discuss applications in non-small cell lung cancer, colorectal cancer and cystic fibrosis. We will also discuss briefly several open technical questions.

2009 Andrei Yakovlev Colloquium

Dean Follmann, PhD
Biostatistics Research Branch
National Institute of Allergy and Infectious Diseases

Crossover Trials for Survival and Recurrent Event Endpoints

The crossover is a popular and efficient trial design used in the context of patient heterogeneity to assess the effect of treatments that act relatively quickly and whose benefit disappears with discontinuation. Each patient can serve as her own control as within-individual treatment and placebo responses are compared. Conventional wisdom is that these designs are not appropriate for absorbing binary endpoints, such as death or HIV infection. We explore the use of crossover designs in the context of these non-repeatable binary endpoints and show that they can be more efficient than the standard parallel group design when there is heterogeneity in individuals’ risks. We also introduce a new two-period design where first period “survivors” are re-randomized for the second period. This design combines the crossover design with the parallel design and achieves some of the efficiency advantages of the crossover design while ensuring that the second period groups are comparable by randomization.

We discuss the validity of the new designs and evaluate mixture model and semi-parametric methods of inference. We extend our results to cross-over trials with recurrent events. Simulations are used to compare the different designs and examples are provided to explore practical issues in implementation.

2008 Andrei Yakovlev Colloquium

Yi Li, PhD
Department of Biostatistics
Harvard University, Dana-Farber Cancer Institute

Detecting Disparities in Long-term Cancer Survivals: Challenges and Possible Solutions

This talk deals with long-term disease-specific survivals among the prostate cancer patients in the NIH Surveillance Epidemiology and End Results (SEER) program, wherein the main endpoint (e.g. deaths from prostate cancer) and the censoring causes (e.g. deaths from heart diseases) may be dependent. While a number of authors have studied the mixture survival model to analyze survival data with non-negligible long-term survival fractions, none has studied the mixture model in the presence of dependent censoring. To account for such dependence, we propose a more general long-term survival model that allows for dependent censoring. We derive the models from the perspective of competing risks and model the dependence between the censoring time and the survival time using a class Archimedean copula models. Within this framework, we consider the parameter estimation, the long-term survival detection, and the two-sample comparison of latency distributions in the presence of dependent censoring when a proportion of patients is deemed to be long-term survivors. Large sample results using the martingale theory are obtained. We examine the finite sample performance of the proposed methods via simulation and apply them to analyze the SEER prostate cancer data.