# 2016 Andrei Yakovlev Colloquium

Our first colloquium of each academic year is named for the late Dr. Andrei Yakovlev (Chair 2002-2008) in honor of his many major contributions to the Department.

**Richard J. Cook, PhD
** Professor of Statistics

University of Waterloo

Thursday, September 8, 2016

2:30 p.m. - 4:00 p.m.

Helen Wood Hall - Room 1W501

**Augmented composite likelihood for the analysis of family data under biased sampling schemes**

The heritability of chronic diseases can be effectively studied by examining the nature and extent of within-family associations in disease onset times. In such studies families are typically recruited through a biased sampling scheme in which affected individuals in a disease registry are sampled, and their relatives are contacted to provide right-censored or current status data on their disease onset times. We develop likelihood and composite likelihood methods for modeling the within-family association in these times through copula models in which dependencies are characterized by Kendall's tau. Auxiliary data from independent individuals are exploited by augmentating composite likelihoods to increase precision of marginal parameter estimates and consequently increase efficiency in dependence parameter estimation. An application to a motivating family study in psoriatic arthritis illustrates the method and provides some evidence of excessive paternal transmission of risk.

## Previous Lectures

2015 Andrei Yakovlev Colloquium

**Heping Zhang, PhD
** Susan Dwight Bliss Professor of Public Health (Biostatistics)

Yale School of Public Health

**Decision Trees for Precision Medicine**Double-blind, randomized clinical trials are the preferred approach to demonstrating the effectiveness of one treatment against another. The comparison is, however, made on the average group effects. While patients and clinicians have always struggled to understand why patients respond differently to the same treatment, and while much hope has been held for the nascent field of predictive biomarkers (e.g. genetic markers), there is still much utility in exploring whether it is possible to estimate treatment efficacy based on demographic and baseline variables including biomarkers. To address this issue, we focused on a concept of the relative effectiveness of treatments that is of particular importance in precision medicine. The method can identify groups of patients that are more likely to respond one treatment than the other, in contrast to the tradition approach that searches for a superior treatment in a larger population. We developed an automated algorithm to construct decision trees and performed extensive simulation to evaluate our algorithm. We analyzed data from clinical trials to illustrate the practical potential of our method.

Thursday, September 24, 2015

2014 Andrei Yakovlev Colloquium

**Daniel Scharfstein, ScD
** Professor of Biostatistics

Johns Hopkins Bloomberg School of Public Health

**Global Sensitivity Analysis for Repeated Measures Studies with Informative Dropout: A Semi-Parametric Approach**In 2010, the National Research Council issued the report: "The Prevention and Treatment of Missing Data in Clinical Trials." This report, commissioned by the FDA, provides 18 recommendations. Since inference in the presence of missing data requires untestable assumptions, Recommendation 15 states: “Sensitivity analyses should be part of the primary reporting of findings from clinical trials. Examining sensitivity to the assumptions about the missing data mechanism should be a mandatory component of reporting.” Broadly speaking, there are three main types of sensitivity analysis. Ad-hoc sensitivity analysis involves analyzing the data using a few different methods and evaluating whether the inferences are consistent. Local sensitivity analysis evaluates how inferences vary in a small neighborhood of a benchmark identification assumption, such as missing at random. Chapter 5 of the report emphasizes global sensitivity analysis, which considers how inferences vary over a much larger neighborhood of identification assumptions. In this talk, we present a global sensitivity analysis methodology for drawing inference about the mean at the final scheduled visit in a repeated measures study with dropout. We discuss a recently developed semi-parametric approach, the software for which is freely available at www.missingdatamatters.org. We present a detailed case study to illustrate the methodology.

Thursday, September 18, 2014

### 2013 Andrei Yakovlev Colloquium

**David Ruppert, PhD**

School of Operations Research and Information Engineering

Cornell University

*Fast Covariance Estimation for High-Dimensional Functional Data*High dimensional functional data are becoming increasingly common, for example, in medical imaging. For such data, we propose fast methods for smooth estimation of the covariance function. These methods scale up linearly with J, the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension J greater than 500; the recently introduced sandwich smoother is an exception, but it is not adapted to smooth covariance matrices of large dimensions, such as J = 10, 000. We introduce two new methods that circumvent this problem: 1) an extremely fast implementation of the sandwich smoother for covariance smoothing; and 2) a two-step procedure that first obtains the singular value decomposition of the data matrix and then smooths the eigenvectors. In high dimensions, these new approaches are at least an order of magnitude faster than standard methods and drastically reduce memory requirements. The new approaches provide instantaneous (a few seconds) smoothing for matrices of dimension J = 10,000 and very fast (< 10 minutes) smoothing for J = 100, 000.

This is joint work with Luo Xiao, Ciprian Crainiceanu, and Vadim Zippunikov.

Thursday, September 19, 2013

### 2012 Andrei Yakovlev Colloquium

**Ying Kuen K. Cheung, PhD
**Mailman School of Public Health

Columbia University

*Phase I clinical trials are experiments in which a drug is administered to humans to determine the maximum tolerated dose, defined as the maximum test dose that causes a toxicity with a target probability. As such, phase I dose-finding is often formulated as a quantile estimation problem. In this talk, I will focus on clinical scenarios where toxicity is defined by dichotomizing a continuous outcome, for which a correct specification of the variance function of the outcomes is important. This is especially true for sequential study where the variance assumption directly involves in the generation of the design points and hence sensitivity analysis may not be feasible after the data are collected. In this light, there is a strong reason for avoiding parametric assumptions on the variance function, although this may incur efficiency loss. This talk will show how much information one may retrieve by making additional parametric assumptions on the variance in the context of a sequential least squares recursion. By asymptotic comparison and simulation study, we demonstrate that assuming homoscedasticity achieves only a modest efficiency gain when compared to nonparametric variance estimation: when homoscedasticity in truth holds, the latter is at worst 88% as efficient as the former in the limiting case, and often achieves well over 90% efficiency for most practical situations.*

**On the Efficiency of Nonparametric Variance Estimation in Sequential Dose Finding**

Thursday, September 6, 2012

### 2011 Andrei Yakovlev Colloquium

**Xihong Lin, PhD
**Department of Biostatistics

Harvard School of Public Health

**Statistical Issues and Challenges in Analyzing High-throughput 'Omics Data in Population-Based Studies**With the advance of biotechnology, massive "omics" data, such as genomic and proteomic data, become rapidly available in population based studies to study the interplay of genes and environment in causing human diseases.An increasing challenge is how to design such studies, manage the data, analyze such high-throughput "omics" data, interpret the results, and make the findings reproducible.We discuss several statistical issues in analysis of high-dimensional "omics" data in population based "omics" studies.We present statistical methods for analysis of several types of "omics" data, including incorporation of biological structures in analysis of data from genome-wide association studies, and next generation sequencing data for rare variants.Data examples are presented to illustrate the methods.Strategies for interdisciplinary training in statistical genetics, computational biology and genetic epidemiology will also be discussed.

Thursday, September 29, 2011

### 2010 Andrei Yakovlev Colloquium

**Michael R. Kosorok, PhD
**University of North Carolina at Chapel Hill

**Reinforcement Learning, Clinical Trials and Personalized Medicine**In this talk, we discuss using reinforcement learning to discover optimal dynamic treatment regimes for treating cancer and other life-threatening diseases. The approach we propose is to use a specially designed sequence of two randomized clinical trials that enables discovery and validation of these optimal regimens. Because these regimens are optimized over patient characteristics, including biomarkers, they are a form of personalized medicine. We discuss applications in non-small cell lung cancer, colorectal cancer and cystic fibrosis. We will also discuss briefly several open technical questions.

Thursday, September 9, 2010

### 2009 Andrei Yakovlev Colloquium

**Dean Follmann, PhD
**Biostatistics Research Branch

National Institute of Allergy and Infectious Diseases

**Crossover Trials for Survival and Recurrent Event Endpoints**The crossover is a popular and efficient trial design used in the context of patient heterogeneity to assess the effect of treatments that act relatively quickly and whose benefit disappears with discontinuation. Each patient can serve as her own control as within-individual treatment and placebo responses are compared. Conventional wisdom is that these designs are not appropriate for absorbing binary endpoints, such as death or HIV infection. We explore the use of crossover designs in the context of these non-repeatable binary endpoints and show that they can be more efficient than the standard parallel group design when there is heterogeneity in individuals’ risks. We also introduce a new two-period design where first period “survivors” are re-randomized for the second period. This design combines the crossover design with the parallel design and achieves some of the efficiency advantages of the crossover design while ensuring that the second period groups are comparable by randomization.

We discuss the validity of the new designs and evaluate mixture model and semi-parametric methods of inference. We extend our results to cross-over trials with recurrent events. Simulations are used to compare the different designs and examples are provided to explore practical issues in implementation.

Thursday, September 17, 2009

### 2008 Andrei Yakovlev Colloquium

**Yi Li, PhD
**Department of Biostatistics

Harvard University, Dana-Farber Cancer Institute

**Detecting Disparities in Long-term Cancer Survivals: Challenges and Possible Solutions**This talk deals with long-term disease-specific survivals among the prostate cancer patients in the NIH Surveillance Epidemiology and End Results (SEER) program, wherein the main endpoint (e.g. deaths from prostate cancer) and the censoring causes (e.g. deaths from heart diseases) may be dependent. While a number of authors have studied the mixture survival model to analyze survival data with non-negligible long-term survival fractions, none has studied the mixture model in the presence of dependent censoring. To account for such dependence, we propose a more general long-term survival model that allows for dependent censoring. We derive the models from the perspective of competing risks and model the dependence between the censoring time and the survival time using a class Archimedean copula models. Within this framework, we consider the parameter estimation, the long-term survival detection, and the two-sample comparison of latency distributions in the presence of dependent censoring when a proportion of patients is deemed to be long-term survivors. Large sample results using the martingale theory are obtained. We examine the finite sample performance of the proposed methods via simulation and apply them to analyze the SEER prostate cancer data.

Thursday, September 18, 2008