Aligning Data Normalization with Analysis Goals for Reproducible Genomic Studies
Li-Xuan Qin, Ph.D.
Memorial Sloan Kettering Cancer Center
Data normalization is an important preprocessing step for genomic data that contain unwanted variations due to disparate experimental handling. While methods for data normalization have been developed in the context of group comparison with limited differential expression, they have encountered frequent uses for an otherwise unapproved inference goal such as sample classification, an important quantitative tool that is in dire need to tailor treatment choices for personalized medicine. To study this critical yet over-looked disconnection between the use of data normalization and the goal of subsequent analysis, we have collected a unique pair of microarray datasets on the same set of tumor samples at Memorial Sloan Kettering Cancer Center and conducted extensive simulation studies based on novel resampling schemes. In this talk, I will report our findings on how data normalization impacts the analyses of sample classification and group comparison with moderate differential expression, and suggest an alternative approach to more effectively deal with the unwanted variations in genomics data.
Thursday, October 17, 2019
3:30 p.m. - 4:45 p.m.
Helen Wood Hall - Auditorium 1W-304
Should We Model X in High-Dimensional Inference?
Lucas Janson, Ph.D.
Many important scientific questions are about the relationship between a response variable Y and a set of explanatory variables X, for instance, Y might be a disease state and the X's might be a person's SNPs, and the question is which of these SNPs are related to the disease. For answering such questions, most statistical methods focus their assumptions on the conditional distribution of Y given X (or Y | X for short). I will describe some benefits of shifting those assumptions from the conditional distribution Y | X to the joint distribution of X, especially for high-dimensional data. First, modeling X can lead to assumptions that are more realistic and verifiable. Second, there are substantial methodological payoffs in terms of much greater flexibility in the tools an analyst can bring to bear on their data while also being guaranteed exact (non-asymptotic) inference. I will briefly mention some of my recent and ongoing work on methods for high-dimensional inference that model X instead of Y, as well as some challenges and interesting directions for the future.
Thursday, October 3, 2019
Uncovering the Mechanisms of General Anesthesia: Where Neuroscience Meets Statistics
Emery Brown, M.D., Ph.D.
Massachusetts Institute of Technology
Harvard Medical School
2019 Andrei Yakovlev Colloquium
Thursday, September 19, 2019
Some Inferential Tools for Health Policy & Outcomes Research
Sharon-Lise Normand, Ph.D.
Harvard Medical School
2019 Charles L. Odoroff Memorial Lecture
Thursday, May 9, 2019
Discovering Effect Modification in Observational Studies
Dylan Small, Ph.D.
University of Pennsylvania
There is effect modification if the magnitude of a treatment effect varies with the level of an observed covariate. A larger treatment effect is typically less sensitive to bias from unmeasured covariates, so it is important to recognize effect modification when it is present. Additionally, effect modification is of interest for personalizing treatments based on an individual’s covariates. We present a method for conducting a sensitivity analysis in an observational study that empirically discovers effect modification by exploratory methods, but controls the family-wise error rate or false discovery rate in discovered groups. We will discuss an application of the method to an observational study of the effect of superior nursing at a hospital on surgical mortality.
Thursday, April 25, 2019
Data-Adaptive Regression Modeling in High Dimensions
Ashley Petersen, Ph.D.
University of Minnesota
In recent years, it has become easier and less expensive to collect and store large amounts of data in a number of fields. This has amplified interest in the development of statistical methods to adequately model this data. With high-dimensional data, the traditional plots used in exploratory data analysis can be limiting, given the large number of possible predictors. Thus, it can be helpful to fit sparse regression models, in which variable selection is adaptively performed, to explore the relationships between a large set of predictors and an outcome. For maximal utility, the functional forms of the covariate fits should be flexible enough to adequately reflect the unknown relationships and interpretable enough to be useful as a visualization technique. In this talk, we will provide an overview of recent work in the area of sparse additive modeling that can be used for visualization of relationships in big data. In addition, we will present recent novel work that fuses together the aims of these previous proposals in order to not only adaptively perform variable selection and flexibly fit included covariates, but also adaptively control the complexity of the covariate fits for increased interpretability.
Thursday, April 11, 2019
Statistical and Computational Aspects in the Analysis of Genomic Data from Family Based Designs
Ingo Ruczinski, Ph.D.
Johns Hopkins University
Family based study designs are regaining popularity because large-scale sequencing can help to interrogate the relationship between disease and variants too rare in the population to be detected through any test of association in a conventional case-control study, but may nonetheless co-segregate with disease within families. In addition, family based designs also allow for the assessment of de novo events and parent-of-origin effects. In this presentation, we focus on statistical and computational aspects in the analysis of sequencing data from nuclear families with affected probands and extended multiplex families, with an emphasis on improvements in scalability and new methods for causal variant detection.
Thursday, April 4, 2019
Parallel Markov Chain Monte Carlo Methods for Bayesian Analysis of Big Data
Erin Conlon, Ph.D.
University of Massachusetts
Recently, new parallel Markov chain Monte Carlo (MCMC) methods have been developed for massive data sets that are too large for traditional statistical analysis. These methods partition big data sets into subsets, and implement parallel Bayesian MCMC computation independently on the subsets. The posterior MCMC samples from the subsets are then joined to approximate the full data posterior distributions. Current strategies for combining the subset samples include averaging, weighted averaging and kernel smoothing approaches. Here, I will discuss our new method for combining subset MCMC samples that directly products the subset densities.
While our method is applicable for both Gaussian and non-Gaussian posteriors, we show in simulation studies that our method outperforms existing methods when the posteriors are non-Gaussian. I will also discuss computational tools we have developed for carrying out parallel MCMC computing in Bayesian analysis of big data.
Thursday, February 14, 2019