|
Spring 2008 Biostatistics Brown Bag Seminar Abstracts
Modeling Intrahost Sequence Evolution in HIV-1 Infection
Ha Youn Lee, Ph.D.
Quantifying the dynamics of intrahost HIV-1 sequence
evolution is one means of uncovering information about the interaction
between HIV-1 and the host immune system. In this tlk, I will
introduce a mathematical model and Monte-Carlo simulation of viral
evolution within an individual during HIV-1 infection that enables to
explain the universal dynamics of sequence divergence and diversity,
to classify of new HIV-1 infections originating from multiple versus
single transmitted viral strains, and to estimate time since the most
recent common ancestor of a transmitted viral lineage.
From 13 out of 15 longitudinally followed patients (3-12 years), we
found that the rate of intrahost HIV-1 evolution is not constant, but
rather slows down at a rate correlated with the rate of CD4+ T cell
count decline. We studied a HIV-1 sequence evolution model where for
each sequence we keep track of its distance from the founder strain
and assign a fitness and survival probability of mutations based on
the distance from the founder strain.
The model suggests that the saturation of divergence and the decrease
of diversity observed in the later stages of infection are attributed
to a decrease in the probability of mutant strains to survive as the
distance from the founder strain increases rather than due to an
increase of viral fitness. At the second part, I will talk about both
synchronous and asynchronous models of acute phase of HIV-1 evolution
with a single cycle reverse transcriptase error rate, average
generation time, and basic reproductive ratio. These models were used
to analyze 3,475 complete env sequences recently derived by single
genome amplification from 102 subjects with acute HIV-1 (clade B)
infection, classifying a single strain infection from a multiple
variant infection and also identifying transmitted HIV-1 envelope
genes.
Thursday, February 21, 2008
12:30 p.m.
Biostatistics Conference Room
Fall 2007 Biostatistics Brown Bag Seminar Abstracts
Creating an R Package Part I: It's EASY!
Gregory Warnes, Ph.D.
The open source statistical package R provides nice tools for
bundling a set of functions and data together as an R package.
Creating an R package from your R scripts helps to provide good
documentation, and makes it much easier to share with others and to
maintain your code for your own future use. This Brown Bag will
demonstrate how to create an R package, and the advantages that come
from doing so.
Adaptive Simon Two-Stage Design for Preliminary Test of Targeted Sub-Population
Qin Yu, Graduate Student
The trend towards specialized clinical development programs for targeted cancer therapies is growing fast, which was made possible by significant improvements in molecular characterization of biological pathways fostering the growth of tumors. The proposed phase two stage design, which is an adaptation to Simon's two-stage design, allows for preliminary determination of efficacy for a particular sub-population defined by biomarker status. The advantage of adopting this two-stage design is shown via a real study.
Using Auxiliary Variables to Enhance Survival Analysis
Haiyan Su, Graduate Student
One of the primary problems facing statisticians who work with survival data is the loss of information that occurs with right-censored data. Markers, which are prognostic longitudinal variables, can be used to replace some of the information lost due to right-censoring because of its property of correlating and predicting to the overall survival event. In oncology studies, disease progression status are measured at certain times and are correlated with survival, how to incorporate information on disease progression (the markers) in the analysis of survival to reduce the variance of treatment effect estimator (e.g. log hazard ratio in Cox model) is interesting and challenging. In this work, we applied Mackenzie & Abrahamowicz's (MA) plug-in method which writes the test statistic as a functional of the Kaplan-Meier estimators, and then replaced the latter with an efficient estimator of the survival curve that incorporates the information from markers. Possible choices of survival curve estimator are Murray-Tsiatis (MT) method and Finkelstein-Schoenfeld (FS) method. The resulting estimators can greatly improve the efficiency provided that the marker is highly prognostic and that the frequency of censoring is high. MA's methodology is illustrated with an application to a real time to event data by using MT survival curve estimator. We will also introduce FS method with a real data example.
Approximate Iteration Algorithms
Anthony Almudevar, Ph.D.
In this talk I will summarize some work undertaken by the authors in the
area of approximate iterations, ranging from basic theory to applications
in control theory and numerical analysis. The relationship of these
processes to some important medical applications will be reviewed. The
talk will divide naturally into three sections.
1. Models of Approximate Iterative Processes. An iterative process is
usually expressed as a normed space $V$ with some operator $T$, on which a
sequence $v(k+1) = Tv(k), k \geq 1$ is generated, given some starting
value $v(0)$. Ideally, this sequence converges to a fixed point $w = Tw$.
In practice, the operator can only be evaluated approximately, so the
iteration is more accurately written $v(k+1) = T_k v(k) = Tv(k) + u(k)$
where, alternatively, $T_k$ is the $k$th approximation of $T$, or $u(k)$
is the approximation error associated with the $k$th iteration. It is
possible to show that if $T$ is contractive the approximate algorithm will
converge to the fixed point, at a rate equivalent to $\max(r^k, |u(k)|)$,
where $r$ is the contraction constant. The remaining work largely follows
from this result.
2. Numerical Analysis. Many iterative algorithms rely on operators which
may be difficult or impossible to evaluate exactly, but for which
approximations are available. Furthermore, a graduated range of
approximations may be constructed, inducing a functional relationship
between computational complexity and approximation tolerance. In such a
case, a reasonable strategy would be to vary tolerance over iterations,
starting with a cruder approximation, then gradually decreasing tolerance
as the solution is approached.
However, in such an algorithm, because the computational complexity
increases over iterations, the convergence rate of the algorithm is more
appropriately calculated with respect to
cumulative computation time than to iteration number. This leaves open the
problem of determining an optimal rate of change of approximation
tolerance.
Our theory of approximate iterations may be used to show that, under
general conditions, for linearly convergent algorithms the optimal choice
of approximation tolerance convergence rate is the same linear convergence
rate as the exact algorithm itself, regardless of the tolerance-complexity
relationship. This result will be illustrated with several examples of
Markov decision processes.
3. Adaptive Stochastic Decision Processes. A stochastic decision process
is a random sequence whose distribution beyond a time $t$ can be
determined by an action taken by an observer at time $t$, who has access
to all process history up to that time. There is usually some reward
criterion, so that the objective of the action is to maximize the
expected value of the reward. If the process distribution under all
possible action sequences is known then, at least in principle, the
optimal action under any given history can be calculated, as so would be
available to the observer as a control policy. Typically, these
distributions are unknown, but may be estimated by the observer using
process history. In this case, the observer needs to vary the actions
sufficiently in order to estimate the model. This, however, conflicts with
the goal of achieving the optimal expected reward, since this type of
exploratory behavior will be suboptimal. An adaptive decision process is
one which attempts to seek an optimal balance between exploratory behavior
and seeking to maximize reward based on current model estimates. Our
theory can be used to define, for Markov decision processes, an
exploration rate, and then to show that the optimal exploration rate
decreases in proportion to $t^{-1/3}$, resulting in a process in which
regret (difference between optimal and achieved reward) converges to zero
at a rate of $t^{-1/3}$, as distinct from a rate of $t^{-1/2}$ associated
with estimation alone. The theory extends naturally to sequential clinical
trials.
This is joint work with Edilson F. Arruda and Jason LaCombe.
Thursday, October 4, 2007
12:30 PM
Biostatistics Conference Room
Spring 2007 Biostatistics Brown Bag Seminar Abstracts
Where Do We Stand in Microarray Data Analysis? Lessons of the Past and Hopes for the Future
Andrei Yakovlev, Ph.D.
University of Rochester
This presentation discusses numerous pitfalls in the analysis of microarray gene expression data. Modern state of the art in this area is far from satisfactory. Many misconceptions still dominate the literature on microarray data analysis. An overview of the most common misconceptions will be given and some constructive alternatives will be proposed. In particular, I will present a new method designed to select differentially expressed genes in non-overlapping gene pairs. This method offers two distinct advantages: (1) it leads to dramatic gains in terms of the mean numbers of true and false discoveries, as well as in stability of the results of testing; (2) its outcomes are entirely free from the log-additive array-specific technical noise.
Thursday, May 17, 2007
11:30 AM
Room 2-6408 (K-207) Medical Center
Integrating Quantitative/Computational Sciences
for Biomedical Research
Hulin Wu, Ph.D.
Our Division (Division of Biomedical Modeling and Informatics)
has been formed for two years. Since we moved to a remote
location, our communication and interaction with our Department
are not as frequent as before. In this talk, I will give an overview
on the research of our Division in order to promote more interactions
and collaborations with other faculty and students in our Department.
Also I will share our experience on how to do 100% our own research
while we are doing 100% collaboration and consulting. Some tips
on how to find more time to do our "own" research will be given.
Our Division is formed to integrate quantitative (statistics,
mathematics, engineering, physics etc.) and computational sciences
(computer sciences and biomedical informatics) to do biomedical
research. In this new era of high technologies, many new quantitative
and computational sciences have evolved from various disciplines
to become major tools for biomedical research. These include
biostatistics, biomathematics, bioinformatics, biomedical informatics,
computational biology, mathematical biology and theoretical
biology, biophysics, bioengineering etc. This also brings a great
opportunity for biometrical scientists to integrate the various
quantitative/computational methodologies and techniques
to support biomedical discoveries and research. Our Division,
collaborating with biomedical investigators, is currently working
on development of mathematical models, statistical methods,
computer simulation systems, software packages, informatics tools
and data management systems for HIV infections, AIDS
clinical studies, influenza infections and immune response to
infectious diseases. In this talk, I will discuss our experience of
interactions and collaborations among biostatisticians,
biomathematicians, biophysicists, bioengineers and biocomputing
scientists as well as biomedical investigators. In particular,
I will review the three components: (1) mathematical models for
HIV viral fitness experiments, AIDS clinical biomarker data,
immune response to influenza A virus infections; (2) statistical
methods for biomedical dynamic (differential equation) models;
(3) user-friendly computer simulation and estimation software.
Finally I will discuss some challenges and opportunities for
biometrical scientists in biomedical research.
Bayesian multiple outcomes models and the Seychelles data
Sally W. Thurston, Ph.D.
Understanding the relationship between prenatal mercury exposure and neurodevelopment in children is of great interest to many
practitioners. Typically, analyses rely on separate models fit to
each outcome. If the effect of exposure is very similar across
outcomes, separate models lack power to detect a common exposure
effect. Furthermore, the outcomes cluster into broad domains and
domain-specific effects are also of interest. We fit a Bayesian model
which allows the mercury effect to vary across outcomes, while
allowing for shrinkage of these effects within domains, and to a
lesser extent between domains. We will discuss the benefits and
challenges of fitting this model within a Bayesian framework, and
apply the model to multiple outcomes measured in children at 9 years
of age in the Seychelles. This is work in progress, and is joint with
David Ruppert at Cornell University.
An Introduction to Adobe Contribute and Blackboard Academic
Suite
Chris Beck, Ph.D, and
Rebekka Cranmer, Senior Web Developer, Web Services Department
Learn how to create new web pages and edit existing one with Adobe's Contribute. You will learn how to add images and text to a page, as well as edit images and create PDFs. Additionally, you will explore the page review and publishing features. Using Contribute you will be able to easily create content and publish the content to the URMC live Web server.
In the second half of this brown-bag seminar, the Blackboard Academic Suite will be introduced. Blackboard is a secure online course management
tool that is used to facilitate learning objectives, assessment, and
information exchange between instructors and students. It can also be
used for secure information exchange within an organization or other group
of people at the University of Rochester. A brief tutorial and
demonstration of the software aimed at course instructors and organization
leaders will be presented.
Fall 2006 Biostatistics Brown Bag Seminar Abstracts
Correlation Analysis for Longitudinal Data
Wan Tang, Ph.D.
Correlation analysis is widely used in biomedical and psychosocial research to evaluate quality of outcomes and to assess instrument and rater reliability. For continuous outcomes, the product-moment correlation and the associated Pearson estimate are the most popular in applications. Although asymptotic distributions of the Pearson estimates are available for multivariate outcomes, they only apply to complete data. As longitudinal study designs become increasingly popular, missing data is commonplace in most trials and cohort studies. In this talk, we propose new product-moment estimates to extend the Pearson estimates to address missing data within a longitudinal data setting. We discuss non-parametric inference under both the missing completely at random (MCAR) and missing at random (MAR) assumptions. Inference under MAR is quite complex in general and we consider several special cases that not only reduce the complexity but also apply to most real studies. The approach is illustrated with real study data in psychosocial research.
Bayesian Network as a Model of Biological Network
Peter Salzman, Ph.D.
Bayesian Network is a graphical representation of a multivariate
distribution. This representation applied to gene expression data can be
usefull to understand the direct and indirect interactions between genes/
gene products (proteins). In this talk I'll address two issues related to
Bayesian network models. The estimation/reconstruction of network from
data is computationaly intensive process as the space of possible models
is superexponential in the number of genes. In the first part of this talk
I'll describe an algorithm that operates on the space of rankings that is
'only' exponential in the number of genes.
In the second part of the talk I'll propose a procedure that tests if a
collection of genes loosely defined as a pathway is differentially
expressed under two conditions. It is based on first reconstructing the
network for each condition and then comparing the two networks. I'll
present result for simulated and real biological data to demonstrate the
applicability of the method.
Adverse Effects of Intergene Correlations in Microarray Data Analysis
Xing Qiu, Ph.D.
In the field of microarray data analysis, a common task is to find those genes that are differentially expressed in two groups of patients.
Inter-gene stochastic dependence plays a critical role in the methods of such statistical inference. It is frequently assumed that dependence between genes (or tests) is sufficiently weak to justify many methodologies that
resort to pooling test statistics across genes. In this talk, I present two popular methods of this kind, namely the empirical Bayes methodology and a procedure introduced
by Storey et al which depends on the estimation of false discovery rate. Then I provide some empirical evidences to
demonstrate that these methods suffer a lot from such pooling practice, such as high variability and lack of consistency.
Causal Comparisons in Randomized Trials of Two Active Treatments: The Effect of Supervised Exercise to Promote Smoking Cessation
Jason Roy, Ph.D.
In behavioral medicine trials, such as smoking cessation trials, two or more active treatments are often compared. Noncompliance by some subjects with their assigned treatment poses a challenge to the data analyst. Causal parameters of interest might include those defined by subpopulations based on their potential compliance status under each assignment, using the principal stratification framework (e.g., causal effect of new therapy compared to standard therapy among subjects that would comply with either intervention). Even if subjects in one arm do not have access to the other treatment(s), the causal effect of each treatment typically can only be identified from the outcome, randomization and compliance data within certain bounds. We propose to use additional information – compliance-predictive covariates – to help identify the
causal effects. Our approach is to specify marginal compliance models conditional on covariates within each arm of the study. Parameters from these models can be identified from the data. We then link the two compliance models through an association model that depends on a parameter that is not identifiable, but has a meaningful interpretation; this parameter forms the basis for a sensitivity analysis. We demonstrate the benefit of utilizing covariate information in both a simulation study and in an analysis of data from a smoking cessation trial.
Spring 2006 Biostatistics Brown Bag Seminar Abstracts
A Nonparametric Model for Bivariate Distributions
Based on Diagonal Copulas Sungsub Choi, Ph.D.,
Department of Mathematics,
Pohang University of Science and Technology,
A useful approach in constructing multivariate distributions is based on copula
functions, and, in particular, Archimedean copulas have been in wide use.
The talk will introduce a new class of copulas based on convex diagonal functions,
and explores their distributional properties. Several examples of parametric
diagonal copulas will be given. We will then explore the ways of extension
to constructing multivariate proportional hazards models.
Motion Tracking in Wireless Networks
Using Artificial Triangulation Anthony Almudevar, Ph.D.
One important problem in the application of wireless networks is the location of a
mobile node Tx based on the received signal strength (RSS) at a fixed configuration
of receivers of a radio frequency signal transmitted by Tx. Because the RSS is
inversely related to transmission distance, the distance of Tx from each receiver can
be determined, and its location established by geometric triangulation, as long as at
least three well spaced receivers are used.
The use of such wireless networks provides a convenient method of collecting a
longitudinal record of motion for patients susceptible to dementia. This can
provide an objective method for the real-time monitoring of noncognitive symptoms
of dementia such as restlessness, pacing, wandering, changes in sleep patterns, changes
in circadian rhythm or specific changes in daily routine.
However, the calibration of the RSS to transmission distance relationship is complicated
by the presence of obstacles, particularly in an indoor setting. The relationship depends
strongly on the geometric configuration of walls and other large obstacles, the proximity
of high voltage devices such as microwave ovens and televisions, as well as the
orientation of any person wearing such a transmitter.
I will present as an interim solution a method of mapping of RSS measurements onto a two
dimensional plane which preserves the topological and directional properties of any
trajectory of Tx without requiring precise knowledge of the receiver configuration or
the RSS to transmission distance relationship. The method works by imposing an artificial
triangulation on suitably transformed RSS measurements.
Such a representation will suffice to capture the essential features of patient motion.
In particular, locations which are frequently occupied (favorite chair, kitchen, etc) can
be identified with sufficient data, leading to the construction of a ‘living space network’
through an unsupervised learning process. The network can be later validated or annotated.
The methodology will be illustrated using data collected under a study funded by an Everyday
Technologies for Alzheimer Care (ETAC) research grant from the Alzheimer's Association,
using monitoring equipment provided by Home Free Systems and GE Global Research. This is joint
work with Dr. Adrian Leibovici and the Center for Future Health, University of Rochester.
Testing Equality of Ordered Means in
the General Linear Model Michael McDermott, Ph.D.
Hypothesis testing problems involving order constrained means arise
frequently in practice. The standard approach to this problem in the
one-way layout is the likelihood ratio test. In many practical settings,
such as a randomized controlled trial, it is useful to include
covariates in the primary statistical model. Likelihood ratio tests for
equality of ordered means that incorporate covariate adjustment are quite
complex and are rarely applied in practice because of difficulties in
their implementation. In this paper, a test is proposed that is based
on multiple contrasts among the adjusted group means. The p-values
associated with these contrasts are, in general, dependent. An overall
significance test is carried out using Fisher’s statistic to combine
the dependent p-values arising from these contrasts; the null distribution
of this statistic can be well approximated by that of a scaled chi-square
random variable. The contrasts can be chosen to yield a test with high
power, for alternatives at a fixed distance from the null hypothesis,
throughout the restricted parameter space. The test is generally easy to
implement for a variety of partial order restrictions. An example from a
randomized clinical trial is used to illustrate the proposed test.
Fall 2005 Biostatistics Brown Bag Seminar Abstracts
Rule-based Modeling of Signaling by Epidermal
Growth Factor Receptor Michael L. Blinov
Theoretical Biology and Biophysics Group,
Los Alamos National Laboratory, Los Alamos, NM
Signal transduction networks often exhibit combinatorial complexity:
the number of protein complexes and modification states that potentially
can be generated during the response to a signal is large, because
signaling proteins contain multiple sites of modification and interact with
multiple binding partners. The conventional approach of manually specifying
each term of a mathematical model is impossible. To avoid this problem,
modelers often make assumptions to limit the number of species, but these are
usually poorly justified. As an alternative, we have developed an approach to
represent biomolecular interactions as rules specifying activities, potential
modifications and interactions of the domains of signaling molecules
[Hlavacek et al. (2003) Biotech. Bioeng.] Rules are evaluated automatically
to generate the reaction network. This approach is implemented in BioNetGen
software [Blinov et al. (2004) Bioinformatics; Blinov et al. (in press) LNCS].
To illustrate this approach, we have developed a model of early events in
signaling by the epidermal growth factor (EGF) receptor (EGFR), which
includes EGF, EGFR, the adapter proteins Grb2 and Shc, and the guanine
nucleotide exchange factor Sos [Blinov et al. (2005) BioSystems]. These
events can potentially generate a diversity of protein complexes and
phosphoforms; however, this diversity has been largely ignored in
computational models of EGFR signaling. The model predicts the dynamics of
356 molecular species connected through 3,749 reactions. This model is
compared with a previously developed model [Kholodenko et al. (1999) JBC]
that incorporates the same protein-protein interactions but is based on several
restrictive assumptions and thus includes only 18 molecular species involved in
Sos activation. The new model is consistent with experimental data and yields
new predictions without requiring new parameters. The model predicts distinct
temporal patterns of phosphorylation for different tyrosines of EGFR, distinct
reaction paths for Sos activation, a large number of distinct protein complexes
at short times, and signaling by receptor monomers. Comparing the two models
helps design experiments to test hypotheses, e.g., genetic mutation blocking
Shc-dependent pathways helps to distinguish between competitive and non-competitive
mechanisms of adapter proteins binding.
Stochastic Curtailment in Multi-Armed Trials
Xiaomin He
Stochastically curtailed procedures in multi-armed trials are
complicated due to repeated significance testing and multiple comparisons.
From either frequentist or Bayesian viewpoints, there exists some
dependence among pairwise test statistics. Investigators must consider
such dependence when testing homogeneity of treatments. This paper studies
the property of canonical multivariate joint distribution of test
statistics in multi-armed trials. Pairwise and global monitoring are
suggested based on this property. In pairwise monitoring, the Hochberg
step-up procedure is recommended to strongly control the overall
significance level. In global monitoring, the conditional and predictive
power are calculated based on current multivariate test statistics, which
reflect the dependence among pairwise test statistics. Futility monitoring
in multi-armed trials is also considered. Simulation results in
multi-armed trials show that, compared with the traditional group
sequential and non-sequential procedures, stochastic curtailment has
advantages in sample size, time and cost. An example concerning a proposed
study of Coenzyme Q$_{10}$ in early Parkinson Disease is given.
Power Analysis for Correlations from Clustered
Study Designs Xin Tu
Power analysis constitutes an important component of modern clinical
trials and research studies. Although a variety of methods and software
packages are available, they are primarily focused on regression models,
with little attention paid to correlation analysis. However, the latter is
a simpler and more appropriate approach for modeling association between
correlated variables that measure a common (latent) construct using
different scales, different assessment methods and different raters as
arising in psychosocial and other health-care related research areas. A
major difficulty for performing power analysis is how to deal with the
excessive number of parameters in the distributions of the correlation
estimates, many of which are nuisance parameters. In addition, as missing
data patterns are unpredictable and dynamic before a study is realized,
its effect must also be addressed when performing power analysis, which
further complicates the analytic problems. With no real data to estimate
the parameters and missing data patterns as in most real study
applications, it is difficult to proceed with estimation of power and
sample size for correlation analysis for a real study. In this talk, we
discuss how to eliminate nuisance parameters and model missing data
patterns to effectively address these issues. We illustrate our approaches
with both real and simulated data.
This is joint work with Paul Crits-Christoph (University of
Pennsylvania), Changyong Feng (University of Rochester), Robert Gallop
(University of Pennsylvania) and Jeanne Kowalski (Johns Hopkins
University).
Branching Processes, Generation, and
Applications Ollivier Hyrien
I will first present results on the distribution of the generation in a
Bellman-Harris branching process starting with a single cell. Approximate
expressions for this distribution have been described in the literature,
and I will present an exact expression. As an application, I will give an
explicit expression for the distribution of the age in the considered
setting. The results are illustrated using a Markov process.
The second part of my talk will focus on the statistical analysis of
CFSE-labeling experiments, a bioassay frequently used by biologists to
study cell proliferation. The data generated by this assay are dependent,
a feature that has never been mentioned in the literature. The dependency
structure is quite complex, making it impossible to use the method of
maximum likelihood. I propose three estimation techniques, and present
their asymptotic and finite sample properties. An application to T
lymphocytes will also be given.
Similarity Searches in Genome-wide Numerical
Data Sets Galina Glazko Stowers Institute for Medical
Research
Many types of genomic data are naturally represented as
multidimensional vectors. The frequent purpose of genome-scale data
analysis is to uncover the subsets in the data that are related by a
similarity of some sort. One way to do it is by computing the distances
between vectors. The major question here is: how to choose the distance
measure, when several of them are available? First, we consider the
problem of functional inference using phyletic patterns. Phyletic patterns
denote presence and absence of orthologous genes in completely sequenced
genomes, and are used to infer functional links, on the assumption that
genes involved in the same pathway or functional system are co-inherited
by the same set of genomes. I demonstrate that the use of appropriate
distance measure and clustering algorithm increases the sensitivity of
phyletic pattern method; however, the method itself has the limit of
applicability caused by differential gains, losses, and displacements of
orthologous genes. Second, we study the characteristic properties of
various distance measures and their performance in several tasks of genome
analysis. Most distance measures between binary vectors turn out to belong
to a single parametric family, namely generalized average-based distance
with different exponents. I show that descriptive statistics of distance
distribution, such as skewness and kurtosis, can guide the appropriate
choice of the exponent. On the contrary, the more familiar distance
properties, such as metric and additivity, appear to have much less effect
on the performance of distances. Third, we discuss the new approach for
local clustering based on an iterative pattern-matching and apply the new
approach to identify potential malaria vaccine candidates in Plasmodium
falciparum transcriptome.
Partially Linear Models and Related Topics
Hua Liang
In this brown-bag seminar I will bring a presentation of the state of
the art of partially linear models, with a particular focus on several
special topics such as with error-prone covariates, missing observation,
nonlinear component checking. Extension to more general models will be
discussed. The applications of these projects in biology, economics, and
nutrition will be mentioned. The talk covers a series of my publications
in the Annals of Statistics, JASA, Statistica Sinica, Statistical Methods
in Medical Research, and more recent submission.
Spring 2005 Biostatistics Brown Bag Seminar Abstracts
Estimating Incremental Cost-Effectiveness Ratios
and Their Confidence Intervals with Differentially Censored Data
Hongkun Wang and Hongwei Zhao
With medical cost escalating over recent years, cost analysis is being
conducted more and more to assess economical impact of new treatment
options. An incremental cost-effectiveness ratio is a measure that
assesses the additional cost for a new treatment for saving one year of
life. In this talk, we consider cost effective analysis for new treatments
evaluated in a randomized clinical trial setting with staggered entries.
In particular, the censoring times are different for cost and survival
data. We propose a method for estimating the incremental
cost-effectiveness ratio and obtaining its confidence interval when
differential censoring exists. Simulation experiments are conducted to
evaluate our proposed method. We also apply our methods to a clinical
trial example comparing the cost-effectiveness of implanted defibrillators
with conventional therapy for individuals with reduced left ventricular
function after myocardial infarction.
Regression Analysis of ROC Curves and Surfaces
Christopher Beck
Receiver operating characteristic (ROC) curves are commonly used to
describe the performance of a diagnostic test in terms of discriminating
between healthy and diseased populations. A popular index of the
discriminating ability or accuracy of the diagnostic test is the area
under the ROC curve. When there are three or more populations, the concept
of an ROC curve can be generalized to that of an ROC surface, with the
volume under the ROC surface serving as an index of diagnostic accuracy.
After introducing the basic concepts associated with ROC curves and
surfaces, methods for assessing the effects of covariates on diagnostic
test performance will be discussed. Examples from a recent study organized
by the Agency for Toxic Substances and Disease Registry (and conducted
here in Rochester) will be presented to illustrate these methods.
Constructing Prognostic Gene Signatures for
Cancer Survival Derick Peterson
Modern micro-array technologies allow us to simultaneously measure the
expressions of a huge number of genes, some of which are likely to be
associated with cancer survival. While such gene expressions are unlikely
to ever completely replace important clinical covariates, evidence is
already beginning to mount that they can provide significant additional
predictive information. The difficult task is to search among an enormous
number of potential predictors and to correctly identify most of the
important ones, without mistakenly identifying too many spurious
associations. Many commonly used screening procedures unfortunately
over-fit the training data, leading to subsets of selected genes that are
unrelated to survival in the target population, despite appearing
associated with the outcome in the particular sample of data used for
subset selection. And some genes might only be useful when used in concert
with certain other genes and/or with clinical covariates, yet most
available screening methods are inherently univariate in nature, based
only on the marginal associations between each predictor and the outcome.
While it is impossible to simultaneously adjust for a huge number of
predictors in an unconstrained way, we propose a method that offers a
middle ground where some partial adjustments can be made in an adaptive
way, regardless of the number of candidate predictors.
A New Test Statistic for Testing Two-Sample
Hypotheses in Microarray Data Analysis Yuanhui Xiao
We introduce a test statistic intended for use in nonparametric testing
of the two-sample hypothesis with the aid of resampling techniques. This
statistic is constructed as an empirical counterpart of a certain distance
measure N between the distributions F and G from
which the samples under study are drawn. The distance measure N can
be shown to be a probability metric. In two-sample comparisons, the null
hypothesis F = G is formulated as H0 : N =
0. In a computer experiment, where gene expressions were generated from a
log-normal distribution, while departures from the null hypothesis were
modeled via scale transformations, the permutation test based on the
distance N appeared to be more powerful than the one based on the
commonly used t-statistic. The proposed statistic is not
distribution free so that the two-sample hypothesis F = G is
composite, i.e., it is formulated as H0 : F(x) = H(x), G(x)
= H(x) for all x and some H(x). The question of how the
null distribution H should be modeled arises naturally in this
situation. For the N-statistic, it can be shown that a specific
resampling procedure (resampling analog of permutations) provides a
rational way of modeling the null distribution. More specifically, this
procedure mimics the sampling from a null distribution H which is,
in some sense, the "least favorable" for rejection of the null hypothesis.
No statement of such generality can be made for the t-statistic.
The usefulness of the proposed statistic is illustrated with an
application to experimental data generated to identify genes involved in
the response of cultured cells to oncogenic mutations.
The Effects of Normalization on the Correlation Structure of
Microarray Data Xing Qiu, Andrew I. Brooks, Lev Klebanov, and
Andrei Yakovlev
Stochastic dependence between gene expression levels in microarray data
is of critical importance for the methods of statistical inference that
resort to pooling test statistics across genes. It is frequently assumed
that dependence between genes (or tests) is sufficiently weak to justify
the proposed methods of testing for differentially expressed genes. A
potential impact of between-gene correlations on the performance of such
methods has yet to be explored. We present a systematic study of
correlation between the t-statistics associated with different genes. We
report the effects of four different normalization methods using a large
set of microarray data on childhood leukemia in addition to several sets
of simulated data. Our findings help decipher the correlation structure of
microarray data before and after the application of normalization
procedures. A long-range correlation in microarray data manifests itself
in thousands of genes that are heavily correlated with a given gene in
terms of the associated t-statistics. The application of normalization
methods may significantly reduce correlation between the t-statistics
computed for different genes. However, such procedures are unable to
completely remove correlation between the test statistics. The long-range
correlation structure also persists in normalized data.
Estimating Complexity in Bayesian
Networks Peter Salzman
Bayesian networks are commonly used to model complex genetic
interaction graphs in which genes are represented by nodes and
interactions by directed edges. Although a likelihood function is usually
well defined, the maximum likelihood approach favors networks with high
model complexity. To overcome this we propose a two step algorithm to
learn the network structure. First, we estimate model complexity. This
requires finding the MLE conditional on model complexity then using
Bayesian updating, resulting in an informative prior density on
complexity. This is accomplished using simulated annealing to solve a
constrained optimization problem on the graph space. In the second step we
use an MCMC algorithm to construct a posterior density of gene graphs
which incorporates the information obtained in the first step. Our
approach is illustrated by an example.
A New Approach to Testing for Sufficient
Follow-up in Cure-Rate Analysis Lev Klebanov and Andrei
Yakovlev
The problem of sufficient follow-up arises naturally in the context of
cure rate estimation. This problem was brought to the fore by Maller and
Zhou (1992, 1994) in an effort to develop nonparametric statistical
inference based on a binary mixture model. The authors proposed a
statistical test to help practitioners decide whether or not the period of
observation has been long enough for this inference to be theoretically
sound. The test is inextricably entwined with estimation of the cure
probability by the Kaplan-Meier estimator at the point of last
observation. While intuitively compelling, the test by Maller and Zhou
does not provide a satisfactory solution to the problem because of its
unstable and non-monotonic behavior when the duration of follow-up
increases. The present paper introduces an alternative concept of
sufficient follow-up allowing derivation of a lower bound for the expected
proportion of immune subjects in a wide class of cure models. By building
on the proposed bound, a new statistical test is designed to address the
issue of the presence of immunes in the study population. The usefulness
of the proposed approach is illustrated with an application to survival
data on breast cancer patients identified through the NCI Surveillance,
Epidemiology and End Results Database.
Assessment of Diagnostic Tests in the
Presence of Verification Bias Michael McDermott
Sensitivity and specificity are common measures of the accuracy of a
diagnostic test. The usual estimators of these quantities are unbiased if
data on the diagnostic test result and the true disease status are
obtained from all subjects in a random sample from the intended population
to which the test will be applied. In many studies, however, verification
of the true disease status is performed only for a subset of the sample.
This may be the case, for example, if ascertainment of the true disease
status is invasive or costly. Often, verification of the true disease
status depends on the result of the diagnostic test and possibly other
characteristics of the subject (e.g., only subjects judged to be at higher
risk of having the disease). If sensitivity and specificity are estimated
using only the information from the subset of subjects for whom both the
test result and the true disease status have been ascertained, these
estimates will typically be biased. This talk will review some methods for
dealing with the problem of verification bias. Some new approaches to the
problem will also be introduced.
Estimation of Causal Treatment Effects from
Randomized Trials with Varying Levels of Non-Compliance Jason
Roy
Data from randomized trials with non-compliance are often analyzed with
an intention-to-treat (ITT) approach. However, while ITT estimates may be
of interest to policy-makers, estimates of causal treatment effects may be
of more interest to clinicians. For the simple situation where treatment
and compliance are binary (yes/no), instrumental variable (IV) methods can
be used to estimate the average causal effect of treatment among those
that would comply with treatment assignment. When there are more than two
compliance levels (e.g., non-compliance, partial compliance, full
compliance), however, these IV methods cannot identify the
compliance-level causal effects without strong assumptions. We consider
likelihood-based methods for dealing with this problem. The research was
motivated by a study of the effectiveness of a disease self-management
program in reducing health care utilization among older women with heart
disease. This is work-in-progress.
Statistical Inference for Branching Processes
Nikolay Yanev
It is well known that branching processes have many applications in
biology. In this talk the asymptotic behavior of branching populations
having an increasing and random number of ancestors is investigated. An
estimation theory will be developed for the mean, variance and offspring
distributions of the process $\{Z_{t}(n)\}$ with random number of
ancestors $Z_{0}(n)$, as both $n$ (and thus $Z_{0}(n)$, in some sense) and
$t$ approach infinity. Nonparametric estimators are proposed and shown to
be consistent and asymptotically normal. Some censored estimators are also
considered. It is shown that all results can be transferred to branching
processes with immigration, under an appropriate sampling scheme. A system
for simulation and estimation of branching processes will be demonstrated.
No preliminary knowledge in this field is assumed.
Modeling of Stochastic Periodicity: Renewal,
Regenerative and Branching Processes Nikolay
Yanev Department of Probability and Statistics, Chair, Institute
of Mathematics and Informatics, Bulgarian Academy of
Sciences, SOFIA, BULGARIA
In deterministic processes periodicity is usually well defined. However
in the stochastic case there are many possible models. One way to study
stochastic periodicity is proposed in this lecture. The models are based
on Alternating Renewal and Regenerative Processes. The limiting behavior
is investigated, with special attention given to the case of periods of
regeneration with infinite mean. Two applications in the Branching
Processes are considered: Bellman-Harris branching processes with
state-dependent immigration and discrete-time branching processes with a
random migration.
The main purpose of the talk is to describe stochastic models which can
be applied in Biology, especially Epidemiology and Biotechnology.
No preliminary knowledge in this field is assumed.
Testing Approximate Statistical Hypotheses
Y. N. Tyurin Moscow State University
Statistical hypotheses often take the form of statements about some
properties of functionals of probability distributions. Usually, according
to a hypothesis the functionals in question have certain exact values.
Many of the classical statistical hypotheses are of this form: the
hypothesis about mathematical expectation of a normal sample
(one-dimensional or multidimensional); the hypothesis about probabilities
of outcomes in independent trails (which should be tested based on
observed frequencies); the linear hypotheses in Gaussian linear models
etc.
Stated as suppositions about exact values those hypotheses do not
express accurately the thinking of natural scientists. In practice an
applied scientist would be satisfied if those or similar suppositions were
?correct? in some approximate sense (meaning their approximate agreement
with statistical data).
The above-mentioned discrepancy between applied-science approach and
the mathematical expression of it leads to rejection of any statistical
hypothesis given sufficiently large amount of sample data ? a well known
statistical phenomenon.
This talk will show how hypotheses about exact values can be re-stated
as rigorously formulated approximate hypotheses and how those can be
tested against sample data with special attention given to the hypotheses
mentioned above.
Fall 2004 Biostatistics Brown Bag Seminar Abstracts
A Bayesian Analysis of Multiple Hypothesis
Tests Anthony Almudevar
A Bayesian methodology is proposed for the problem of multiple
hypothesis tests for a given effect. The density of test statistics is
modelled as a mixture based on hypothesis status. A full posterior measure
is constructed for the mixture conditional on the observable total
density. Commonly used quantities such as false discovery rates and
posterior probabilities of hypothesis status can be directly calculated
from the mixture, and so full posterior measures for these quantities can
be directly obtained. The posterior measure is computed by sampling from a
Monte Carlo Markov chain. This approach proves to be very flexible,
allowing a model for the magnitude of the effects, as well as for
dependence structure, to be developed and incorporated into the posterior
measure. In addition, this approach is ideally suited to the situation in
which the presence of large numbers of marginal, or weak, effects
complicates any attempt to estimate the hypothesis mixture. In this case,
a simple redefinition of the null hypothesis is proposed which makes the
mixture estimation well defined and feasible.
Analysis of Variance, Coefficient of
Determination, and Approximate F-tests for Local Polynomial Regression
Li-Shan Huang
In this paper, we develop analogous ANOVA inference tools for
nonparametric local polynomial regression in the simple case with
bivariate data. The results include: (i) a local exact ANOVA
decomposition, (ii) a local R-squared, (iii) a global ANOVA decomposition,
(iv) a global R-squared, (v) an asymptotically idempotent projection
matrix, (vi) degree of freedom, and (vii) approximate $F$-tests. We also
provide some interesting geometric views why a local exact ANOVA
decomposition holds. The work here is different from earlier developments
by other investigators. This is a joint work with Jianwei Chen in the
department.
On the Role of Copula Models in Survival
Analysis David Oakes
An important class of models in bivariate survival analysis consists of
the so-called frailty models, which arise from the introduction of a
common unobserved random proportionality factor into the hazard functions
of the two related survival times. This assumption leads to a simple
copula representation of the joint survivor function. The class of such
models will be described and various characterization results presented.
Some extensions will be discussed. Methods for parametric and
semiparametric inference about the parameters governing the marginal
distributions and the association structure will be surveyed briefly.
Cost-Effectiveness Studies Associated with
Clinical Trials - Projecting Effects Beyond the Range of the Data
Jack Hall
I start with a quick overview of the MADIT-II clinical trial and of the
associated cost-effectiveness study, including a general overview of
cost-effectiveness studies. I then review the need for projecting results
beyond the limited (3.5 years) span of the available data and the
associated difficulties [and fool-hardiness???]. Finally, I talk about the
life-table method we developed for use in the MADIT-II cost-effectiveness
study to project survival experience beyond the time span of the data.
[Hongwei Zhao, Hongkun Wang and Hongyue Wang contributed to the
MADIT-II cost-effectiveness study, as well as five people from the
Community and Preventive Medicine Department and two in the Heart Research
Group. A manuscript has just been submitted for publication (with these
eleven authors).]
Spring 2004 Biostatistics Brown Bag Seminar Abstracts
Quick and Easy Solutions for Dealing with Data:
Part 1 Arthur Watts
- 1: Ask a Programmer.
- We have an excellent group of programmers with vast experience in
data management and analysis. Although our bread and butter has been
clinical trials, we have provided a variety of services over the years,
including safety monitoring, simulations, double imputation, and many
other customized statistical applications. We always check the data for
errors, to minimize the possibility of having to re-run the analysis, or
worse yet, re-publish the results. In addition to indexed notebooks,
final results are often delivered on a CD containing the database,
PowerPoint summary, Microsoft Word reports and a browsable html
document. Our web based services have included on-line surveys,
randomization, data collection and study monitoring. Before you start a
new analysis, stop by and see us. We can save you a lot of time.
- 2: Visit our Web site.
- In an effort to make clinical databases easier to analyze, we have
developed an extensive library of flexible procedures that create a
variety of statistical reports. Since investigators want to look at many
outcome measures, these procedures operate on lists of variables,
looping through each variable to run the analysis. Eliminating much of
the tedious programming usually required to analyze clinical databases,
these procedures can save hours of programming time. "Let the computer
do your work for you."
Conditional Inference Methods for Incomplete
Poisson Data With Endogenous Time-Varying Covariates Jason
Roy
We investigate the effect of protease inhibitors (PIs) on the rate of
emergency room (ER) visits among HIV-infected women from a longitudinal
cohort study. One strategy to account for serial correlation in
longitudinal studies is to assume observations are independent,
conditional on unit-specific nuisance parameters. It is possible to
estimate these models using unconditional maximum likelihood, where the
nuisance parameters are assigned a parametric distribution and integrated
out of the likelihood. Alternately, we can proceed using conditional
inference, where we eliminate the nuisance parameters from the likelihood
by conditioning on a sufficient statistic for these parameters. An
advantage of conditional inference methods over parametric random effects
models is all patient-level time-invariant factors (both measured and
unmeasured) are accounted for in the analysis. A limitation is standard
conditional inference methods assume missing data are missing completely
at random and do not allow endogenous time-varying covariates (i.e., ER
visits in the past cannot predict future PI use). Both assumptions are
unlikely to be met for these data, because one would expect `sicker'
patients would be more likely to receive treatment and/or drop out from
the study. We develop new estimation strategies that allow endogenous
time-varying covariates and missing at random dropouts. The analysis shows
that PI use reduces the rate of ER visits among patients whose CD4 cell
count was <200 cells/mL at baseline. The size of the effect is
substantially smaller than that estimated using a random effects
approach.
On the Density of the Solution to a Random System
of Equations Anthony Almudevar
Click here
for PDF file of seminar abstract
Paradoxical Association of a Group of
Atherosclerosis-related Genotypes with Reduced Rate of Coronary Events
After Myocardial Infarction David Oakes
Local Polynomial Density Estimation With Interval
Censored Data Derick R. Peterson and Mark J. van der
Laan
A survival time is interval censored if only its current status, an
indicator of whether the event has occurred, is observed at a possibly
random number of monitoring times. We provide estimators with pointwise
confidence limits for all derivatives of the distribution of the time till
event, assuming that the observed monitoring times are independent of the
time of interest. Our estimator is a standard local polynomial regression
smoother applied to the pooled sample of dependent current status
observations. We show that the proposed estimator has a normal limiting
distribution identical to that of a smoother applied to independent
current status observations. Thus local bandwidth selection techniques and
pointwise confidence limit procedures for standard nonparametric
regression perform properly, despite the dependence in the pooled sample.
Pre-limit Theorems and Their Applications
Lev Klebanov
Finitely many empirical observations can never justify any tail
behavior, thus they cannot justify the applicability of classical limit
theorems in probability theory. In this paper we attempt to show that
instead of relying on limit theorems, one may use the so-called pre-limit
theorems explained later. The applicability of our pre-limit theorem
relies not on the tail but on the 'central section' ('body') of the
distributions and as a result, instead of a limiting behavior (when $n$,
the number of i.i.d. observations tends to infinity), the pre-limit
theorem should provide an approximation for distribution functions in case
$n$ is 'large' but not too 'large'. Our pre-limiting approach seems to be
more realistic for practical applications.
p-Values-Only-Based Stepwise Procedures for
Multiple Testing and Their Optimality Properties Alexander
Gordon
Click here
for PDF file of seminar abstract
Modeling Cancer Screening: Further Thoughts and
Results Andrei Yakovlev
Over the years, many large-scale randomized trials have been conducted
to evaluate the effects of breast cancer screening. These trials have
failed to provide conclusive evidence for significant survival benefits of
mammographic screening because of certain pitfalls in their design and
lack of statistical power. However, such studies represent a rich source
of information on the natural history of breast cancer, thereby opening up
the way to evaluate potential benefits of breast cancer screening through
using realistic mathematical models of cancer development and detection.
We propose a biologically motivated model of breast cancer development and
detection allowing for arbitrary screening schedules and the effects of
clinical covariates recorded at the time of diagnosis on post-treatment
survival. Biologically meaningful parameters of the model are estimated by
the method of maximum likelihood from the data on age and tumor size at
detection that resulted from two randomized trials known as the Canadian
National Breast Screening Studies. When properly calibrated, the model
provides a good description of the U.S. national trends in breast cancer
incidence and mortality. The model was validated by predicting (without
any further calibration or tuning) certain quantitative characteristics
obtained from the SEER data. In particular, the model provides an
excellent prediction of the size-specific age-adjusted incidence of
invasive breast cancer as a function of calendar time for the period
1975-1999. Predictive properties of the model are also illustrated with an
application to the dynamics of age-specific incidence and stage-specific
age-adjusted incidence over the period 1975-1999.
Iterated Birth and Death Markov Process and its
Biological Applications Leonid Hanin
We solve, under realistic biological assumptions, the following
long-standing problem in radiation biology: to find the distribution of
the number of clonogenic tumor cells surviving a given arbitrary schedule
of fractionated radiation. Mathematically, this leads to the problem of
computing the distribution of the state N(t) of an iterated birth and
death Markov process at any time t counted from the end of exposure. We
show that the distribution of the random variable N(t) belongs to the
class of generalized negative binomial distributions, find an explicit
computationally feasible formula for this distribution, and identify its
limiting forms. In particular, for t = 0, the limiting distribution turns
out to be Poisson, and an estimate of the rate of convergence in the total
variation metric that generalizes the classical Law of Rare Events is
obtained.
Statistical Methods of Translating Microarray
Data into Clinically Relevant Diagnostic Information in Colorectal Cancer
Byung Soo Kim
The aim of the study is two fold. First, we identify a set of
differentially expressed (DE) genes in colorectal cancer, compared with
normal colorectal tissues to rank genes for the development of biomarkers
for population screening of colorectal cancer. Second, we detect a set of
DE genes for subtypes of colorectal cancer which can be classified with
respect to stage, location and carcino-embryonic antigen (CEA) level. The
cancer and normal tissues were obtained from 87 colorectal cancer patients
who underwent surgery at Severance Hospital, Yonsei Cancer Center, Yonsei
University College of Medicine, from May to December of 2002. We
originally attempted to extract total RNAs from tumor and normal tissues
from 87 patients. From each of 36 patients we had RNA specimens both for
tumor and normal tissues. However, from 19 (32) patients RNA specimens for
normal tissues (tumor) only were available. Thus, we have a matched pair
sample of size 36 and two independent samples of sizes 19 and 32. We
conducted a cDNA microarray experiment using a common reference design
with 17K human cDNA microarrays. We pooled eleven cancer cell lines from
various origins and used it for the common reference. We used
M=log2(R/G) for the evaluation of relative intensity. As a
means of utilizing the whole data set we first use the matched pair data
set as a training set from which we detect a set of DE genes between the
normal tissue and the tumor. Then we use the pool of two independent data
sets of "tumor only" and "normal only" as the test set for the validation.
We employ four procedures for detecting a set of DE genes from the matched
pair sample of size 36: Paired t test and Dudoit et al.?s maxT procedure;
Tusher et al.?s SAM procedure; L?nstedt and Speed?s empirical Bayes
procedure; Hotelling?s T2 statistic. We employ the diagonal
quadratic discriminant analysis for the classification of the test set. We
modify standard methods for the data at hand and propose a t-based
statistics, say t3, which combine three data types for the
detection of DE genes. We also extend Pepe et al.?s ROC approach of
ranking genes for the purpose of biomarker development for our mixed data
type (Pepe et al., 2003 Biometrics). We note that only a few genes are
required to achieve 0% test error in discriminating the normal tissue from
the colorectal cancer. For the subtype analyses various approaches failed
to identify DE genes with respect to colon cancer versus rectum cancer and
stage B versus stage C. We employed a regression approach to detect a few
genes which well correlated with CEA.
Fall 2003 Biostatistics Brown Bag Seminar Abstracts
Biomarker Measurement Error: A Bayesian
Approach with Application to Lung Cancer Sally W.
Thurston
Molecular biologists have identified specific cellular changes, called
biomarkers, which enable them to better understand the pathway from
chemical exposure to initiation of some cancers. In lung cancer, one such
biomarker is the number of DNA adducts in lung tissue. Adducts are formed
from the binding of cigarette carcinogens to DNA, and this adduct
formation plays a central role in lung cancer initiation from smoking.
The goal of this work is to incorporate knowledge of such underlying
biological mechanisms into a useful statistical framework to improve
cancer risk estimates. The model considers adducts in the blood to be a
surrogate measure of lung adducts. Lung adducts can never be measured in
controls. The model is developed on a subset of the data, a small portion
of which has biomarker measurements, and is used to predict cancer risk
for the remaining data which do not have biomarker measurements. These
predictions are compared to those from a traditional model, and to
observed case/control status. Although the biomarker model compares
favorably with the traditional approach, model diagnostics suggest that
better predictions could be made from an expanded model which allows for
measurement error in lung adducts.
Functional Response Models and their
Applications Xin M. Tu
I will discuss a new class of semi-parametric (distribution-free)
regression models with functional responses. This class of functional
response models (FRM) generalizes the traditional regression models by
defining the response variable as a function of several responses from
multiple subjects. By using such multiple-subjects-based responses, the
FRM integrates many popular non- and semi-parametric approaches within a
unified modeling framework. For example, under the proposed framework, we
can derive regression models to perform inferences for two-way contingency
tables and to estimate variance components by identifying them as model
parameters. The FRM also provides theoretical platform for developing new
models for addressing limitations of existing non- and semi-models. For
example, we can develop FRMs to generalize ANOVA so that we can not only
compare the means, but also the variances of the multiple groups, and to
derive and extend the Mann-Whitney-Wilcoxon (MWW) rank-based tests to more
than two groups. For inferences, we discuss a novel approach by
integrating the U-statistic theory with the generalized estimating
equations. The talk is illustrated with examples from biomedical and
psychosocial research.
Biomedical Modeling, Prediction and
Simulation Hulin Wu
In this brown-bag seminar, I am going to give a brief introduction to
several on-going projects in my research group. Our research projects
include 1) Nonparametric smoothing/regression methods for
longitudinal data with applications to long-term HIV dynamic modeling
2) Mechanism-based modeling of longitudinal data with
applications to AIDS treatment response
modeling a) Hierarchical Bayesian
approach b) Mixed-effects
state-space model approach 3) Nonlinear and time-varying
coefficient state-space models and particle filter techniques with
applications to SARS epidemics 4) Clinical trial modeling
and simulations
In summary, we are trying to combine the models and techniques from
biomathematics, engineering, computer science and statistics to solve
important biomedical problems. The multi-discipline feature of these
projects will be further enhanced in the next several years. Currently the
research faculty and postdoc fellows who are involved in these projects
include Drs. Yangxin Huang, Jianwei Chen, Haihong Zhu and Dacheng Liu as
well as other external collaborators.
A Discussion on Intent-To-Treat Principle for
Blood Transfusion Trials Hongwei Zhao
Spring 2003 Biostatistics Brown Bag Seminar Abstracts
Statistical Analysis of Skewed
Data Hongkun Wang and Hongwei Zhao
This talk is motivated by an example where the dependent variable has a
lot of zero values and a very skewed distribution, and the interest is to
find a relationship between several covariates and this variable. We will
examine briefly some current literatures which dealt with this problem. We
will also discuss the interpretation of the parameters for some of those
proposed models. In the end we will present the results of the data
analysis of our example.
Inference on multi-type cell systems using
clonal data and application to oligodendrocytes development in cell
culture Ollivier Hyrien
Fall 2002 Biostatistics Brown Bag Seminar Abstracts
Designing and Analyzing a Small Bernoulli-Trial
Experiment, with Application to a Recent Cardiological Device Trial
Jack Hall
In the recent `WEARIT' trial, the success of a wearable defibrillator
in preventing death from a heart attack in patients awaiting a heart
transplant was evaluated. A trial design was called for that would meet
certain requirements on error probabilities, that would make a decision --
for or against the device -- within a speci- fied maximum number (n = 15)
of heart attack incidents in a group of recruited patients, and hopefully
would terminate after many fewer incidents. We will use this setting to
review single and double sampling plans, curtailed sampling, and various
other sequential sampling plans that might be used for such a trial, along
with the associated methodology for inference about the implicit Bernoulli
parameter -- the success rate in resuscitating patients after a heart
attack. We present this in the context of the WEARIT trial. You may be
surprised how many statistical issues arise in an inference problem
associated with observation of a few Bernoulli trials!
Using Local Correlation in Kernel-Based
Smoothers for Dependent Data Derick Peterson
This is a joint work with Hongwei Zhao and Sara Eapen.
Informative Prior Specification for Linear
Regression Models using Parameter Decompositions Sally
Thurston
I will motivate this work by discussing a dataset for which the
intended Bayesian analysis requires an informative prior, due to
interactions for which the data likelihood has no direct information. I
will then present a method of obtaining informative priors for a linear
regression model, based on information elicited from a subject matter
expert. This method relies on a decomposition, novel in the multivariate
case, of regression coefficients, their covariance matrix, and the
residual variance of the regression. The only quantities which the expert
needs to specify are the population means, variances, and pairwise
correlations. Finally, I will discuss how I used the information elicited
from the expert to obtain a proper informative prior for this example.
This is joint work with Joe Ibrahim and Susan Korrick.
Topology, DNA Topology and Some Probabilistic
Models of Nucleic Acids Eva Culakova
This is an informative talk based on already known results. The
presentation was inspired by my effort to understand the book by A D Bates
and A Maxwell "DNA Topology". First I will introduce a classical result
about "Hopf Map" in order to give my appreciation to the field of
topology. Next I will give an example of a situation where topology can
help to understand DNA recombination. At the end I will briefly introduce
a probabilistic model that is used to distinguish if two nucleic acids or
protein sequences are related or not. This part is based on the book by
Durbin, Eddy, Krogh and Mitchison "Biological Sequence Analysis".
Li-Shan Huang
I will discuss the paper, Doksum, K., and Samarov, A. (1995).
Nonparametric estimation of global functionals and a measure of the
explanatory power of covariates in regression. Annals of Statistics, 23,
1443-1473. and propose new ideas of nonparametric coefficient of
determination.
An Overview of Multiple Imputation
Michael McDermott
Spring 2002 Biostatistics Brown Bag Seminar Abstracts
MADIT-II: A Recently Completed Sequential
Clinical Trial Jack Hall
The Multicenter Automatic Defibrillator Implantation Trial #2,
administered here at the UR Medical Center with 1232 heart-disease
patients enrolled through 76 hospital centers, came to a favorable
conclusion in November by reaching a pre-specified sequential stop- ping
criteria for efficacy. The statistical work was, and continues to be,
carried out here, including statistical design of the study, weekly
analyses of the survivorship data, chairing of the Monitoring Committee,
and final analyses of efficacy and side-effects data, with cost analyses
still to come. This talk will give an overview, focusing on the
statistical aspects of designing, monitoring and analyzing such trial
data.
Combining Statified and Unstratified Log-Rank
Tests for Correlated Survival Data Changyong Feng
The log-rank test is the most widely used nonparametric method for
testing treatment differences in survival-analysis-based clinical trials
due to its efficiency under proportional hazards. Most previous work on
the log-rank test has assumed that the samples from the two treatment
groups are independent. However, in multicenter clinical trials, survival
times of patients in the same medical center may be correlated due to some
factors specific to each center; or studies may utilize pairing of
patients or response units, resulting in dependence. For such data we can
construct stratified and unstratified log-rank tests (call them SLRT and
ULRT respectively). These two tests address somewhat different features of
the data. An appropriate linear combination of these two tests may give a
more powerful test than either individual test. Under a matched-pair
frailty model, we obtain closed-form asymptotic local alternative
distributions and the correlation coefficient of SLRT and ULRT. Based on
these results we construct an optimal linear combination of the two test
statistics. Simulation studies with Hougaard model confirm our
construction. Our approach is illustrated with data from the Diabetic
Retinopathy Study(Huster, et al, 1989). We extend our work to the cases of
stratum size > 2 and of variable (but upper bounded) stratum sizes.
Non-Sexual Household Transmission of HCV
Infection Fenyuan Xiao
Objective: This study was designed to determine the prevalence and the
incidence of HCV infection among non-sexual household contacts of
HCV-infected women and to describe the association between HCV infection
and potential household risk factors in order to examine whether
non-sexual household contact is a route of HCV transmission. Methods: A
baseline prevalence survey included 409 non-sexual household contacts of
241 HCV-infected index women in the Houston area from 1994 to 1997. A
total of 470 non- sexual household contacts with no evidence of HCV
infection at baseline investigation were re-assessed approximately three
years after baseline enrollment. Information on potential risk factors was
collected through face to face interviews and blood samples were tested
for anti- HCV with ELISA-2 and Matrix / RIBA-2. The relationships between
HCV infection and potential risk factors were examined by using univariate
and multivariate logistic regression analyses. Results: The overall
prevalence of anti-HCV positivity among 409 non-sexual household contacts
was 4.4%. The highest prevalence of anti-HCV was found in parents (19.5%),
followed by siblings (8.1%) and other relatives (5.6%); the children had
the lowest prevalence of anti-HCV (1.2%). The univariate analysis showed
that IDU, blood transfusion, tattoos, sexual contact with injecting drug
users, more than 3 sexual partners in a lifetime, history of a STD,
incarceration, previous hepatitis, and contact with hepatitis patients
were significantly associated with HCV infection, however, sharing razors,
nail clippers, toothbrushes, gum, food or beds with HCV-infected women,
and history of dialysis, health care job, body piercing, and homosexual
activities were not. Multivariate analysis found that IDU (OR = 221.7 with
95% CI of 22.8 to 2155.7) and history of a STD (OR = 11.7 with 95% CI of
1.2 to 113.1) were the only variables significantly associated with HCV
infection. No such associations remained for other risk factors. The
three-year cumulative incidence of anti- HCV among 352 non-sexual
household contacts of HCV-infected women was zero. Conclusion: This study
has provided no evidence that non-sexual household contact is a likely
route of transmission for HCV infection. The risk of sharing razors, nail
clippers, toothbrushes, gum, food and/or beds with HCV-infected women is
not evident and has not been shown to be the likely mode for HCV spread
among family members. This study does suggest that IDU is the likely route
of transmission for most HCV infection. Association also has been shown
independently with a history of STD. The prevalence of anti-HCV among
non-sexual household contacts was low. Exposure to common parenteral risk
factors and sexual transmission between sexual partners may account for
HCV spread among household members of HCV-infected persons.
Parameter Estimation in Bivariate Copula
Models Antai Wang
Many models have been proposed for multivariate failure-time data
(T_{1},T_{2}) arising in reliability and other applications. A bivariate
survivor function S(t_{1},t_{2}) is said to be generated by an archimedean
copula if it can be expressed in the form
S(t_{1},t_{2})=p[q{S_{1}(t_{1})}+q{S_{2}(t_{2})}] for some convex,
decreasing function q defined on (0,1]. Here $p$ is the inverse function
of q. Usually, p is specified as some function of an unknown parameter.
Given a sample from S(t_{1},t_{2}), the distribution function of
V=S(T_{1},T_{2}), called the Kendall distribution, can be expressed simply
in terms of q. We use the score function from the log-likelihood of the
V's to estimate the unknown parameter. Although the V's are unknown, they
can be estimated empirically. Interestingly, our estimates based on the
empirical V's are much more precise than the estimates based on the true
and unknown V's. We also investigate an alternative procedure based on
iteratively estimating the V's using the assumed copula structure. We
discuss the asymptotic theory for both methods and present some
illustrative examples. I will also cover the recent development of a new
method to estimate the parameter for bivariate data subject to right
random censoring briefly.
Microarray Analyses Li-Shan
Huang
Bayesian inference of
phylogeny John Huelsenbeck
A Few Remarks on Partial Correlation
Heng Li
A Generalization of ROC
Curves Michael McDermott
Fall 2001 Biostatistics Brown Bag Seminar Abstracts
Use of placebo-controls vs. active-controls
in clinical trials evaluating new treatments Mike
McDermott
Using Measurement Error Models w/o and w/
Interactions to Assess Effects of Prenatal and Postnatal Methylmercury
Exposure in the Seychelles Child Development Study at age
66-months Li-Shan Huang
Overrunning in Sequential Clinical
Trials Jack Hall
Most large-scale clinical trials these days have sequential stopping
rules that permit early termination of the trial when clear superiority of
a treatment is firmly established early in the trial. Once a stopping
boundary has been reached, statistical methods allow computation of
p-values and estimates of treatment effects which recognize the sequential
stopping rule. Typically, however, additional `lagged' data become
available after the boundary has been reached. Earlier methods of
accommodating such `overrunning' have serious defects. Two new methods
(one joint with Aiyi Liu, the other joint with Keyue Ding) will be
described, and illustrated with data from the MADIT trial of an implanted
defibrillator (New England Journal of Medicine, 335:1933-40, 1996).
A Second Look at Some Statistical Ideas Via
Geometric Projection Heng Li
Geometric concepts have always been useful in statistics. Consider, for
example, the number of situations in which the idea of orthogonal
projection plays a crucial role. We will discuss a closely related
geometric operation, to be called orthogonal cross projection, and point
out some of its manifestations in statistics (e.g., covariance). Power
point technology would be used in the presentation, provided that all the
equipments are functional and are not too sophisticated for the presenter
to operate.
Two-Period Designs: Part II David
Oakes
On Two Consistent Tests of Bivariate
Independence and Some Applications Greg Wilding
The use of the correlation coefficient for testing bivariate
independence, although most common, has serious limitations. In this talk
I will discuss Hoeffding's (1948) test of bivariate independence, and its
asymptotic equivalent due to Blum, Kiefer and Rosenblatt (1961), which are
well known to be consistent against all dependence alternatives.
Specifically, I will describe the status of its null distribution and
compare its power using a variety of copulas, including those due to
Morgenstern, Gumbel, Plackett, Marshall and Olkin, Raftery, Clayton, and
Frank. I will also show how the test of bivariate independence can be used
for constructing simple goodness-of-fit tests.
Smoothing Longitudinal Data: A Work in
Progress Derick R. Peterson, Hongwei Zhao, Sara Eapen
We consider the general problem of smoothing longitudinal data to
estimate the nonparametric marginal mean function, where a random but
bounded number of measurements are available for each independent subject.
In stark contrast to recent work in this area, we show that not only can
consistent estimators use the correlation structure of the data but that
ignoring this correlation structure necessarily results in inefficiency,
just as in the parametric setting. The class of local polynomial
kernel-based estimating equations considered by Lin & Carroll (JASA
2000) are shown to be too small, such that they cannot properly make use
of the correlation structure; this explains the problem with their general
message that it is best to assume working independence, while also
providing insight into why penalized likelihood-based correlated smoothing
splines can be expected to be efficient. We propose a class of simple,
explicit ad hoc estimators which although not efficient can improve upon
the working independence local polynomial modeling approach by making use
of the local correlation structure to dramatically improve the precision
even for moderate sample sizes.
Spring 2001 Biostatistics Brown Bag Seminar Abstracts
30th Anniversary of the Biplot Ruben
Gabriel
A Review of Nonparametric Surival Estimation
with Bivariate Right-Censored Data Derick Peterson
The problem of nonparametric estimation of the survival function with
censored data has an elegant and efficient solution in the one-dimensional
case: the Kaplan-Meier estimator. In higher dimensions, with multiple,
possibly correlated, survival times, however, the task is much more
formidable. Several authors have proposed ad hoc estimators in this model,
and in 1996 van der Laan proposed a theoretically efficient estimator,
while also analyzing inefficient estimators previously proposed by
Dabrowska, Prentice and Cai, and Pruitt. I will review these estimators
and explain why the NPMLE is not, in general, consistent for the bivariate
survival function. Unlike in the one-dimensional case, some sort of
smoothing is required for efficient estimation. Bandwidth selection
remains an open problem in this context, thus contributing to the slow
uptake of van der Laan's estimator.
Confidence Intervals: Equal-Tail, Shortest or
Unbiased? Jack Hall
Various criteria for choosing confidence intervals have been considered
in the literature. We focus on three, named in the title. When based on a
pivot with a symmetric distribution, the three coincide, but in
`small-sample' applications this covers little more than confidence
intervals for normal population means, contrasts among such means, and
rank procedures about a center of symmetry. Of course, from a large-sample
perspective, a maximum likelihood estimate minus parameter, standardized
by a standard error estimate, is such a pivot, and this covers many
applications.
We review the pro's and con's of the three competitors, largely in the
context of confidence intervals for the variance when sampling from a
normal population, and similarly for variance ratios of analysis of
variance. However, our motivation is for dealing with confidence intervals
for the hazard ratio after a sequential clinical trial: What kind of
interval should be preferred?
Your opinions will be invited....
Analysis of Chicago Ozone Data
1981-1991 Li-Shan Huang
Ozone concentrations are affected by precursor emissions and by
meteorological conditions. It is of interest to analyze trends in ozone
after adjusting for meteorological influences. We will discuss the
following 4 approaches to analyze Chicago Ozone data 1981-91:
- Nonlinear Regression, by Bloomfield, Royle, Steinberg and Yang
(1996)
- Logistic Models, by Smith and Huang (1993)
- Semi-parametric modeling, by Gao, Sacks and Welch (1996)
- Tree regression & empirical Bayes, by Huang and Smith (1999)
A Test for Equality of Ordered Inverse Gaussian
Means Lili Tian
The inverse gaussian (IG) distribution, called the fraternal twin of
the Gaussian distribution, has been widely used in applied fields due to
the facts that it is ideally suited for modeling positively skewed data
and that its inference theory is well known to be analogous to that of the
Gaussian distribution in numerous ways. For example, Weiss (1982, 1983,
1984) demonstrated that the distribution of circulation times of drug
molecules through the body can be approximated by the IG distribution. We
propose a test procedure to assess trends in the IG response variable
(e.g., in animal toxicity studies). This approach, based on combining
independent tests using classical methods, can be easily extended to a
spectrum of order constraints. It is also shown that this procedure is
intriguingly analogous to that for the Gaussian distribution. The power
properties are examined by simulation.
Correlation Between Variables When Each Is Subject
to Sets of Exchangeable Measurements: An Approach Based on Group
Invariance Heng Li
An analytical procedure is developed for a type of data structure
suitable for modelling the situation in which multiple measurements are
made on each of a set of variables, and the measurements can be divided
into exchangeable subsets. The procedure is based on the pattern in
covariance matrix corresponding to the group invariance inherent in the
data structure, from which a closed-form expression of Gaussian likelihood
can be found. Sufficient statistics in the form of sums of squares and
cross products and their distributions are obtained, leading to methods of
statistical inference for a variety of practical purposes from correction
for attenuation to estimation of reliability coefficients. The closed-form
expression of the likelihood function is also helpful for implementing
likelihood-based computation, such as the EM algorithm for handling
missing data, and for Bayesian inference. The latter can be a very
effective tool in dealing with some inferential problems that do not have
standard solutions in the traditional framework. Examples include
guaranteeing the nonnegative definiteness of an estimated disattenuated
correlation matrix and combining information on association parameters
from a main study and a reliability, reproducibility, or repeatability
study. No originality is claimed and nothing presented will be beyond what
is intuitively obvious and/or what has already been in the literature,
although the procedure is readily adaptable for variations on the basic
structure. The main objective is to illustrate the application of group
invariance in modelling and analysis, which is the topic of almost all my
previous lunch presentations. The current presentation, however, involves
a data structure that has not been discussed in the previous
presentations.
On Kendall's Process and an Associated
Estimation Procedure David Oakes
If X is a continuous univariate random variable with distribution
function F(x) then it is well-known that F(X) = pr (X < x) and S(X) =
1- F(X) are uniformly distributed. For a bivariate random variable (X,Y)
it is no longer true in general that F(X,Y) and the corresponding survivor
function S(X,Y) follow uniform distributions. For example if X and Y are
independent, S(X,Y) has the distribution of the product of two independent
uniform variables.
This talk will explore the use of a bivariate analog of the probability
integral transform in estimating the parameters governing the dependence
structure in a bivariate distribution. We will present and explain some
simulation results that at first sight seemed somewhat surprising.
(This is joint work with Antai Wang and will form the basis for his
upcoming qualifying paper)
Bootstrap variations: random
weighting Derick Peterson
A review of treatment allocation methods in
clinical trials Hongwei Zhao
Randomized-Withdrawal and Randomized-Start
Designs Jack Hall
Randomized-withdrawal and randomized-start designs have recently been
introduced in the neurological clinical trials literature as designs which
facilitate detection of long-term (`neuroprotective') effects as
distinguished from short-term (`symptomatic') effects of a treatment
relative to a placebo. Models and analyses for such designs will be
described, along with various advantages and limitations. Factorial
versions will also be considered.
Fall 2000 Biostatistics Brown Bag Seminar Abstracts
A Roughness-Penalty View of Kernel
Smoothing Li-Shan Huang
It has been shown that a smoothing spline estimate is an equivalent
kernel estimate. In this paper, we show that both the Nadaraya-Watson and
local linear kernel estimators are equivalent penalized estimators.
Algebraic Rationales for Some Statistical Procedures:
Possibilities for Unification and Generalization Heng
Li
Many common procedures in statistics have algebraic interpretations. We
will discuss a series of examples beginning with the most basic ones. It
will be shown how algebraic rules extracted from simple cases can be
applied to tackle some non-trivial problems. Possibilities for a general
framework will also be discussed.
A Simulation Study of Frailty Effects in Censored
Bivariate Survival Data Sara Eapen
Multivariate censored survival data typically have correlated failure
times. The corrleation can be a consequence of the observational design,
for example with clustered sampling and matching, or it can be a focus of
interest as in genetic studies, longitudinal studies of recurrent events
and other studies involving multiple measurements. The correlation between
failure times can be accounted for by fixed or random effects. A
simulation study was designed to compare the performance of the mixture
likelihood approach to estimating the model with these frailty effects in
censored bivariate survival data. It is found that the mixture method is
surprisingly robust to misspecification of the frailty distribution.
Profile Likelihood and the EM-algorithm
David Oakes
A Review of the Case-Crossover Design &
Applications Jack Hall
The case-crossover design -- a case-control study in which the subject
serves as his own control -- was formally introduced by the epidemiologist
Malcolm McClure in 1991. He described it as `a method for studying
transient effects on the the risk of acute events'. The design will be
described and discussed in the context of several published applications
(including participation by Jamie Robins and Robert Tibshirani),
evaluating the questions: Are MI's more likely following (i) sexual
activity? (ii) coffee drinking? (iii) episodes of anger? Are auto
accidents more likely while using a cell phone?
Exploring Multivariate Data with Density Trees
Richard Raubertas
Classification trees are widely used as rules for assigning
observations to classes based on their attributes or features. A
classification tree is equivalent to a partition of the feature space into
rectangular regions, with a constant estimate of class probabilities in
each region. Density trees are proposed as a variation on this idea,
designed to examine the multivariate distribution of the features
themselves. A tree-structured approach is used to partition the feature
space into low- and high-density regions; that is, regions with especially
low or especially high numbers of observations relative to an arbitrary
reference distribution. This results in a nonparametric,
piecewise-constant estimate of the joint distribution of the features.
Because the regions are defined by simple inequalities on individual
features, density trees can provide a direct and interpretable description
of multivariate structure. In addition, they may be useful for identifying
regions where prediction models derived from the data are poorly supported
by observations.
Nonparametric regression for longitudinal data
Hongwei Zhao
My talk is motivated by an applied example where it is desirable to fit
a nonparametric regression model for data that were obtained
longitudinally. Even though theory for nonparametric regression for
independent data have been well developed, there are still questions that
need to be answered for applying nonparametric methods to the longitudinal
data. Simulations are conducted to compare some current available methods
as well as some news ones. These methods are also applied to a real
example.
Generalized Nonlinear Regression
Christopher Cox
Please send your comments and suggestions about this web
page to the BST Webmaster mailto:webmaster@bst.rochester.edu
|