University of Rochester Medical Center
SearchDirectoryNewsEventsStrong HealthURMC Home
 

Spring 2009 Biostatistics Brown Bag Seminar Abstracts

Does Extirpation of the Primary Breast Tumor Give Boost To Growth of Metastases? Evidence Revealed By Mathematical Modeling
Leonid Hanin
Department of Mathematics, Idaho State University

A comprehensive mechanistic model of cancer natural history was developed to obtain an explicit formula for the distribution of volumes of detectable metastases in a given secondary site at any time post-diagnosis. This model provided a perfect fit to the volumes of n = 31 bone metastases observed in a breast cancer patient 8 years after primary diagnosis. Based on the model with optimal parameters the individual natural history of cancer for the patient was reconstructed. This gave definitive answers to the following three questions of major importance in clinical oncology: (1) How early an event is metastatic dissemination of breast cancer? (2) How long is the metastasis latency time? and (3) Does extirpation of the primary breast tumor accelerate the growth of metastases? Specifically, according to the model applied to the patient in question, (1) inception of the first metastasis occurred 29.5 years prior to the primary diagnosis; (2) the expected metastasis latency time was about 79.5 years; and (3) resection of the primary tumor was followed by a 32-fold increase in the rate of metastasis growth.

Friday, March 27, 2009

 

Likelihood Estimation of the Population Tree
Arindam RoyChoudhury, PhD
Postdoctoral Associate
Dept. of Biological Statistics and Computational Biology
Cornell University

The population tree, i.e., the evolutionary tree connecting various populations, has applications in various fields of biology and medical sciences. It can be estimated from genome wide allele-count data. We will present a maximum-likelihood estimator of the tree based on a coalescent
theoretic setup.

Using the coalescent theory we keep track of the probability of the number of lineages at different time-points in a given tree. We condition on the number of lineages to compute the probability of the observed
allele-counts. Computing these probabilities requires a sophisticated
"pruning" algorithm. The algorithm computes arrays of probabilities at the
root of the tree from the data at the tips of the tree. At the root, the
arrays determine the likelihood. The arrays consist of probabilities
related to the number of lineages and allele-counts among those lineages.
Our computation is exact, and avoids time consuming Monte-Carlo methods.

Thursday, March 12, 2009

 

Trust me, I’m an academic statistician!
Professional ethics, conflict of interest, and JAMA policies in the reporting of randomized clinical trials
Michael P. McDermott, Ph.D.

The issue of conflict of interest and its potential impact on the integrity of scientific research has received an increasing amount of attention in the past two decades.  Conflicts of interest come in many forms, including financial, intellectual, and professional.  Policies have been adopted by academic institutions, governmental agencies, scientific journals, and other organizations to manage potential conflicts of interest.  Statisticians, being integral to the conduct of scientific research, are certainly not immune to these potential conflicts.

Much of the research in the discovery and development of pharmacological (and other) therapies is sponsored by the pharmaceutical industry.  It is clear that the trust that the public has in this research has eroded over time; there is a perceived lack of objectivity of the pharmaceutical industry in the conduct and reporting of this research.  One manifestation of this can be found in the following policy adopted by the Journal of the American Medical Association (JAMA) in 2005: For industry-sponsored studies, “an additional independent analysis of the data must be conducted by statisticians at an academic institution, such as a medical school, academic medical center, or government research institute” rather than only by statisticians employed by the company sponsoring the research.

This seminar will outline some of the background that led to the JAMA policy, the implications of the policy, and the reaction of members of the scientific community.  The broader issue of professional statistical ethics and the potential conflicts of interest that are faced by statisticians will also be discussed.

Thursday, February 26, 2009

 

Applications of Multivariate Hypothesis Testing in Gene Discovery
Anthony Almudevar , Ph.D.

A fundamental problem in genomic studies is the detection of differential expression among large sets of gene expression values produced by microarray data collected under varying experimental conditions. Although the problem is most naturally expressed as a sequence of hypothesis tests involving the expression distributions of individual genes, two points have recently been noted: 1) the individual expression distributions are characterized by statistical dependence induced by gene cooperation, and 2) information about which sets of genes form cooperative pathways is now widely available. This has led to an interest in multivariate tests involving vectors of gene expressions taken from gene sets known to have some form of functional relationship. While this leads to potentially greater power, as well as better interpretability, the development of suitable multivariate testing methods is still an important area of research. In this talk, I will discuss a number of approaches to this problem. In the first, statistical methods used to model gene pathways (primarily Bayesian networks) are adapted to hypothesis testing. In the second, we will consider how the theory of Neyman-Pearson tests can be adapted to the testing of complex hypotheses, following an earlier application to a problem in statistical genetics (Almudevar, 2001, Biometrics).

Thursday, February 5, 2009



Differential Equation Modeling of Infectious Diseases: Identifiability,
Parameter Estimation, Model Selection, and Computing Tools

Hongyu Miao , Ph.D.

Many biological processes and systems can be described by a set of differential equation (DE) models. However, literature in statistical inference for DE models is very sparse. We propose identifiability analysis, statistical estimation, model selection, and multi-model averaging methods for biological problems such as HIV viral fitness and influenza infection that can be described by a set of nonlinear ordinary differential equations (ODE). Related computing techniques have also been developed and available as a few comprehensive software packages. We expect that the proposed modeling, inference approaches and computing techniques for the DE models can be widely used for a variety of biomedical studies.

Thursday, January 22, 2009


Spring 2008 Biostatistics Brown Bag Seminar Abstracts

In memory of Andrei: Multitype Branching Processes with Biological Applications
Nick Yanev, Ph.D.

The asymptotic behavior of multitype Markov branching processes with discrete or continuous time is investigated when both the initial number of ancestors and the time tend to infinity. Some limiting distributions are obtained and the asymptotic multivariate normality is proved in the positive regular and nonsingular case. The paper also considers the relative frequencies of distinct types of individuals (cells), a concept motivated by applications in the field of cell biology. We obtain non-random limits and multivariate asymptotic normality for the frequencies when the initial number of ancestors is large and the time is fixed or tends to infinity. When the time is fixed the results are valid for any branching process with a finite number of types; the only assumption required is that of independent individual evolutions. The reported limiting results are of special interest in cell kinetics studies where the relative frequencies, but not the absolute cell counts, are accessible to measurement. Relevant statistical applications are discussed in the context of asymptotic maximum likelihood inference for multitype branching processes.

Thursday, May 22, 2008
12:30 p.m.
Biostatistics Conference Room


Modeling Intrahost Sequence Evolution in HIV-1 Infection
Ha Youn Lee, Ph.D.

Quantifying the dynamics of intrahost HIV-1 sequence evolution is one means of uncovering information about the interaction between HIV-1 and the host immune system. In this tlk, I will introduce a mathematical model and Monte-Carlo simulation of viral evolution within an individual during HIV-1 infection that enables to explain the universal dynamics of sequence divergence and diversity, to classify of new HIV-1 infections originating from multiple versus single transmitted viral strains, and to estimate time since the most recent common ancestor of a transmitted viral lineage.

From 13 out of 15 longitudinally followed patients (3-12 years), we found that the rate of intrahost HIV-1 evolution is not constant, but rather slows down at a rate correlated with the rate of CD4+ T cell count decline. We studied a HIV-1 sequence evolution model where for each sequence we keep track of its distance from the founder strain and assign a fitness and survival probability of mutations based on the distance from the founder strain.

The model suggests that the saturation of divergence and the decrease of diversity observed in the later stages of infection are attributed to a decrease in the probability of mutant strains to survive as the distance from the founder strain increases rather than due to an increase of viral fitness. At the second part, I will talk about both synchronous and asynchronous models of acute phase of HIV-1 evolution with a single cycle reverse transcriptase error rate, average generation time, and basic reproductive ratio. These models were used to analyze 3,475 complete env sequences recently derived by single genome amplification from 102 subjects with acute HIV-1 (clade B) infection, classifying a single strain infection from a multiple variant infection and also identifying transmitted HIV-1 envelope genes.

Thursday, February 21, 2008
12:30 p.m.
Biostatistics Conference Room



Fall 2007 Biostatistics Brown Bag Seminar Abstracts

Creating an R Package Part I: It's EASY!
Gregory Warnes, Ph.D.

The open source statistical package R provides nice tools for bundling a set of functions and data together as an R package. Creating an R package from your R scripts helps to provide good documentation, and makes it much easier to share with others and to maintain your code for your own future use. This Brown Bag will demonstrate how to create an R package, and the advantages that come from doing so.
 

Adaptive Simon Two-Stage Design for Preliminary Test of Targeted Sub-Population
Qin Yu, Graduate Student

The trend towards specialized clinical development programs for targeted cancer therapies is growing fast, which was made possible by significant improvements in molecular characterization of biological pathways fostering the growth of tumors. The proposed phase two stage design, which is an adaptation to Simon's two-stage design, allows for preliminary determination of efficacy for a particular sub-population defined by biomarker status. The advantage of adopting this two-stage design is shown via a real study.
 

Using Auxiliary Variables to Enhance Survival Analysis
Haiyan Su, Graduate Student

One of the primary problems facing statisticians who work with survival data is the loss of information that occurs with right-censored data. Markers, which are prognostic longitudinal variables, can be used to replace some of the information lost due to right-censoring because of its property of correlating and predicting to the overall survival event. In oncology studies, disease progression status are measured at certain times and are correlated with survival, how to incorporate information on disease progression (the markers) in the analysis of survival to reduce the variance of treatment effect estimator (e.g. log hazard ratio in Cox model) is interesting and challenging. In this work, we applied Mackenzie & Abrahamowicz's (MA) plug-in method which writes the test statistic as a functional of the Kaplan-Meier estimators, and then replaced the latter with an efficient estimator of the survival curve that incorporates the information from markers. Possible choices of survival curve estimator are Murray-Tsiatis (MT) method and Finkelstein-Schoenfeld (FS) method. The resulting estimators can greatly improve the efficiency provided that the marker is highly prognostic and that the frequency of censoring is high. MA's methodology is illustrated with an application to a real time to event data by using MT survival curve estimator. We will also introduce FS method with a real data example.


 

Approximate Iteration Algorithms
Anthony Almudevar, Ph.D.

In this talk I will summarize some work undertaken by the authors in the area of approximate iterations, ranging from basic theory to applications in control theory and numerical analysis. The relationship of these processes to some important medical applications will be reviewed. The talk will divide naturally into three sections.

1. Models of Approximate Iterative Processes. An iterative process is usually expressed as a normed space $V$ with some operator $T$, on which a sequence $v(k+1) = Tv(k), k \geq 1$ is generated, given some starting value $v(0)$. Ideally, this sequence converges to a fixed point $w = Tw$. In practice, the operator can only be evaluated approximately, so the iteration is more accurately written $v(k+1) = T_k v(k) = Tv(k) + u(k)$ where, alternatively, $T_k$ is the $k$th approximation of $T$, or $u(k)$ is the approximation error associated with the $k$th iteration. It is possible to show that if $T$ is contractive the approximate algorithm will converge to the fixed point, at a rate equivalent to $\max(r^k, |u(k)|)$, where $r$ is the contraction constant. The remaining work largely follows from this result.

2. Numerical Analysis. Many iterative algorithms rely on operators which may be difficult or impossible to evaluate exactly, but for which approximations are available. Furthermore, a graduated range of approximations may be constructed, inducing a functional relationship between computational complexity and approximation tolerance. In such a case, a reasonable strategy would be to vary tolerance over iterations, starting with a cruder approximation, then gradually decreasing tolerance as the solution is approached.

However, in such an algorithm, because the computational complexity increases over iterations, the convergence rate of the algorithm is more appropriately calculated with respect to cumulative computation time than to iteration number. This leaves open the problem of determining an optimal rate of change of approximation tolerance.

Our theory of approximate iterations may be used to show that, under general conditions, for linearly convergent algorithms the optimal choice of approximation tolerance convergence rate is the same linear convergence rate as the exact algorithm itself, regardless of the tolerance-complexity relationship. This result will be illustrated with several examples of Markov decision processes.

3. Adaptive Stochastic Decision Processes. A stochastic decision process is a random sequence whose distribution beyond a time $t$ can be determined by an action taken by an observer at time $t$, who has access to all process history up to that time. There is usually some reward criterion, so that the objective of the action is to maximize the expected value of the reward. If the process distribution under all possible action sequences is known then, at least in principle, the optimal action under any given history can be calculated, as so would be available to the observer as a control policy. Typically, these distributions are unknown, but may be estimated by the observer using process history. In this case, the observer needs to vary the actions sufficiently in order to estimate the model. This, however, conflicts with the goal of achieving the optimal expected reward, since this type of exploratory behavior will be suboptimal. An adaptive decision process is one which attempts to seek an optimal balance between exploratory behavior and seeking to maximize reward based on current model estimates. Our theory can be used to define, for Markov decision processes, an exploration rate, and then to show that the optimal exploration rate decreases in proportion to $t^{-1/3}$, resulting in a process in which regret (difference between optimal and achieved reward) converges to zero at a rate of $t^{-1/3}$, as distinct from a rate of $t^{-1/2}$ associated with estimation alone. The theory extends naturally to sequential clinical trials.
   This is joint work with Edilson F. Arruda and Jason LaCombe.

Thursday, October 4, 2007
12:30 PM
Biostatistics Conference Room



Spring 2007 Biostatistics Brown Bag Seminar Abstracts

Where Do We Stand in Microarray Data Analysis? Lessons of the Past and Hopes for the Future

Andrei Yakovlev, Ph.D.
University of Rochester

This presentation discusses numerous pitfalls in the analysis of microarray gene expression data. Modern state of the art in this area is far from satisfactory. Many misconceptions still dominate the literature on microarray data analysis. An overview of the most common misconceptions will be given and some constructive alternatives will be proposed. In particular, I will present a new method designed to select differentially expressed genes in non-overlapping gene pairs. This method offers two distinct advantages: (1) it leads to dramatic gains in terms of the mean numbers of true and false discoveries, as well as in stability of the results of testing; (2) its outcomes are entirely free from the log-additive array-specific technical noise.

Thursday, May 17, 2007
11:30 AM
Room 2-6408 (K-207) Medical Center


Integrating Quantitative/Computational Sciences
for Biomedical Research
Hulin Wu, Ph.D.

Our Division (Division of Biomedical Modeling and Informatics) has been formed for two years. Since we moved to a remote location, our communication and interaction with our Department are not as frequent as before. In this talk, I will give an overview on the research of our Division in order to promote more interactions and collaborations with other faculty and students in our Department. Also I will share our experience on how to do 100% our own research while we are doing 100% collaboration and consulting. Some tips on how to find more time to do our "own" research will be given.

Our Division is formed to integrate quantitative (statistics, mathematics, engineering, physics etc.) and computational sciences (computer sciences and biomedical informatics) to do biomedical research. In this new era of high technologies, many new quantitative and computational sciences have evolved from various disciplines to become major tools for biomedical research. These include biostatistics, biomathematics, bioinformatics, biomedical informatics, computational biology, mathematical biology and theoretical biology, biophysics, bioengineering etc. This also brings a great opportunity for biometrical scientists to integrate the various quantitative/computational methodologies and techniques to support biomedical discoveries and research. Our Division, collaborating with biomedical investigators, is currently working on development of mathematical models, statistical methods, computer simulation systems, software packages, informatics tools and data management systems for HIV infections, AIDS clinical studies, influenza infections and immune response to infectious diseases. In this talk, I will discuss our experience of interactions and collaborations among biostatisticians, biomathematicians, biophysicists, bioengineers and biocomputing scientists as well as biomedical investigators. In particular, I will review the three components: (1) mathematical models for HIV viral fitness experiments, AIDS clinical biomarker data, immune response to influenza A virus infections; (2) statistical methods for biomedical dynamic (differential equation) models; (3) user-friendly computer simulation and estimation software. Finally I will discuss some challenges and opportunities for biometrical scientists in biomedical research.


Bayesian multiple outcomes models and the Seychelles data
Sally W. Thurston, Ph.D.

Understanding the relationship between prenatal mercury exposure and neurodevelopment in children is of great interest to many practitioners. Typically, analyses rely on separate models fit to each outcome. If the effect of exposure is very similar across outcomes, separate models lack power to detect a common exposure effect. Furthermore, the outcomes cluster into broad domains and domain-specific effects are also of interest. We fit a Bayesian model which allows the mercury effect to vary across outcomes, while allowing for shrinkage of these effects within domains, and to a lesser extent between domains. We will discuss the benefits and challenges of fitting this model within a Bayesian framework, and apply the model to multiple outcomes measured in children at 9 years of age in the Seychelles. This is work in progress, and is joint with David Ruppert at Cornell University.



An Introduction to Adobe Contribute and Blackboard Academic
Suite

Chris Beck, Ph.D, and
Rebekka Cranmer, Senior Web Developer, Web Services Department

Learn how to create new web pages and edit existing one with Adobe's Contribute. You will learn how to add images and text to a page, as well as edit images and create PDFs. Additionally, you will explore the page review and publishing features. Using Contribute you will be able to easily create content and publish the content to the URMC live Web server.

In the second half of this brown-bag seminar, the Blackboard Academic Suite will be introduced. Blackboard is a secure online course management
tool that is used to facilitate learning objectives, assessment, and information exchange between instructors and students. It can also be used for secure information exchange within an organization or other group of people at the University of Rochester. A brief tutorial and demonstration of the software aimed at course instructors and organization leaders will be presented.

 

Fall 2006 Biostatistics Brown Bag Seminar Abstracts

Correlation Analysis for Longitudinal Data
Wan Tang, Ph.D.

Correlation analysis is widely used in biomedical and psychosocial research to evaluate quality of outcomes and to assess instrument and rater reliability.  For continuous outcomes, the product-moment correlation and the associated Pearson estimate are the most popular in applications.  Although asymptotic distributions of the Pearson estimates are available for multivariate outcomes, they only apply to complete data.  As longitudinal study designs become increasingly popular, missing data is commonplace in most trials and cohort studies.  In this talk, we propose new product-moment estimates to extend the Pearson estimates to address missing data within a longitudinal data setting.  We discuss non-parametric inference under both the missing completely at random (MCAR) and missing at random (MAR) assumptions.  Inference under MAR is quite complex in general and we consider several special cases that not only reduce the complexity but also apply to most real studies.  The approach is illustrated with real study data in psychosocial research. 

Bayesian Network as a Model of Biological Network
Peter Salzman, Ph.D.

Bayesian Network is a graphical representation of a multivariate
distribution. This representation applied to gene expression data can be
usefull to understand the direct and indirect interactions between genes/
gene products (proteins). In this talk I'll address two issues related to
Bayesian network models. The estimation/reconstruction of network from
data is computationaly intensive process as the space of possible models
is superexponential in the number of genes. In the first part of this talk
I'll describe an algorithm that operates on the space of rankings that is
'only' exponential in the number of genes.

In the second part of the talk I'll propose a procedure that tests if a
collection of genes loosely defined as a pathway is differentially
expressed under two conditions. It is based on first reconstructing the
network for each condition and then comparing the two networks. I'll
present result for simulated and real biological data to demonstrate the
applicability of the method.


Adverse Effects of Intergene Correlations in Microarray Data Analysis
Xing Qiu, Ph.D.

In the field of microarray data analysis, a common task is to find those genes that are differentially expressed in two groups of patients. Inter-gene stochastic dependence plays a critical role in the methods of such statistical inference. It is frequently assumed that dependence between genes (or tests) is sufficiently weak to justify many methodologies that resort to pooling test statistics across genes. In this talk, I present two popular methods of this kind, namely the empirical Bayes methodology and a procedure introduced by Storey et al which depends on the estimation of false discovery rate. Then I provide some empirical evidences to demonstrate that these methods suffer a lot from such pooling practice, such as high variability and lack of consistency.


Causal Comparisons in Randomized Trials of Two Active Treatments: The Effect of Supervised Exercise to Promote Smoking Cessation
Jason Roy, Ph.D.

In behavioral medicine trials, such as smoking cessation trials, two or more active treatments are often compared. Noncompliance by some subjects with their assigned treatment poses a challenge to the data analyst. Causal parameters of interest might include those defined by subpopulations based on their potential compliance status under each assignment, using the principal stratification framework (e.g., causal effect of new therapy compared to standard therapy among subjects that would comply with either intervention). Even if subjects in one arm do not have access to the other treatment(s), the causal effect of each treatment typically can only be identified from the outcome, randomization and compliance data within certain bounds. We propose to use additional information – compliance-predictive covariates – to help identify the causal effects. Our approach is to specify marginal compliance models conditional on covariates within each arm of the study. Parameters from these models can be identified from the data. We then link the two compliance models through an association model that depends on a parameter that is not identifiable, but has a meaningful interpretation; this parameter forms the basis for a sensitivity analysis. We demonstrate the benefit of utilizing covariate information in both a simulation study and in an analysis of data from a smoking cessation trial.

Spring 2006 Biostatistics Brown Bag Seminar Abstracts

 

A Nonparametric Model for Bivariate Distributions Based on Diagonal Copulas
Sungsub Choi, Ph.D.,
Department of Mathematics,
Pohang University of Science and Technology,

A useful approach in constructing multivariate distributions is based on copula functions, and, in particular, Archimedean copulas have been in wide use. The talk will introduce a new class of copulas based on convex diagonal functions, and explores their distributional properties. Several examples of parametric diagonal copulas will be given. We will then explore the ways of extension to constructing multivariate proportional hazards models.


Motion Tracking in Wireless Networks Using Artificial Triangulation
Anthony Almudevar, Ph.D.

One important problem in the application of wireless networks is the location of a mobile node Tx based on the received signal strength (RSS) at a fixed configuration of receivers of a radio frequency signal transmitted by Tx. Because the RSS is inversely related to transmission distance, the distance of Tx from each receiver can be determined, and its location established by geometric triangulation, as long as at least three well spaced receivers are used.

The use of such wireless networks provides a convenient method of collecting a longitudinal record of motion for patients susceptible to dementia. This can provide an objective method for the real-time monitoring of noncognitive symptoms of dementia such as restlessness, pacing, wandering, changes in sleep patterns, changes in circadian rhythm or specific changes in daily routine. However, the calibration of the RSS to transmission distance relationship is complicated by the presence of obstacles, particularly in an indoor setting. The relationship depends strongly on the geometric configuration of walls and other large obstacles, the proximity of high voltage devices such as microwave ovens and televisions, as well as the orientation of any person wearing such a transmitter.

I will present as an interim solution a method of mapping of RSS measurements onto a two dimensional plane which preserves the topological and directional properties of any trajectory of Tx without requiring precise knowledge of the receiver configuration or the RSS to transmission distance relationship. The method works by imposing an artificial triangulation on suitably transformed RSS measurements. Such a representation will suffice to capture the essential features of patient motion. In particular, locations which are frequently occupied (favorite chair, kitchen, etc) can be identified with sufficient data, leading to the construction of a ‘living space network’ through an unsupervised learning process. The network can be later validated or annotated.

The methodology will be illustrated using data collected under a study funded by an Everyday Technologies for Alzheimer Care (ETAC) research grant from the Alzheimer's Association, using monitoring equipment provided by Home Free Systems and GE Global Research. This is joint work with Dr. Adrian Leibovici and the Center for Future Health, University of Rochester.


Testing Equality of Ordered Means in the General Linear Model
Michael McDermott, Ph.D.

Hypothesis testing problems involving order constrained means arise frequently in practice. The standard approach to this problem in the one-way layout is the likelihood ratio test. In many practical settings, such as a randomized controlled trial, it is useful to include covariates in the primary statistical model. Likelihood ratio tests for equality of ordered means that incorporate covariate adjustment are quite complex and are rarely applied in practice because of difficulties in their implementation. In this paper, a test is proposed that is based on multiple contrasts among the adjusted group means. The p-values associated with these contrasts are, in general, dependent. An overall significance test is carried out using Fisher’s statistic to combine the dependent p-values arising from these contrasts; the null distribution of this statistic can be well approximated by that of a scaled chi-square random variable. The contrasts can be chosen to yield a test with high power, for alternatives at a fixed distance from the null hypothesis, throughout the restricted parameter space. The test is generally easy to implement for a variety of partial order restrictions. An example from a randomized clinical trial is used to illustrate the proposed test.



Fall 2005 Biostatistics Brown Bag Seminar Abstracts



Rule-based Modeling of Signaling by Epidermal Growth Factor Receptor
Michael L. Blinov
Theoretical Biology and Biophysics Group,
Los Alamos National Laboratory, Los Alamos, NM

Signal transduction networks often exhibit combinatorial complexity: the number of protein complexes and modification states that potentially can be generated during the response to a signal is large, because signaling proteins contain multiple sites of modification and interact with multiple binding partners. The conventional approach of manually specifying each term of a mathematical model is impossible. To avoid this problem, modelers often make assumptions to limit the number of species, but these are usually poorly justified. As an alternative, we have developed an approach to represent biomolecular interactions as rules specifying activities, potential modifications and interactions of the domains of signaling molecules [Hlavacek et al. (2003) Biotech. Bioeng.] Rules are evaluated automatically to generate the reaction network. This approach is implemented in BioNetGen software [Blinov et al. (2004) Bioinformatics; Blinov et al. (in press) LNCS]. To illustrate this approach, we have developed a model of early events in signaling by the epidermal growth factor (EGF) receptor (EGFR), which includes EGF, EGFR, the adapter proteins Grb2 and Shc, and the guanine nucleotide exchange factor Sos [Blinov et al. (2005) BioSystems]. These events can potentially generate a diversity of protein complexes and phosphoforms; however, this diversity has been largely ignored in computational models of EGFR signaling. The model predicts the dynamics of 356 molecular species connected through 3,749 reactions. This model is compared with a previously developed model [Kholodenko et al. (1999) JBC] that incorporates the same protein-protein interactions but is based on several restrictive assumptions and thus includes only 18 molecular species involved in Sos activation. The new model is consistent with experimental data and yields new predictions without requiring new parameters. The model predicts distinct temporal patterns of phosphorylation for different tyrosines of EGFR, distinct reaction paths for Sos activation, a large number of distinct protein complexes at short times, and signaling by receptor monomers. Comparing the two models helps design experiments to test hypotheses, e.g., genetic mutation blocking Shc-dependent pathways helps to distinguish between competitive and non-competitive mechanisms of adapter proteins binding.


Stochastic Curtailment in Multi-Armed Trials
Xiaomin He

Stochastically curtailed procedures in multi-armed trials are complicated due to repeated significance testing and multiple comparisons. From either frequentist or Bayesian viewpoints, there exists some dependence among pairwise test statistics. Investigators must consider such dependence when testing homogeneity of treatments. This paper studies the property of canonical multivariate joint distribution of test statistics in multi-armed trials. Pairwise and global monitoring are suggested based on this property. In pairwise monitoring, the Hochberg step-up procedure is recommended to strongly control the overall significance level. In global monitoring, the conditional and predictive power are calculated based on current multivariate test statistics, which reflect the dependence among pairwise test statistics. Futility monitoring in multi-armed trials is also considered. Simulation results in multi-armed trials show that, compared with the traditional group sequential and non-sequential procedures, stochastic curtailment has advantages in sample size, time and cost. An example concerning a proposed study of Coenzyme Q$_{10}$ in early Parkinson Disease is given.


Power Analysis for Correlations from Clustered Study Designs
Xin Tu

Power analysis constitutes an important component of modern clinical trials and research studies. Although a variety of methods and software packages are available, they are primarily focused on regression models, with little attention paid to correlation analysis. However, the latter is a simpler and more appropriate approach for modeling association between correlated variables that measure a common (latent) construct using different scales, different assessment methods and different raters as arising in psychosocial and other health-care related research areas. A major difficulty for performing power analysis is how to deal with the excessive number of parameters in the distributions of the correlation estimates, many of which are nuisance parameters. In addition, as missing data patterns are unpredictable and dynamic before a study is realized, its effect must also be addressed when performing power analysis, which further complicates the analytic problems. With no real data to estimate the parameters and missing data patterns as in most real study applications, it is difficult to proceed with estimation of power and sample size for correlation analysis for a real study. In this talk, we discuss how to eliminate nuisance parameters and model missing data patterns to effectively address these issues. We illustrate our approaches with both real and simulated data.

This is joint work with Paul Crits-Christoph (University of Pennsylvania), Changyong Feng (University of Rochester), Robert Gallop (University of Pennsylvania) and Jeanne Kowalski (Johns Hopkins University).


Branching Processes, Generation, and Applications
Ollivier Hyrien

I will first present results on the distribution of the generation in a Bellman-Harris branching process starting with a single cell. Approximate expressions for this distribution have been described in the literature, and I will present an exact expression. As an application, I will give an explicit expression for the distribution of the age in the considered setting. The results are illustrated using a Markov process.

The second part of my talk will focus on the statistical analysis of CFSE-labeling experiments, a bioassay frequently used by biologists to study cell proliferation. The data generated by this assay are dependent, a feature that has never been mentioned in the literature. The dependency structure is quite complex, making it impossible to use the method of maximum likelihood. I propose three estimation techniques, and present their asymptotic and finite sample properties. An application to T lymphocytes will also be given.


Similarity Searches in Genome-wide Numerical Data Sets
Galina Glazko
Stowers Institute for Medical Research

Many types of genomic data are naturally represented as multidimensional vectors. The frequent purpose of genome-scale data analysis is to uncover the subsets in the data that are related by a similarity of some sort. One way to do it is by computing the distances between vectors. The major question here is: how to choose the distance measure, when several of them are available? First, we consider the problem of functional inference using phyletic patterns. Phyletic patterns denote presence and absence of orthologous genes in completely sequenced genomes, and are used to infer functional links, on the assumption that genes involved in the same pathway or functional system are co-inherited by the same set of genomes. I demonstrate that the use of appropriate distance measure and clustering algorithm increases the sensitivity of phyletic pattern method; however, the method itself has the limit of applicability caused by differential gains, losses, and displacements of orthologous genes. Second, we study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. I show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances. Third, we discuss the new approach for local clustering based on an iterative pattern-matching and apply the new approach to identify potential malaria vaccine candidates in Plasmodium falciparum transcriptome.


Partially Linear Models and Related Topics
Hua Liang

In this brown-bag seminar I will bring a presentation of the state of the art of partially linear models, with a particular focus on several special topics such as with error-prone covariates, missing observation, nonlinear component checking. Extension to more general models will be discussed. The applications of these projects in biology, economics, and nutrition will be mentioned. The talk covers a series of my publications in the Annals of Statistics, JASA, Statistica Sinica, Statistical Methods in Medical Research, and more recent submission.


 

Spring 2005 Biostatistics Brown Bag Seminar Abstracts

Estimating Incremental Cost-Effectiveness Ratios and Their Confidence Intervals with Differentially Censored Data
Hongkun Wang and Hongwei Zhao

With medical cost escalating over recent years, cost analysis is being conducted more and more to assess economical impact of new treatment options. An incremental cost-effectiveness ratio is a measure that assesses the additional cost for a new treatment for saving one year of life. In this talk, we consider cost effective analysis for new treatments evaluated in a randomized clinical trial setting with staggered entries. In particular, the censoring times are different for cost and survival data. We propose a method for estimating the incremental cost-effectiveness ratio and obtaining its confidence interval when differential censoring exists. Simulation experiments are conducted to evaluate our proposed method. We also apply our methods to a clinical trial example comparing the cost-effectiveness of implanted defibrillators with conventional therapy for individuals with reduced left ventricular function after myocardial infarction.


Regression Analysis of ROC Curves and Surfaces
Christopher Beck

Receiver operating characteristic (ROC) curves are commonly used to describe the performance of a diagnostic test in terms of discriminating between healthy and diseased populations. A popular index of the discriminating ability or accuracy of the diagnostic test is the area under the ROC curve. When there are three or more populations, the concept of an ROC curve can be generalized to that of an ROC surface, with the volume under the ROC surface serving as an index of diagnostic accuracy. After introducing the basic concepts associated with ROC curves and surfaces, methods for assessing the effects of covariates on diagnostic test performance will be discussed. Examples from a recent study organized by the Agency for Toxic Substances and Disease Registry (and conducted here in Rochester) will be presented to illustrate these methods.


Constructing Prognostic Gene Signatures for Cancer Survival
Derick Peterson

Modern micro-array technologies allow us to simultaneously measure the expressions of a huge number of genes, some of which are likely to be associated with cancer survival. While such gene expressions are unlikely to ever completely replace important clinical covariates, evidence is already beginning to mount that they can provide significant additional predictive information. The difficult task is to search among an enormous number of potential predictors and to correctly identify most of the important ones, without mistakenly identifying too many spurious associations. Many commonly used screening procedures unfortunately over-fit the training data, leading to subsets of selected genes that are unrelated to survival in the target population, despite appearing associated with the outcome in the particular sample of data used for subset selection. And some genes might only be useful when used in concert with certain other genes and/or with clinical covariates, yet most available screening methods are inherently univariate in nature, based only on the marginal associations between each predictor and the outcome. While it is impossible to simultaneously adjust for a huge number of predictors in an unconstrained way, we propose a method that offers a middle ground where some partial adjustments can be made in an adaptive way, regardless of the number of candidate predictors.


A New Test Statistic for Testing Two-Sample Hypotheses in Microarray Data Analysis
Yuanhui Xiao

We introduce a test statistic intended for use in nonparametric testing of the two-sample hypothesis with the aid of resampling techniques. This statistic is constructed as an empirical counterpart of a certain distance measure N between the distributions F and G from which the samples under study are drawn. The distance measure N can be shown to be a probability metric. In two-sample comparisons, the null hypothesis F = G is formulated as H0 : N = 0. In a computer experiment, where gene expressions were generated from a log-normal distribution, while departures from the null hypothesis were modeled via scale transformations, the permutation test based on the distance N appeared to be more powerful than the one based on the commonly used t-statistic. The proposed statistic is not distribution free so that the two-sample hypothesis F = G is composite, i.e., it is formulated as H0 : F(x) = H(x), G(x) = H(x) for all x and some H(x). The question of how the null distribution H should be modeled arises naturally in this situation. For the N-statistic, it can be shown that a specific resampling procedure (resampling analog of permutations) provides a rational way of modeling the null distribution. More specifically, this procedure mimics the sampling from a null distribution H which is, in some sense, the "least favorable" for rejection of the null hypothesis. No statement of such generality can be made for the t-statistic. The usefulness of the proposed statistic is illustrated with an application to experimental data generated to identify genes involved in the response of cultured cells to oncogenic mutations.


The Effects of Normalization on the Correlation Structure of Microarray Data
Xing Qiu, Andrew I. Brooks, Lev Klebanov, and Andrei Yakovlev

Stochastic dependence between gene expression levels in microarray data is of critical importance for the methods of statistical inference that resort to pooling test statistics across genes. It is frequently assumed that dependence between genes (or tests) is sufficiently weak to justify the proposed methods of testing for differentially expressed genes. A potential impact of between-gene correlations on the performance of such methods has yet to be explored. We present a systematic study of correlation between the t-statistics associated with different genes. We report the effects of four different normalization methods using a large set of microarray data on childhood leukemia in addition to several sets of simulated data. Our findings help decipher the correlation structure of microarray data before and after the application of normalization procedures. A long-range correlation in microarray data manifests itself in thousands of genes that are heavily correlated with a given gene in terms of the associated t-statistics. The application of normalization methods may significantly reduce correlation between the t-statistics computed for different genes. However, such procedures are unable to completely remove correlation between the test statistics. The long-range correlation structure also persists in normalized data.


Estimating Complexity in Bayesian Networks
Peter Salzman

Bayesian networks are commonly used to model complex genetic interaction graphs in which genes are represented by nodes and interactions by directed edges. Although a likelihood function is usually well defined, the maximum likelihood approach favors networks with high model complexity. To overcome this we propose a two step algorithm to learn the network structure. First, we estimate model complexity. This requires finding the MLE conditional on model complexity then using Bayesian updating, resulting in an informative prior density on complexity. This is accomplished using simulated annealing to solve a constrained optimization problem on the graph space. In the second step we use an MCMC algorithm to construct a posterior density of gene graphs which incorporates the information obtained in the first step. Our approach is illustrated by an example.


A New Approach to Testing for Sufficient Follow-up in Cure-Rate Analysis
Lev Klebanov and Andrei Yakovlev

The problem of sufficient follow-up arises naturally in the context of cure rate estimation. This problem was brought to the fore by Maller and Zhou (1992, 1994) in an effort to develop nonparametric statistical inference based on a binary mixture model. The authors proposed a statistical test to help practitioners decide whether or not the period of observation has been long enough for this inference to be theoretically sound. The test is inextricably entwined with estimation of the cure probability by the Kaplan-Meier estimator at the point of last observation. While intuitively compelling, the test by Maller and Zhou does not provide a satisfactory solution to the problem because of its unstable and non-monotonic behavior when the duration of follow-up increases. The present paper introduces an alternative concept of sufficient follow-up allowing derivation of a lower bound for the expected proportion of immune subjects in a wide class of cure models. By building on the proposed bound, a new statistical test is designed to address the issue of the presence of immunes in the study population. The usefulness of the proposed approach is illustrated with an application to survival data on breast cancer patients identified through the NCI Surveillance, Epidemiology and End Results Database.


Assessment of Diagnostic Tests in the Presence of Verification Bias
Michael McDermott

Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in a random sample from the intended population to which the test will be applied. In many studies, however, verification of the true disease status is performed only for a subset of the sample. This may be the case, for example, if ascertainment of the true disease status is invasive or costly. Often, verification of the true disease status depends on the result of the diagnostic test and possibly other characteristics of the subject (e.g., only subjects judged to be at higher risk of having the disease). If sensitivity and specificity are estimated using only the information from the subset of subjects for whom both the test result and the true disease status have been ascertained, these estimates will typically be biased. This talk will review some methods for dealing with the problem of verification bias. Some new approaches to the problem will also be introduced.


Estimation of Causal Treatment Effects from Randomized Trials with Varying Levels of Non-Compliance
Jason Roy

Data from randomized trials with non-compliance are often analyzed with an intention-to-treat (ITT) approach. However, while ITT estimates may be of interest to policy-makers, estimates of causal treatment effects may be of more interest to clinicians. For the simple situation where treatment and compliance are binary (yes/no), instrumental variable (IV) methods can be used to estimate the average causal effect of treatment among those that would comply with treatment assignment. When there are more than two compliance levels (e.g., non-compliance, partial compliance, full compliance), however, these IV methods cannot identify the compliance-level causal effects without strong assumptions. We consider likelihood-based methods for dealing with this problem. The research was motivated by a study of the effectiveness of a disease self-management program in reducing health care utilization among older women with heart disease. This is work-in-progress.


Statistical Inference for Branching Processes
Nikolay Yanev

It is well known that branching processes have many applications in biology. In this talk the asymptotic behavior of branching populations having an increasing and random number of ancestors is investigated. An estimation theory will be developed for the mean, variance and offspring distributions of the process $\{Z_{t}(n)\}$ with random number of ancestors $Z_{0}(n)$, as both $n$ (and thus $Z_{0}(n)$, in some sense) and $t$ approach infinity. Nonparametric estimators are proposed and shown to be consistent and asymptotically normal. Some censored estimators are also considered. It is shown that all results can be transferred to branching processes with immigration, under an appropriate sampling scheme. A system for simulation and estimation of branching processes will be demonstrated.

No preliminary knowledge in this field is assumed.


Modeling of Stochastic Periodicity: Renewal, Regenerative and Branching Processes
Nikolay Yanev
Department of Probability and Statistics, Chair,
Institute of Mathematics and Informatics,
Bulgarian Academy of Sciences,
SOFIA, BULGARIA

In deterministic processes periodicity is usually well defined. However in the stochastic case there are many possible models. One way to study stochastic periodicity is proposed in this lecture. The models are based on Alternating Renewal and Regenerative Processes. The limiting behavior is investigated, with special attention given to the case of periods of regeneration with infinite mean. Two applications in the Branching Processes are considered: Bellman-Harris branching processes with state-dependent immigration and discrete-time branching processes with a random migration.

The main purpose of the talk is to describe stochastic models which can be applied in Biology, especially Epidemiology and Biotechnology.

No preliminary knowledge in this field is assumed.


Testing Approximate Statistical Hypotheses
Y. N. Tyurin
Moscow State University

Statistical hypotheses often take the form of statements about some properties of functionals of probability distributions. Usually, according to a hypothesis the functionals in question have certain exact values. Many of the classical statistical hypotheses are of this form: the hypothesis about mathematical expectation of a normal sample (one-dimensional or multidimensional); the hypothesis about probabilities of outcomes in independent trails (which should be tested based on observed frequencies); the linear hypotheses in Gaussian linear models etc.

Stated as suppositions about exact values those hypotheses do not express accurately the thinking of natural scientists. In practice an applied scientist would be satisfied if those or similar suppositions were ?correct? in some approximate sense (meaning their approximate agreement with statistical data).

The above-mentioned discrepancy between applied-science approach and the mathematical expression of it leads to rejection of any statistical hypothesis given sufficiently large amount of sample data ? a well known statistical phenomenon.

This talk will show how hypotheses about exact values can be re-stated as rigorously formulated approximate hypotheses and how those can be tested against sample data with special attention given to the hypotheses mentioned above.


 

Fall 2004 Biostatistics Brown Bag Seminar Abstracts

A Bayesian Analysis of Multiple Hypothesis Tests
Anthony Almudevar

A Bayesian methodology is proposed for the problem of multiple hypothesis tests for a given effect. The density of test statistics is modelled as a mixture based on hypothesis status. A full posterior measure is constructed for the mixture conditional on the observable total density. Commonly used quantities such as false discovery rates and posterior probabilities of hypothesis status can be directly calculated from the mixture, and so full posterior measures for these quantities can be directly obtained. The posterior measure is computed by sampling from a Monte Carlo Markov chain. This approach proves to be very flexible, allowing a model for the magnitude of the effects, as well as for dependence structure, to be developed and incorporated into the posterior measure. In addition, this approach is ideally suited to the situation in which the presence of large numbers of marginal, or weak, effects complicates any attempt to estimate the hypothesis mixture. In this case, a simple redefinition of the null hypothesis is proposed which makes the mixture estimation well defined and feasible.


Analysis of Variance, Coefficient of Determination, and Approximate F-tests for Local Polynomial Regression
Li-Shan Huang

In this paper, we develop analogous ANOVA inference tools for nonparametric local polynomial regression in the simple case with bivariate data. The results include: (i) a local exact ANOVA decomposition, (ii) a local R-squared, (iii) a global ANOVA decomposition, (iv) a global R-squared, (v) an asymptotically idempotent projection matrix, (vi) degree of freedom, and (vii) approximate $F$-tests. We also provide some interesting geometric views why a local exact ANOVA decomposition holds. The work here is different from earlier developments by other investigators. This is a joint work with Jianwei Chen in the department.


On the Role of Copula Models in Survival Analysis
David Oakes

An important class of models in bivariate survival analysis consists of the so-called frailty models, which arise from the introduction of a common unobserved random proportionality factor into the hazard functions of the two related survival times. This assumption leads to a simple copula representation of the joint survivor function. The class of such models will be described and various characterization results presented. Some extensions will be discussed. Methods for parametric and semiparametric inference about the parameters governing the marginal distributions and the association structure will be surveyed briefly.


Cost-Effectiveness Studies Associated with Clinical Trials - Projecting Effects Beyond the Range of the Data
Jack Hall

I start with a quick overview of the MADIT-II clinical trial and of the associated cost-effectiveness study, including a general overview of cost-effectiveness studies. I then review the need for projecting results beyond the limited (3.5 years) span of the available data and the associated difficulties [and fool-hardiness???]. Finally, I talk about the life-table method we developed for use in the MADIT-II cost-effectiveness study to project survival experience beyond the time span of the data.

[Hongwei Zhao, Hongkun Wang and Hongyue Wang contributed to the MADIT-II cost-effectiveness study, as well as five people from the Community and Preventive Medicine Department and two in the Heart Research Group. A manuscript has just been submitted for publication (with these eleven authors).]


Spring 2004 Biostatistics Brown Bag Seminar Abstracts

Quick and Easy Solutions for Dealing with Data: Part 1
Arthur Watts

1: Ask a Programmer.
We have an excellent group of programmers with vast experience in data management and analysis. Although our bread and butter has been clinical trials, we have provided a variety of services over the years, including safety monitoring, simulations, double imputation, and many other customized statistical applications. We always check the data for errors, to minimize the possibility of having to re-run the analysis, or worse yet, re-publish the results. In addition to indexed notebooks, final results are often delivered on a CD containing the database, PowerPoint summary, Microsoft Word reports and a browsable html document. Our web based services have included on-line surveys, randomization, data collection and study monitoring. Before you start a new analysis, stop by and see us. We can save you a lot of time.
2: Visit our Web site.
In an effort to make clinical databases easier to analyze, we have developed an extensive library of flexible procedures that create a variety of statistical reports. Since investigators want to look at many outcome measures, these procedures operate on lists of variables, looping through each variable to run the analysis. Eliminating much of the tedious programming usually required to analyze clinical databases, these procedures can save hours of programming time. "Let the computer do your work for you."


Conditional Inference Methods for Incomplete Poisson Data With Endogenous Time-Varying Covariates
Jason Roy

We investigate the effect of protease inhibitors (PIs) on the rate of emergency room (ER) visits among HIV-infected women from a longitudinal cohort study. One strategy to account for serial correlation in longitudinal studies is to assume observations are independent, conditional on unit-specific nuisance parameters. It is possible to estimate these models using unconditional maximum likelihood, where the nuisance parameters are assigned a parametric distribution and integrated out of the likelihood. Alternately, we can proceed using conditional inference, where we eliminate the nuisance parameters from the likelihood by conditioning on a sufficient statistic for these parameters. An advantage of conditional inference methods over parametric random effects models is all patient-level time-invariant factors (both measured and unmeasured) are accounted for in the analysis. A limitation is standard conditional inference methods assume missing data are missing completely at random and do not allow endogenous time-varying covariates (i.e., ER visits in the past cannot predict future PI use). Both assumptions are unlikely to be met for these data, because one would expect `sicker' patients would be more likely to receive treatment and/or drop out from the study. We develop new estimation strategies that allow endogenous time-varying covariates and missing at random dropouts. The analysis shows that PI use reduces the rate of ER visits among patients whose CD4 cell count was <200 cells/mL at baseline. The size of the effect is substantially smaller than that estimated using a random effects approach.


On the Density of the Solution to a Random System of Equations
Anthony Almudevar

Click here for PDF file of seminar abstract


Paradoxical Association of a Group of Atherosclerosis-related Genotypes with Reduced Rate of Coronary Events After Myocardial Infarction
David Oakes


Local Polynomial Density Estimation With Interval Censored Data
Derick R. Peterson and Mark J. van der Laan

A survival time is interval censored if only its current status, an indicator of whether the event has occurred, is observed at a possibly random number of monitoring times. We provide estimators with pointwise confidence limits for all derivatives of the distribution of the time till event, assuming that the observed monitoring times are independent of the time of interest. Our estimator is a standard local polynomial regression smoother applied to the pooled sample of dependent current status observations. We show that the proposed estimator has a normal limiting distribution identical to that of a smoother applied to independent current status observations. Thus local bandwidth selection techniques and pointwise confidence limit procedures for standard nonparametric regression perform properly, despite the dependence in the pooled sample.


Pre-limit Theorems and Their Applications
Lev Klebanov

Finitely many empirical observations can never justify any tail behavior, thus they cannot justify the applicability of classical limit theorems in probability theory. In this paper we attempt to show that instead of relying on limit theorems, one may use the so-called pre-limit theorems explained later. The applicability of our pre-limit theorem relies not on the tail but on the 'central section' ('body') of the distributions and as a result, instead of a limiting behavior (when $n$, the number of i.i.d. observations tends to infinity), the pre-limit theorem should provide an approximation for distribution functions in case $n$ is 'large' but not too 'large'. Our pre-limiting approach seems to be more realistic for practical applications.


p-Values-Only-Based Stepwise Procedures for Multiple Testing and Their Optimality Properties
Alexander Gordon

Click here for PDF file of seminar abstract


Modeling Cancer Screening: Further Thoughts and Results
Andrei Yakovlev

Over the years, many large-scale randomized trials have been conducted to evaluate the effects of breast cancer screening. These trials have failed to provide conclusive evidence for significant survival benefits of mammographic screening because of certain pitfalls in their design and lack of statistical power. However, such studies represent a rich source of information on the natural history of breast cancer, thereby opening up the way to evaluate potential benefits of breast cancer screening through using realistic mathematical models of cancer development and detection. We propose a biologically motivated model of breast cancer development and detection allowing for arbitrary screening schedules and the effects of clinical covariates recorded at the time of diagnosis on post-treatment survival. Biologically meaningful parameters of the model are estimated by the method of maximum likelihood from the data on age and tumor size at detection that resulted from two randomized trials known as the Canadian National Breast Screening Studies. When properly calibrated, the model provides a good description of the U.S. national trends in breast cancer incidence and mortality. The model was validated by predicting (without any further calibration or tuning) certain quantitative characteristics obtained from the SEER data. In particular, the model provides an excellent prediction of the size-specific age-adjusted incidence of invasive breast cancer as a function of calendar time for the period 1975-1999. Predictive properties of the model are also illustrated with an application to the dynamics of age-specific incidence and stage-specific age-adjusted incidence over the period 1975-1999.


Iterated Birth and Death Markov Process and its Biological Applications
Leonid Hanin

We solve, under realistic biological assumptions, the following long-standing problem in radiation biology: to find the distribution of the number of clonogenic tumor cells surviving a given arbitrary schedule of fractionated radiation. Mathematically, this leads to the problem of computing the distribution of the state N(t) of an iterated birth and death Markov process at any time t counted from the end of exposure. We show that the distribution of the random variable N(t) belongs to the class of generalized negative binomial distributions, find an explicit computationally feasible formula for this distribution, and identify its limiting forms. In particular, for t = 0, the limiting distribution turns out to be Poisson, and an estimate of the rate of convergence in the total variation metric that generalizes the classical Law of Rare Events is obtained.

Statistical Methods of Translating Microarray Data into Clinically Relevant Diagnostic Information in Colorectal Cancer
Byung Soo Kim

The aim of the study is two fold. First, we identify a set of differentially expressed (DE) genes in colorectal cancer, compared with normal colorectal tissues to rank genes for the development of biomarkers for population screening of colorectal cancer. Second, we detect a set of DE genes for subtypes of colorectal cancer which can be classified with respect to stage, location and carcino-embryonic antigen (CEA) level. The cancer and normal tissues were obtained from 87 colorectal cancer patients who underwent surgery at Severance Hospital, Yonsei Cancer Center, Yonsei University College of Medicine, from May to December of 2002. We originally attempted to extract total RNAs from tumor and normal tissues from 87 patients. From each of 36 patients we had RNA specimens both for tumor and normal tissues. However, from 19 (32) patients RNA specimens for normal tissues (tumor) only were available. Thus, we have a matched pair sample of size 36 and two independent samples of sizes 19 and 32. We conducted a cDNA microarray experiment using a common reference design with 17K human cDNA microarrays. We pooled eleven cancer cell lines from various origins and used it for the common reference. We used M=log2(R/G) for the evaluation of relative intensity. As a means of utilizing the whole data set we first use the matched pair data set as a training set from which we detect a set of DE genes between the normal tissue and the tumor. Then we use the pool of two independent data sets of "tumor only" and "normal only" as the test set for the validation. We employ four procedures for detecting a set of DE genes from the matched pair sample of size 36: Paired t test and Dudoit et al.?s maxT procedure; Tusher et al.?s SAM procedure; L?nstedt and Speed?s empirical Bayes procedure; Hotelling?s T2 statistic. We employ the diagonal quadratic discriminant analysis for the classification of the test set. We modify standard methods for the data at hand and propose a t-based statistics, say t3, which combine three data types for the detection of DE genes. We also extend Pepe et al.?s ROC approach of ranking genes for the purpose of biomarker development for our mixed data type (Pepe et al., 2003 Biometrics). We note that only a few genes are required to achieve 0% test error in discriminating the normal tissue from the colorectal cancer. For the subtype analyses various approaches failed to identify DE genes with respect to colon cancer versus rectum cancer and stage B versus stage C. We employed a regression approach to detect a few genes which well correlated with CEA.


 

Fall 2003 Biostatistics Brown Bag Seminar Abstracts

Biomarker Measurement Error: A Bayesian Approach with Application to Lung Cancer
Sally W. Thurston

Molecular biologists have identified specific cellular changes, called biomarkers, which enable them to better understand the pathway from chemical exposure to initiation of some cancers. In lung cancer, one such biomarker is the number of DNA adducts in lung tissue. Adducts are formed from the binding of cigarette carcinogens to DNA, and this adduct formation plays a central role in lung cancer initiation from smoking.

The goal of this work is to incorporate knowledge of such underlying biological mechanisms into a useful statistical framework to improve cancer risk estimates. The model considers adducts in the blood to be a surrogate measure of lung adducts. Lung adducts can never be measured in controls. The model is developed on a subset of the data, a small portion of which has biomarker measurements, and is used to predict cancer risk for the remaining data which do not have biomarker measurements. These predictions are compared to those from a traditional model, and to observed case/control status. Although the biomarker model compares favorably with the traditional approach, model diagnostics suggest that better predictions could be made from an expanded model which allows for measurement error in lung adducts.

Functional Response Models and their Applications
Xin M. Tu

I will discuss a new class of semi-parametric (distribution-free) regression models with functional responses. This class of functional response models (FRM) generalizes the traditional regression models by defining the response variable as a function of several responses from multiple subjects. By using such multiple-subjects-based responses, the FRM integrates many popular non- and semi-parametric approaches within a unified modeling framework. For example, under the proposed framework, we can derive regression models to perform inferences for two-way contingency tables and to estimate variance components by identifying them as model parameters. The FRM also provides theoretical platform for developing new models for addressing limitations of existing non- and semi-models. For example, we can develop FRMs to generalize ANOVA so that we can not only compare the means, but also the variances of the multiple groups, and to derive and extend the Mann-Whitney-Wilcoxon (MWW) rank-based tests to more than two groups. For inferences, we discuss a novel approach by integrating the U-statistic theory with the generalized estimating equations. The talk is illustrated with examples from biomedical and psychosocial research.

Biomedical Modeling, Prediction and Simulation
Hulin Wu

In this brown-bag seminar, I am going to give a brief introduction to several on-going projects in my research group. Our research projects include
  1) Nonparametric smoothing/regression methods for longitudinal data with applications to long-term HIV dynamic modeling
  2) Mechanism-based modeling of longitudinal data with applications to AIDS treatment response modeling
      a) Hierarchical Bayesian approach
      b) Mixed-effects state-space model approach
  3) Nonlinear and time-varying coefficient state-space models and particle filter techniques with applications to SARS epidemics
  4) Clinical trial modeling and simulations

In summary, we are trying to combine the models and techniques from biomathematics, engineering, computer science and statistics to solve important biomedical problems. The multi-discipline feature of these projects will be further enhanced in the next several years. Currently the research faculty and postdoc fellows who are involved in these projects include Drs. Yangxin Huang, Jianwei Chen, Haihong Zhu and Dacheng Liu as well as other external collaborators.

A Discussion on Intent-To-Treat Principle for Blood Transfusion Trials
Hongwei Zhao



Spring 2003 Biostatistics Brown Bag Seminar Abstracts

Statistical Analysis of Skewed Data
Hongkun Wang and Hongwei Zhao

This talk is motivated by an example where the dependent variable has a lot of zero values and a very skewed distribution, and the interest is to find a relationship between several covariates and this variable. We will examine briefly some current literatures which dealt with this problem. We will also discuss the interpretation of the parameters for some of those proposed models. In the end we will present the results of the data analysis of our example.

Inference on multi-type cell systems using clonal data and application to oligodendrocytes development in cell culture
Ollivier Hyrien


Fall 2002 Biostatistics Brown Bag Seminar Abstracts

Designing and Analyzing a Small Bernoulli-Trial Experiment, with Application to a Recent Cardiological Device Trial
Jack Hall

In the recent `WEARIT' trial, the success of a wearable defibrillator in preventing death from a heart attack in patients awaiting a heart transplant was evaluated. A trial design was called for that would meet certain requirements on error probabilities, that would make a decision -- for or against the device -- within a speci- fied maximum number (n = 15) of heart attack incidents in a group of recruited patients, and hopefully would terminate after many fewer incidents. We will use this setting to review single and double sampling plans, curtailed sampling, and various other sequential sampling plans that might be used for such a trial, along with the associated methodology for inference about the implicit Bernoulli parameter -- the success rate in resuscitating patients after a heart attack. We present this in the context of the WEARIT trial. You may be surprised how many statistical issues arise in an inference problem associated with observation of a few Bernoulli trials!

Using Local Correlation in Kernel-Based Smoothers for Dependent Data
Derick Peterson

This is a joint work with Hongwei Zhao and Sara Eapen.

Informative Prior Specification for Linear Regression Models using Parameter Decompositions
Sally Thurston

I will motivate this work by discussing a dataset for which the intended Bayesian analysis requires an informative prior, due to interactions for which the data likelihood has no direct information. I will then present a method of obtaining informative priors for a linear regression model, based on information elicited from a subject matter expert. This method relies on a decomposition, novel in the multivariate case, of regression coefficients, their covariance matrix, and the residual variance of the regression. The only quantities which the expert needs to specify are the population means, variances, and pairwise correlations. Finally, I will discuss how I used the information elicited from the expert to obtain a proper informative prior for this example. This is joint work with Joe Ibrahim and Susan Korrick.

Topology, DNA Topology and Some Probabilistic Models of Nucleic Acids
Eva Culakova

This is an informative talk based on already known results. The presentation was inspired by my effort to understand the book by A D Bates and A Maxwell "DNA Topology". First I will introduce a classical result about "Hopf Map" in order to give my appreciation to the field of topology. Next I will give an example of a situation where topology can help to understand DNA recombination. At the end I will briefly introduce a probabilistic model that is used to distinguish if two nucleic acids or protein sequences are related or not. This part is based on the book by Durbin, Eddy, Krogh and Mitchison "Biological Sequence Analysis".


Li-Shan Huang

I will discuss the paper, Doksum, K., and Samarov, A. (1995). Nonparametric estimation of global functionals and a measure of the explanatory power of covariates in regression. Annals of Statistics, 23, 1443-1473. and propose new ideas of nonparametric coefficient of determination.

An Overview of Multiple Imputation
Michael McDermott


Spring 2002 Biostatistics Brown Bag Seminar Abstracts

MADIT-II: A Recently Completed Sequential Clinical Trial
Jack Hall

The Multicenter Automatic Defibrillator Implantation Trial #2, administered here at the UR Medical Center with 1232 heart-disease patients enrolled through 76 hospital centers, came to a favorable conclusion in November by reaching a pre-specified sequential stop- ping criteria for efficacy. The statistical work was, and continues to be, carried out here, including statistical design of the study, weekly analyses of the survivorship data, chairing of the Monitoring Committee, and final analyses of efficacy and side-effects data, with cost analyses still to come. This talk will give an overview, focusing on the statistical aspects of designing, monitoring and analyzing such trial data.

Combining Statified and Unstratified Log-Rank Tests for Correlated Survival Data
Changyong Feng

The log-rank test is the most widely used nonparametric method for testing treatment differences in survival-analysis-based clinical trials due to its efficiency under proportional hazards. Most previous work on the log-rank test has assumed that the samples from the two treatment groups are independent. However, in multicenter clinical trials, survival times of patients in the same medical center may be correlated due to some factors specific to each center; or studies may utilize pairing of patients or response units, resulting in dependence. For such data we can construct stratified and unstratified log-rank tests (call them SLRT and ULRT respectively). These two tests address somewhat different features of the data. An appropriate linear combination of these two tests may give a more powerful test than either individual test. Under a matched-pair frailty model, we obtain closed-form asymptotic local alternative distributions and the correlation coefficient of SLRT and ULRT. Based on these results we construct an optimal linear combination of the two test statistics. Simulation studies with Hougaard model confirm our construction. Our approach is illustrated with data from the Diabetic Retinopathy Study(Huster, et al, 1989). We extend our work to the cases of stratum size > 2 and of variable (but upper bounded) stratum sizes.

Non-Sexual Household Transmission of HCV Infection
Fenyuan Xiao

Objective: This study was designed to determine the prevalence and the incidence of HCV infection among non-sexual household contacts of HCV-infected women and to describe the association between HCV infection and potential household risk factors in order to examine whether non-sexual household contact is a route of HCV transmission. Methods: A baseline prevalence survey included 409 non-sexual household contacts of 241 HCV-infected index women in the Houston area from 1994 to 1997. A total of 470 non- sexual household contacts with no evidence of HCV infection at baseline investigation were re-assessed approximately three years after baseline enrollment. Information on potential risk factors was collected through face to face interviews and blood samples were tested for anti- HCV with ELISA-2 and Matrix / RIBA-2. The relationships between HCV infection and potential risk factors were examined by using univariate and multivariate logistic regression analyses. Results: The overall prevalence of anti-HCV positivity among 409 non-sexual household contacts was 4.4%. The highest prevalence of anti-HCV was found in parents (19.5%), followed by siblings (8.1%) and other relatives (5.6%); the children had the lowest prevalence of anti-HCV (1.2%). The univariate analysis showed that IDU, blood transfusion, tattoos, sexual contact with injecting drug users, more than 3 sexual partners in a lifetime, history of a STD, incarceration, previous hepatitis, and contact with hepatitis patients were significantly associated with HCV infection, however, sharing razors, nail clippers, toothbrushes, gum, food or beds with HCV-infected women, and history of dialysis, health care job, body piercing, and homosexual activities were not. Multivariate analysis found that IDU (OR = 221.7 with 95% CI of 22.8 to 2155.7) and history of a STD (OR = 11.7 with 95% CI of 1.2 to 113.1) were the only variables significantly associated with HCV infection. No such associations remained for other risk factors. The three-year cumulative incidence of anti- HCV among 352 non-sexual household contacts of HCV-infected women was zero. Conclusion: This study has provided no evidence that non-sexual household contact is a likely route of transmission for HCV infection. The risk of sharing razors, nail clippers, toothbrushes, gum, food and/or beds with HCV-infected women is not evident and has not been shown to be the likely mode for HCV spread among family members. This study does suggest that IDU is the likely route of transmission for most HCV infection. Association also has been shown independently with a history of STD. The prevalence of anti-HCV among non-sexual household contacts was low. Exposure to common parenteral risk factors and sexual transmission between sexual partners may account for HCV spread among household members of HCV-infected persons.

Parameter Estimation in Bivariate Copula Models
Antai Wang

Many models have been proposed for multivariate failure-time data (T_{1},T_{2}) arising in reliability and other applications. A bivariate survivor function S(t_{1},t_{2}) is said to be generated by an archimedean copula if it can be expressed in the form S(t_{1},t_{2})=p[q{S_{1}(t_{1})}+q{S_{2}(t_{2})}] for some convex, decreasing function q defined on (0,1]. Here $p$ is the inverse function of q. Usually, p is specified as some function of an unknown parameter. Given a sample from S(t_{1},t_{2}), the distribution function of V=S(T_{1},T_{2}), called the Kendall distribution, can be expressed simply in terms of q. We use the score function from the log-likelihood of the V's to estimate the unknown parameter. Although the V's are unknown, they can be estimated empirically. Interestingly, our estimates based on the empirical V's are much more precise than the estimates based on the true and unknown V's. We also investigate an alternative procedure based on iteratively estimating the V's using the assumed copula structure. We discuss the asymptotic theory for both methods and present some illustrative examples. I will also cover the recent development of a new method to estimate the parameter for bivariate data subject to right random censoring briefly.

Microarray Analyses
Li-Shan Huang

Bayesian inference of phylogeny
John Huelsenbeck

A Few Remarks on Partial Correlation
Heng Li

A Generalization of ROC Curves
Michael McDermott


Fall 2001 Biostatistics Brown Bag Seminar Abstracts

Use of placebo-controls vs. active-controls in clinical trials evaluating new treatments
Mike McDermott

Using Measurement Error Models w/o and w/ Interactions to Assess Effects of Prenatal and Postnatal Methylmercury Exposure in the Seychelles Child Development Study at age 66-months
Li-Shan Huang

Overrunning in Sequential Clinical Trials
Jack Hall

Most large-scale clinical trials these days have sequential stopping rules that permit early termination of the trial when clear superiority of a treatment is firmly established early in the trial. Once a stopping boundary has been reached, statistical methods allow computation of p-values and estimates of treatment effects which recognize the sequential stopping rule. Typically, however, additional `lagged' data become available after the boundary has been reached. Earlier methods of accommodating such `overrunning' have serious defects. Two new methods (one joint with Aiyi Liu, the other joint with Keyue Ding) will be described, and illustrated with data from the MADIT trial of an implanted defibrillator (New England Journal of Medicine, 335:1933-40, 1996).

A Second Look at Some Statistical Ideas Via Geometric Projection
Heng Li

Geometric concepts have always been useful in statistics. Consider, for example, the number of situations in which the idea of orthogonal projection plays a crucial role. We will discuss a closely related geometric operation, to be called orthogonal cross projection, and point out some of its manifestations in statistics (e.g., covariance). Power point technology would be used in the presentation, provided that all the equipments are functional and are not too sophisticated for the presenter to operate.

Two-Period Designs: Part II
David Oakes

On Two Consistent Tests of Bivariate Independence and Some Applications
Greg Wilding

The use of the correlation coefficient for testing bivariate independence, although most common, has serious limitations. In this talk I will discuss Hoeffding's (1948) test of bivariate independence, and its asymptotic equivalent due to Blum, Kiefer and Rosenblatt (1961), which are well known to be consistent against all dependence alternatives. Specifically, I will describe the status of its null distribution and compare its power using a variety of copulas, including those due to Morgenstern, Gumbel, Plackett, Marshall and Olkin, Raftery, Clayton, and Frank. I will also show how the test of bivariate independence can be used for constructing simple goodness-of-fit tests.

Smoothing Longitudinal Data: A Work in Progress
Derick R. Peterson, Hongwei Zhao, Sara Eapen

We consider the general problem of smoothing longitudinal data to estimate the nonparametric marginal mean function, where a random but bounded number of measurements are available for each independent subject. In stark contrast to recent work in this area, we show that not only can consistent estimators use the correlation structure of the data but that ignoring this correlation structure necessarily results in inefficiency, just as in the parametric setting. The class of local polynomial kernel-based estimating equations considered by Lin & Carroll (JASA 2000) are shown to be too small, such that they cannot properly make use of the correlation structure; this explains the problem with their general message that it is best to assume working independence, while also providing insight into why penalized likelihood-based correlated smoothing splines can be expected to be efficient. We propose a class of simple, explicit ad hoc estimators which although not efficient can improve upon the working independence local polynomial modeling approach by making use of the local correlation structure to dramatically improve the precision even for moderate sample sizes.


Spring 2001 Biostatistics Brown Bag Seminar Abstracts

30th Anniversary of the Biplot
Ruben Gabriel

A Review of Nonparametric Surival Estimation with Bivariate Right-Censored Data
Derick Peterson

The problem of nonparametric estimation of the survival function with censored data has an elegant and efficient solution in the one-dimensional case: the Kaplan-Meier estimator. In higher dimensions, with multiple, possibly correlated, survival times, however, the task is much more formidable. Several authors have proposed ad hoc estimators in this model, and in 1996 van der Laan proposed a theoretically efficient estimator, while also analyzing inefficient estimators previously proposed by Dabrowska, Prentice and Cai, and Pruitt. I will review these estimators and explain why the NPMLE is not, in general, consistent for the bivariate survival function. Unlike in the one-dimensional case, some sort of smoothing is required for efficient estimation. Bandwidth selection remains an open problem in this context, thus contributing to the slow uptake of van der Laan's estimator.

Confidence Intervals: Equal-Tail, Shortest or Unbiased?
Jack Hall

Various criteria for choosing confidence intervals have been considered in the literature. We focus on three, named in the title. When based on a pivot with a symmetric distribution, the three coincide, but in `small-sample' applications this covers little more than confidence intervals for normal population means, contrasts among such means, and rank procedures about a center of symmetry. Of course, from a large-sample perspective, a maximum likelihood estimate minus parameter, standardized by a standard error estimate, is such a pivot, and this covers many applications.

We review the pro's and con's of the three competitors, largely in the context of confidence intervals for the variance when sampling from a normal population, and similarly for variance ratios of analysis of variance. However, our motivation is for dealing with confidence intervals for the hazard ratio after a sequential clinical trial: What kind of interval should be preferred?

Your opinions will be invited....

Analysis of Chicago Ozone Data 1981-1991
Li-Shan Huang

Ozone concentrations are affected by precursor emissions and by meteorological conditions. It is of interest to analyze trends in ozone after adjusting for meteorological influences. We will discuss the following 4 approaches to analyze Chicago Ozone data 1981-91:

  • Nonlinear Regression, by Bloomfield, Royle, Steinberg and Yang (1996)
  • Logistic Models, by Smith and Huang (1993)
  • Semi-parametric modeling, by Gao, Sacks and Welch (1996)
  • Tree regression & empirical Bayes, by Huang and Smith (1999)

A Test for Equality of Ordered Inverse Gaussian Means
Lili Tian

The inverse gaussian (IG) distribution, called the fraternal twin of the Gaussian distribution, has been widely used in applied fields due to the facts that it is ideally suited for modeling positively skewed data and that its inference theory is well known to be analogous to that of the Gaussian distribution in numerous ways. For example, Weiss (1982, 1983, 1984) demonstrated that the distribution of circulation times of drug molecules through the body can be approximated by the IG distribution. We propose a test procedure to assess trends in the IG response variable (e.g., in animal toxicity studies). This approach, based on combining independent tests using classical methods, can be easily extended to a spectrum of order constraints. It is also shown that this procedure is intriguingly analogous to that for the Gaussian distribution. The power properties are examined by simulation.

Correlation Between Variables When Each Is Subject to Sets of Exchangeable Measurements: An Approach Based on Group Invariance
Heng Li

An analytical procedure is developed for a type of data structure suitable for modelling the situation in which multiple measurements are made on each of a set of variables, and the measurements can be divided into exchangeable subsets. The procedure is based on the pattern in covariance matrix corresponding to the group invariance inherent in the data structure, from which a closed-form expression of Gaussian likelihood can be found. Sufficient statistics in the form of sums of squares and cross products and their distributions are obtained, leading to methods of statistical inference for a variety of practical purposes from correction for attenuation to estimation of reliability coefficients. The closed-form expression of the likelihood function is also helpful for implementing likelihood-based computation, such as the EM algorithm for handling missing data, and for Bayesian inference. The latter can be a very effective tool in dealing with some inferential problems that do not have standard solutions in the traditional framework. Examples include guaranteeing the nonnegative definiteness of an estimated disattenuated correlation matrix and combining information on association parameters from a main study and a reliability, reproducibility, or repeatability study. No originality is claimed and nothing presented will be beyond what is intuitively obvious and/or what has already been in the literature, although the procedure is readily adaptable for variations on the basic structure. The main objective is to illustrate the application of group invariance in modelling and analysis, which is the topic of almost all my previous lunch presentations. The current presentation, however, involves a data structure that has not been discussed in the previous presentations.

On Kendall's Process and an Associated Estimation Procedure
David Oakes

If X is a continuous univariate random variable with distribution function F(x) then it is well-known that F(X) = pr (X < x) and S(X) = 1- F(X) are uniformly distributed. For a bivariate random variable (X,Y) it is no longer true in general that F(X,Y) and the corresponding survivor function S(X,Y) follow uniform distributions. For example if X and Y are independent, S(X,Y) has the distribution of the product of two independent uniform variables.

This talk will explore the use of a bivariate analog of the probability integral transform in estimating the parameters governing the dependence structure in a bivariate distribution. We will present and explain some simulation results that at first sight seemed somewhat surprising.

(This is joint work with Antai Wang and will form the basis for his upcoming qualifying paper)

Bootstrap variations: random weighting
Derick Peterson

A review of treatment allocation methods in clinical trials
Hongwei Zhao

Randomized-Withdrawal and Randomized-Start Designs
Jack Hall

Randomized-withdrawal and randomized-start designs have recently been introduced in the neurological clinical trials literature as designs which facilitate detection of long-term (`neuroprotective') effects as distinguished from short-term (`symptomatic') effects of a treatment relative to a placebo. Models and analyses for such designs will be described, along with various advantages and limitations. Factorial versions will also be considered.

Fall 2000 Biostatistics Brown Bag Seminar Abstracts

A Roughness-Penalty View of Kernel Smoothing
Li-Shan Huang

It has been shown that a smoothing spline estimate is an equivalent kernel estimate. In this paper, we show that both the Nadaraya-Watson and local linear kernel estimators are equivalent penalized estimators.

Algebraic Rationales for Some Statistical Procedures: Possibilities for Unification and Generalization
Heng Li

Many common procedures in statistics have algebraic interpretations. We will discuss a series of examples beginning with the most basic ones. It will be shown how algebraic rules extracted from simple cases can be applied to tackle some non-trivial problems. Possibilities for a general framework will also be discussed.

A Simulation Study of Frailty Effects in Censored Bivariate Survival Data
Sara Eapen

Multivariate censored survival data typically have correlated failure times. The corrleation can be a consequence of the observational design, for example with clustered sampling and matching, or it can be a focus of interest as in genetic studies, longitudinal studies of recurrent events and other studies involving multiple measurements. The correlation between failure times can be accounted for by fixed or random effects. A simulation study was designed to compare the performance of the mixture likelihood approach to estimating the model with these frailty effects in censored bivariate survival data. It is found that the mixture method is surprisingly robust to misspecification of the frailty distribution.

Profile Likelihood and the EM-algorithm
David Oakes

A Review of the Case-Crossover Design & Applications
Jack Hall

The case-crossover design -- a case-control study in which the subject serves as his own control -- was formally introduced by the epidemiologist Malcolm McClure in 1991. He described it as `a method for studying transient effects on the the risk of acute events'. The design will be described and discussed in the context of several published applications (including participation by Jamie Robins and Robert Tibshirani), evaluating the questions: Are MI's more likely following (i) sexual activity? (ii) coffee drinking? (iii) episodes of anger? Are auto accidents more likely while using a cell phone?

Exploring Multivariate Data with Density Trees
Richard Raubertas

Classification trees are widely used as rules for assigning observations to classes based on their attributes or features. A classification tree is equivalent to a partition of the feature space into rectangular regions, with a constant estimate of class probabilities in each region. Density trees are proposed as a variation on this idea, designed to examine the multivariate distribution of the features themselves. A tree-structured approach is used to partition the feature space into low- and high-density regions; that is, regions with especially low or especially high numbers of observations relative to an arbitrary reference distribution. This results in a nonparametric, piecewise-constant estimate of the joint distribution of the features. Because the regions are defined by simple inequalities on individual features, density trees can provide a direct and interpretable description of multivariate structure. In addition, they may be useful for identifying regions where prediction models derived from the data are poorly supported by observations.

Nonparametric regression for longitudinal data
Hongwei Zhao

My talk is motivated by an applied example where it is desirable to fit a nonparametric regression model for data that were obtained longitudinally. Even though theory for nonparametric regression for independent data have been well developed, there are still questions that need to be answered for applying nonparametric methods to the longitudinal data. Simulations are conducted to compare some current available methods as well as some news ones. These methods are also applied to a real example.

Generalized Nonlinear Regression
Christopher Cox

 

Please send your comments and suggestions about this web page to the BST Webmaster mailto:webmaster@bst.rochester.edu