# Seminar Abstracts

### Spring 2012 Biostatistics Brown Bag Seminar Abstracts

**Generalized Additive Partial Linear Models: A Review***
Hua Liang, PhD
* In this brown bag seminar, I will review various estimation methods for generalized additive partial linear models (GAPLM), with emphasis on the use of polynomial spline smoothing for estimation of nonparametric functions and quasi-likelihood based estimators for the linear parameters. I will also talk about asymptotic normality for the estimators of the parametric components and variable selection procedures for the linear parameters by employing a nonconcave penalized likelihood, which is shown to have an oracle property. I will present several numerical examples including simulation and empirical studies for an illustration.

Thursday, March 22, 2012 at 12:30 pm, SRB 2420

**Effect Modification using Latent Mixture Analysis***
Tanzy Love, PhD
*The Seychelles Child Development Study (SCDS) is examining associations between prenatal exposure to low doses of methylmercury (MeHg) from maternal fish consumption and children's developmental outcomes. Whether MeHg has neurotoxic effects at low doses remains unclear and recommendations for pregnant women and children to reduce fish intake may prevent a substantial number of people from receiving sufficient nutrients that are abundant in fish. The primary findings of the SCDS are inconsistent with adverse associations between MeHg from fish consumption and neurodevelopmental outcomes. However, whether there are subpopulations of children who are particularly sensitive to this diet is an open question. Secondary analysis from this study found significant interactions between prenatal MeHg levels and both caregiver IQ and income on 19 month IQ (Davidson et al., 1999). These results are dependent on the categories chosen for these covariates and are difficult to interpret collectively.

In this paper, we estimate effect modification of the association between prenatal MeHg exposure and 19 month IQ using a general formulation of mixture regression. Our model creates a latent categorical group membership variable which interacts with MeHg in predicting the outcome. We also fit the same outcome model when in addition the latent variable is assumed to be a parametric function of three distinct socioeconomic measures.

Bayesian MCMC methods allow group membership and the regression coefficients to be estimated simultaneously and our approach yields a principled choice of the number of distinct subpopulations. The results show three different response patterns between prenatal MeHg exposure and 19 month IQ in this population.

This is joint work with Sally Thurston and Phil Davidson

Thursday, March 1, 2012 at 12:30 pm, SRB 441

**Recent Advances in Numerical Methods for Nonlinear Equations and Nonlinear Least Squares***
Ya-xiang Yuan, PhD
* Nonlinear equations and nonlinear least squares problems have many applications in physics, chemistry, engineering, biology, economics, finance and many other fields. In this paper, we will review some recent results on numerical methods for these two special problems, particularly on Levenberg-Marquardt type methods, quasi-Newton type methods, and trust region algorithms. Discussions on variable projection methods and subspace methods are also given.

Thursday, February 16, 2012 at 12:30 pm, SRB 4414

### Fall 2011 Biostatistics Brown Bag Seminar Abstracts

**Estimation and identifiability for discretely observed age-dependent branching processes***
Rui Chen, PhD
* A classical problem in biology is to draw inference about cell kinetics from observations collected at discrete time points. The theory of age-dependent branching processes provides an appealing framework for the quantitative analysis of cell proliferation, differentiation, and death under various experimental settings. Likelihood inference being generally difficult with continuous-time processes, we have proposed quasi- and pseudo-likelihood estimators. The goal of this talk is to compare these estimators, and discuss the associated issue of parameter identifiability using moments of the process. Applications to real data and simulation studies will illustrate the talk.

Thursday, December 22 , 2011 at 12:30 pm, SRB 4414

**RINO: A Robust Interchangeable Normalization Method for Testing Differential Expressions***
Xing Qiu, PhD
*Modern microarray analyses depend on a sophisticated data pre-processing procedure called normalization, which is designed to reduce the technical noise level and/or render the arrays more comparable in one study. A formal statistical test such as two sample t-test is then applied to the normalized data to obtain p-values. Popular normalization procedures such as global and quantile normalizations are designed for largely homogeneous arrays, i.e., the proportion and/or magnitude of differentiation is negligible. In this talk I will show that if samples in a study are highly heterogeneous with unbalanced differential structure (many more up-regulated genes than down-regulated genes or vice versa), formal statistical inferences based on normalized data can have intolerably high type I error. The fundamental reason for this phenomenon is that normalization procedures are variance reduction transformations which also bring in bias; and this bias can be substantial when the data have unbalanced differential expression structure. RINO is a novel normalization procedure designed to work with highly heterogeneous arrays. In simulation studies it controls type I error for data with high unbalanced differential expression structure, and it does this without sacrificing testing power. Finally, I apply RINO to several biological datasets. RINO is invariably superior than two competing methods in these comparisons.

Tuesday, December 13 , 2011 at 12:30 pm, SRB 4414

**Discrete Network Analysis of Continuous Gene Expression Data***
Nikolay Balov, PhD
* Using gene profiles for detecting phenotypic differences due to disease or other factors is a main problem of gene expression analysis. Graphical statistical models such as Bayesian networks, unlike many standard approaches including support vector machines and simple linear regression, provide classification power by identifying and quantifying significant gene interactions. In addition, discrete Bayesian networks have the attractive feature of being able to represent non-linear and non-Gaussian dependencies. Their application however depends on discretization of the original data which is inevitably accompanied by loss of information. As a result, discrete network models suffer from very unstable estimation in small sample settings, as is usually the case with microarray data. Can we lessen the impact of discretization and improve the estimation efficiency? We propose a modified MLE based on the so-called ‘soft’ discretization and provide some theoretical considerations of when and why the latter is better than the standard MLE employing `hard' discretization. These two approaches are used for implementing discrete Bayesian network classifiers and then compared in a real data setting - predicting the p53 status of breast cancer subjects.

Thursday, December 8, 2011 at 12:30 pm, SRB 4414

**Molecular dynamics of amyloid fibrils***
Ana Rojas, PhD
*Under certain conditions, a number of proteins can form aggregates known as amyloid fibrils. In the last two decades, these aggregates have gained great attention due to their role in diseases as type 2 diabetes, Alzheimer's, Parkinson's and Huntington's disease, and more recently, in HIV infectivity.

Due to the nature of amyloids, it has been extremely difficult to elucidate their 3-dimensional structures as well as the pathway leading to fibril formation. In this regard, simulations have been a valuable tool. In particular, the molecular dynamics (MD) technique, which has the advantage of describing the time evolution of a system as a function of time, is a useful tool to study this problem. I will present examples where we have successfully applied MD to understand different aspect of amyloid fibrils.

Thursday, December 1, 2011 at 12:30 pm, SRB 4414

**Order restricted analysis of covariance with unequal slopes ***
Jason Morrissette, M.A.
*Analysis of covariance models arise often in practice. We present methods for estimating the parameters of an analysis of covariance model under pre-specified order restrictions on the adjusted mean response across the levels of a grouping variable (e.g., treatment). The order restriction is assumed to hold across pre-specified ranges of the covariates. The assumption of equal slopes for each level of the grouping variable is relaxed, but the association between the response and the covariates is assumed to be linear. The estimation procedure involves solving a quadratic programming minimization problem with a carefully specified constraint matrix. The likelihood ratio test for equality of ordered mean responses is developed and the null distribution of the test statistic is described. A Johnson-Neyman type test for identifying regions of the covariates that correspond to significant group differences is also described. The proposed methods are demonstrated using data from a clinical trial of the dopamine agonist pramipexole for the treatment of early Parkinson’s disease performed by the Parkinson Study Group.

Thursday, October 27, 2011 at 12:30 pm, SRB 4414

** Efficient Inference for Differences Between Paired Observations***
Jack Hall, Ph.D.
*Paired data differences come about as the difference between after- and before-treatment measurements on patients, or between left- and right-side measurements, or as differences between twins, for example, with symmetry at zero the natural representation of 'no effect'. The Wilcoxon signed-rank test

*W*is a popular technique (alternative to a

*t*-test) for testing such a null hypothesis, but methods for inferences after a rejection are limited.

We propose two semiparametric models for skewed alternatives to symmetry, and introduce a signed-rank test with greater power than

*W*(or

*t*) at these alternatives. We go on to estimate the departure-from-symmetry-at-0 parameter in these models, deriving an efficient estimate. The parameter has both skewness and hazard ratio interpretations.

The test and estimation methods are illustrated with a textbook example, and a simulation study is summarized. Extension to regression is suggested. [Part of this work was done jointly with Jon Wellner.]

[For presentation at Stanford University, November 2, 2011.]

Monday, October 17, 2011 at 12:30 pm, SRB 4414

**When subtraction equals multiplication, the proportional odds model with an anchor ***
David Oakes, Ph.D.
*Motivated by the well-known tonsil-size data analyzed in McCullagh’s (1980) pioneering paper on the proportional odds model, we consider the behavior of this model when the distribution of responses at one level of the explanatory variables is known or can be estimated directly with very high precision.

Thursday, October 6, 2011 at 12:30 pm, SRB 4414

### Fall 2010 Biostatistics Brown Bag Seminar Abstracts

** Peak Detection as Multiple Testing for ChIP-Seq Data**

Armin Schwartzman, Ph.D.

Harvard School of Public Health

A multiple testing approach to peak detection is applied to the problem of detecting transcription factor binding sites in ChIP-Seq data. The proposed algorithm is a modified version of a more general peak detection algorithm, where after kernel smoothing, the presence of a peak is tested at each observed local maximum, followed by multiple testing correction via false discovery rate. The adaptation to ChIP-Seq data includes modeling of the data as a Poisson sequence, use of Monte Carlo simulations to estimate the distribution of the heights of local maxima under the null hypothesis for computing p-values of candidate peaks, and local estimation of the background Poisson rate from a Control sample.

Thursday, October 14, 2010 at 12:30 pm, Biostats Conference Room

**Targeting cancer cell-specific gene networks to achieve therapeutic efficacy and specificity**

Helene McMurray, Ph.D.

For cancer treatment, the ideal combination of therapeutic agents would produce highly effective inhibition of cancer growth, with minimal collateral damage to normal cells. To gain predictability of effect and specificity, it is necessary to identify, understand, and ultimately, target, the key features within cancer cell regulation. Because the cancer state has underlying robustness, it will be important to target multiple vulnerabilities in cancer cells, in order to maximize cancer ablation. Our strategy to accomplish this focuses on three key areas: I. Identifying key features in cancer cells: Can we find cancer regulatory genes/ gene networks in genomic scale gene expression data?; II. Achieving cancer cell specificity: Are there genes/ gene networks that specifically regulate cancer cell behavior?; and III. Targeting for cancer cell ablation: Can we find pharmacologic means to alter these genes/gene networks?

Thursday, September 23, 2010 at 12:30 pm, Biostats Conference Room

### Summer 2010 Biostatistics Brown Bag Seminar Abstracts

**Sufficiency, Completeness, and Sigma-Algebra**

Xing Qiu, Ph.D.

Most probability textbooks present sigma-algebra as a pure technical prerequisite for measure/probability without explaining its geometric representations. As a result, many statisticians do not have good intuition of this otherwise very powerful tool. In this talk, I will use atom/particle as a way to visualize sigma-algebras, and then illustrate the deep connection between ordered sigma-algebras (called a filtration) and important statistical notions such as sufficiency and completeness. From this point of view, the much celebrated Lehmann-Scheffé Theorem becomes quite apparent.

Thursday, August 12, 2010

** Correlation Between the True and False Discoveries in a Positively Dependent Multiple Comparison Problem**

Rui Hu, Ph.D.

Testing multiple hypotheses when observations are positively correlated is very common in practice. The dependence between observations can induce dependence between test statistics and distort the joint distribution of the true and false positives. It has a profound impact on the performance of common multiple testing procedures. While the marginal statistical properties of the true and false discoveries such as their means and variances have been extensively studied in the past, their correlation remains unknown.

By conducting a thorough simulation study, we are able to find that the true and false positives are likely to be positively correlated if testing power is low and *vice versa*. The fact that positive dependence between observations can induce negative correlation between the true and false discoveries may be used to design better multiple testing procedures in the future.

Thursday, August 5, 2010

**The Signed Log-Rank Test and Modeling Alternatives to Symmetry at Zero**

W. Jackson Hall, Ph.D.

We first re-visit the two-sample log-rank test, but removed from the context of censored-data survival-analysis. It is asymptotically efficient in each of two *Lehmann alternatives models* – with one survival function a power of the other, or one distribution function a power of the other –, but it is rarely applied outside survival analysis. The test should be considered as an alternative to the standard Wilcoxon test, which is aimed at shift alternatives.

We extend this directly to the one-sample problem, now testing symmetry-at-0, a hypothesis often natural for paired data differences. The Wilcoxon signed-rank test is commonly used and aimed at symmetric shift alternatives. We propose instead a signed log-rank test. We report on its power, and show it to be a locally most powerful rank test, and asymptotically efficient, in each of two *Lehmann-alternatives-to-symmetry-at-0 models*. Efficient estimation of the corresponding hazard-ratio-type parameter is briefly described, and extensions to regression models envisaged.

Some of this work is joint with Jon Wellner.

Thursday, July 22, 2010

### Spring 2010 Biostatistics Brown Bag Seminar Abstracts

**
Statistical Design on Estimating the Duration of HIV Infection
**

Ha Youn Lee, Ph.D. and Tanzy Love, Ph.D.

For practical and economic reasons, HIV prevention field needs to be able to assess the duration of virus infection in individuals. We are studying the diversification of HIV sequence population within an infected individual by combining high-throughput pyrosequencing, mathematical modeling, and statistical inference. Our main goal is providing a reliable, simple method for quantitatively evaluating the duration of virus infection.

We start the talk by introducing the model of HIV sequence evolution which illustrates detailed virological processes within the body such as production of newly infected cells and explicit random mutations in HIV genomes by reverse transcriptase errors. Then we focus on walking you through our statistical design on how we can jointly estimate the time post infection and the number of (transmitted) founder strains from the intersequence Hamming distance distribution of sampled HIV genes. Two independent methods for parameter estimations, maximum likelihood estimation and Bayesian posterior estimation, will be discussed.

Thursday, April 8, 2010

**Potential advantages (and disadvantages) of high-throughput RNA sequencing over microarrays for global gene expression profiling**

Stephen Welle, Ph.D.

I will briefly discuss the principle of the method and the specific implementation available at the URMC FGC (SOLiD 3 Plus system), and show how some of the output files are formatted. I will discuss theoretical advantages of this method and illustrate some of these with some of my own RNA-Seq data. I will solicit discussion from your group regarding whether such data require different statistical approaches than those used with microarrays for assessing differential expression between experimental conditions.

Thursday, February 25, 2010

**Sieve Estimation of Constant and Time-Varying Coefficients in Nonlinear Ordinary Differential Equation Models by Considering Both Numerical Error and Measurement Error**

Hongqi Xue, Ph.D.

This article considers estimation of constant and time-varying coefficients in nonlinear ordinary differential equation (ODE) models where analytic closed-form solutions are not available. The numerical solution-based nonlinear least squares (NLS) estimator is proposed. A numerical algorithm such as the Runge-Kutta algorithm is used to approximate the ODE solution. The asymptotic properties are established for the proposed estimators with consideration of both numerical error and measurement error. The B-spline approach is used to approximate the time-varying coefficients and the corresponding asymptotic theories in this case are investigated under the framework of the sieve approach. Our results show that if the maximum step size of the [pic]-order numerical algorithm goes to zero at a rate faster than [pic], then the numerical error is negligible compared to the measurement error. This provides a theoretical guidance in selection of the step size for numerical evaluations of ODEs.

Moreover, we have shown that the numerical solution-based NLS estimator and the sieve NLS estimator are strongly consistent. The sieve estimator of constant parameters is asymptotically normal with the same asymptotic co-variance as that of the case where the true ODE solution is exactly known, while the estimator of the time-varying parameter has the optimal convergence rate under some regularity conditions. We illustrate our approach with both simulation studies and clinical data on HIV viral dynamics.

*This is joint work with Hongyu Miao and Hulin Wu.
*

Thursday, February 11, 2010

### Fall 2009 Biostatistics Brown Bag Seminar Abstracts

**On the Impact of Parametric Assumptions and Robust Alternatives for Longitudinal Data Analysis
**

Naiji Lu, Ph.D.

Models for longitudinal data are employed in a wide range of behavioral, biomedical, psychosocial, and health-care related research. One popular model for continuous response is the linear mixed-effects model (LMM). Although simulations by recent studies show that LMM provides reliable estimates under departures from the normality assumption for complete data, the invariable occurrence of missing data in practical studies renders such robustness results less useful when applied to real study data. We showed by simulated study data that in the presence of missing data estimates of the fixed-effect of LMM are biased under departures from normality. We discussed two robust alternatives, the weighted generalized estimating equations (WGEE) and the augmented WGEE (AWGEE), and compare their performances with LMM using real as well as simulated data. Our simulation results showed that both WGEE and AWGEE provide valid inference for skewed non-normal data when missing data follow the missing at random (MAR), the most popular missing data mechanism for real study data.

Thursday, December 17, 2009

** Statistical Mechanics of RNA Structure Prediction
**

*David Mathews , M.D., Ph.D.*

RNA structure is hierarchical and therefore the secondary structure, the set of the canonical base pairs, can be predicted independently of the 3D structure. For decades, these predictions were based on predicting the lowest free energy structure, that with the highest probability of forming. In these calculations, free energy change is predicted using an empirical, nearest neighbor model.

The accuracy of secondary structure prediction can be significantly improved by determining the probabilities of formation of structures using partition functions. In this talk, I will show that structures composed of base pairs with high pairing probability are more accurate, on average, than lowest free energy structures. I will also discuss a new method that predicting structures that generalizes to more complex topologies than most methods.

Thursday, December 3, 2009

** Latent Variable Models for Discovering Effect Modification in the Seychelles Child Development Study
**

*Tanzy Love, Ph.D*

The Seychelles Child Development Study (SCDS) is testing the hypothesis that prenatal exposure to low doses of methyl-mercury, MeHg, from maternal consumption of fish is associated with the child's developmental outcomes. No consistent negative relationships between exposure to MeHg and cognitive functions have been identified in the primary analysis of the main cohort. However, secondary regression analysis of this cohort found a small effect modification of MeHg by both caregiver IQ and household income (Davidson et al. 1999). This analysis showed a significant positive relationship between MeHg and intelligence among the subset of children whose caregivers had high IQs and were of high socio-economic status.

Using a Bayesian MCMC approach, we fit both a standard latent class model and developed a new latent variable model to further explore these results. The new model allows interactions between an unobserved social/cultural status and mercury to be discovered. It also uncovers the relationship between the latent social/cultural status variable and the observed socio-economic variables.

Thursday, November 12, 2009

### Summer 2009 Biostatistics Brown Bag Seminar Abstracts

**Statistical issues in gene regulatory network reconstruction: How classical principles of experimental design, statistical significance and model fitting can be applied to the modeling of gene perturbation data
**

*Anthony Almudevar, Ph.D*

The reconstruction of gene regulatory networks based on experimental gene perturbation data is a promising field of research with the aim of developing treatments for cancer based on the direct molecular control of cancer cells. In this talk I will give an overview of the field, and describe some recent methodologies developed by our group. The governing principle is that ideas of classical statistical inference may be applied to this problem, despite the complex form of the models.

Thursday, July 23, 2009

**The Beauty of Symmetry: An Introduction to the Invariance Principle in Hypothesis Testing
**

*Xing Qiu, Ph.D*

The Neyman-Pearson lemma is the cornerstone of the theory of statistical inference. It solves the simple hypothesis testing problem perfectly: the likelihood ratio test is the most powerful test. In reality, most H-T problems are composite problems for which no uniformly most powerful test exists. Therefore, additional principles such as the unbiasedness principle, the mini-max principle, and the invariance principle are needed in order to construct a test that is "of good quality".

The invariance principle in statistical inference is a very powerful tool to reduce the dimensionality/complexity of the H-T problem when it exhibits a natural symmetry (usually it does). This symmetry can be formalized by the group actions on both the data space and the distribution/parameter space. In this seminar, I will demonstrate the beauty and the usefulness of the invariance principle through three concrete examples: 1. (a particular) Chi-square test; 2. Two sample

t-test; 3. Wilcoxon rank-sum test.

Thursday, July 9, 2009

### Spring 2009 Biostatistics Brown Bag Seminar Abstracts

**Statistics on Manifolds with Applications
**

*Nikolay Balov, M.S.*

Department of Statistics, Florida State University

Department of Statistics, Florida State University

With ever increasing complexity of observational and theoretical data models, the sufficiency of the classical statistical techniques, designed to be applied only on vector quantities, is being challenged. Our work is an attempt to improve the understanding of random phenomena on non-Euclidean spaces.

Specifically, our goal is to generalize the notion of distribution covariance, which in standard settings is defined only in Euclidean spaces, on arbitrary manifolds with metric also known as Riemannian manifolds. We introduce a tensor field structure, named covariance field, that is consistent with the heterogeneous nature of manifolds. It not only describes the variability imposed by a probability distribution but also provides alternative distribution representations. The covariance field combines the distribution density with geometric characteristics of its domain and thus fills the gap between these two. We show how this new structure can provide a systematic approach for defining parametric families of distributions on manifolds, regression analysis and nonparametric statistical tests for comparing distributions.

We then present several application areas where this new theory may have potential impact. One of them is the branch of directional statistics, with domain of influence ranging from geosciences to medical image analysis and bioinformatics. The fundamental level at which the covariance based structures are introduced, also opens a new area for future research.

Wednesday, April 22, 2009

**Does Extirpation of the Primary Breast Tumor Give Boost To Growth of Metastases? Evidence Revealed By Mathematical Modeling
**

*Leonid Hanin*

Department of Mathematics, Idaho State University

Department of Mathematics, Idaho State University

A comprehensive mechanistic model of cancer natural history was developed to obtain an explicit formula for the distribution of volumes of detectable metastases in a given secondary site at any time post-diagnosis. This model provided a perfect fit to the volumes of n = 31 bone metastases observed in a breast cancer patient 8 years after primary diagnosis. Based on the model with optimal parameters the individual natural history of cancer for the patient was reconstructed. This gave definitive answers to the following three questions of major importance in clinical oncology: (1) How early an event is metastatic dissemination of breast cancer? (2) How long is the metastasis latency time? and (3) Does extirpation of the primary breast tumor accelerate the growth of metastases? Specifically, according to the model applied to the patient in question, (1) inception of the first metastasis occurred 29.5 years prior to the primary diagnosis; (2) the expected metastasis latency time was about 79.5 years; and (3) resection of the primary tumor was followed by a 32-fold increase in the rate of metastasis growth.

Friday, March 27, 2009

**Likelihood Estimation of the Population Tree
**

*Arindam RoyChoudhury, PhD*

Postdoctoral Associate

Dept. of Biological Statistics and Computational Biology

Cornell University

Postdoctoral Associate

Dept. of Biological Statistics and Computational Biology

Cornell University

The population tree, i.e., the evolutionary tree connecting various populations, has applications in various fields of biology and medical sciences. It can be estimated from genome wide allele-count data. We will present a maximum-likelihood estimator of the tree based on a coalescent

theoretic setup.

Using the coalescent theory we keep track of the probability of the number of lineages at different time-points in a given tree. We condition on the number of lineages to compute the probability of the observed allele-counts. Computing these probabilities requires a sophisticated "pruning" algorithm. The algorithm computes arrays of probabilities at the root of the tree from the data at the tips of the tree. At the root, the arrays determine the likelihood. The arrays consist of probabilities related to the number of lineages and allele-counts among those lineages. Our computation is exact, and avoids time consuming Monte-Carlo methods.

Thursday, March 12, 2009

**Trust me, I’m an academic statistician!
Professional ethics, conflict of interest, and JAMA policies in the reporting of randomized clinical trials
**

*Michael P. McDermott, Ph.D.*

The issue of conflict of interest and its potential impact on the integrity of scientific research has received an increasing amount of attention in the past two decades. Conflicts of interest come in many forms, including financial, intellectual, and professional. Policies have been adopted by academic institutions, governmental agencies, scientific journals, and other organizations to manage potential conflicts of interest. Statisticians, being integral to the conduct of scientific research, are certainly not immune to these potential conflicts.

Much of the research in the discovery and development of pharmacological (and other) therapies is sponsored by the pharmaceutical industry. It is clear that the trust that the public has in this research has eroded over time; there is a perceived lack of objectivity of the pharmaceutical industry in the conduct and reporting of this research. One manifestation of this can be found in the following policy adopted by the Journal of the American Medical Association (JAMA) in 2005: For industry-sponsored studies, “an additional independent analysis of the data must be conducted by statisticians at an academic institution, such as a medical school, academic medical center, or government research institute” rather than only by statisticians employed by the company sponsoring the research.

This seminar will outline some of the background that led to the JAMA policy, the implications of the policy, and the reaction of members of the scientific community. The broader issue of professional statistical ethics and the potential conflicts of interest that are faced by statisticians will also be discussed.

Thursday, February 26, 2009

**Applications of Multivariate Hypothesis Testing in Gene Discovery**

*Anthony Almudevar , Ph.D.*

A fundamental problem in genomic studies is the detection of differential expression among large sets of gene expression values produced by microarray data collected under varying experimental conditions. Although the problem is most naturally expressed as a sequence of hypothesis tests involving the expression distributions of individual genes, two points have recently been noted: 1) the individual expression distributions are characterized by statistical dependence induced by gene cooperation, and 2) information about which sets of genes form cooperative pathways is now widely available. This has led to an interest in multivariate tests involving vectors of gene expressions taken from gene sets known to have some form of functional relationship. While this leads to potentially greater power, as well as better interpretability, the development of suitable multivariate testing methods is still an important area of research. In this talk, I will discuss a number of approaches to this problem. In the first, statistical methods used to model gene pathways (primarily Bayesian networks) are adapted to hypothesis testing. In the second, we will consider how the theory of Neyman-Pearson tests can be adapted to the testing of complex hypotheses, following an earlier application to a problem in statistical genetics (Almudevar, 2001, Biometrics).

Thursday, February 5, 2009

**Differential Equation Modeling of Infectious Diseases: Identifiability,**

Parameter Estimation, Model Selection, and Computing Tools

*Hongyu Miao , Ph.D.*

Many biological processes and systems can be described by a set of differential equation (DE) models. However, literature in statistical inference for DE models is very sparse. We propose identifiability analysis, statistical estimation, model selection, and multi-model averaging methods for biological problems such as HIV viral fitness and influenza infection that can be described by a set of nonlinear ordinary differential equations (ODE). Related computing techniques have also been developed and available as a few comprehensive software packages. We expect that the proposed modeling, inference approaches and computing techniques for the DE models can be widely used for a variety of biomedical studies.

Thursday, January 22, 2009

Spring 2008 Biostatistics Brown Bag Seminar Abstracts

**In memory of Andrei: Multitype Branching Processes with Biological Applications**

*Nick Yanev, Ph.D.*

The asymptotic behavior of multitype Markov branching processes with discrete or continuous time is investigated when both the initial number of ancestors and the time tend to infinity. Some limiting distributions are obtained and the asymptotic multivariate normality is proved in the positive regular and nonsingular case. The paper also considers the relative frequencies of distinct types of individuals (cells), a concept motivated by applications in the field of cell biology. We obtain non-random limits and multivariate asymptotic normality for the frequencies when the initial number of ancestors is large and the time is fixed or tends to infinity. When the time is fixed the results are valid for any branching process with a finite number of types; the only assumption required is that of independent individual evolutions. The reported limiting results are of special interest in cell kinetics studies where the relative frequencies, but not the absolute cell counts, are accessible to measurement. Relevant statistical applications are discussed in the context of asymptotic maximum likelihood inference for multitype branching processes.

Thursday, May 22, 2008

12:30 p.m.

Biostatistics Conference Room

**Modeling Intrahost Sequence Evolution in HIV-1 Infection **

*Ha Youn Lee, Ph.D.*

Quantifying the dynamics of intrahost HIV-1 sequence evolution is one means of uncovering information about the interaction between HIV-1 and the host immune system. In this tlk, I will introduce a mathematical model and Monte-Carlo simulation of viral evolution within an individual during HIV-1 infection that enables to explain the universal dynamics of sequence divergence and diversity, to classify of new HIV-1 infections originating from multiple versus single transmitted viral strains, and to estimate time since the most recent common ancestor of a transmitted viral lineage.

From 13 out of 15 longitudinally followed patients (3-12 years), we found that the rate of intrahost HIV-1 evolution is not constant, but rather slows down at a rate correlated with the rate of CD4+ T cell count decline. We studied a HIV-1 sequence evolution model where for each sequence we keep track of its distance from the founder strain and assign a fitness and survival probability of mutations based on the distance from the founder strain.

The model suggests that the saturation of divergence and the decrease of diversity observed in the later stages of infection are attributed to a decrease in the probability of mutant strains to survive as the distance from the founder strain increases rather than due to an increase of viral fitness. At the second part, I will talk about both synchronous and asynchronous models of acute phase of HIV-1 evolution with a single cycle reverse transcriptase error rate, average generation time, and basic reproductive ratio. These models were used to analyze 3,475 complete env sequences recently derived by single genome amplification from 102 subjects with acute HIV-1 (clade B) infection, classifying a single strain infection from a multiple variant infection and also identifying transmitted HIV-1 envelope genes.

Thursday, February 21, 2008

12:30 p.m.

Biostatistics Conference Room

Fall 2007 Biostatistics Brown Bag Seminar Abstracts

**Creating an R Package Part I: It's EASY!**

*Gregory Warnes, Ph.D.*

The open source statistical package R provides nice tools for bundling a set of functions and data together as an R package. Creating an R package from your R scripts helps to provide good documentation, and makes it much easier to share with others and to maintain your code for your own future use. This Brown Bag will demonstrate how to create an R package, and the advantages that come from doing so.

**Adaptive Simon Two-Stage Design for Preliminary Test of Targeted Sub-Population**

*Qin Yu, Graduate Student*

The trend towards specialized clinical development programs for targeted cancer therapies is growing fast, which was made possible by significant improvements in molecular characterization of biological pathways fostering the growth of tumors. The proposed phase two stage design, which is an adaptation to Simon's two-stage design, allows for preliminary determination of efficacy for a particular sub-population defined by biomarker status. The advantage of adopting this two-stage design is shown via a real study.

**Using Auxiliary Variables to Enhance Survival Analysis
**

*Haiyan Su, Graduate Student*

One of the primary problems facing statisticians who work with survival data is the loss of information that occurs with right-censored data. Markers, which are prognostic longitudinal variables, can be used to replace some of the information lost due to right-censoring because of its property of correlating and predicting to the overall survival event. In oncology studies, disease progression status are measured at certain times and are correlated with survival, how to incorporate information on disease progression (the markers) in the analysis of survival to reduce the variance of treatment effect estimator (e.g. log hazard ratio in Cox model) is interesting and challenging. In this work, we applied Mackenzie & Abrahamowicz's (MA) plug-in method which writes the test statistic as a functional of the Kaplan-Meier estimators, and then replaced the latter with an efficient estimator of the survival curve that incorporates the information from markers. Possible choices of survival curve estimator are Murray-Tsiatis (MT) method and Finkelstein-Schoenfeld (FS) method. The resulting estimators can greatly improve the efficiency provided that the marker is highly prognostic and that the frequency of censoring is high. MA's methodology is illustrated with an application to a real time to event data by using MT survival curve estimator. We will also introduce FS method with a real data example.

**Approximate Iteration Algorithms **

*Anthony Almudevar, Ph.D.*

In this talk I will summarize some work undertaken by the authors in the area of approximate iterations, ranging from basic theory to applications in control theory and numerical analysis. The relationship of these processes to some important medical applications will be reviewed. The talk will divide naturally into three sections.

1. Models of Approximate Iterative Processes. An iterative process is usually expressed as a normed space $V$ with some operator $T$, on which a sequence $v(k+1) = Tv(k), k \geq 1$ is generated, given some starting value $v(0)$. Ideally, this sequence converges to a fixed point $w = Tw$. In practice, the operator can only be evaluated approximately, so the iteration is more accurately written $v(k+1) = T_k v(k) = Tv(k) + u(k)$ where, alternatively, $T_k$ is the $k$th approximation of $T$, or $u(k)$ is the approximation error associated with the $k$th iteration. It is possible to show that if $T$ is contractive the approximate algorithm will converge to the fixed point, at a rate equivalent to $\max(r^k, |u(k)|)$, where $r$ is the contraction constant. The remaining work largely follows from this result.

2. Numerical Analysis. Many iterative algorithms rely on operators which may be difficult or impossible to evaluate exactly, but for which approximations are available. Furthermore, a graduated range of approximations may be constructed, inducing a functional relationship between computational complexity and approximation tolerance. In such a case, a reasonable strategy would be to vary tolerance over iterations, starting with a cruder approximation, then gradually decreasing tolerance as the solution is approached.

However, in such an algorithm, because the computational complexity increases over iterations, the convergence rate of the algorithm is more appropriately calculated with respect to cumulative computation time than to iteration number. This leaves open the problem of determining an optimal rate of change of approximation tolerance.

Our theory of approximate iterations may be used to show that, under general conditions, for linearly convergent algorithms the optimal choice of approximation tolerance convergence rate is the same linear convergence rate as the exact algorithm itself, regardless of the tolerance-complexity relationship. This result will be illustrated with several examples of Markov decision processes.

3. Adaptive Stochastic Decision Processes. A stochastic decision process is a random sequence whose distribution beyond a time $t$ can be determined by an action taken by an observer at time $t$, who has access to all process history up to that time. There is usually some reward criterion, so that the objective of the action is to maximize the expected value of the reward. If the process distribution under all possible action sequences is known then, at least in principle, the optimal action under any given history can be calculated, as so would be available to the observer as a control policy. Typically, these distributions are unknown, but may be estimated by the observer using process history. In this case, the observer needs to vary the actions sufficiently in order to estimate the model. This, however, conflicts with the goal of achieving the optimal expected reward, since this type of exploratory behavior will be suboptimal. An adaptive decision process is one which attempts to seek an optimal balance between exploratory behavior and seeking to maximize reward based on current model estimates. Our theory can be used to define, for Markov decision processes, an exploration rate, and then to show that the optimal exploration rate decreases in proportion to $t^{-1/3}$, resulting in a process in which regret (difference between optimal and achieved reward) converges to zero at a rate of $t^{-1/3}$, as distinct from a rate of $t^{-1/2}$ associated with estimation alone. The theory extends naturally to sequential clinical trials.

* This is joint work with Edilson F. Arruda and Jason LaCombe.*

Thursday, October 4, 2007

12:30 PM

Biostatistics Conference Room

Spring 2007 Biostatistics Brown Bag Seminar Abstracts

**Where Do We Stand in Microarray Data Analysis? Lessons of the Past and Hopes for the Future **

*Andrei Yakovlev, Ph.D.
University of Rochester
*

This presentation discusses numerous pitfalls in the analysis of microarray gene expression data. Modern state of the art in this area is far from satisfactory. Many misconceptions still dominate the literature on microarray data analysis. An overview of the most common misconceptions will be given and some constructive alternatives will be proposed. In particular, I will present a new method designed to select differentially expressed genes in non-overlapping gene pairs. This method offers two distinct advantages: (1) it leads to dramatic gains in terms of the mean numbers of true and false discoveries, as well as in stability of the results of testing; (2) its outcomes are entirely free from the log-additive array-specific technical noise.

Thursday, May 17, 2007

11:30 AM

Room 2-6408 (K-207) Medical Center

** Integrating Quantitative/Computational Sciences
for Biomedical Research
**

*Hulin Wu, Ph.D.*

Our Division (Division of Biomedical Modeling and Informatics) has been formed for two years. Since we moved to a remote location, our communication and interaction with our Department are not as frequent as before. In this talk, I will give an overview on the research of our Division in order to promote more interactions and collaborations with other faculty and students in our Department. Also I will share our experience on how to do 100% our own research while we are doing 100% collaboration and consulting. Some tips on how to find more time to do our "own" research will be given.

Our Division is formed to integrate quantitative (statistics, mathematics, engineering, physics etc.) and computational sciences (computer sciences and biomedical informatics) to do biomedical research. In this new era of high technologies, many new quantitative and computational sciences have evolved from various disciplines to become major tools for biomedical research. These include biostatistics, biomathematics, bioinformatics, biomedical informatics, computational biology, mathematical biology and theoretical biology, biophysics, bioengineering etc. This also brings a great opportunity for biometrical scientists to integrate the various quantitative/computational methodologies and techniques to support biomedical discoveries and research. Our Division, collaborating with biomedical investigators, is currently working on development of mathematical models, statistical methods, computer simulation systems, software packages, informatics tools and data management systems for HIV infections, AIDS clinical studies, influenza infections and immune response to infectious diseases. In this talk, I will discuss our experience of interactions and collaborations among biostatisticians, biomathematicians, biophysicists, bioengineers and biocomputing scientists as well as biomedical investigators. In particular, I will review the three components: (1) mathematical models for HIV viral fitness experiments, AIDS clinical biomarker data, immune response to influenza A virus infections; (2) statistical methods for biomedical dynamic (differential equation) models; (3) user-friendly computer simulation and estimation software. Finally I will discuss some challenges and opportunities for biometrical scientists in biomedical research.

Bayesian multiple outcomes models and the Seychelles data

*Sally W. Thurston, Ph.D. *

Understanding the relationship between prenatal mercury exposure and neurodevelopment in children is of great interest to many practitioners. Typically, analyses rely on separate models fit to each outcome. If the effect of exposure is very similar across outcomes, separate models lack power to detect a common exposure effect. Furthermore, the outcomes cluster into broad domains and domain-specific effects are also of interest. We fit a Bayesian model which allows the mercury effect to vary across outcomes, while allowing for shrinkage of these effects within domains, and to a lesser extent between domains. We will discuss the benefits and challenges of fitting this model within a Bayesian framework, and apply the model to multiple outcomes measured in children at 9 years of age in the Seychelles. This is work in progress, and is joint with David Ruppert at Cornell University.

**
An Introduction to Adobe Contribute and Blackboard Academic
Suite**

*Chris Beck, Ph.D*, and

*Rebekka Cranmer, Senior Web Developer, Web Services Department*

Learn how to create new web pages and edit existing one with Adobe's Contribute. You will learn how to add images and text to a page, as well as edit images and create PDFs. Additionally, you will explore the page review and publishing features. Using Contribute you will be able to easily create content and publish the content to the URMC live Web server.

In the second half of this brown-bag seminar, the Blackboard Academic Suite will be introduced. Blackboard is a secure online course management

tool that is used to facilitate learning objectives, assessment, and information exchange between instructors and students. It can also be used for secure information exchange within an organization or other group of people at the University of Rochester. A brief tutorial and demonstration of the software aimed at course instructors and organization leaders will be presented.

### Fall 2006 Biostatistics Brown Bag Seminar Abstracts

**Correlation Analysis for Longitudinal Data**

*Wan Tang, Ph.D*.

Correlation analysis is widely used in biomedical and psychosocial research to evaluate quality of outcomes and to assess instrument and rater reliability. For continuous outcomes, the product-moment correlation and the associated Pearson estimate are the most popular in applications. Although asymptotic distributions of the Pearson estimates are available for multivariate outcomes, they only apply to complete data. As longitudinal study designs become increasingly popular, missing data is commonplace in most trials and cohort studies. In this talk, we propose new product-moment estimates to extend the Pearson estimates to address missing data within a longitudinal data setting. We discuss non-parametric inference under both the missing completely at random (MCAR) and missing at random (MAR) assumptions. Inference under MAR is quite complex in general and we consider several special cases that not only reduce the complexity but also apply to most real studies. The approach is illustrated with real study data in psychosocial research.

**Bayesian Network as a Model of Biological Network**

*Peter Salzman, Ph.D*.

Bayesian Network is a graphical representation of a multivariate

distribution. This representation applied to gene expression data can be

usefull to understand the direct and indirect interactions between genes/

gene products (proteins). In this talk I'll address two issues related to

Bayesian network models. The estimation/reconstruction of network from

data is computationaly intensive process as the space of possible models

is superexponential in the number of genes. In the first part of this talk

I'll describe an algorithm that operates on the space of rankings that is

'only' exponential in the number of genes.

In the second part of the talk I'll propose a procedure that tests if a

collection of genes loosely defined as a pathway is differentially

expressed under two conditions. It is based on first reconstructing the

network for each condition and then comparing the two networks. I'll

present result for simulated and real biological data to demonstrate the

applicability of the method.

**Adverse Effects of Intergene Correlations in Microarray Data Analysis**

*Xing Qiu, Ph.D*.

In the field of microarray data analysis, a common task is to find those genes that are differentially expressed in two groups of patients. Inter-gene stochastic dependence plays a critical role in the methods of such statistical inference. It is frequently assumed that dependence between genes (or tests) is sufficiently weak to justify many methodologies that resort to pooling test statistics across genes. In this talk, I present two popular methods of this kind, namely the empirical Bayes methodology and a procedure introduced by Storey et al which depends on the estimation of false discovery rate. Then I provide some empirical evidences to demonstrate that these methods suffer a lot from such pooling practice, such as high variability and lack of consistency.

**Causal Comparisons in Randomized Trials of Two Active Treatments: The Effect of Supervised Exercise to Promote Smoking Cessation**

*Jason Roy, Ph.D.*

In behavioral medicine trials, such as smoking cessation trials, two or more active treatments are often compared. Noncompliance by some subjects with their assigned treatment poses a challenge to the data analyst. Causal parameters of interest might include those defined by subpopulations based on their potential compliance status under each assignment, using the principal stratification framework (e.g., causal effect of new therapy compared to standard therapy among subjects that would comply with either intervention). Even if subjects in one arm do not have access to the other treatment(s), the causal effect of each treatment typically can only be identified from the outcome, randomization and compliance data within certain bounds. We propose to use additional information – compliance-predictive covariates – to help identify the causal effects. Our approach is to specify marginal compliance models conditional on covariates within each arm of the study. Parameters from these models can be identified from the data. We then link the two compliance models through an association model that depends on a parameter that is not identifiable, but has a meaningful interpretation; this parameter forms the basis for a sensitivity analysis. We demonstrate the benefit of utilizing covariate information in both a simulation study and in an analysis of data from a smoking cessation trial.

### Spring 2006 Biostatistics Brown Bag Seminar Abstracts

**A Nonparametric Model for Bivariate Distributions Based on Diagonal Copulas**

*Sungsub Choi, Ph.D.,*

*Department of Mathematics,*

*Pohang University of Science and Technology,*

A useful approach in constructing multivariate distributions is based on copula functions, and, in particular, Archimedean copulas have been in wide use. The talk will introduce a new class of copulas based on convex diagonal functions, and explores their distributional properties. Several examples of parametric diagonal copulas will be given. We will then explore the ways of extension to constructing multivariate proportional hazards models.

**Motion Tracking in Wireless Networks Using Artificial Triangulation**

*Anthony Almudevar, Ph.D.*

One important problem in the application of wireless networks is the location of a mobile node Tx based on the received signal strength (RSS) at a fixed configuration of receivers of a radio frequency signal transmitted by Tx. Because the RSS is inversely related to transmission distance, the distance of Tx from each receiver can be determined, and its location established by geometric triangulation, as long as at least three well spaced receivers are used.

The use of such wireless networks provides a convenient method of collecting a longitudinal record of motion for patients susceptible to dementia. This can provide an objective method for the real-time monitoring of noncognitive symptoms of dementia such as restlessness, pacing, wandering, changes in sleep patterns, changes in circadian rhythm or specific changes in daily routine. However, the calibration of the RSS to transmission distance relationship is complicated by the presence of obstacles, particularly in an indoor setting. The relationship depends strongly on the geometric configuration of walls and other large obstacles, the proximity of high voltage devices such as microwave ovens and televisions, as well as the orientation of any person wearing such a transmitter.

I will present as an interim solution a method of mapping of RSS measurements onto a two dimensional plane which preserves the topological and directional properties of any trajectory of Tx without requiring precise knowledge of the receiver configuration or the RSS to transmission distance relationship. The method works by imposing an artificial triangulation on suitably transformed RSS measurements. Such a representation will suffice to capture the essential features of patient motion. In particular, locations which are frequently occupied (favorite chair, kitchen, etc) can be identified with sufficient data, leading to the construction of a ‘living space network’ through an unsupervised learning process. The network can be later validated or annotated.

The methodology will be illustrated using data collected under a study funded by an Everyday Technologies for Alzheimer Care (ETAC) research grant from the Alzheimer's Association, using monitoring equipment provided by Home Free Systems and GE Global Research. This is joint work with Dr. Adrian Leibovici and the Center for Future Health, University of Rochester.

**Testing Equality of Ordered Means in the General Linear Model**

*Michael McDermott, Ph.D.*

Hypothesis testing problems involving order constrained means arise frequently in practice. The standard approach to this problem in the one-way layout is the likelihood ratio test. In many practical settings, such as a randomized controlled trial, it is useful to include covariates in the primary statistical model. Likelihood ratio tests for equality of ordered means that incorporate covariate adjustment are quite complex and are rarely applied in practice because of difficulties in their implementation. In this paper, a test is proposed that is based on multiple contrasts among the adjusted group means. The p-values associated with these contrasts are, in general, dependent. An overall significance test is carried out using Fisher’s statistic to combine the dependent p-values arising from these contrasts; the null distribution of this statistic can be well approximated by that of a scaled chi-square random variable. The contrasts can be chosen to yield a test with high power, for alternatives at a fixed distance from the null hypothesis, throughout the restricted parameter space. The test is generally easy to implement for a variety of partial order restrictions. An example from a randomized clinical trial is used to illustrate the proposed test.

### Fall 2005 Biostatistics Brown Bag Seminar Abstracts

**Rule-based Modeling of Signaling by Epidermal Growth Factor Receptor**

*Michael L. Blinov*

Theoretical Biology and Biophysics Group,

Los Alamos National Laboratory, Los Alamos, NM

Signal transduction networks often exhibit combinatorial complexity: the number of protein complexes and modification states that potentially can be generated during the response to a signal is large, because signaling proteins contain multiple sites of modification and interact with multiple binding partners. The conventional approach of manually specifying each term of a mathematical model is impossible. To avoid this problem, modelers often make assumptions to limit the number of species, but these are usually poorly justified. As an alternative, we have developed an approach to represent biomolecular interactions as rules specifying activities, potential modifications and interactions of the domains of signaling molecules [Hlavacek et al. (2003) Biotech. Bioeng.] Rules are evaluated automatically to generate the reaction network. This approach is implemented in BioNetGen software [Blinov et al. (2004) Bioinformatics; Blinov et al. (in press) LNCS]. To illustrate this approach, we have developed a model of early events in signaling by the epidermal growth factor (EGF) receptor (EGFR), which includes EGF, EGFR, the adapter proteins Grb2 and Shc, and the guanine nucleotide exchange factor Sos [Blinov et al. (2005) BioSystems]. These events can potentially generate a diversity of protein complexes and phosphoforms; however, this diversity has been largely ignored in computational models of EGFR signaling. The model predicts the dynamics of 356 molecular species connected through 3,749 reactions. This model is compared with a previously developed model [Kholodenko et al. (1999) JBC] that incorporates the same protein-protein interactions but is based on several restrictive assumptions and thus includes only 18 molecular species involved in Sos activation. The new model is consistent with experimental data and yields new predictions without requiring new parameters. The model predicts distinct temporal patterns of phosphorylation for different tyrosines of EGFR, distinct reaction paths for Sos activation, a large number of distinct protein complexes at short times, and signaling by receptor monomers. Comparing the two models helps design experiments to test hypotheses, e.g., genetic mutation blocking Shc-dependent pathways helps to distinguish between competitive and non-competitive mechanisms of adapter proteins binding.

**Stochastic Curtailment in Multi-Armed Trials **

*Xiaomin He*

Stochastically curtailed procedures in multi-armed trials are complicated due to repeated significance testing and multiple comparisons. From either frequentist or Bayesian viewpoints, there exists some dependence among pairwise test statistics. Investigators must consider such dependence when testing homogeneity of treatments. This paper studies the property of canonical multivariate joint distribution of test statistics in multi-armed trials. Pairwise and global monitoring are suggested based on this property. In pairwise monitoring, the Hochberg step-up procedure is recommended to strongly control the overall significance level. In global monitoring, the conditional and predictive power are calculated based on current multivariate test statistics, which reflect the dependence among pairwise test statistics. Futility monitoring in multi-armed trials is also considered. Simulation results in multi-armed trials show that, compared with the traditional group sequential and non-sequential procedures, stochastic curtailment has advantages in sample size, time and cost. An example concerning a proposed study of Coenzyme Q$_{10}$ in early Parkinson Disease is given.

**Power Analysis for Correlations from Clustered Study Designs **

*Xin Tu*

Power analysis constitutes an important component of modern clinical trials and research studies. Although a variety of methods and software packages are available, they are primarily focused on regression models, with little attention paid to correlation analysis. However, the latter is a simpler and more appropriate approach for modeling association between correlated variables that measure a common (latent) construct using different scales, different assessment methods and different raters as arising in psychosocial and other health-care related research areas. A major difficulty for performing power analysis is how to deal with the excessive number of parameters in the distributions of the correlation estimates, many of which are nuisance parameters. In addition, as missing data patterns are unpredictable and dynamic before a study is realized, its effect must also be addressed when performing power analysis, which further complicates the analytic problems. With no real data to estimate the parameters and missing data patterns as in most real study applications, it is difficult to proceed with estimation of power and sample size for correlation analysis for a real study. In this talk, we discuss how to eliminate nuisance parameters and model missing data patterns to effectively address these issues. We illustrate our approaches with both real and simulated data.

This is joint work with Paul Crits-Christoph (University of Pennsylvania), Changyong Feng (University of Rochester), Robert Gallop (University of Pennsylvania) and Jeanne Kowalski (Johns Hopkins University).

**Branching Processes, Generation, and Applications **

*Ollivier Hyrien*

I will first present results on the distribution of the generation in a Bellman-Harris branching process starting with a single cell. Approximate expressions for this distribution have been described in the literature, and I will present an exact expression. As an application, I will give an explicit expression for the distribution of the age in the considered setting. The results are illustrated using a Markov process.

The second part of my talk will focus on the statistical analysis of CFSE-labeling experiments, a bioassay frequently used by biologists to study cell proliferation. The data generated by this assay are dependent, a feature that has never been mentioned in the literature. The dependency structure is quite complex, making it impossible to use the method of maximum likelihood. I propose three estimation techniques, and present their asymptotic and finite sample properties. An application to T lymphocytes will also be given.

**Similarity Searches in Genome-wide Numerical Data Sets **

*Galina Glazko*

Stowers Institute for Medical Research

Many types of genomic data are naturally represented as multidimensional vectors. The frequent purpose of genome-scale data analysis is to uncover the subsets in the data that are related by a similarity of some sort. One way to do it is by computing the distances between vectors. The major question here is: how to choose the distance measure, when several of them are available? First, we consider the problem of functional inference using phyletic patterns. Phyletic patterns denote presence and absence of orthologous genes in completely sequenced genomes, and are used to infer functional links, on the assumption that genes involved in the same pathway or functional system are co-inherited by the same set of genomes. I demonstrate that the use of appropriate distance measure and clustering algorithm increases the sensitivity of phyletic pattern method; however, the method itself has the limit of applicability caused by differential gains, losses, and displacements of orthologous genes. Second, we study the characteristic properties of various distance measures and their performance in several tasks of genome analysis. Most distance measures between binary vectors turn out to belong to a single parametric family, namely generalized average-based distance with different exponents. I show that descriptive statistics of distance distribution, such as skewness and kurtosis, can guide the appropriate choice of the exponent. On the contrary, the more familiar distance properties, such as metric and additivity, appear to have much less effect on the performance of distances. Third, we discuss the new approach for local clustering based on an iterative pattern-matching and apply the new approach to identify potential malaria vaccine candidates in Plasmodium falciparum transcriptome.

**Partially Linear Models and Related Topics **

*Hua Liang*

In this brown-bag seminar I will bring a presentation of the state of the art of partially linear models, with a particular focus on several special topics such as with error-prone covariates, missing observation, nonlinear component checking. Extension to more general models will be discussed. The applications of these projects in biology, economics, and nutrition will be mentioned. The talk covers a series of my publications in the Annals of Statistics, JASA, Statistica Sinica, Statistical Methods in Medical Research, and more recent submission.

### Spring 2005 Biostatistics Brown Bag Seminar Abstracts

**Estimating Incremental Cost-Effectiveness Ratios and Their Confidence Intervals with Differentially Censored Data **

*Hongkun Wang and Hongwei Zhao*

With medical cost escalating over recent years, cost analysis is being conducted more and more to assess economical impact of new treatment options. An incremental cost-effectiveness ratio is a measure that assesses the additional cost for a new treatment for saving one year of life. In this talk, we consider cost effective analysis for new treatments evaluated in a randomized clinical trial setting with staggered entries. In particular, the censoring times are different for cost and survival data. We propose a method for estimating the incremental cost-effectiveness ratio and obtaining its confidence interval when differential censoring exists. Simulation experiments are conducted to evaluate our proposed method. We also apply our methods to a clinical trial example comparing the cost-effectiveness of implanted defibrillators with conventional therapy for individuals with reduced left ventricular function after myocardial infarction.

**Regression Analysis of ROC Curves and Surfaces **

*Christopher Beck*

Receiver operating characteristic (ROC) curves are commonly used to describe the performance of a diagnostic test in terms of discriminating between healthy and diseased populations. A popular index of the discriminating ability or accuracy of the diagnostic test is the area under the ROC curve. When there are three or more populations, the concept of an ROC curve can be generalized to that of an ROC surface, with the volume under the ROC surface serving as an index of diagnostic accuracy. After introducing the basic concepts associated with ROC curves and surfaces, methods for assessing the effects of covariates on diagnostic test performance will be discussed. Examples from a recent study organized by the Agency for Toxic Substances and Disease Registry (and conducted here in Rochester) will be presented to illustrate these methods.

**Constructing Prognostic Gene Signatures for Cancer Survival **

*Derick Peterson*

Modern micro-array technologies allow us to simultaneously measure the expressions of a huge number of genes, some of which are likely to be associated with cancer survival. While such gene expressions are unlikely to ever completely replace important clinical covariates, evidence is already beginning to mount that they can provide significant additional predictive information. The difficult task is to search among an enormous number of potential predictors and to correctly identify most of the important ones, without mistakenly identifying too many spurious associations. Many commonly used screening procedures unfortunately over-fit the training data, leading to subsets of selected genes that are unrelated to survival in the target population, despite appearing associated with the outcome in the particular sample of data used for subset selection. And some genes might only be useful when used in concert with certain other genes and/or with clinical covariates, yet most available screening methods are inherently univariate in nature, based only on the marginal associations between each predictor and the outcome. While it is impossible to simultaneously adjust for a huge number of predictors in an unconstrained way, we propose a method that offers a middle ground where some partial adjustments can be made in an adaptive way, regardless of the number of candidate predictors.

**A New Test Statistic for Testing Two-Sample Hypotheses in Microarray Data Analysis **

*Yuanhui Xiao*

We introduce a test statistic intended for use in nonparametric testing of the two-sample hypothesis with the aid of resampling techniques. This statistic is constructed as an empirical counterpart of a certain distance measure *N* between the distributions *F* and *G* from which the samples under study are drawn. The distance measure *N* can be shown to be a probability metric. In two-sample comparisons, the null hypothesis *F* = *G* is formulated as H0 : *N* = 0. In a computer experiment, where gene expressions were generated from a log-normal distribution, while departures from the null hypothesis were modeled via scale transformations, the permutation test based on the distance *N* appeared to be more powerful than the one based on the commonly used *t*-statistic. The proposed statistic is not distribution free so that the two-sample hypothesis *F* = *G* is composite, i.e., it is formulated as H0 : *F(x) = H(x), G(x) = H(x)* for all *x* and some *H(x)*. The question of how the null distribution *H* should be modeled arises naturally in this situation. For the *N*-statistic, it can be shown that a specific resampling procedure (resampling analog of permutations) provides a rational way of modeling the null distribution. More specifically, this procedure mimics the sampling from a null distribution *H* which is, in some sense, the "least favorable" for rejection of the null hypothesis. No statement of such generality can be made for the *t*-statistic. The usefulness of the proposed statistic is illustrated with an application to experimental data generated to identify genes involved in the response of cultured cells to oncogenic mutations.

**The Effects of Normalization on the Correlation Structure of Microarray Data **

*Xing Qiu, Andrew I. Brooks, Lev Klebanov, and Andrei Yakovlev *

Stochastic dependence between gene expression levels in microarray data is of critical importance for the methods of statistical inference that resort to pooling test statistics across genes. It is frequently assumed that dependence between genes (or tests) is sufficiently weak to justify the proposed methods of testing for differentially expressed genes. A potential impact of between-gene correlations on the performance of such methods has yet to be explored. We present a systematic study of correlation between the t-statistics associated with different genes. We report the effects of four different normalization methods using a large set of microarray data on childhood leukemia in addition to several sets of simulated data. Our findings help decipher the correlation structure of microarray data before and after the application of normalization procedures. A long-range correlation in microarray data manifests itself in thousands of genes that are heavily correlated with a given gene in terms of the associated t-statistics. The application of normalization methods may significantly reduce correlation between the t-statistics computed for different genes. However, such procedures are unable to completely remove correlation between the test statistics. The long-range correlation structure also persists in normalized data.

**Estimating Complexity in Bayesian Networks**

*Peter Salzman*

Bayesian networks are commonly used to model complex genetic interaction graphs in which genes are represented by nodes and interactions by directed edges. Although a likelihood function is usually well defined, the maximum likelihood approach favors networks with high model complexity. To overcome this we propose a two step algorithm to learn the network structure. First, we estimate model complexity. This requires finding the MLE conditional on model complexity then using Bayesian updating, resulting in an informative prior density on complexity. This is accomplished using simulated annealing to solve a constrained optimization problem on the graph space. In the second step we use an MCMC algorithm to construct a posterior density of gene graphs which incorporates the information obtained in the first step. Our approach is illustrated by an example.

**A New Approach to Testing for Sufficient Follow-up in Cure-Rate Analysis**

*Lev Klebanov and Andrei Yakovlev*

The problem of sufficient follow-up arises naturally in the context of cure rate estimation. This problem was brought to the fore by Maller and Zhou (1992, 1994) in an effort to develop nonparametric statistical inference based on a binary mixture model. The authors proposed a statistical test to help practitioners decide whether or not the period of observation has been long enough for this inference to be theoretically sound. The test is inextricably entwined with estimation of the cure probability by the Kaplan-Meier estimator at the point of last observation. While intuitively compelling, the test by Maller and Zhou does not provide a satisfactory solution to the problem because of its unstable and non-monotonic behavior when the duration of follow-up increases. The present paper introduces an alternative concept of sufficient follow-up allowing derivation of a lower bound for the expected proportion of immune subjects in a wide class of cure models. By building on the proposed bound, a new statistical test is designed to address the issue of the presence of immunes in the study population. The usefulness of the proposed approach is illustrated with an application to survival data on breast cancer patients identified through the NCI Surveillance, Epidemiology and End Results Database.

**Assessment of Diagnostic Tests in the Presence of Verification Bias **

*Michael McDermott*

Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in a random sample from the intended population to which the test will be applied. In many studies, however, verification of the true disease status is performed only for a subset of the sample. This may be the case, for example, if ascertainment of the true disease status is invasive or costly. Often, verification of the true disease status depends on the result of the diagnostic test and possibly other characteristics of the subject (e.g., only subjects judged to be at higher risk of having the disease). If sensitivity and specificity are estimated using only the information from the subset of subjects for whom both the test result and the true disease status have been ascertained, these estimates will typically be biased. This talk will review some methods for dealing with the problem of verification bias. Some new approaches to the problem will also be introduced.

**Estimation of Causal Treatment Effects from Randomized Trials with Varying Levels of Non-Compliance **

*Jason Roy*

Data from randomized trials with non-compliance are often analyzed with an intention-to-treat (ITT) approach. However, while ITT estimates may be of interest to policy-makers, estimates of causal treatment effects may be of more interest to clinicians. For the simple situation where treatment and compliance are binary (yes/no), instrumental variable (IV) methods can be used to estimate the average causal effect of treatment among those that would comply with treatment assignment. When there are more than two compliance levels (e.g., non-compliance, partial compliance, full compliance), however, these IV methods cannot identify the compliance-level causal effects without strong assumptions. We consider likelihood-based methods for dealing with this problem. The research was motivated by a study of the effectiveness of a disease self-management program in reducing health care utilization among older women with heart disease. This is work-in-progress.

**Statistical Inference for Branching Processes **

*Nikolay Yanev*

It is well known that branching processes have many applications in biology. In this talk the asymptotic behavior of branching populations having an increasing and random number of ancestors is investigated. An estimation theory will be developed for the mean, variance and offspring distributions of the process $\{Z_{t}(n)\}$ with random number of ancestors $Z_{0}(n)$, as both $n$ (and thus $Z_{0}(n)$, in some sense) and $t$ approach infinity. Nonparametric estimators are proposed and shown to be consistent and asymptotically normal. Some censored estimators are also considered. It is shown that all results can be transferred to branching processes with immigration, under an appropriate sampling scheme. A system for simulation and estimation of branching processes will be demonstrated.

No preliminary knowledge in this field is assumed.

**Modeling of Stochastic Periodicity: Renewal, Regenerative and Branching Processes **

*Nikolay Yanev*

Department of Probability and Statistics, Chair,

Institute of Mathematics and Informatics,

Bulgarian Academy of Sciences,

SOFIA, BULGARIA

In deterministic processes periodicity is usually well defined. However in the stochastic case there are many possible models. One way to study stochastic periodicity is proposed in this lecture. The models are based on Alternating Renewal and Regenerative Processes. The limiting behavior is investigated, with special attention given to the case of periods of regeneration with infinite mean. Two applications in the Branching Processes are considered: Bellman-Harris branching processes with state-dependent immigration and discrete-time branching processes with a random migration.

The main purpose of the talk is to describe stochastic models which can be applied in Biology, especially Epidemiology and Biotechnology.

No preliminary knowledge in this field is assumed.

**Testing Approximate Statistical Hypotheses **

*Y. N. Tyurin*

Moscow State University

Statistical hypotheses often take the form of statements about some properties of functionals of probability distributions. Usually, according to a hypothesis the functionals in question have certain exact values. Many of the classical statistical hypotheses are of this form: the hypothesis about mathematical expectation of a normal sample (one-dimensional or multidimensional); the hypothesis about probabilities of outcomes in independent trails (which should be tested based on observed frequencies); the linear hypotheses in Gaussian linear models etc.

Stated as suppositions about exact values those hypotheses do not express accurately the thinking of natural scientists. In practice an applied scientist would be satisfied if those or similar suppositions were ?correct? in some approximate sense (meaning their approximate agreement with statistical data).

The above-mentioned discrepancy between applied-science approach and the mathematical expression of it leads to rejection of any statistical hypothesis given sufficiently large amount of sample data ? a well known statistical phenomenon.

This talk will show how hypotheses about exact values can be re-stated as rigorously formulated approximate hypotheses and how those can be tested against sample data with special attention given to the hypotheses mentioned above.