Seminar Archives
October 28, 2009
Special Seminar
Deborah Glueck,
UColorado Denver
BIAS IN TRIALS COMPARING PAIRED CONTINUOUS TESTS CAN CAUSE RESEARCHERS TO CHOOSE THE WRONG SCREENING MODALITY
Background: To compare the diagnostic accuracy of two continuous screening tests, a common approach is to test the difference between the areas under the receiver operating characteristic (ROC) curves. After study participants are screened with both screening tests, the disease status is determined as accurately as possible, either by an invasive, sensitive and specific secondary test, or by a less invasive, but less sensitive approach. For most participants, disease status is approximated through the less sensitive approach. The invasive test must be limited to the fraction of the participants whose results on either or both screening tests exceed a threshold of suspicion, or who develop signs and symptoms of the disease after the initial screening tests. The limitations of this study design lead to a bias in the ROC curves we call paired screening trial bias. This bias reflects the synergistic effects of inappropriate reference standard bias, differential verification bias, and partial verification bias. The absence of a gold reference standard leads to inappropriate reference standard bias. When different reference standards are used to ascertain disease status, it creates differential verification bias. When only suspicious screening test scores trigger a sensitive and specific secondary test, the result is a form of partial verification bias.
Methods: For paired screening tests with bivariate normally distributed scores, we give formulae and programs to quantify the effect of paired screening trial bias on a paired comparison of area under the curves. We fix the prevalence of disease, and the chance a diseased subject manifests signs and symptoms. We derive the formulas for true sensitivity and specificity, and those for the sensitivity and specificity observed by the study investigator.
Results: The observed area under the ROC curves is quite different from the true area under the ROC curves. The typical direction of the bias is a strong inflation in sensitivity, paired with a concomitant slight deflation of specificity.
Conclusion: In paired trials of screening tests, when area under the ROC curve is used as the metric, bias may lead researchers to make the wrong decision as to which screening test is better.
October 21, 2009
Lior Pachter
UC Berkeley
TRANSCRIPTOME ASSEMBLY AND EXPRESSION ANALYSIS WITH RNA-Seq
Scientists are now beginning to use high-throughput sequencing of cDNA (RNA-Seq) to accurately calculate the abundance of transcripts in a sample such as the total nuclear RNA population of a cell. We can compute gene-level expression by mapping "reads" to the genome and comparing the relative counts. Estimating transcript-level expression presents a much greater challenge, because a read may align to more than one isoform of the same gene, and therefore cannot be unambiguously mapped to a single transcript. In this talk, I will discuss Cufflinks, our recently developed algorithm to infer isoform-level abundances from alignments of RNA-Seq reads to a genome. Previous methods have required a reference annotation against which to compare aligned reads, while Cufflinks requires no annotation. The algorithm assembles a parsimonious set of transcripts that "explain" the observed reads in a sample, and then uses a linear statistical model of sequencing coupled with a numerical optimization procedure to produce a unique maximum likelihood estimate of the relative abundances of these transcripts. Cufflinks is the first software system that performs simultaneous ab initio discovery and quantitation of a transcriptome from RNA-Seq data. We explored the performance of the system with a case study of differentiating mouse myoblasts. Cufflinks assembled a time course of paired 75bp RNA-Seq reads taken at 0, 60, 120, and 168 hours into differentiation. Using a companion tool included with Cufflinks, we compared the assemblies for common isoforms and tracked their abundances across the time course. The comparison revealed thousands of alternatively spliced transcripts present at all time points, including novel genes and novel isoforms of known genes. This is work with Cole Trapnell Geo Pertea, Ali Mortazavi, Brian Williams, Steven Salzberg, and Barbara Wold.
October 14, 2009
Eric Holmgren, PhD
Genentech, Inc.
A THEORY FOR THE EXTENDED ANALYSIS OF A CLINICAL TRIAL
Every clinical trial has the potential to produce a final result for the primary endpoint that is suggestive of a treatment effect but does not meet the formal criteria for statistical significance. When one is faced with this sort of result, the trial has to be considered negative. What's more, under the frequentist paradigm any further investigation of the primary hypothesis of the trial would require a new study since all of the type 1 error has been used.
In this paper we propose a framework for extending such a trial beyond its final analysis where the total accumulated one sided type 1 error is 0.025. Essentially an extension of a trial is justified as long as the additional risk of falsely concluding effectiveness is matched proportionally by the added benefit of correctly identifying an active drug. The incremental risk and benefit from the study extension must be in the same proportion as the risk and benefit for the original study.
October 13, 2009
Special Seminar
PLEASE NOTE
Time: 4-5pm
Location: Room 6704
Henrik Bengtsson
Department of Statistics, University of California at Berkeley
Enhanced detection of chromosomal events via single-sample preprocessing and integration of copy-number estimates from multiple technologies
Genomic alterations are hallmarks of gene deregulation and genome instability in cancers. High-resolution single-nucleotide polymorphism (SNP) and copy-number (CN) microarrays offer means to identify these. I will present a suite of single-sample statistical inference methods that improves the ability to detect alterations in total and parental CNs. Since these are truly single-sample methods, they are suitable for medical diagnostics of individuals where no reference samples are available. At the other extreme, they have a computational advantage when analyzing very large data sets that grow over time.
CRMA v2 is a standardized preprocessing method for all of Affymetrix genotyping and CN arrays, including the recent Cytogenetic array. Using affine models and techniques such as robust smooth splines, CRMAv2 controls for common artifacts that otherwise diminish the signal or introduce false positives, e.g. allelic crosstalk, sequence effects, PCR conditions and waviness.
For paired study designs with matched tumor normal hybridizations, we have developed a special method named TumorBoost that allows us to better detect loss of heterozygosity (LOH) and changes in parental CN states. TumorBoost adjusts for SNP effects in the allele-specific CNs of the tumor by estimating them using the matched normal. It applies to any SNP array technology including Affymetrix and Illumina, and is robust against the choice preprocessing method.
A third method drawing on principal curves was developed for integrating full-resolution CN estimates from a variety of technologies such as Affymetrix, Agilent and Illumina, as well as DNAseq. Multi-source CN normalization (MSCN) is currently used in a large-scale project where four labs use four different platforms to produce CN estimates of the same tissues.
Using standardized CN assessments, we demonstrate that each of these methods increases the power to detect PCN change points, which in turn improves the precision of the estimate of their locations. Combined, they will provide even greater specificity and sensitivity.
We have implemented analytical pipelines for CRMA v2, TumorBoost and MSCN in the widely used open-source aroma.affymetrix and aroma.cn framework.
August 25, 2009
Special Seminar
PLEASE NOTE VENUE CHANGE
Time: 4-5pm
Place: Rm 6702, Lobby 5, China Basin Landing
Thomas Wu
Genentech, Inc.
Detecting complex variants and splicing in next-generation sequencing data with applications to prostate cancer transcriptomes
Next-generation sequencing promises to expand our knowledge of genomes and transcriptomes by generating large volumes of reads from genomic DNA or transcribed RNA. To expand the range of biological phenomena that can be inferred from short reads, I have developed fast and memory-efficient methods for detecting multiple mismatches, long indels, splicing, and combinations thereof. The resulting program GSNAP builds upon prior work in genomic alignment algorithms, namely, the programs GMAP for aligning ESTs and mRNA sequences and PMAP for aligning protein sequences to a genome. Although GSNAP relies upon a well-established hash table scheme, it can nevertheless achieve speeds comparable with programs based on the Burrows-Wheeler Transform, by employing a constrained search methodology. GSNAP also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and alignment of reads from bisulfite-treated DNA for the study of methylation state
We have used GSNAP and associated tools to analyze transcriptional sequencing data from human prostate adenocarcinomas and matched normal tissues. Expression analysis reveals overexpression of several protein-coding genes and small nucleolar RNAs in tumors. Spliced reads exhibit a variety of alternative and cryptic splicing events. In addition, we find evidence for numerous read-through fusion transcripts involving adjacent genes in both tumor and normal samples, as well as evidence for tumor-specific chromosomal rearrangements. We observe that genotyping using short reads is highly sensitive and accurate, although uneven transcriptional coverage constrains the discovery of single-nucleotide variations.
July 23, 2009
Special Seminar
PLEASE NOTE VENUE CHANGE
Time: 4-5pm
Place: Pottruck Auditorium Rm 102 Rock Hall
Ed Green
Max Planck Institute for Evolutionary Anthropology
Recent human evolution as revealed by the Neandertal genome
Recent technological advances have enabled large-scale retrieval and sequencing of DNA from our closest relatives, the extinct Neandertals. To detect regions of recent positive selection in humans, to better understand our relationship to Neandertals, and to eventually understand Neandertal-specific biology we recently embarked on a project to sequence the complete Neandertal genome. To achieve this goal, several technological advances were required in recovery and identification of ancient DNA sequence from fossil bones. Having passed a milestone of 1-fold genome coverage, we have begun to analyze these data to address questions about recent human evolution. From these data, we estimate an average Neandertal-human genome divergence of about 800,000 years and a population split time of about 300,000 years. The latter estimate rules of one model of hominid evolution, namely that Neandertals are the decendants of H. heidelbergensis.
Because Neandertals share some of the genetic diversity still extant within human populations, they make an ideal genetic comparison to test for recent positive selection in humans. We are currently analyzing human diversity data to find regions where little or no variation is shared with Neandertals in the hopes of identifying the genetic basis of being fully modern humans.
June 24, 2009
Special Seminar
PLEASE NOTE VENUE CHANGE
Time: 4-5pm
Place: Pottruck Auditorium Rm 102 Rock Hall
Richard Bourgon
European Bioinformatics Institute
High-resolution mapping of meiotic crossovers and non-crossovers in yeast
Meiotic recombination plays a central role in the evolution of sexually reproducing organisms. The two recombination outcomes -- crossover (CO) and noncrossover (NCO) -- increase genetic diversity, but also have the potential to homogenize alleles by gene conversion. While CO rates are known to vary considerably across the genome, NCOs and gene conversions have only been identified at a handful of loci. To examine recombination genome-wide and at high spatial resolution, we generated maps of COs, CO-associated gene conversion and NCO gene conversion by hybridizing genomic DNA from all four products of 56 yeast meioses to high-density tiling microarrays.
The resulting data presented two significant computational challenges.
First, accurately identifying regions of gene conversion required high- confidence genotype calls. Microarray probe behavior, however, was seen to be sensitive to the varying genomic background of the segregants. To address this, we developed a semi-supervised algorithm which out-performed existing methods for array-based genotyping. Second, evaluation of how CO and NCO rates vary along the genome (including the identification of "hotspots") requires adjustment for differences in marker spacing. Our high-resolution data inverted the traditional relationship between markers and recombination events, making traditional adjustments inappropriate: there were typically multiple markers within each CO or NCO gene conversion event, rather than vice versa. As a consequence, marker spacing correction required a novel approach.
To our knowledge, the resulting maps represent the first high-resolution, genome-wide characterization of the multiple outcomes of recombination in any organism. They reveal differences in the distributions of COs and NCOs, provide evidence of interference between COs and NCOs, permit identification of instances of post-meiotic segregation, and suggest that up to 1% of each meiotic product's genome is subject to gene conversion in a single meiosis -- with detectable bias towards GC nucleotides.
May 20, 2009
Mark van der Laan, PhD
Departments of Biostatistics and Statistics University of California, Berkeley
TARGETED MAXIMUM LIKELIHOOD LEARNING OF CAUSAL EFFECTS AND VARIABLE IMPORTANCE PARAMETERS IN GENOMICS
Current statistical practice to assess an effect of an intervention or exposure on an outcome of interest often involves either maximum likelihood estimation for a priori specified regression model, or, manual and/or data adaptive interventions to fine tune a choice of model. In both cases, bias in the point estimates and the estimate of the signal to noise ratio are rampant, causing an epidemic of false claims based on data analyses.
In this talk we present our efforts to construct machine learning algorithms for estimating a causal effect that take away the need for specifying regression models, while still providing maximum likelihood based estimators and inference. Two fundamental concepts underlying this methodology are the very aggressive use of cross-validation to select optimal combinations of many model fits, and subsequent targeted maximum likelihood estimation to target the fit towards the causal effect of interest.
We illustrate this method in observational studies for assessing the effect of an intervention on adherence to drug regimen in HIV infected patients, and for discovery of mutations in the HIV virus that cause resistance to a particular drug regimen.
We also illustrate the performance on FDA approved clinical trials, simulated data imitating postmarket safety analysis, and the analysis of single nucleotide polymorphisms.
April 15, 2009
Jane-Ling Wang
UC Davis
JOINT MODELING OF LONGITUDINAL AND SURVIVAL DATA
In clinical and other longitudinal studies, it has become increasingly common to observe an event time of interest, usually referred to as a survival time, along with baseline and longitudinal covariates. Both the survival and covariate processes are of interest, as is the relationship between them. Due to several complications, traditional approaches, including the partial likelihood approach for the Cox proportional hazards model and the rank based approach for the accelerated failure time model, encounter difficulties when longitudinal covariates are involved in the modeling of survival times. Moreover, the longitudinal processes are often subject to informative dropout. Jointly modeling the survival and longitudinal data emerges as an effective way to overcome these difficulties.
In this talk, we will discuss the challenges in this area and provide several solutions. One of the difficulties is that maximum likelihood estimates (MLE) often do not exist when the survival component is modeled semiparametrically as in the Cox or accelerated failure time models. Several alternatives will be illustrated, including nonparametric MLEs, the method of sieves, and pseudo-likelihood approaches. Another difficulty is related to the parametric modeling of the longitudinal component. Nonparametric alternatives will be considered to deal with this complication.
The talk is based on various joint work with Jimin Ding, Fushing Hsieh and Yi-Kuan Tseng.
March 18, 2009
Sandrine Dudoit, PhD
Division of Biostatistics and Department of Statistics, UC Berkeley
STATISTICAL INFERENCE IN mRNA-Seq
For the past decade, microarrays have been the assays of choice for high-throughput studies of gene expression. Recent improvements in the efficiency, quality, and cost of genome-wide sequencing are prompting biologists to rapidly abandon microarrays in favor of ultra high-throughput sequencing, a.k.a., second-generation or next-generationsequencing: e.g., Applied Biosystems' SOLiD, Helicos BioSciences'HeliScope, Illumina's Genome Analyzer, and Roche's 454 Life Sciences sequencing systems. These high-throughput sequencing technologies have already been applied to monitor genome-wide transcription levels (mRNA-Seq), DNA-protein interactions (ChIP-Seq), chromatin structure, and DNA methylation status. While sequencing-based gene expression studies have been touted as overcoming longstanding limitations of microarray-based studies, these new biotechnologies raise similar as well as novel statistical and computational challenges, in areas such as image analysis, base-calling, read-mapping, and (differential) expression inference.
This talk will report on our investigation of two mRNA-Seq datasets obtained using Illumina's Genome Analyzer platform to measure transcript levels in reference samples from the MicroArray Quality Control (MAQC) Project. We focus on the analysis of mapped read counts and the following three main issues: (1) exploratory data analysis (EDA); (2) assessment of biological effects of interest (e.g., expression levels in Brain vs. UHR RNA) and nuisance experimental effects (e.g., library preparation, flow-cell, and lane effects); (3) identification of differentially expressed genes.
This is joint work with James H. Bullard, Steffen Durinck, Kasper D.Hansen, and Elizabeth A. Purdom.
February 18, 2009
Weng Kee Wong, PhD
University of California, Los Angeles
A WEB-BASED TOOL FOR FINDING OPTIMAL DESIGNS IN BIOMEDICAL RESEARCH
Despite advances in optimal design theory, applications to find efficient designs in biomedical studies have been sporadic. Part of the reason may be that the theory can be complicated, and optimal design strategies and their benefits are not well advertised.
I review optimal design theory and recent advances in the field. To promote optimal design ideas, I present a website that allows practitioners to generate a variety of optimal designs easily and freely. After selecting a suitable model from the site and an optimality criterion, the user inputs design parameters for his or her problem. The site returns the optimal design and the efficiency of any user-selected design. I give illustrative applications in the biomedical sciences and hope that this website will facilitate biomedical researchers implement a more informed design that provides improved statistical inference at minimal cost.
February 4, 2009
Annual Special Invited Seminar
PLEASE NOTE VENUE CHANGE ::185 Berry Street, Lobby 6, Room 331 (Radiology Classroom)
Judea Pearl, PhD
University of California, Los Angeles
Causal Inference In The Empirical Sciences
The questions that motivate most studies in the health, social and behavioral sciences are not statistical but causal in nature. For example, what is the efficacy of a given drug in a given population? Whether data can prove an employer guilty of hiring discrimination? What fraction of past crimes could have been prevented by a given policy? What was the cause of death of a given individual, in a specific incident?
Remarkably, although much of the conceptual and algorithmic tools needed for tackling such problems are now well established, they are hardly known to empirical researchers.
The barrier has been cultural; formulating causal problems mathematically requires certain extensions to the standard mathematical language of statistics, and these extensions are not generally emphasized in the mainstream literature and education. (Skeptics and traditionalists in the audience will be invited to write down a mathematical formula for the empirical claim: 'The rooster crow does not cause the sun to rise.')
In this talk, I will attempt to break this cultural barrier by introducing a few basic principles and simple mathematical tools that are sufficient for solving most (if not all) problems involving causal relationships. The principles are based on non-parametric structural equation models, a natural generalization of those used by econometricians in the 1950-60s, yet cast in new mathematical underpinnings. This framework, enriched with a few ideas from logic and graph theory, gives rise to a formal yet friendly calculus of counterfactuals that unifies all existing approaches to causation -- from econometric and Rubin's models to path-diagrams -- and resolves long-standing problems in several of the sciences. These include questions of confounding, causal effect estimation, covariate selection, policy analysis, legal responsibility, effect decomposition, instrumental variables, and the integration of data from diverse studies.
December 17, 2008
Herbie Lee
Applied Math & Statistics Department, UC Santa Cruz
Neural Networks as a Statistical Model
Neural networks are a popular tool for the modeling of complex relationships in data, particularly as a form of non-linear regression and classification. This talk will give an overview of the probability model behind a neural network, explain how it fits into the larger context of nonparametric modeling in statistics, explore choices of priors within the Bayesian paradigm, discuss model selection, and give several examples of data analysis with neural networks
November 19, 2008
Nancy Zhang
Department of Statistics, Stanford University
CROSS SAMPLE DETECTION AND SUMMARY OF HIGH-DENSITY DNA COPY NUMBER DATA
DNA copy number variants (CNVs), i.e. gains and losses of chromosomal segments, are an important class of genetic variation. Various laboratory techniques have been developed for high-throughput measurement of DNA copy number. Currently, most CNV detection methods treat each sample separately, as if the statistical algorithm has never seen a similar sample before, and will never see another sample again. Empirically, however, CNVs do not just occur at random in our genome. Instead they often share the exact same breakpoints across samples (in the case of inherited CNVs) or appear in 'hot spots' distributed unevenly across the chromosomes (in the case of copy number aberrations in cancer). To the extent that some CNVs are recurrent, a joint analysis, one that incorporates information across all samples analyzed, is likely to be statistically more robust. Furthermore, a joint segmentation can lead to multi-sample summary statistics that serve naturally as the overall molecular signature of the cohort.
We formulate a statistical model for cross-sample analysis of DNA copy number. We discuss statistics for pooling information across samples to detect weak signals and refine boundaries, and show that they have better power properties and provide a more easily interpreted summary of the data than do procedures based on a separate analysis for each sequence.
October 15, 2008
Haiyan Huang
Department of Statistics, UC Berkeley
Computational Diagnosis by Querying Large-Scale Public Microarray Repositories
Predicting phenotypes from genotypes is one of the major challenges of functional genomics. In this talk, I will introduce a Bayesian method we developed for using microarray repositories to create a disease diagnosis database. Various statistical/computational issues involved will be discussed. This study provides an important application for the enormous amount of costly generated, yet freely available, genomics data.
September 17, 2008
Jeff Wall
Institute for Human Genetics and Department of Epidemiology and Biostatistics, UCSF
INFERENCE OF HUMAN DEMOGRAPHIC HISTORY FROM SEQUENCE POLYMORPHISM DATA
We describe new methods for estimating human demographic parameters from DNA sequence polymorphism data and apply these methods to sequence data from 222 genes (Environmental Genome Project) and 90 non-coding regions. While m any of our results are similar to those of previous studies, one finding was quite unexpected: we find much more genetic variation on the X chromosome (compared to the autosomes) than expected. We review possible explanations for this pattern and suggest that the most likely one is some degree of polygyny throughout most of human history.
July 29, 2008
Adam Olshen
Memorial Sloan-Kettering Cancer Center
Segmentation of allele-specific DNA copy number data
Segmentation methods have gained popularity for identifying gains and losses in DNA copy number data. Traditionally segmentation has been applied to total copy number data, as that is all that is available for most copy number platforms. For SNP platforms, however, there are separate estimates of copy number for the two parental chromosomes, which necessitates new segmentation methodology. I will present our segmentation approach. I will demonstrate its utility on data from the Cancer Genome Atlas. I will also discuss other copy number research in which I am involved.
Joint work with Venkat Seshan and Richard Olshen.
June 10, 2008
Kingshuk Roy Chowdhury
Statistics Department, University College Cork
Morphometric analysis of cellular images
We consider a problem of identifying possible changes in shape of cellular nuclei after the application of a toxin. In order to measure sufficient numbers of cells, we need automated methods of boundary identification. We present a semi-automated algorithm called constrained automated seeded region growing (CASRG) which performs well for in meso-scale applications. Analysis of nuclei shapes shows that the application of toxin causes significant elongation over time in the nuclei, but no significant changes in average size or orientation.
May 6, 2008
Brian Leroux
Department of Biostatistics, University of Washington
Estimation of the Intraclass Correlation Coefficient
The Intraclass Correlation Coefficient (ICC) is a useful parameter for planning studies that involve clustered data, such as dental studies and group-randomized trials. The ICC is the correlation between two outcomes within the same cluster, and is used for performing power calculations. The ANOVA method is a convenient method for estimation of the ICC, which uses only simple closed-form expressions. Other methods, such as maximum-likelihood, may achieve greater precision than the ANOVA method, but require distributional assumptions and iterative computational procedures. These other methods require distributional assumptions for hypothesis testing and confidence intervals for the ICC. In this talk, I describe a new closed-form method of estimation and inference for the ICC that does not require distributional assumptions. Simulation studies show that the new method yields valid inferences for the ICC in a wide range of settings. The new method performs similarly to ANOVA when the data are normally distributed and cluster sizes are equal, but has greater precision than ANOVA if cluster sizes are unequal. Applications are made to data sets from a dental practice-based research network and a school-based smoking prevention trial.
April 24, 2008
Paul Scheet
Center for Statistical Genetics and Department of Biostatistics, University of Michigan
A Statistical Model for Patterns of Population Genetic Variation with Applications
Current high-throughput technologies have enabled large-scale surveys of population genetic data, such as those for genome-wide association (GWA) studies of complex traits. These data demand computationally tractable models for inference. In this talk I present a statistical model for patterns of linkage disequilibrium (the correlation of alleles at nearby loci; LD) among tightly-linked SNPs. I demonstrate how this model may be used to improve association mapping techniques by imputing genotypes from a dense reference panel of individuals and by directly modeling haplotype variation to detect associations with rare SNPs. I also present a new LD-based quality control tool for genotype data, which can detect, and in some cases correct, genotyping errors. Finally, I will present a new framework for incorporating haplotype information into traditional single-marker methods for analyzing population genetic data and apply the model to a recent survey of SNP data from the Human Genome Diversity Project to visualize global haplotype variation.
April 1, 2008
David Siegmund
Department of Statistics, Stanford University
Mapping Quantitative Traits
I describe a unified model for the statistical foundations of population based association mapping and family based linkage mapping of quantitative traits in humans. Analysis of the model involves the efficient score statistic for the conditional likelihood, given the phenotypes. Analytic expressions for noncentrality parameters give qualitative insight into the relative power of different statistics and the loss of power that occurs if the scientist's assumed genetic model differs from nature's "true" genetic model. The multiple comparisons problem of genome scans to search for anonymous genes is discussed.
Reference: Dupuis J, Siegmund D and Yakir B. (2007) PNAS, 104:20210-5.
March 4, 2008
David Draper
Department of Applied Mathematics & Statistics, UC Santa Cruz
BAYESIAN DECISION THEORY IN BIOSTATISTICS: THE UTILITY OF UTILITY
The discipline of statistics may be divided broadly into four activities: description (graphical and numerical summaries of a data set, without attempting to reason outward from it), inference (drawing probabilistic conclusions about the underlying process that gave rise to the data), prediction (summarizing uncertainty about future observables), and decision-making (looking for optimal behavioral choices in the face of uncertainty, by constructing appropriate utility functions and maximizing expected utility). The history of the discipline has tended to focus on description and inference at the expense of prediction and decision-making; in particular, problems that at first look inferential may profitably be reformulated as decisions, and people sometimes use inferential tools to suggest "optimal" behaviors that are not as optimal as they initially seem. In this talk I'll describe two case studies in biostatistics in which Bayesian decision theory gives new insight in settings that seem inferential: variable selection in generalized linear models (with application to the construction of a cost-effective scale for measuring sickness at admission to hospital) and determining the efficacy of a vaccine against HIV.
January 15, 2008
Fushing Hsieh
Department of Statistics, UC Davis
Nonparametric state-space decoding computations for non-autonomous dynamics
Hierarchical factor segmentation (HFS) algorithm is introduced as nonparametric computations for decoding state-space trajectory underlying various types of time series data generated from non-autonomous dynamics. We illustrate two applications of HFS algorithm: one for decoding in Hidden Markov model (HMM) and the other for computing signature-phases in circadian rhythms. In HMM, efficiency of HFS algorithm is compared with the popular dynamic programming based Viterbi and posterior-Viterbi algorithms on simulated as well as real CpG island genetics data. On circadian rhythm, we analyze event-time series (actogram) data generated from experiment possibly coupled with light pulse interruptions. A sequence of signature-phases is computed to mark a sequence of rhythmic cycles of variable cyclic lengths. The signature-phase also provides rigorous foundation for phase-shift measurements which are very important in biomedical hormone therapy. Our non-Fourier analysis for rhythmic dynamics is compared with Fourier analysis and Periodogram based methodologies.
December 18, 2007
Andrew Vickers
Sloan Kettering Memorial Hospital, New York
How do we know whether a predictive model is of clinical value? How do we know whether a molecular marker is worth measuring? A discussion of some simple decision analytic methods
There is increasing interest in and use of multivariable prediction models to aid clinical management. In oncology, it has been shown that such models are more accurate than the use of crude risk categories, such as those based on cancer stage. Accordingly, it has been suggested that multivariable models should be used to make decisions about patient care, such as whether a patient should receive chemotherapy after initial curative surgery. Research on molecular markers has mirrored the growth of prediction models: currently an enormous volume of papers are published examining whether a tissue or blood marker can predict the occurrence or course of disease.
Markers and models are currently evaluated in terms of accuracy using metrics such as the area-under-the-curve (AUC), sensitivity and specificity or the concordance index. A model is thought to be a good one if it is accurate; a marker is claimed to be of value if it increases the accuracy of a model. But how accurate is accurate enough? For instance, should we use a model with an AUC of 0.65, or only those with AUC's above 0.75? Similarly, if a marker improves AUC from, say, 0.65 to 0.68, is it worth using in the clinic?
This all depends, of course, on what the model or marker will be used for. Evaluating models and markers in terms of clinical consequences is the remit of a field known as "decision analysis". The problem with decision analysis, however, is that it requires additional information, for example, on the benefits, harms and costs of treatment, or on patient preferences for different health states. Perhaps as a result, the number of papers in the literature using decision analytic methods is dwarfed by those that report accuracy.
In this presentation, I will describe some simple decision analytic methods that can be directly applied to the data set of a model or marker, without the need for external information. These methods can therefore be used to tell us whether or not to use a model in the clinic, or whether a marker is a good one. To illustrate the use of the methods I will look at markers for the detection of prostate cancer, and also examine whether a statistical model is a better basis than cancer stage for determining use of chemotherapy after radical cystectomy.
References:
Vickers AJ and Elkin EB. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Medical Decision Making 2006;26(6):565-74. Reprint
Vickers AJ, Kramer BS, Baker SG. Selecting patients for randomized trials: a systematic approach based on risk group. Trials 2006;7:30. Reprint
December 11, 2007
Diana Miglioretti
Center for Health Studies, Seattle
Modeling the dissemination of a screening test (and other interests)
Microsimulation modelers such as the Cancer Intervention and Surveillance Modeling Network (CISNET) rely on accurate models of screening dissemination to estimate the contribution of screening to observed changes in cancer incidence and mortality. We propose an approach for estimating the age at first screening test from current status data collected via two series of cross-sectional surveys. To model the national probability of ever having screening test of interest, we incorporate birth cohort effects into a mixed-influence diffusion model. We link a state-specific model to the national-level diffusion model using a marginalized modeling approach. To simulate screening histories for our microsimulation model, we will link this model to a latent class survival model for modeling multiple gap times between screening examinations, which I will briefly describe as a works in progress. If there is time, I will also describe some of my other interests, including my work with the Breast Cancer Surveillance Consortium.
November 20, 2007
Raquel Prado
Department of Applied Mathematics & Statistics, UC Santa Cruz
ASSESSING THE EFFECT OF SELECTION IN DNA SEQUENCES ENCODING MALARIA ANTIGENS
A model-based approach for assessing the effect of natural selection at the amino acid level in protein-coding DNA sequences is presented. Bayesian generalized linear models are used to describe patterns of codon mutations in count data derived from sequence alignments. Such models provide a flexible framework thatallows experts to simultaneously perform the following tasks: detecting residues with relatively large ratios of non-synonymous to synonymous mutation probabilities; comparing intra-specific andinter-specific mutation probabilities, as well as mutation probabilities across various protein domains; and determining if radical changes are being encouraged by natural selection. Key modeling features include the incorporation of biologically meaningfulinformation via structured priors and model validation via posteriorpredictive checks and/or estimation of gene trees. The methodology is illustrated with analyses of polymorphic data obtained from isolates of the apical membrane antigen-1 in the human malaria parasite P.falciparum. Divergence data derived from a strain of the homologous gene in P.reichenowi are also analyzed.
October 2, 2007
Marc Coram
Department of Health Research and Policy, Stanford University
Allele Frequency Estimation by Borrowing Strength across Populations
In genetic studies, allele frequency at a genetic marker is routinely inferred, often using genotypes from a small set of individuals. Improving the accuracy of these estimates will benefit studies of human genetic variation or the genetic etiology of heritable traits. Here, we propose an empirical Bayes approach for estimating allele frequencies at single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. Applications of our method to data from recent genomic projects suggest that this empirical Bayes approach can substantially reduce the variability in the frequency estimates, while introducing little bias. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.
September 25, 2007
John Boscardin
Department of Biostatistics, UCLA
FLEXIBLE MODELING OF HETEROGENEOUS LONGITUDINAL DATA
We model multivariate longitudinal data on multiple subjects using a state space smoothing spline approach . The covariance parameters for the state space model are subject-specific so as to allow for heterogenei ty, but are modeled hierarchically to facilitate borrowing of information across subjects to the extent s upported by the data. The performance and applicability of this model is highlighted using intensive c are unit data from a prospective, observational study of severe head trauma patients. In this setting, real-time inference is required for clinical utility. Extensions to within-subject heterogeneity will al so be discussed. This is joint work with Hector Lemus.
May 18, 2007
Jason Fine
Department of Statistics and Department of Biostatistics & Medical Informatics, University of Wisconsin, Madison
NONPARAMETRIC ASSOCIATION ANALYSIS OF MULTIVARIATE COMPETING RISKS DATA, WITH APPLICATION TO DEMENTIA ONSET IN AN AGING POPULATION
While nonparametric association analyses of bivariate failure times have been widely studied, analogous analyses of bivariate competing risks data have not been investigated. Such analyses are important in familial association studies in genetic epidemiology and demography, where multiple interacting failure types may invalidate nonparametric analyses for independently censored clustered survival data. The scenario is common in population based studies where onset of certain chronic diseases, eg, psychiatric disorders, may be dependently censored by death. I first develop nonparametric estimators for the bivariate cause-specific hazards function and the bivariate cumulative incidence function, which are natural extensions of their univariate counterparts and make no assumptions about the dependence of the risks. The estimators are shown to be uniformly consistent and to converge weakly to Gaussian processes.
Time-dependent summary association measures are proposed and yield formal tests of independence in clusters. The practical utility of the methodology is illustrated in an analysis of dementia in the Cache County Aging Study, where dependent censoring by mortality is heavy and the onset associations are strongly time-varying.
May 8, 2007
Mary Lesperance
Department of Mathematics & Statistics, University of Victoria
GRAPHICAL TECHNIQUES FOR GENE EXPRESSION STUDIES
Correspondence analysis (CA) is a descriptive technique designed for investigating the association between row and column variables by graphically displaying the patterns in the data. It has been widely applied to categorical data. We explore and develop variations of CA techniques to identify differentially expressed genes and to assess the quality of replicate DNA arrays.
Multiple correspondence analysis (MCA) and a related technique called joint correspondence analysis (JCA) are methods for visualizing the joint features of 2 or more categorical variables. We have been working with the Genetic Pathology Evaluation Centre (GPEC) at UBC and the Breast Outcomes Unit (BCOU) at the B.C. Cancer Agency (BCCA) to study relationships between molecular markers and outcomes for breast cancer. Molecular markers and diagnostic variables are typically categorized as positive/negative by pathologists and oncologists, whereas outcome measures such as time to recurrence or breast cancer specific survival time are continuous and possibly censored. We consider fuzzy coding methods to display survival information in an MCA analysis of molecular markers.
May 3, 2007
Mitchell Gail
Chief of Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute p>
PROBABILITY OF DETECTING DISEASE-ASSOCIATED SINGLE NUCLEOTIDE POLYMORPHISMS IN CASE-CONTROL STUDIES WIT H WHOLE GENOME SCANS
Some case-control genome-wide association studies (CCGWASs) select promising single nucleotide polymorphisms (SNPs) by ranking corresponding p-values, rather than by applying the same p-value threshold to each SNP. For such a study, we define the detection probability (DP) for a specific disease-associated SNP as the probability that the SNP will be "T-selected", namely have one of the top T largest chi-square values (or smallest p-values) for trend tests of association. The corresponding proportion positive (PP) is the fraction of selected SNPs that are true disease-associated SNPs. We study DP and PP analytically and via simulations, both for fixed and for random effects models of genetic risk, that allow for heterogeneity in genetic risk. DP increases with genetic effect size and case-control sample size, and decreases with the number of non-disease SNPs, mainly through the ratio of T to N, the total number of SNPs. We show that DP increases very slowly with T, and the increment in DP per unit increase in T declines rapidly with T. DP is also diminished if the number of true disease SNPs exceeds T. For a genetic odds ratio per minor allele of 1.2 or less, even a CCGWAS with 1000 cases and 1000 controls requires T to be impractically large to achieve an acceptable DP, leading to PP values so low as to make the study futile and misleading. We further calculate the sample size of the initial CCGWAS that is required to minimize the total cost of a research program that also includes follow-up studies to examine the T selected SNPs. A large initial CCGWAS is desirable if genetic effects are small or if the cost of a follow-up study is large.
Joint work with M. Pfeiffer, William Wheeler and David Pee.
April 10, 2007
Biao Xing
Senior Biostatistician, Genentech, Inc.
BLINDED SAMPLE SIZE REESTIMATION IN RANDOMIZED CLINICAL TRIALS WITH CONTINUOUS ENDPOINT
Blinded sample size reestimation allows for modifying the sample size of an ongoing trial to ensure sufficient statistical power without breaking the blind. One challenge is the blinded estimation of the within group variance. Early proposed methods either make untenable assumptions or are only applicable to two-treatment trials. Moreover, these methods are often biased. We proposed a simple unbiased method, which also makes minimal assumptions. The method uses the enrollment order of subjects and the randomization block size to estimate the variance and then reestimate the sample size. It can be applied to normal or non-normal data, to trials with two or more arms, equal or unequal allocation schemes, and fixed or random randomization block sizes. Results from simulations and data analysis suggest that the proposed blinded sample size estimation approach is practical.
March 27, 2007
Alan Hubbard
Division of Biostatistics, UC Berkeley SPH
A NEW SCREENING ALGORITHM FOR MULTIPLE RISK FACTOR/DISEASE ASSOCIATION STUDIES: COMBINING VARIABLE IMPORTANCE, THE CONDITIONAL PERMUTATION DISTRIBUTION AND MULTIPLE TESTING PROCEDURES
A typical study design for investigating potential causes of disease can involve collecting a large number of potential risk factors and disease outcomes in a random sample of individuals. A new proposal is made, merging recently developments in causal inference and computational biology with existing methods, for an algorithm providing simultaneous ranking/testing of many potential risk factors. The procedure has three features: 1) The Parameter of Interest: a natural parameter of interest for certain study designs and risk factors is that inspired by the so-called population intervention model (Hubbard and van der Laan, 2005), which can also serve as a more general measure of variable importance. Under assumptions, this importance measure for a particular risk factor (variable) can be interpreted as the change in the mean disease outcome in a population if an intervention (for all subjects) set the variable to its "safest" level. 2) Marginal Inference: to provide inference (p-values) for estimated variable importances, conditional permutation methods (Rosenbaum, 1984) are available; these methods can potentially provide exact finite sample tests. 3) Experimentwise Inference: A recently proposed multiple testing procedure (using the quantile-function) inspired by problems in computational biology provides sharp control of experimentwise type I errors, while using the conditional permutation distribution for marginal control. The benefits of this new combined methodology are a parameter with public health significance, robust inference that automatically accounts for model selection, and a set of risk factors for which one has the most evidence (confidence) of having an impact on disease in the target population. This technique, which could replace typically ad hoc approaches, provides an automated procedure for analyzing studies of many candidate risk factors and disease outcomes.
March 2, 2007
Yuanyuan Xiao
Center for Bioinformatics & Molecular Biostatistics, UCSF
SNP GENOTYPING USING AFFYMETRIX GENECHIP ARRAYS
Modern strategies for mapping disease loci require efficient genotyping of a large number of known polymorphic sites in the genome. The sensitive and high-throughput nature of hybridization-based DNA microarray technology provides an ideal platform for such an application by interrogating up to hundreds of thousands of single nucleotide polyphorphisms (SNPs) in a single assay. Similar to the development of expression arrays, these genotyping arrays pose many data analytic challenges that are often platform specific. Affymetrix SNP arrays, for example, use multiple sets of short oligonucleotide probes for each known SNP, and require effective statistical methods to combine these probe intensities in order to generate reliable and accurate genotype calls.
In this talk, I will discuss current genotyping methods using Affymetrix SNP arrays and will introduce a new algorithm (MAMS) we have developed, which combines single-array multi-SNP and multi-array single-SNP calls to improve the accuracy of genotype calls, without the need for training data or computation-intensive normalization procedures as in other multi-array methods. Using a set of publicly available HapMap arrays/samples with known genotypes (from other genotyping technologies) as benchmarks, we illustrate the performance of MAMS in comparison of existing genotyping algorithms.
February 22, 2007
Yu Shen
Department of Biostatistics, M. D. Anderson Cancer Center
Inference of Tamoxifen's Effects on Prevention of Breast Cancer from a Randomized Controlled Trial
Breast cancer is the most common non-skin cancer among women in the United States, and continues to be an important cause of morbidity and mortality for women at high risk of developing the disease. The advent of preventive intervention and early detection of cancer brings greater hope to the control of breast cancer, while also posing significant challenges to researchers and public health policy makers. To provide quantitative frameworks to describe the natural history of breast cancer; assess the impact of the primary preventive intervention on the natural progression of the disease, we propose a flexible semiparametric model to assess the effects of a preventive agent on the incidence of breast cancer as well as time to the diagnosis of the disease, separately, in the framework of a cure-rate model. We used an estimating equation approach to estimate the unknown parameters, and assessed the semiparametric model assumption with a test based on the area between two survival curves. This is a joint work with Qin and Costantino.
February 21, 2007
Eran Halperin
International Computer Science Institute, UC Berkeley
WHOLE-GENOME DISEASE ASSOCIATION STUDIES: CHALLENGES AND SOLUTIONS
The recent data release of the Haplotype Mapping project, and the rapid reduction in genotyping costs, open new directions and opportunities in the study of complex diseases via the analysis of single nucleotide polymorphisms (SNPs) data. At the same time, the increased size of the SNP datasets set new computational and statistical challenges.
In this talk I will discuss some of the challenges set by the large-scale of these studies, and the current solutions to these challenges. In particular, I will describe recent results on whole-genome haplotype analysis, including haplotype inference, and the incorporation of the HapMap data in haplotype analysis of case-control studies. I will also discuss potential drawbacks of these methods due to population substructure, and suggest solutions that are scalable to the coming large-scale studies.
February 16, 2007
Wei Li
Dana-Farber Cancer Institute, Harvard School of Public Health
ChIP-chip ON GENOME TILING ARRAYS: TOWARDS AN UNDERSTANDING OF THE GLOBAL TRANSCRIPTIONAL REGULATION
Identifying the regulatory targets of a transcription factor (TF) is crucial to understanding its biological function. Chromatin Immunoprecipitation coupled with DNA microarray analysis (ChIP-chip) has quickly evolved as a popular technique to study the in vivo targets of DNA-binding proteins at the genome level.
We developed a series of algorithms to reliably detect and annotate ChIP-enriched regions using Affymetrix whole-genome tiling arrays, including 1) Model-based Analysis of Tiling-arrays (MAT) for ChIP-region detection, 2) extreme MApping of OligoNucleotide (xMAN) for microarray probe mapping, 3) Cis-regulatory Element Annotation System (CEAS) for ChIP-region annotation. Since the inception in early 2006, they have been adopted by hundreds of academic users and are now considered as the ChIP-chip data analysis standard in many labs. We are also coordinating the ENCODE spike-in consortium, which consists of more than 10 transcriptional regulation groups worldwide, to systematically analyze the performance variability introduced in ChIP-chip protocols, array platforms, and analysis methods.
We applied those algorithms to the ChIP-chip data of Estrogen Receptor (ER) and Androgen Receptor (AR) on Affymetrix human genome tiling arrays, and successfully identified thousands of novel binding sites, most of which are far from the promoters of known genes. A screen for enriched motifs within those regions revealed both the typical and non-typical AR responsive elements (ARE) and several other co-factor motifs, including Forkhead and Ap1. Co-immunoprecipitation and re-ChIP assays confirmed the interaction with these co-factors in vivo. Specific targeted silencing of these various cofactors differentially affected hormone-induced gene expression and cell cycle progression.
February 14, 2007
Niko Beerenwinkel
Harvard University
EVOLUTIONARY ESCAPE ON FITNESS LANDSCAPES
The evolution of HIV within individual patients is associated with disease progression and failure of antiretroviral drug therapy. Using graphical models we describe the development of HIV drug resistance mutations and show how these models improve predictions of the clinical outcome of combination therapy. We present combinatorial algorithms for computing the risk of escape of an evolving population on a given fitness landscape. The geometry of fitness landscapes and the underlying gene interactions are analyzed in an attempt to generalize the notion of pairwise epistasis to higher-order genetic systems. Finally, we discuss the new and exciting prospects for analyzing viral genetic variation that arises from recent pyro-sequencing technology.
February 6, 2007
John Rice
Department of Statistics, UC Berkeley
Testing Many Hypotheses
Suppose that a very large number of independent null hypotheses are tested, almost all of which are true. How can the proportion of false null hypotheses be estimated? For motivation, I will briefly discuss the Taiwanese-American Occultation Survey, and will explain how this question arises. I will then present some results based on joint work with Nicolai Meinshausen.
January 9, 2007
Alan Dabney
Department of Statistics, Texas A&M
MODEL-BASED PROTEIN SUMMARIES AND DIFFERENTIAL LABEL-FREE QUANTITATIVE PROTEOMICS
An LC-MS experiment begins with the component peptides of a mixture of proteins. Peptides are first separated by liquid chromatography, then each peptide is characterized by mass and quantified by peak height using mass spectrometry. Differential label-free quantitative proteomics refers to the use of peak heights to compare peptide abundance between groups of interest. Statistical issues include: intensity- (peak-height-) dependent bias, widespread in formative missingness, and the desire to make inference at the protein level on the basis of peptides. I will present a model-based approach to addressing these issues. The method will be illustrated on data from the Pacific Northwest National Laboratory.
November 14, 2006
Jay Bartroff
Department of Statistics, Stanford University
MODERN SEQUENTIAL ANALYSIS IN COMPUTERIZED ADAPTIVE TESTING
Sequential analysis of data is used in a variety of types of psychometric tests, including computerized adaptive testing (CAT), classroom interaction intervention, psychological studies with longitudinal data, depression diagnosis, and even crime-suspect identification tests. Focusing on CAT, we discuss designing efficient procedures using sequential generalized likelihood ratio tests, and show how these techniques can lead to substantial improvement over currently-used stopping rules and conventional fixed-length tests. We also extend the asymptotic optimality theory of these tests from the i.i.d. setting to the case of sequentially generated experiments, as in CAT. An example of these tests is given using a real math question pool provided by a subsidiary of the Educational Testing Service. Further practical issues like test security and content balancing will be discussed, and the interesting theoretical challenges they pose. This is joint work with T. L. Lai and Matthew Finkelman.
October 17, 2006
Ying Qing Chen
Statistical Center for HIV/AIDS Research & Prevention (SCHARP),
Fred Hutchinson Cancer Research Center
On Attributable Risk Functions
Time-to-event endpoints are often used in clinical and epidemiological studies to evaluate disease association with hazardous exposures. In the statistical literature of time-to-event analysis, such association is usually measured by the hazard ratio in the proportional hazards model. In public health, it is also of important interest to assess the excess risk attributable to an exposure in a given population. In this talk, we discuss the notion of "population attributable fraction" for the binary outcomes and extend it to the attributable risk function for event times in prospective studies. A simple estimator of the time-varying attributable risk function is proposed under the proportional hazards model. Our proposed methodology is motivated and demonstrated by the data collected in a multicenter acquired immunodeficiency syndrome (AIDS) cohort study to estimate the attributable risk of human immunodeficiency virus type 1 (HIV-1) infections due to several potential risk factors.
September 26, 2006
Hans Mueller
Department of Statistics, UC Davis
Functional Methods for Longitudinal Data
Functional methods are designed for the analysis of samples of time course data under minimal assumptions. Three useful functional concepts in the context of longitudinal biological data are warping, functional principal components and functional regression. Corresponding models, problems and examples will be discussed. Applications include gene time course expression data, sparse and irregular longitudinal data, and longitudinal data with time-to-event.
December 22, 2005 (joint local ASA meeting)
Wing Hung Wong
Department of Statistics, Stanford University
GLOBAL STUDY OF GENE REGULATION IN EMBRYONIC STEM CELLS
We are interested in the transcriptional programs underlying embryonic stem cells and their early differentiated lineages. Our approach uses gene expression profiling, cell sorting, chromatin immunoprecipitation, as well as multi-species cis-regulatory sequence analysis, to identify developmentally regulated genes and to characterize the sequence elements responsible for their regulation.
CBMB December 13, 2005
Zemin Zhang
Senior Scientist, Department of Bioinformatics, Genentech, Inc.
CANCER TARGET FINDING FROM DNA COPY NUMBER ANALYSIS
Aberrant DNA amplification is one of the most common mutations in cancer cells and it frequently leads to increased expression of encapsulated cancer-promoting genes. Such genetic changes also provide opportunities for cancer diagnostics and targeted therapies. For example, the anti-HER2 antibody drug Herceptin has been used for treating the breast cancer patients diagnosed with HER2 amplification. To identify additional HER2-like targets, we took several approaches to study recurrently amplified regions in cancers. First, we developed a computational method for scanning an EST-based transcriptome to find genomic regions harboring cluster of genes with increased expression in cancer tissues. We demonstrated that these regions correlated with previously identified tumor amplicons. We then analyzed BAC clone-based array CGH data using a variety of methods to explore DNA copy number changes in hundreds of breast-, brain- and colon-tumor samples. Since the low resolution of BAC-clone array CGH (approximately 1-2 Mbp) impedes the localization of culprit cancer-causing genes, we incorporated the Affymetrix expression microarray data to pinpoint amplified genes with increased expression in cancer. The correlation between expression patterns of neighboring genes was found to be helpful in confirming DNA amplifications and localizing the culprit cancer genes. In addition, we developed a computational scanning method to search for genes frequently amplified across different types of tumor tissues. Preliminary applications of this method demonstrated great reductions in the noise level commonly seen with array CGH data, and implicated known general oncogenes as well as novel genes with potential tumor-promoting functions.
December 13, 2005
Chris Triggs
Department of Statistics, University of Auckland
ESTABLISHING IDENTITY USING DNA PROFILES
The widespread use of evidence from DNA profiles has transformed forensic science. Many questions of identity in both civil and criminal investigations can be reduced to the question: Does this person belong in this pedigree? We have a set of people whose genetic profiles and relationships we know. We wish to assess the weight of evidence as to whether another individual, whose profile maybe known only in part, is related to the first group. Cases of disputed paternity where we wish to assess whether a man is the biological father of a particular child are examples of this.
After a general discussion of how to present probabilistic evidence in this talk I will discuss two specific problems. The first is a case of disputed paternity with evidence from a large pedigree. The second arises from the Boxing Day tsunami where there were a large number of unidentified bodies. Genetic information from family members was known about some of the missing people.
November 28, 2005
Xiao-Hua Andrew Zhou
Department of Biostatistics, University of Washington
SEMI-PARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION OF ROC CURVES
The diagnostic capability or accuracy of a medical test is often assessed using a receiver operating characteristic (ROC) curve. In this talk, I will discuss a new semi-parametric likelihood approach to estimate the ROC curve that satisfies the property of invariance of the ROC curve. I will show that our new estimator is asymptotically normal and report a simulation study, which demonstrates that the proposed estimator has the best performance among all the existing semi-parametric estimators considered here. Finally, I will outline a new semi-parametric estimation approach for ROC curve regression models.
This is a joint work with Huazhen Lin.
November 15, 2005
Robert Tibshirani
Professor of Health Research and Policy, and Statistics, Stanford University
BIOMARKER DISCOVERY: FACT OR ARTEFACT?
The areas of genomics and proteomics present exciting challenges for Statistical Sciences. The main challenge is to extract interpretable and reproducible information from datasets with large numbers of features (genes, SNPs, proteins) and a relatively small number of observations (biological samples, patients). There has been a flurry of statistical work, some by statisticians and also by other quantitative researchers in biology, computer science, physics and engineering.
In my view, the quality of this work has been very mixed. The race to publish and obtain grant funding has produced a significant number of fragile, irreproducible analyses. I will discuss in detail a recent controversial study published in NEJM, and then I will suggest ways in which the field can move forward in a more productive way. I briefly will describe a new tool called "supervised principal components," an example of a promising method for tackling this kind of problem.
CBMB November 15, 2005
David M. Rocke
UC Davis Division of Biostatistics (School of Medicine),
Department of Applied Science (College of Engineering),
and Institute for Data Analysis and Visualization
VARIABILITY AND DATA TRANSFORMATION FOR GENE EXPRESSION, PROTEOMICS, AND METABOLOMICS DATA
Biologists now have the capacity to measure thousands of compounds simultaneously from a single biological sample using gene expression arrays, mass spectrometry, NMR spectroscopy or other methods. These methods can be used to measure mRNA transcripts, proteins, short peptides, lipids, and other biologically active compounds. In this talk, I will describe an important statistical challenge in the use of such data. Using raw data, logarithms, or ratios, the variability of the measurements is strongly dependent on the level of expression, causing a failure of the assumptions of most standard methods of statistical analysis. We present a solution to this problem via a specially tuned data transformation and show how it promotes the effectiveness of simple and sophisticated analyses of the data.
October 7, 2005
Victor De Gruttola
Department of Biostatistics, Harvard University
RELATING GENOTYPE TO PHENOTYPE: RESAMPLING-BASED MULTIPLE HYPOTHESIS TESTING USING ORDER STATISTICS
Development and spread of resistance to anti-retroviral drugs limits their utility. We present multiple testing methods relating HIV genotype to phenotype. A semi-parametric resampling approach identifies patterns of mutations at a set of relevant codons associated with changes in drug susceptibility with respect to wild-type. It compares observed, ordered, mean responses to expected order statistics from an unspecified error distribution, preserves the family-wise error rate asymptotically, and is approximately conservative in finite samples. Two applications use protease sequences and measures of in-vitro sensitivity from the Stanford HIV Drug Resistance Database. The first identifies patterns of mutations that enhance or decrease drug susceptibility; the second investigates interactions. This latter shows that while M46I/L mutations are associated with drug resistance, adding L88D/S mutations leads to hypersusceptible virus. Further addition of T90M/L mutations results in highly resistant virus. This allows the investigation of how mutations act in the presence of others and may suggest mechanisms by which resistance occurs or is reversed through the accumulation of mutations.
Joint work with Jennifer Schumi
October 7, 2005
Mark Segal
UCSF Department of Epidemiology & Biostatistics
CHESS, CHANCE AND CONSPIRACY
Chess and chance are seemingly strange bedfellows. Luck and/or randomness have no apparent role in move selection when the game is played at the highest levels. However, when competition is at the ultimate level, that of the World Chess Championship (WCC), chess and conspiracy are not strange bedfellows, there being a long and colorful history of accusations levied between participants. One such accusation, frequently repeated, was that all the games in the 1985 WCC (Karpov vs Kasparov) were fixed and pre-arranged move-by-move. That this claim was advanced by a former World Champion, Bobby Fischer, argues that it at least be investigated. That the only published, concrete basis for this claim consists of an observed run of particular moves, allows this investigation to be performed using probabilistic and statistical methods. In particular, we employ imbedded finite Markov chains to evaluate distributions of select runs statistics. Further, we demonstrate how both chess computers and game databases can be brought to bear on the problem.
No knowledge of chess is assumed -- we touch on poker, go, checkers, baseball, basketball, parapsychology and cosmology so hopefully there is something for everyone.
September 26, 2005
Joanne Chapman
School of Physical, Environmental and Mathematical Sciences
University of New South Wales, Canberra, Australia
A CORRELATED GAMMA FRAILTY MODEL FOR BIVARIATE PROPORTIONAL HAZARDS SURVIVAL DATA
In this talk I'll give you a brief review of the basic definitions and functions used in survival analysis, and take you a quick trip on the path of progress since the development of the famous, and well-referenced, Cox's proportional hazards model in 1972.
I'll also talk about the concept of frailty (a measure of unknowns) and show how it is incorporated into standard survival models. In bivariate survival, as well as being a measure of unknown heterogeneity, frailty also measures association. I will introduce an extension to presently available models that allows us to easily model negative association; and show the particular importance of this when the heterogeneity present in the data is small.
CBMB August 18, 2005
Gordon Smyth
Walter & Eliza Hall Institute of Medical Research
PARAMETER SHRINKAGE AND SEPARATE CHANNEL ANALYSIS OF TWO-COLOUR MICROARRAY DATA
Analysis of two-colour microarray data is traditionally conducted by way of the log-ratios of red to green channel intensities for each spot on each array. A point of some controversy is whether more information can be obtained from the data by maintaining the two channels as separate observations. Pioneering work by Wolfinger et al (2001) advocated an approach using mixed linear models, treating each spot as a block with a random effect. Although flexible and useful, the mixed model framework greatly complicates the application of empirical Bayesian methods. This talk will describe an approach to separate channel analysis using random effects and heteroscedastic regression which applies different rates of shrinkage to different aspects of the covariance models. The aim is to gain information as well as preserving computational efficiency and simplicity of interpretation for the final models.
CBMB July 8, 2005
Natalie Thorne
University of Cambridge, Computational Biology Group
Department of Oncology, Hutchison/MRC Research Centre
ISSUES IN THE ANALYSIS OF DNA METHYLATION ARRAY DATA
DNA methylation plays an important role in regulation of gene transcription and is strongly implicated in cancer development. There are many limitations of the current methods to detect DNA methylation in a high-throughput genome-wide profiling manner. This is limiting the identification of potential new DNA methylation markers that predict or promote neoplastic progression.
We have been working on developing and comparing various methods for assessing genome-wide DNA methylation using an annotated 12K CpG island microarray. I will discuss some of the issues that need to be addressed in the low level analysis of such microarray data. In particular I will discuss the problem of normalisation.
June 8, 2005 (joint local ASA meeting)
Peter Bacchetti
UCSF Department of Epidemiology & Biostatistics
A COMPLETELY DIFFERENT APPROACH TO SAMPLE SIZE PLANNING
This talk will critique the usual power-based methods of determining sample size and propose an alternative. One difficulty is that the standard approach requires exact specification of inputs that generally are not known in advance, such as the standard deviations and size of the difference. A neglected but equally serious problem is that ignoring the cost implications of different sample size choices cannot be justified. We propose and justify a new approach for choosing sample size based on cost efficiency, the ratio of a study's scientific and/or practical value to its total cost. This can lead to very different answers than conventional power-based methods or Bayesian maximization of expected utility. By showing that a study's projected scientific or practical value exhibits diminishing marginal returns as a function of increasing sample size for a wide variety of definitions of study value, we are able to propose two simple methods that are justified as not falling short of the most cost-efficient sample size. The first is to choose the sample size that minimizes the average cost per subject. The second is to choose sample size to minimize total cost divided by the square root of sample size. This latter method is theoretically more justifiable for innovative studies, but also appears to perform well and has some justification in other cases. For example, if projected study value is assumed to be proportional to power at one specific alternative and total cost is a linear function of sample size, then this approach is more cost efficient than the sample size producing 90% power. In many situations, these methods are easier to implement, based on more reliable inputs, and better justified than current conventional approaches.
Most of the material is from joint work with Chuck McCulloch and Mark Segal.
May 11, 2005
Joseph Hogan
Departments of Medical Science and Community Health, Brown University
SENSITIVITY ANALYSIS FOR ESTIMATES OF CAUSAL TREATMENT EFFECT IN LONGITUDINAL HIV COHORT STUDIES
This talk is intended to illustrate the use of instrumental variables and associated sensitivity analysis for estimating causal treatment effects of HAART from observational cohort studies. Our focus will be on transparent representation of underlying assumptions, and on the role of coherent sensitivity analyses to understand the effects of departures from those assumptions. Characteristics of an 'ideal' sensitivity analysis will be proposed.
As part of the talk, we highlight key differences between various approaches to causal inference (e.g. propensity scores versus instrumental variables); for the most part, they can be differentiated by underlying assumptions about whether all confounders have been observed. It is argued that this at least partially explains why (for example) economists tend to prefer instrumental variables while epidemiologists favor propensity scores and inverse weighting.
April 6, 2005
Bryan Shepherd
Department of Biostatistics, University of Washington
COMPARING OUTCOMES THAT ONLY EXIST IN A GROUP CHOSEN AFTER RANDOMIZATION
In many experiments researchers would like to compare between treatments an outcome that only exists in a subset of participants selected after randomization. For example, in preventive HIV vaccine efficacy trials it is of interest to determine whether randomization to vaccine causes lower HIV viral load, a quantity that only exists in participants who acquire HIV. I will talk about some of the challenges of making these comparisons and propose sensitivity analysis methods using causal inference techniques. These methods estimate the average causal effect of treatment assignment on a post-infection outcome among those who would be infected whether randomized to vaccine or placebo. Our key assumption is that subjects randomized to the vaccine arm who become infected would also have become infected if randomized to the placebo arm. It is not known which of those subjects infected in the placebo arm would have been infected if randomized to the vaccine, but this can be modeled conditional on baseline covariates, the observed viral load, and a specified sensitivity parameter. I apply these methods to the first Phase III preventative HIV vaccine trial (VaxGen's trial of AIDSVAX B/B).
February 23, 2005
Sophia Rabe-Hesketh
Educational Statistics & Interdepartmental Group in Biostatistics
University of California, Berkeley
GENERALIZED LINEAR LATENT AND MIXED MODELS FOR NOMINAL DATA
As the name implies, generalized linear latent and mixed models (GLLAMMs) are multilevel latent variable models. The latent variables may represent true variables measured with error or random coefficients. Alternatively, they may be used merely to induce dependence among different responses, possibly of mixed types. Latent variables can be regressed on other latent and observed variables varying at the same or higher levels.
I will begin by describing the GLLAMM framework and then consider models for nominal responses.
Two important examples of nominal data are unordered polytomous responses, such as treatment chosen by a physician, and rankings, such as the preference order of different beers. It is natural to formulate models for such responses in terms of the latent 'utility' or 'attractiveness' of the alternatives (treatments or beers), giving rise to the well-known multinomial logit model. When the data have a multilevel structure, dependence among the observed responses from the same cluster (given the covariates) can be thought of as arising from residual correlations among the underlying utilities. It is useful to structure these correlations using latent variables varying at different levels of the hierarchical dataset. The methodology will be applied to party choice and rankings from the 1987-1992 panel of the British Election Study. Three levels will be considered: elections, voters, and constituencies.
October 13, 2004
David Oakes
Department of Biostatistics and Computational Biology, University of Rochester
ON THE POTENTIAL FOR RISK REVERSAL DUE TO HETEROGENEITY SELECTION
Suppose that the total risk for experiencing an event is the sum of an observed portion (o) and an unobserved portion (u). The unobserved portion follows a population distribution dF(u) say. Consider now the risk of experiencing a second event, among people who have already experienced a first event. Even if the individual risks are unchanged (so that an individual who has risk o + u for the first event has the same risk o + u for a second event) the population level distribution of risks will change, following Bayes' theorem. We show how this can lead to risk reversal, whereby, among those who have suffered a first event, higher values of o are associated on average with lower risks for a second event. These musings were prompted by unexpected results from a study of the influence of coronary disease-related genotypes on recurrent cardiac events among patients who had experienced a myocardial infarction. The talk will describe the study, its unexpected results, and their possible explanation via this risk reversal phenomenon.
September 22, 2004
Tianxi Cai
Department of Biostatistics, Harvard School of Public Health
SEMI-PARAMETRIC BOX-COX POWER TRANSFORMATION MODELS FOR CENSORED SURVIVAL OBSERVATIONS
The accelerated failure time model specifies that the logarithm of the failure time is linearly related to the covariate vector without assuming a parametric error distribution. In this article, we consider the semi-parametric Box-Cox transformation model, which includes the above regression model as a special case, to analyze possibly censored failure time observations. Inference procedures for the transformation and regression parameters are proposed via a resampling technique. Prediction of the survival function of future subjects with a specific covariate vector is also provided via point-wise and simultaneous interval estimates. All the proposals are illustrated with the data sets from two clinical studies.
January 28, 2004
Peter Gilbert
Statistical Center for HIV/AIDS Research & Prevention (SCHARP)
SENSITIVITY ANALYSES COMPARING OUTCOMES MEASURED ONLY IN A SUBSET SELECTED POST-RANDOMIZATION, WITH APPLICATION TO HIV VACCINE TRIALS
In many experiments researchers want to compare an outcome that is only measured in a subset of participants selected after randomization. For example, in HIV vaccine efficacy trials it is of interest to determine whether randomization to vaccine causes lower viral load, a quantity that only exists in infected subjects. To make a causal comparison and account for potential selection bias we propose a sensitivity analysis following the principal stratification framework set forth by Frangakis and Rubin (2002). Our goal is to obtain the average causal effect of treatment assignment on viral load at a given baseline covariate level in the always infected principle stratum (those who would have been infected whether they had been assigned to vaccine or placebo). We assume stable unit treatment values (SUTVA), randomization, and that subjects randomized to the vaccine arm who became infected would also have become infected if randomized to the placebo arm (monotonicity). Membership in the always infected stratum is unknown, but can be modeled conditional on randomization arm, infection status, covariates, the observed viral load, and a specified sensitivity parameter. The observed viral load is also modeled as a function of covariates and given treatment assignment. We can then obtain maximum likelihood estimates of the average causal effect conditional on covariates and the sensitivity parameter. This approach is extended to include censoring and non-continuous outcome variables. We apply our method to VaxGen's Phase III HIV vaccine trial, and conclude that vaccination has no significant effect on viral load.
November 5, 2003
John Kornak
UCSF/VA Medical Center Magnetic Resonance Unit
ISSUES IN THE STATISTICAL ANALYSIS OF fMRI DATA
Functional magnetic resonance imaging (fMRI) is a non-invasive imaging technique capable of detecting changes in cerebral activity. fMRI experiments typically focus on the detection of these changes by quantifying local hemodynamic responses to brain activity, and on estimating their magnitude and extent. The complex biological mechanisms underlying the phenomenon observed via fMRI are not fully understood, leading to many widely differing statistical approaches to data analysis.
This talk will describe several alternative statistical approaches for fMRI data and focus on issues relating to the compensatation for temporal-smearing effects of the hemodynamic response and spatial analysis of response parameters/statistics. Evidence is presented of the need to consider local shape variation of the hemodynamic response function in order to optimally estimate brain activation levels. A fully Bayesian spatial model, taking estimated brain activation levels as input, is then constructed for the purpose of determining regions of activation. In contrast to the usual spatial thresholding approaches, this model inherently trades local hemodyamic response magnitude with spatial extent.
October 15, 2003
Michael LeBlanc
Department of Biostatistics, University of Washington /
Fred Hutchinson Cancer Research Center
ADAPTIVE RISK GROUP REFINEMENT
Combinations of univariate clinical decisions, such as {serum calcium = 3} and {age <60}, are often easier to interpret than smooth or additive decision boundaries obtained from fitting additive regression models. Tree-based or recursive partitioning methods, such as the Classification and Regression Tree (CART) algorithm due to Breiman, Friedman, Olshen and Stone, (1984), are widely used for coming up with such simple rules.
Tree based methods, while useful, do not directly allow for calibration of patient groups in terms of average patient outcome or the proportion of patients in the group. For instance, in developing a clinical trial for a new aggressive therapy, one must limit the study to only those patients with sufficiently poor prognosis appropriate for the toxicity associated with that therapy. However, the poor prognostic group must include a sufficient proportion of the patients with that disease to make patient accrual to the clinical trial feasible.
Motivated by Patient Rule Induction Method (PRIM) method of Friedman and Fisher (1999), we construct interpretable prognostic rules based on a sequence of "box shaped" regions in the predictor space indexed by the fraction of patients in the prognostic group. Simulations are used to study the properties of the method and compare it to constructing prognostic groups based on regression trees and linear proportional hazards models. We consider graphical methods for understanding constructed regions and also describe an analysis of several completed clinical trials for patients with multiple myeloma.
CBMB October 3, 2003
Alexander Schliep
Max Planck Institute
ANALYZING GENE EXPRESSION TIME-SERIES DATA
A number of microarray datasets provide some information about how cellular processes cause changes over time. Observing and measuring those changes over time allows insights into the how and why of regulation. However, the proper way of analyzing the resulting time-course data is still very much an issue under investigation. The inherent time dependencies in the data suggest that clustering techniques which reflect those dependencies might be more appropriate.
We propose an approach based on Hidden Markov Models (HMMs) to account for the horizontal dependencies along the time axis in time-course data and to cope with the prevalent errors and missing values. The HMMs are used within a model-based clustering respectively a mixture-modeling framework.
The inherent robustness problems with clustering noisy data are circumvented by adding partial information about known groups of data in the complete data set. This is known as partly supervised learning.
GQL, a graphical user interface allows us to interactively explore an expression profile dataset for time-course.
October 2, 2003
Kelly H. Zou
Department of Radiology, Brigham and Women's Hospital
Department of Health Care Policy, Harvard Medical School
STATISTICAL VALIDATION OF IMAGING ANALYSIS
The validity of image segmentation is an important issue in image processing because it has a direct impact on surgical planning. We examined classification accuracy in imaging analysis based on three two-sample validation metrics against the estimated composite latent gold standard, which was derived from several experts' manual segmentations by an expectation-maximization (EM) algorithm called STAPLE. The distribution functions of the tumor and control pixel data were parametrically assumed to be a mixture of two beta distributions with different shape parameters. We estimated the corresponding receiver operating characteristic (ROC) curve, Dice similarity coefficient, and mutual information, over all possible decision thresholds. Based on each validation metric, an optimal threshold was then computed via maximization. We illustrated these methods using magnetic resonance (MR) imaging data on three radiologic examples: (1) accuracy of brain tumor segmentation, (2) reliability of elbow medial collateral ligament assessment, and (3) hidden gold standard in prostate peripheral zone segmentation for brachytherapy. The performances of these validation metrics were investigated via Monte-Carlo simulation. Extensions of incorporating spatial correlation structures were briefly considered under a Markov random fields model.
CBMB August 13, 2003
Eric Schadt
Rosetta Inpharmatics (A wholly owned subsidiary of Merck)
INFERRING CAUSALITY FROM MICROARRAY DATA IN SEGREGATING POPULATIONS: AN UNBIASED APPROACH TO THE IDENTIFICATION OF TARGETS FOR COMMON HUMAN DISEASES
A key goal of biomedical research is to identify the basis of common human diseases. Here I present a procedure for the identification of key drivers of common human diseases using gene expression data in a segregating mouse population. Central to this procedure is the integration of genetic and gene expression information with clinical trait data to infer causal patterns of association between key drivers and disease phenotypes. This procedure allows for the objective identification of druggable targets for common human diseases. Specific examples on the application of this method to obesity traits will be provided.
August 1, 2003
Jason P. Fine
Departments of Statistics and Biostatistics, University of Wisconsin, Madison
COMPARING NON-NESTED COX MODELS
This talk will focus on choosing between two, possibly non-nested proportional hazards models using the partial likelihood ratio test. I will present the limiting distribution of the test under general conditions. The multiplicative hazards models being fitted may be misspecified and the true model is not assumed to be contained by either of the fitted models. The null hypothesis is that the models are equidistant in Kullback-Leibler distance applied to the rank likelihood. The ratio statistic is consistent for the model which is closer to the truth. However, its distribution takes one of two forms and depends on the unknown data generating mechanism, which complicates inference. A two-step testing procedure is proposed which is valid regardless of the true model. The first step involves a novel test for the equality of the fitted models which is separate from the partial likelihood. The methodology has important applications in model assessment. A reanalysis of the well-known PBC data will be used to demonstrate its utility in selecting the functional forms of covariates and relative risks.
June 11, 2003
Marvin Zelen
Department of Biostatistics, Harvard University
THE EARLY DETECTION OF DISEASE AND STOCHASTIC MODELS
Early detection of disease presents opportunities for using existing technologies to significantly improve patient benefit. The possibility of diagnosing a chronic disease early, while it is asymptomatic, may result in treating the disease in an earlier stage leading to better prognosis. Many cancers, diabetes, tuberculosis, cardiovascular disease, HIV related diseases, etc., may have better prognosis when combined with an effective treatment. However, gathering scientific evidence to demonstrate benefit has proved to be difficult. Clinical trials have been arduous to carry out, because of the need to have large numbers of subjects, long follow-up periods and problems of non-compliance. Implementing public health early detection programs have proved to be costly and not based on analytic considerations. Many of these difficulties are a result of not understanding the early disease detection process and the disease natural histories. One way to approach these problems is to model the early detection process. This talk will discuss stochastic models for the early detection of disease. Breast cancer will be used to illustrate some of the ideas. The talk will discuss breast cancer randomized trials, stage shift and benefit, scheduling of examinations, issues of screening younger women and those at elevated risk and the planning of trials.
CBMB May 19, 2003
Laura Lazzeroni
Division of Biostatistics, Stanford University
ALLELE SHARING AND ALLELIC ASSOCIATION IN AFFECTED SIB PAIRS
The genotypes of affected siblings contain information about both allele sharing and allelic association, either of which can point to the presence of a disease-related gene. Allele sharing tests, also known as linkage or identity-by-descent tests, are designed to detect whether siblings who share the same disease also tend to inherit the same alleles at a genetic locus. Allelic association tests, such as the transmission-disequilibrium test, are designed to detect the association of a disease and a particular allele in the population at large. Whether allele sharing or allelic association is stronger and which type of test is more powerful depends on unknown factors, including the true genetic disease model at any linked risk-related loci, the strength of any other genetic and environmental risk factors and the population distribution of those factors. The difference in power can be substantial. I will discuss a test designed to detect both allele sharing and allelic association that is as powerful, or nearly as powerful, in any setting as the more powerful of the sharing and association tests. Underlying the test is a mixture model formulated in terms of family-specific relative risks. I will show how this model also yields interesting clues about which genetic and population models are most plausible in light of observed levels of allele sharing and association. This information can be used to decide whether an implicated locus provides a promising lead for further research.
May 9, 2003
Glen Satten
The Centers for Disease Control
will speak on two topics:
IS THERE EVIDENCE THAT SEROCONVERTING REPEAT BLOOD DONORS CHANGE THE PATTERN OF THEIR BLOOD DONATIONS AT SEROCONVERSION - OR, HOW SPECIAL IS A "SPECIAL" INTERVAL?
Glen A. Satten, George Schreiber, Simone Glynn, Michael P. Busch, David Wright, Fanhui Kong, Steve Kleinman and the REDS study group
Length-biased sampling occurs in renewal processes when the probability that an interval is selected is proportional to the length of the interval. This can occur when intervals are selected because they contain an event that is independent of the renewal process and occurs with constant hazard. For example, if the times between donations for repeat blood donors are independent but identically distributed, and if the donor seroconverts to HIV (develops antibodies that indicate infection with human immunodeficiency virus), then the interval between the last HIV seronegative and first HIV seropositive interval is expected to be longer than that donor's previous time intervals between donations. We develop hypothesis tests to determine if the relationship between the typical and length-biased intervals are as expected, or if there is departure from length-biased sampling. We further develop a regression method to determine if there are covariates that explain the departure from length-biased sampling. Our approach is motivated by the question of whether there is evidence that repeat blood donors who develop antibodies to HIV or other viral infections change their donation pattern in some way because of seroconversion.
CASE-CONTROL STUDIES WITH DIFFERENTIAL NUMBERS OF MEASUREMENTS
Glen A. Satten, W. Dana Flanders
By accident or by design, there may be a systematic difference in the number of exposure measurements that are available for case patients and for control patients. For example, in a recent study of Stachybotrys atras (Etzel et al. 1998 Arch Pediatr Adolesc Med. 152:757-62), the spore count in the homes of case patients was compared to the spore count in the homes of matched control probands. For case patients, an average of environmental 6 measurements were taken, while for control probands an average of 3 environmental measurements were taken. To account for the difference in the number of measurements, Etzel et al. used the average spore count for each study participant as a summary of their exposure. While this appears reasonable, we show that it may result in bias. We present a novel estimator that gives valid inference even when the number of measurements in cases is systematically different from the number in controls. We also consider analyses that use the maximum recorded exposure for each study participant.
March 13, 2003
Karl Broman
Department of Biostatistics, Johns Hopkins University
IDENTIFYING ESSENTIAL GENES IN M. TUBERCULOSIS BY RANDOM TRANSPOSON MUTAGENESIS
Mycobacterium tuberculosis (Mtb) is the organism which causes tuberculosis. Its circular genome of 4.4 Mbp has been completely sequenced and contains 4250 genes. In random transposon mutagenesis, one creates a library of mutants, each of which contains a single insertion of a transposon. Here we consider the Himar1 transposon, which inserts at random at a dinucleotide TA. The Mtb genome contains 74,403 such TA sites. We consider data on a library of 1425 transposon insertion mutants; for each mutant, the particular TA site at which insertion occurred has been determined. That a mutant with transposon insertion within a particular gene is viable indicates that the gene is not essential for the viability of the organism. Genes that are essential for the viability of the organism will never show up in such a library of insertion mutants.
We describe a Bayesian method for estimating the proportion of essential genes in the Mtb genome and for identifying genes likely to be essential, on the basis of such data. The prior distribution for the number of essential genes was taken to be uniform. A Gibbs sampler was used to estimate the posterior distribution.
CBMB March 19, 2003
Serafim Batzoglou
Assistant Professor of Computer Science, Stanford University
ALIGNMENTS, MOTIFS, AND MICROARRAYS
High-throughput experimentation technologies such as whole-genome sequencing and gene microarrays are transforming the way we do biology. From the traditional one-organism, few-genes framework we are quickly moving to many-organism, whole-genome studies. These are powered by algorithms, systems, and paradigms from computer science. In this talk we will cover some of the computational techniques we develop towards high-throughput biology. We will talk about methods for whole-genome multiple alignment and application to the human/mouse/rat genomes, gene microarray expression analysis, and regulatory motif-finding based on cross-species conservation and microarray measurements.
February 27, 2003
Hongzhe Li
Rowe Program in Human Genetics, UC Davis School of Medicine
THE ADDITIVE GENETIC GAMMA FRAILTY MODELS FOR GENETIC LINKAGE AND ASSOCIATION ANALYSIS
Many complex human diseases are due to multiple disease genes and both genetic and environmental risk factors. These diseases often also show variable age of disease onset. In order to incorporate both covariates and age of onset information into genetic analysis, we define an additive genetic gamma frailty model constructed based on the inheritance vectors. Within this modeling framework, we derive a retrospective likelihood ratio test for linkage and a score test for testing genetic association in the linked region using sibships data. Such tests can incorporate both affected and unaffected sibs, environmental covariates and age at disease onset or censoring information, and therefore provide a practical solution to mapping genes for complex diseases with variable age of onset. Simulation studies indicate that the proposed methods have correct type 1 error rates and perform better than the commonly used methods for linkage or association analysis. We further demonstrate the methods using the simulated data set from GAW12 and a real data set of affected sib pairs of prostate cancer.
February 18, 2003
Florin Vaida
Assistant Professor of Biostatistics, Harvard School of Public Health
CONDITIONAL AKAIKE INFORMATION FOR MIXED EFFECTS MODELS
In this talk we show that for a mixed effects model where the focus is on the cluster-specific inference the commonly used definition for AIC is not appropriate. We propose a new definition for the Akaike information to be used in such conditional inference, and we show that for a linear mixed effects model this definition leads to an Akaike information criterion (AIC) where the penalty for the random effects is related to the effective number of parameters, rho, proposed by Hodges and Sargent (Biometrika 2001); rho reflects an interim level of complexity between a fixed-effects model with no cluster effects, and a corresponding model with fixed cluster-specific effects. We compare the conditional AIC with the marginal AIC (in current standard use), and we argue that the latter is only appropriate when the inference is focused on the marginal, population-level parameters. We discuss the relationship of the conditional AIC with the deviance information criterion and other related work. A pharmaco-kinetics data application is used to illuminate the distinction between the two inference settings, and the usefulness of the conditional AIC.
CBMB January 23, 2003
Tony Rossini
Assistant Professor of Biostatistics, University of Washington
STATISTICAL ANALYSIS OF THE GENETIC DIVERSITY OF PATHOGENS
The average pair-wise evolutionary distance between molecular sequences is a simple and approximate measure which describes genetic diversity. Unfortunately, computing the standard errors for this quantity are not straightforward. We describe 2 approaches for doing this. One is a simple and statistically valid approach for computing simple inferential statistics based on U-statistics. This method takes into account the correlation due to reuse of sequence clones. We assume that the resulting sequences are sampled independently within units such as people or bodily compartments being compared for genetic diversity. The second approach uses linear mixed effects models to accommodate variation, and allow for testing for the necessity of random effects. These approaches are examined using data from a study of HIV pathogenesis in children. We conclude with some general problems with this approach and suggest future research into computationally intensive methods using phylogenies which addresses those concerns.
January 15, 2003
Sandrine Dudoit
Assistant Professor of Biostatistics, University of California, Berkeley
STATISTICAL METHODS AND SOFTWARE FOR THE ANALYSIS OF DNA MICROARRAY EXPERIMENTS
DNA microarrays are part of a new class of biotechnologies that allow the monitoring of expression levels in cells for thousands of genes simultaneously. Microarray experiments are being performed increasingly in biological and medical research to address a wide range of problems. In cancer research, microarrays are used to study the molecular variations among tumors with the aim of developing better diagnosis and treatment strategies for the disease. Microarray experiments generate large and complex multivariate datasets. The application of sound statistical design and analysis principles can greatly improve the efficiency and reliability of these experiments throughout the data acquisition and analysis process. Efficient and well-designed statistical software is an essential link between the development of statistical methodology and its positive and timely impact on biology. I will present a survey of statistical methods and software for the analysis of DNA microarray data. I will discuss more specifically computing resources developed as part of the Bioconductor project. This collaborative effort aims to produce an open source and open development computing environment for the analysis of genomic data (www.bioconductor.org).
December 4, 2002
Patrick Heagarty
Associate Professor of Biostatistics, University of Washington
TIME-DEPENDENT ROC CURVES AND LONGITUDINAL DIAGNOSTIC ACCURACY
ROC curves are a popular method for displaying sensitivity and specificity of a continuous diagnostic marker, Y, for a binary disease variable, D. However, many disease outcomes are time-dependent, D(t), and ROC curves that vary as a function of time may be more appropriate. A common example of a time-dependent variable is vital status where D(t)=1 if a patient has died prior to time t and is 0 otherwise. In Heagerty, Lumley and Pepe (2000) we have proposed summarizing the discrimination potential of a marker Y, measured at baseline (t=0), by calculating ROC curves for cumulative disease or death by time t. In other study designs both the disease outcome, D(t), and the marker, Y(t), are measured longitudinally. For this situation there are alternative approaches to defining and estimating sensitivity and specificity. One approach directly estimates the the distribution of the marker process conditional on the survival time using semi-parametric regression quantiles as described in Heagerty and Pepe (1999). A second approach uses "partly conditional" survival methods and more naturally handles censored onset times. The alternative definitions and estimation approaches will be illustrated using longitudinal pulmonary function measurements among cystic fibrosis subjects, and using the Multicenter Aids Cohort (MACS) data.
November 15, 2002
John Nelder
Imperial College, London
EXTENDED LIKELIHOOD INFERENCE APPLIED TO A NEW CLASS OF MODELS
Random-effect models require an extension of Fisher likelihood. Extended likelihood (Pawitan) or, equivalently, h-likelihood (Lee & Nelder), provide a basis for likelihood inference applicable to random-effect models. The model class, called hierarchical generalized linear models (HGLMs), is derived from generalized linear models (GLMs). It supports (1) joint modelling of mean and dispersion; (2) GLM errors for the response; (3) random effects in the linear predictor for the mean, with distributions following any conjugate distribution of a GLM distribution; (4) structured dispersion components depending on covariates. Fitting of fixed and random effects, given dispersion components, reduces to fitting an augmented GLM, while fitting dispersion components, given fixed and random effects, uses an adjusted profile h-likelihood and reduces to a second interlinked GLM, which generalizes REML to all the GLM distributions. A single algorithm can fit all members of the class and does not require either prior distributions or the multiple quadrature needed for methods using marginal likelihood.
Model checking also generalizes from GLMs and allows the visual checking of all aspects of the model. The model class can be extended to cover correlated data expressed by random terms in the model, thus allowing fitting of spatial and temporal models with GLM errors. Correlations can be expressed by transformations of white noise, by structured covariance matrices, or by structured precision matrices. Finally the class can be extended to double HGLMs, which allow random effects in the dispersion model as well as in the mean. This leads, among other things, to a potentially large expansion of classes of models used in finance, the properties of which have still to be investigated.
October 23, 2002
Francessca Dominici
Assistant Professor of Biostatistics, Johns Hopkins University
ESTIMATING HEALTH EFFECTS OF AIR POLLUTION: STATISTICAL CHALLENGES, FINDINGS, AND POLICY IMPLICATIONS
Evidence from time series studies of air pollution and health is central to major policy decisions concerning the risk of death associated with air pollution exposure. The nature and characteristics of time series data make risk estimation challenging, requiring development of complex statistical methods able to detect effects that are very small relative to the combined effects of confounders and residual variation.
Using the National Mortality Morbidity Air Pollution Study, which includes time series data from the 90 largest US locations for the period 1987-1994, we discuss: parametric versus semi-parametric approaches for estimating city-specific relative risks; hierarchical models for synthesizing city-specific estimates, and estimation of the exposure-response relation between air pollution and mortality.
We report national-level estimates of the health effects of air pollution, review their sensitivity to model choice and prior distributions and discuss policy implications.
Sources of model uncertainty call for a systematic assessment of model choice and for development of new methods. Importantly, the weight given by this scientific evidence in setting policy requires a level of confidence in findings that is difficult to attain in the small effects/many potential confounders context, regardless of the sophistication of the statistical approach.
May 20, 2002
Stephen Senn
Department of Statistical Science, University College London
TWO CHEERS FOR P-VALUES
P-values are a practical success but a critical failure. Scientists the world over use them but scarcely a statistician can be found to defend them. Bayesians in particular find them ridiculous but even the modern frequentist has little time for them.
The invention of P-values is often mistakenly ascribed to RA Fisher but in fact they are far older, dating back at least as far Daniel Bernoulli's significance test of 1734 regarding the inclinations of the planetary orbits. The Bayesian Karl Pearson also used them in his famous paper of 1900 on the chi-square goodness of fit test, some 25 years before the publication of Fisher's influential Statistical Methods for Research Workers.
Recently there has been a growing campaign against their use in medical statistics. The journal Epidemiology has even banned them. Bayesian critics have drawn attention to the fact that a just significant result has a moderate replication probability whilst failing to note that this is a desirable and necessary property shared by Bayesian statements. P-values have even been attacked in the popular press.
In this talk I shall consider whether there are any grounds for continuing to use this ubiquitous but despised device.
April 22, 2002
Ying Lu
Department of Radiology, UCSF
ON THE EQUIVALENCE OF TWO DIAGNOSTIC TESTS BASED ON PAIRED OBSERVATIONS
Equivalence of two diagnostic tests is a common problem in medical research. Often we want to determine if a new diagnostic test is as good as the standard reference test. Sometimes, we are interested in an inexpensive test that may have an acceptable inferiority in sensitivity or specificity. While hypothesis testing procedures and sample size formulas for the equivalence of sensitivity or specificity alone have been proposed, very few studies discussed simultaneous comparisons for both indications. In this paper, we present three different hypothesis testing procedures and sample size formulas for simultaneous comparison of sensitivity and specificity based on paired observations and with known disease status. These statistical procedures are then used to compare two classification rules that identify women for future osteoporotic fracture. Simulation experiments demonstrate that the new tests and sample size formulas give the appropriate type I and II error rates. Differences between our approach and the approach of Lui and Cumberland (2001) are discussed. This is a joint work with Drs. H. Jin, ST. Harris, and HK Genant. The research is supported by NIH grant R03 AR47104.
April 18, 2002
Su-Chun Cheng
Department of Statistics, Texas A&M University
APPLICATIONS OF SEMIPARAMETRIC TRANSFORMATION MODELS: CLUSTERED FAILURE TIME DATA AND COVARIATE MEASUREMENT ERROR
The Cox model has been used extensively to model univariate failure times as a function of covariates in the analysis of clinical trials. Nevertheless, its proportional hazards assumption may be questionable for some data. To this end, Dabrowska & Doksum (1988), Cheng, Wei & Ying (1995, 1997) and Scharfstein, Tsiatis & Gilbert (1998) studied a class of semiparametric transformation models, under which an unknown transformation of the event time is linearly related to the covariates with various completely specified error distributions. This class of regression models, which includes the proportional hazards and proportional odds models as special cases, provides useful alternatives to the Cox model for analyzing survival data. The methods for univariate event times, however, may be inappropriate if the data consist of a large number of small clusters of correlated failure times. Also, in the presence of covariate measurement error, survival analysis with the observed covariate may yield a biased estimate for the regression parameter. Existing research on these two topics has focused on using the Cox model with a frailty for clustered event times and on adapting the Cox model to account for covariates with measurement errors. In this talk, I will present separate applications of the transformation models that are generalized to handle clustered survival data and to accommodate covariate measurement error.
CBMB March 14, 2002
Xiaole Liu
Stanford Medical Informatics
DISCOVERY OF TRANSCRIPTION FACTOR BINDING SITES USING COMPUTATIONAL STATISTICS
The rapid development of sequencing technology has enabled the human and many other genomes to be sequenced and made publicly available. Microarray technology has also become considerably more robust and sensitive. The combination of the two allows biologists to study gene expression and transcription regulation at the genome level. Given a set of upstream DNA sequences whose downstream genes are clustered together based on similarity in gene expression profile, or a set of DNA sequences enriched in chromatin immunoprecipitation followed by microarray experiments (ChIP-array), it is desirable to conduct computational analysis to find common sequence motifs that are the potential transcription factor binding sites regulating transcription.
I will review the established approaches for discovering common DNA motifs in a set of sequences, and introduce two computational statistics approaches, BioProspector and MDscan.
BioProspector searches for common sequence motifs from any general cluster of DNA sequences, especially potential transcription factor binding sites from upstream sequences of genes clustered by expression profile similarity. BioProspector adopts a Gibbs sampling motif discovery strategy, but provides many improvements. Motifs can have one-block, two-block, or two-block palindromic patterns. BioProspector allows variable copies of a motif per sequence, and uses background model with Markov dependency to improve the specificity of motifs. The statistical significance of a discovered motif can be calculated by Monte Carlo simulation. Current results for testing each BioProspector feature have been very encouraging. A BioProspector web site is setup for biologists to load their sequences on the server for motif discovery. Another program, MatrixScan, is developed to search the genome for more potential sites using a discovered motif matrix.
MDscan is a fast and novel algorithm that looks for motifs from a set of sequences when one has confidence that a subgroup of the sequences contains the motif more abundantly. It can be used to find protein-DNA interaction sites from sequences selected by ChIP-array experiments because the sequences highly enriched by ChIP-array are very likely to contain the real protein-DNA interaction sites, and with multiple copies per sequence. The comparison of MDscan with several other motif-finding programs shows the advantage of MDscan in both speed and accuracy. It also succeeds in identifying the correct motifs from all published ChIP-array experiments.
CBMB March 4, 2002
Ru-Fang Yeh
Department of Biology, Massachusetts Institute of Technology
PREDICTING HOMOLOGOUS GENE STRUCTURES AND EXONIC SPLICING ENHANCERS IN THE HUMAN GENOME
The sequence of the human genome provides the foundation for new approaches to study the organization and functions of human genes. In this talk, I will demonstrate the use of sequence analysis methods to address two different but closely related problems - identification of genes and exonic splicing enhancers.
A major challenge following the completion of the human genome project is to identify the locations and encoded protein sequences of all human genes. We have developed GenomeScan, a new gene identification program which combines the power of ab initio gene finding algorithm as in Genscan with database search results (such as blastX) in an integrated model. Accuracy from extensive testing and results of the application of GenomeScan to 2.7 billion bases of publicly available human genomic DNA will be discussed.
The vast amount of sequence data also allow us to study the association of sequence content with various biological process. Our PROFILER method uses a statistical analysis of exon-intron and splice site composition to screen for short oligonucleotide sequence motifs in exons that enhance pre-mRNA splicing. Representatives of the predicted motifs were found to possess significant enhancer activity when tested in vivo, while point mutants exhibited sharply reduced activity as predicted. The experimental results verified the ability of PROFILER to predict the splicing phenotypes of exonic mutations in human genes.
December 13, 2001
David Strauss
The Life Expectancy Project
MORTALITY RESEARCH AT THE UC LIFE EXPECTANCY PROJECT
The Life Expectancy Project is a San-Francisco based research and consulting group, formerly housed at UC Riverside. The group, which works with a California data base of 235,000 persons with mental disability, has published more than 100 articles, mostly on epidemiological and actuarial studies of mortality.
The talk will be an overview of a broad range of topics:
- Life expectancy, the life table, and their uses
- Comparative mortality of mentally retarded people in state facilities versus community care
- Biostatistical methods for longitudinal mortality studies
- Life expectancy in cerebral palsy, spinal cord injury, traumatic brain injury, etc.
- Life expectancy in the legal system
Details on most of this can be found on the group's web site, www.LifeExpectancy.com.
September 24, 2001
Jacqueline Law
Pharsight Corporation
THE JOINT MODELING OF A LONGITUDINAL DISEASE PROGRESSION MARKER AND THE FAILURE TIME PROCESS IN THE PRESENCE OF CURE
In this talk I will present a cure model which incorporates a longitudinal disease progression marker. The model is motivated by studies of patients with prostate cancer undergoing radiation therapy. The patients are followed until recurrence of the prostate cancer or censoring, with the PSA marker measured intermittently. Some patients are cured by the treatment and are immune from recurrence. A joint-cure model is developed for this type of data, in which the longitudinal marker and the failure time process are modeled jointly, with a fraction of patients assumed to be immune from the endpoint. A hierarchical nonlinear mixed effects model is assumed for the marker and a time- dependent Cox's proportional hazards model is used to model the time to endpoint. The probability of cure is modeled by a logistic link. The parameters are estimated using a Monte Carlo EM algorithm. Importance sampling with an adaptively chosen t-distribution and variable Monte Carlo sample size is used. This model is fitted to a prostate cancer database. A simulation study is also performed. It is found that the parameter estimates have better statistical properties when the longitudinal disease progression marker is incorporated into the cure model. The classification of the censored patients into the cure group and the susceptible group based on the estimated conditional recurrence probability from the joint-cure model has a higher sensitivity and specificity, and a lower misclassification probability compared with the standard cure model. The addition of the longitudinal data has the effect of reducing the impact of the identifiability problems in a standard cure model and can help overcome biases due to informative censoring.
July 5, 2001
Alistair Scott
Department of Statistics, University of Auckland, New Zealand
Case Control Studies with Complex Sampling
The use of complex sampling designs in population-based case-control studies is relatively common, particularly for sampling the control population. This is prompted by the usual cost and logistical benefits conferred by multi-stage sampling. Complex sampling is typically ignored in the analysis, but with the advent of packages like SUDAAN, survey-weighted analyses that take account of the sample design can be carried out routinely. This talk explores some more efficient alternatives, which can also be implemented using readily available software. We also look at robustness of the procedures when the model is mis-specified.
April 10, 2001
Jack Kalbfleisch
Department of Statistics and Actuarial Science, University of Waterloo
Bootstrapping the Estimating Function
In the Estimating Function (EF) Bootstrap, the distribution of the estimating function is estimated by resampling its terms using bootstrap techniques. Studentized versions of the EF Bootstrap yield methods that are invariant under reparametrizations and yield higher order approximations to confidence regions. This approach often has substantial advantage, both in computation and accuracy, over more traditional bootstrap methods and it applies to a wide class of practical problems where the data are independent but not necessarily identically distributed. We will discuss applications in this context and extensions to estimating components of a vector parameter. Simulations are used to compare the EF bootstrap with competing methods in several examples including the common means problem and nonlinear regression. We will conclude with some discussion of extensions of this approach to autoregressive models and outlining a number of problems for further study.
March 26, 2001
Heping Zhang
Associate Professor of Biostatistics, Yale University School of Medicine
Multivariate Adaptive Splines Models for the Analysis of Longitudinal Data (MASAL)
A mixed-effects multivariate adaptive splines model will be presented to analyze longitudinal or growth curves data that may or may not have been collected through a regular measurement schedule. The MASAL algorithm by Zhang (1994, 1997, 1999) will be described and applied to determine the nonparametric fixed-effects in the mixed-effects multivariate adaptive splines model. The potential of this procedure is illustrated with the analysis of a data set on the effect of cocaine use by pregnant women on the growth of their infants after birth. In addition, residual diagnoses are presented to validate the mixed-effects multivariate adaptive splines model.
March 19, 2001
Chengcheng Hu
Department of Biostatistics, University of Washington
Cox Regression with Mismeasured or Missing Covariates
This talk deals with the estimation of the Cox proportional hazards model when covariates are measured with error or missing. For the measurement error problem, the classical additive measurement error model is considered, as well as a more general model which represents the mismeasured version of the covariate as an arbitrary linear function of the true covariates plus a random noise. No distributional form is imposed on the covariates or the error. Assuming that the covariates are measured precisely for a validation set, we develop consistent and asymptotically normal estimators for the regression parameters and the cumulative baseline hazard function. Simulation studies indicate that the proposed estimators work well for practical sample sizes, and a real example is provided. The method is also adapted to the situation when only replicate measurements are available for the covariates, instead of a validation set. A similar approach is taken to study the Cox model with missing covariates. Imputed covariates are used and a class of modified partial likelihood score functions are proposed to correct the bias in the ordinary imputation approach. The resulting estimators are shown to be consistent and asymptotically normal, and their finite sample properties are explored using simulation.
March 12, 2001
Jason Fine
Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison
Risk Assessment via a Robust Probit Model, with Application to Toxicology
A number of frameworks may be used to assess the risk associated with a continuous toxicity outcome. In a rat study of aconiazide, a drug under investigation for treatment of tuberculosis, animals receiving high doses tended to experience increased weight loss. The goal of our analysis is to identify a "safe" dose. One approach is to formulate the effect of the exposure on the adverse effect with a simple normal model and to compute the risk function using tail probabilities from the standard normal distribution. This risk function depends heavily on the assumed model and may be sensitive to misspecification. A semiparametric alternative based on another definition of risk has recently been studied. However, it is not clear whether the two approaches are related. We explore a semiparametric normal model, in which an unknown transformation of the adverse response satisfies the linear model. It is demonstrated that this formulation unifies the two approaches, allowing for a coherent risk analysis of the dose-response data. The methodology includes estimation and inference for the unknown transformation in the semiparametric model for the continuous response. Novel model-checking techniques are proposed for diagnosing lack-of-fit, including a formal sup-norm test of the simple normal model. The aconiazide data serves as a case study for the risk assessment procedure.
March 6, 2001
Maja Pavlic
Group in Biostatistics, University of California, Berkeley
Modeling response to treatment using normal mixtures
Repeat measurements of patient characteristics are often used to assess response to treatment. If some proportion of patients does not respond to treatment at all, they behave as if they have not been treated. This definition leads to a mixture model description of responders and non-responders to treatment.
However, mixture models can be used to closely approximate any distribution. For this reason, model assumptions, in particular mixture dimensions, need to be checked. Choosing the number of mixture components is a non regular model selection problem since the likelihood ratio test statistic does not follow its usual asymptotic distribution. We propose minimizing estimated distance between the fitted and true model densities to chose a mixture of optimal dimension. Distances we consider are Kullback-Leibler and L2. The method of minimizing distance performs well in comparison to other available model selection functionals as indicated by simulation studies.
Fracture intervention trial (FIT) was a randomized clinical trial of osteoporosis drug alendronate. The statistical methods discussed are applied to the bone mineral density (BMD) change data from FIT. Based on the observed changes in BMD, we challenge the existence of non-responders to alendronate using the mixture model selection methods.
December 12, 2000
Joerg Rahnenfuehrer
Group in Biostatistics, University of California, Berkeley
Data Compression and Statistical Inference: Multivariate Permutation Tests for Clustered Data
The talk deals with the choice of clustering algorithms for multivariate data sets. We make use of a wide range class of algorithms containing as special cases both the well known k-means algorithm and the Kohonen (1985) algorithm. These algorithms define partitions by profoundly investigted by Poetzelberger and Strasser (1999).
We compare the quality of the clustering procedures by first applying them to multivariate data sets and then treating a k-sample problem. For computing the test statistics the data points are replaced by their conditional expectations with respect to the MSP-partition. We present Monte Carlo simulations of power functions for tests that are carried out as multivariate permutation tests.
The results show a vital and decisive connection between the optimal choice of the clustering algorithm and the tails of the probability distribution of the data. Especially for distributions with heavy tails the performance of k-means type algorithms totally breaks down.
Finally we demonstrate the influence of the choice of the cluster algorithm on the quality of the compression of high dimensional real data sets of microarray experiments where often poor working algorithms are applied.
October 26, 2000
Mark van der Laan
Department of Statistics and Group in Biostatistics, University of California, Berkeley
Statistical Inference with Microarray Data
Large-scale gene expression studies are becoming increasingly common as new microarray technology makes it possible to capture the gene expression profiles for thousands of genes at once. Statistical inference with such high dimensional data structures (and, all too often, relatively small samples) is a challenging analytical problem. In the current microbiology literature, (hierarchical) cluster analysis methods have been used to find groups of genes with similar patterns of expression. Such methods are purely exploratory and, thus, do not provide any type of significance levelfor features in the data or any opportunities for purposeful experimental design. We propose the use of a deterministic rule, applied to the parameters of the gene expression distribution, to select a target subset of genes that are of biological interest. We focus on rules that operate on mean vectors and covariance (i.e. correlation) matrices; we also employ the output of a standard cluster analysis methodology ("partitioning around medoids" or PAM) to further refine the subset by exploiting the dependence of certain subsets of genes. An estimate of the target subset is obtained by applying the procedure to the sample statistics (e.g. mean and covariance). The parametric bootstrap is used to estimate the distribution of these estimated subsets; relevant summary measures of this distribution are also proposed. We prove consistency of the subset estimates and asymptotic validity of this parametric bootstrap under the assumption that the sample size converges faster to infinity than the logarithm of the number of genes. The practical performance of the method is illustrated with a simulation study. The method has also been used to analyze cancer-patient data.
September 28, 2000
Mark Segal
Division of Biostatistics, UCSF
Clustering of Translocation Breakpoints
Translocation, the physical movement of genetic material from one chromosome to another, can juxtapose portions of two cellular genes to generate chimeric gene products and/or alter regulation of gene expression. This provides a putative oncogenic stimulus and, indeed, several gene fusions from translocations have been identified in leukemias, lymphomas, sarcomas.
The statistical analysis of translocation breakpoints has focussed on the extent to which they cluster. Somewhat questionable methods have been employed in this regard. After highlighting these shortcomings, we introduce a variety of approaches including scan statistics, smoothed bootstrap, and gap statistics, that provide a comprehensive means for appraising clustering. We apply this battery to TEL-AML1 translocations, the most common translocation in childhood ALL. Results obtained indicate much weaker evidence for clustering than previously published.
April, 10, 1999
Dan Scharfstein
Assistant Professor, Department of Biostatistics
Johns Hopkins School of Hygiene and Public Health
Methods for Conducting Sensitivity Analysis of Trials with Potentially Non-ignorable Competing Causes of Censoring
We consider inference for the treatment-arm mean difference of an outcome that would have been measured at the end of a randomized follow-up study if, during the course of the study, patients had not initiated a non-randomized therapy or dropped out. We argue that the treatment-arm mean difference is not identified unless unverifiable assumptions are made. We describe identifying assumptions that are tantamount to postulating relationships between the components of a pattern-mixture model, but can also be interpreted as imposing restrictions on the cause-specific censoring probabilities of a selection model. We then argue that although sufficient for identification, these assumptions are insufficient for inference due to the curve of dimensionality. We propose reducing dimensionality by specifying semiparametric cause-specific selection models. These models are useful for conducting a sensitivity analysis to examine how inference for the treatment-arm mean difference changes as one varies the magnitude of the cause-specific selection bias over a plausible range. We provide methodology for conducting such sensitivity analysis and illustrate our methods with an analysis of data from the AIDS Clinical Trial Group (ACTG) study 002.
This is joint work with Andrea Rotnitzky and James Robins (Harvard School of Public Health) and Ting-Li Su (Qunitiles, Inc.)
March 13, 1999
Michael I. Jordan
Department of Statistics, University of California, Berkeley
Graphical models and variational approximation
Graphical models provide an elegant formalism for probabilistic computation that unifies much of the literature on complex probabilistic models in computer science, engineering, statistics, and physics. For sparse graphs (e.g., graphs in the form of chains or trees, such as Kalman filters, hidden Markov models, and probabilistic decision trees), there exist general algorithms for probabilistic inference that are exact, efficient and practical. For dense graphs, however, the exact algorithms are often (hopelessly) inefficient, and this fact has hindered the application of this richer class of models to real-life problems. I discuss variational methodology, which provides a general framework for approximate inference in graphical models. I illustrate variational methods with examples of applications to problems in prediction, diagnosis and control.
September 29, 1999
Hongzhe Li
University of California Davis School of Medicine
THE ADDITIVE GAMMA FRAILTY MODELS FOR LINKAGE ANALYSIS OF AGE-AT-ONSET VARIATION FOR COMPLEX DISEASES
Ages at onset data arise frequently in mapping studies of complex traits, for which model-free methods have been widely used recently. Since these complex traits are often lack of simple inheritance patterns, the robust feature of model-free methods, i.e. not requiring specification of mode of inheritance, is especially desirable. However, current methods in nonparametric linkage analysis are mainly concentrated on the affected relative pairs or affected family members with age of onset information either ignored or taken into account by specifying age-dependent penetrances for liability classes.
I will first demonstrate that the power of these methods could be greatly affected by ages at onset and naively combining affected subjects with different ages at onset could result in a reduced power in detecting linkage. I will then present an additive gamma frailty model for linkage analysis of age-at-onset variation. For each individual, I define a frailty as sum of the frailty due to the putative disease locus based on the inheritance distribution and the frailty due to additive polygenic effect and use the Cox proportional hazard model to model age at onset. I will show that the variance of the frailty and therefore the variance of age-at-onset can be written as sum of the variance due to the putative disease gene and the variance of polygene and that test of linkage can be formulated as test of zero variance due to the putative disease gene. I will derive the conditional hazard ratio parameter for sib pairs and define a likelihood ratio based Lod score statistic under the proposed model. Finally, I will present simulation studies to show that the proposed test has correct type I error rate yet it gains more power compared to ones where ages age onset data are ignored.
July 15, 1999
Nancy Flournoy
Department of Mathematics and Statistics, American University
STATISTICAL SCIENCE: A CASE STUDY IN DISCOVERY
My current research in Adaptive Designs is motivated by my 15-year tenure with the team that pioneered bone marrow transplantation at the Fred Hutchinson Cancer Research Center. In some medical investigations I became extremely dissatisfied with the rate of learning using existing statistical design methodologies. I will briefly describe the experimental conditions that led to this frustration and motivate my research interests. I will outline progress that has been made.
Although my current research focuses on 'problems' that result from using existing design methodologies, existing statistical methodologies are also marvelous tools of discovery. I never think of a study in isolation, but rather as a stage in the process of knowledge acquisition. I will review a serial mixture of randomized experiments and observational studies by which we were able to establish that cytomegalovirus (CMV) infection could result from blood transfusions. At that time hepatitis was the only viral infection thought to transmit in this manner. Just as case studies started to appear reporting HIV infection following blood transfusions, our findings for CMV led blood banks to gear up for doing routine viral screenings.
My medical sciences motivated research and this series of investigations illustrate the way I would promote statistics as an integral part of the medical sciences discovery and learning process.
July 29, 1999
Mark Segal
UCSF Division of Biostatistics
Prediction of Binding Peptide Sequences: Application of Trees and Bump-Hunts.
Milik et al., (Nature Biotechnology, 16: 753-6, 1998) use artificial neural networks (ANNs) to predict the amino acid sequences of peptides that bind to the particular MHC class I molecule, K^b. Their motivation is that simple rules for such prediction, based solely on preferences for specific amino acids in certain (anchor) positions, are inadequate and that binding is influenced by the amino acids in all positions of the peptide. The purpose of the ANN application was to elucidate these more complex rules. While ANNs provide a powerful and flexible machinery they have some shortcomings with respect to this problem: (i) difficulty handling highly polymorphic positions in terms of the amino acid representation itself (as opposed to derived properties thereof); (ii) a "black-box" representation of the prediction rule that precludes interpretative insight; and (iii) mediocre performance in terms of sensitivity and specificity for the phage library analyzed.
We demonstrate that handling unordered categorical covariates with numerous levels and attendant interactions (shortcoming (i)) is, in fact, problematic for many regression methods. Further, this and the other difficulties can be effectively redressed using classification tree techniques. We illustrate this approach using the same data studied by Milik et al. Additionally, recently devised bump-hunting methods that also adeptly handle unordered categorical covariates are applied. Other interesting problem features including (a) position covariation, and (b) whether observed associations are attributable to amino acid properties, are addressed.
July 6, 1999
Charles McCulloch
Departments of Statistical Science and Biometrics, Cornell University
Latent Class Mixed Models
Linear mixed models are a well-known method for incorporating heterogeneity (e.g., subject-to-subject variation) into a statistical analysis for continuous responses. However heterogeneity cannot always be captured by the usual assumptions of normally distributed random effects. Latent class mixed models offer a way of incorporating additional heterogeneity which can be used to uncover distinct subpopulations, to incorporate correlated non-normally distributed outcomes, and to classify individuals. The methodology is illustrated with data from the Nutritional Prevention of Cancer trials: latent class models are used with longitudinal data on prostate specific antigen (PSA) as well as incidence of prostate cancer. Four subpopulations are identified which differ both with regard to their PSA trajectories and their incidence rate of prostate cancer.
June 10, 1999
Ying Qing Chen
Department of Biostatistics, Johns Hopkins University
Accelerated Hazards Model and Its Extensions
The proportional hazards model for survival time data assumes that the risk factors of interest predict their effect multiplicatively on an underlying unknown hazard function. Although this model has been studied widely in the statistical literature, it may not be applicable when the assumption of constant proportionality is violated. In a two-arm randomized clinical trial, for example, participants in the treatment group would have the same risk process through time as those in the control group, except that the treatment would speed up or slow down this process. Some alternatives such as the accelerated failure time model have been developed in the literature. In this talk, an accelerated hazards model is introduced to estimate such a treatment effect when there is a scale change relationship between hazard functions. The methodology and its estimation procedure are studied within a two-sample setting. Extensions of the model to other general settings are discussed. The proposed method is applied to a real data set to investigate the practical usage.
This work is joint with Mei-Cheng Wang of the Department of Biostatistics, Johns Hopkins University.
April 15, 1999
Adam Olshen
Stanford Human Genome Center, Department of Genetics
SAMapper: A Maximum Likelihood Method for Constructing Radiation Hybrid Maps
In this talk I will discuss both the uses of radiation hybrid mapping and the mapping program SAMapper developed at the Stanford Human Genome Center. Radiation hybrid mapping is the most common method of making high resolution maps of the human genome. A human-hamster radiation hybrid is constructed by irradiating a human cell and fusing it with a hamster cell. A hybrid contains the whole hamster genomeand random portions of the human genome. Maps are constructed from data on the retention of human markers in a panel of hybrids. The purpose of our new method is to adapt the Boehnke-Lange-Cox maximum likelihood techniques for radiation hybrid mapping so that it is possible to build good maps on the order of several thousand markers. This is accomplished through the use of reasonable plug-in estimators to speed up the likelihood calculation and simulated annealing to search through the many possible orders of markers. In addition, we have developed a novel method of bootstrapping to assess the uncertainty in our maps.
This work is joint with Laura Lazzeroni, Ying Luo and David Cox.
April 9, 1999
Anthony J. Lawrance
School of Mathematics and Statistics, University of Birmingham, Birmingham, England
Engine Mapping: Statistical Modelling as Auto Engineering
Engine mapping is the term used in the auto industry when modelling engine ouputs, such as torque and emissions, in terms of engine inputs, such as load, air-fuel ratio and exhaust gas recycling ratio. Such relationships are required by electronic engine controllers to provide optimum fuel economy within legal restrictions on exhaust gas emissions and within the operational limits of the engine. The older method offered considerable scope for improvement and a new approach has been developed. The key idea is to consider the problem in two stages. The first stage is concerned with response or output sequences as functions of spark advance, which is the way the data are collected, from experiments designed in terms of the input variables. The second stage involves informed multivariate regression modelling of key engineering quantities of curves fitted to these sequences. This division of the problem allows both input from the engineering base and the effective use of statistical modelling and a variety of diagnostics; it produces models with much improved predictive performance. The approach is outlined and then illustrated on data from a designed experiment carried out earlier during the work. A number of novel statistical features are involved, but there are some parallels with repeated measures. The topic had not previously been subject to detailed study in either the statistics or engineering communities and is now being refined for implementation with future production engines. The work is being carried out as a closely collaborative project with the Ford Engineering Research Centre in the UK.
December 7, 1998
David Giltinan
Genentech
Sensitivity Analysis in Mixed Effects Models using the Weighted Bootstrap
Regression analysis of repeated measurement, or clustered, data is usually carried out within the framework of an appropriate mixed effects model. A question of practical interest is whether a particular observation, or small subset of observations, exerts undue influence on the final fitted model. Few options exist for deletion diagnostics that can be practically implemented (such as Cook's distance, in the case of a single level of random variation). The problem is intrinsically more complicated for repeated measurement data, as one may be interested in evaluating the influence of a single within-subject measurement, or that of a particular subject.
Bayesian analysis by Markov chain Monte Carlo (MCMC) techniques affords one possible approach to the development of deletion diagnostics, as sensitivity to assumptions can often be explored by appropriate resampling from the "converged" chain. I will present results illustrating the use of this approach to development of within-subject and full-subject deletion diagnostics, and discuss possible advantages and limitations.
October 27, 1998
Martin W McIntosh
University of Washington, Department of Biostatistics and Fred Hutchinson Cancer Research Center
INVESTIGATING EMPIRICAL BAYES RULES FOR OVARIAN CANCER SCREENING USING LONGITUDINAL BLOOD SERUM MEASURES
Substantial research is underway to identify blood tests for early detection of cancer; For example, yearly measures of PSA, for prostate cancer, and CA125, an antigen specific for ovarian cancer. Good candidates for use in screening are blood born antigens present in stochastically steady levels in normal subjects (subjects without cancer) that increases in concentration early in a cancers development cycle. It is hoped that routine screening of serum may help detect cancer at an early stage. A natural screening strategy determines if serum levels have elevated at the most recent screen. But elevated compared to what? Currently used strategies make use of only the most recent serum measure - positive screen when $Y_{i}^+>c_1$ - or the two most recent measures - positive screen when $Y_i^+$ - $Y_i^->c_2$ (for example, a doubling rule when $c_2=log(2)$). It is commonly known that CA125 has substantial between subject heterogeneity and recently Crump, McIntosh, Karlan and Urban (1998) show that in normal women several other candidate markers for ovarian cancer have substantial heterogeneity. We show how Empirical Bayes motivated screening strategies can increase the sensitivity of the screen while maintaining the same specificity. We also investigate semi-parametric rules for screening using quantile regression that seeks to mimic the Bayes rule of out parametric model, but without its parametric assumptions. Given time, we outline how these methods fit into our long term goals of using multiple markers simultaneously for cancer screening. We motivate our methods using data for ovarian cancer screening.
October 8, 1998
Geof Givens
Department of Statistics, Colorado State University, Fort Collins
ESTIMATING AND ADJUSTING FOR PUBLICATION BIAS USING DATA AUGMENTATION IN BAYESIAN META-ANALYSIS
Meta-analysis reviews, collects, and synthesizes individual sample surveys to estimate an overall effect size. `Publication bias' is a relatively new statistical phenomenon that only arises when one attempts through a meta-analysis to review all studies, significant or insignificant, in order to provide a total perspective on a particular issue. If the studies for a meta-analysis are chosen through a literature review, an inherent selection bias may arise, since for example, studies may tend to be published more readily if they are statistically significant, or deemed to be of higher quality. This has recently received some notoriety as an issue in the evaluation of the relative risk of lung cancer associated with passive smoking, following legal challenges to a 1992 EPA analysis which concluded that such exposure is associated with significant excess risk of lung cancer.
We introduce a Bayesian approach which estimates and adjusts for publication bias, correcting for both the number and outcome of missing studies. Estimation is based on a data augmentation principle within a hierarchical model, and the number and outcomes of unobserved studies are simulated using Gibbs sampling methods. This technique yields a quantitative adjustment for the passive smoking meta-analysis. We estimate that there may be both negative and positive but insignificant studies omitted, and that failing to allow for these would mean that the estimated excess risk may be overstated by around 30%, both in US studies and in the global collection of studies.
A further extension of this method introduces an additional hierarchy that permits the stratification of studies by sample characteristics, study design elements such as blinding and control, and many other objective and subjective factors. We apply this method to a meta-analysis of studies of cervical cancer rates associated with use of oral contraceptives.
September 29, 1998
Allan Donner
Professor and Chairman, Department of Epidemiology and Biostatistics, University of Western Ontario
THE MERITS OF MATCHING IN CLUSTER RANDOMIZATION TRIALS
Concern about potential imbalance on risk factors in cluster randomization trials often prompts researchers to adopt a pair-matched design in one cluster in each matched pair is randomly assigned to the intervention group,with the other cluster receiving the control intervention. Because this design is both intuitively appealing and simple to analyze, it has become very popular among investigators planning a community intervention trial. Nonetheless it is known that for very small studies, involving less than about ten matched pairs, ineffective matching can lead to a loss in power. In this talk, I discuss some less well-recognized limitations to the pair-matched design that may arise in trials of any size. Depending on the overall objectives of the study, investigators are therefore encouraged to consider other designs which do not share these limitations.
August 25, 1998
Steve Self
Fred Hutchinson Cancer Research Center and Department of Biostatistics, University of Washington
An Intermediate-Size Trial Design for HIV Vaccine Trials
The design of an NIH-sponsored trial to evaluate protective efficacy of an HIV vaccine has been challenging due to competing priorities of investigators, NIH and industry. While industry would generally prefer an evaluation that is as quick and definitive as possible, NIH and many investigators are reluctant to invest substantial resources in evaluation of a vaccine for which plausibility of efficacy is considered low. A trial that is intermediate in size between classical Phase II and Phase III designs is proposed as a possible resolution to this conflict. Such a trial would provide preliminary but direct information about vaccine efficacy and serve as a triage of vaccine candidates for subsequent definitive evaluation. The design and operating characteristics of such a trial will be described with respect to primary analyses of efficacy and secondary analyses of potential correlates of immune protection. A description of how this trial design might fit into a larger program of HIV vaccine evaluation will be provided and some challenges for what studies might logically follow will be discussed.
June 4, 1998
Alexandre Tsodikov
Department of Biostatistics, Huntsman Cancer Institute, University of Utah
A Distribution of Tumor Size at Detection: An Application to Breast Cancer Data
A method of estimating numerical characteristics of initiation, promotion and progression stages of tumor latency from data on tumor size at detection is discussed. To this end, a stochastic model of spontaneous carcinogenesis has been developed to allow for a simple pattern of tumor growth kinetics. It is assumed that a tumor becomes detectable when its size attains some threshold level which is treated as a random variable. Initiation is modelled by a Poisson process of the formation of precancerous lesions (initiated cells). Competing promotion of initiated cells results in occurrence of the first malignant cell. Subsequent tumor growth (progression) is modelled by a pure birth process. The model yields a parametric family of joint distributions for tumor size and age at detection. Some estimation problems associated with the proposed model appear to be tractable. In particular use is made of the conditional likelihood based on the limiting distribution of size at detection for older patients that appears to be independent of the model parameters related to tumor promotion. This is illustrated with an application to the statistical analysis of data on primary breast cancer.
April 16, 1998
Chris Andrews
Group in Biostatistics, University of California, Berkeley
A Genomic Humpty Dumpty: Reconstructing Chromosomes from Random Fragments
A two-part Bayesian method for constructing a physical map of a chromosome is discussed and is applied to human chromosome 11. First, Markov chain Monte Carlo techniques are applied to DNA fragment vs Radiation Hybrid (RH) and RH vs Sequence Tag Sites (STS) data to localize the DNA fragments to STS content bins. Simulation studies suggest that this method is robust to model misspecification. The results of this method applied to data for chromosome 11 compare favorably with a previous map using this data (Perlin, 1995) and provides posterior probabilities of correct binning. Second, colocalized DNA fragments are ordered within STS bins using overlap data and then the orderings for consecutive bins are merged to form a map of the entire chromosome. Because the fragments have been grouped by part one, the computations required are still feasible. Criteria for determining the optimal experimental design are discussed.
March 25, 1998
Oernulf Borgan
Department of Statistics, University of Oslo, Norway (Visitor in the Stanford Department of Statistics)
Sampling from the Risk Sets in Cox Regression
Estimation in proportional hazards models, like Cox's regression model, is based on a partial likelihood which compares the covariate values of a failing individual to those of all individuals at risk at the time of the failure. In large epidemiological cohort studies of a rare disease, these methods require the collection of information on exposure variables and other covariates of interest for all individuals in the cohort even though only a small fraction of these actually get diseased. This may be very expensive, or even logistically impossible. Risk set sampling methods, in which each failing individual (case) is compared to a small sample of controls from those at risk at the case's failure time, may give a substantial reduction in the resources that need to be allocated to a study.
In the talk I will describe a general framework for such risk set sampling methods, incorporating classical nested case-control sampling (with simple random sampling of the controls) and counter-matched sampling (with stratified random sampling of the controls) as special cases. Further, I will review methods for estimation of the regression parameters and the integrated baseline hazard and indicate how one may estimate absolute risk for given time-dependent covariate histories. The methods will be illustrated by data on lung cancer deaths in a cohort of uranium miners from the Colorado Plateau.
March 12, 1998
Ludwig Hothorn
Department of Bioinformatics, University of Hannover, Germany
(Visitor in the UCSF Division of Bistatistics)
Multiple Comparisons in the One-Way Layout Including a Control or Placebo Group
In clinical trials, preclinical studies, epidemiological studies and other bio-medical experiments the one-way layout including a control or placebo group occurs frequently. In the negative control or placebo group either no substance (with or without a vehicle) or a standard therapy ("administration on the top") is administered.
The commonly used statistical method is Dunnett's procedure (JASA, 1955). Although the degree of multiplicity is reduces with respect to all-pairs comparison procedures (e.g., Tukey, 1953), in this talk several approaches will be discussed for a further reduction of the number of comparisons (which increase the power but decrease the robustness):
- a-priori ordered hypotheses of interest (Mauer et al., 1995)
- assuming total order restriction of dose groups
- testing of fixed dose combinations versus monotherapies and placebo
- all-pairs comparisons between such treatment groups which are better than control
- estimation of highest effective dose step
Moreover, the inclusion of an international standard or a competitor's drug into this design will be discussed using an equivalence approach within an intersection-union test (Bauer et al., 1998, Stat.Med. in press).
Real data example will be discussed for the above approaches.
March 10, 1998
Eric Vittinghoff
San Francisco Department of Public Health
Per-contact Risk of Transmission of HIV Associated with Four Types of Homosexual Contact
The risk of various types of homosexual contact for transmission HIV transmission is of biologic, epidemiologic, and public health interest. A modified Bernoulli model was used to estimate the per-contact risk of four types of contact widely reported by high-risk HIV-uninfected homosexual men in the CDC Collaborative HIV Seroincidence Study. Per-contact risk of each type of contact was allowed to depend on covariates including age, concurrent STD, and the number of contacts reported. The last covariate directly approximates the effect of unmeasured between-subject heterogeneity in susceptibility commonly found in mixture and nonparametric models.
Simulations indicate that the model performs adequately in estimating relatively small per-contact risks (URO) in the presence of much larger risks (URA). Severe unmodeled heterogeneity in susceptibility (and by extension, infectiousness of partners with whom there are repeated contacts) appears to induce moderate downward bias in estimates of the relatively large per-contact risk.
These estimates of per-contact risk may help in interpreting animal models for HIV transmission and in designing interventions such as post-exposure prophylaxis and rectal microbicides. They also suggest that the lower risk practice of URO may become more important in propagating the epidemic, and should be addressed by interventions targeting high risk MSM.
November 24, 1997
Jerry Lawless
Department of Statistics, University of Waterloo
Semiparametric Methods for Missing Data and Response-selection in Regression
This talk will discuss the construction of estimating functions for parametric regression models in situations where information about responses or covariates may be missing, and where observation may be response-selective. Modelling of covariate distributions is avoided by the use of semiparametric likelihoods and pseudo likelihoods. Applications include multistage studies in epidemiology and medicine, field reliability studies, and broad classes of missing data and measurement error problems.

