Research
Mark R. Segal PhD
Professor, UCSF Division of Biostatistics, and CBMB Director
Microarrays: Dr Segal is involved in a variety of gene expression studies utilizing differing microarray platforms. These include HIV-1 associated dementia (Pulliam, Rempel), smoking/asthma (Fahy, Woodruff), airway remodeling (Arjomandi), antibiotic resistance and screening (Wiener-Kronish), multiple sclerosis (Baranzini, Sen), splicing (Xiao, Yang, Hartzog), and breast and colon cancer (Macina). From a methodologic standpoint he has devised techniques for handling expression data with linked continuous (Conklin, Dahlquist) and survival (Li) outcomes. With CBMB colleagues Yuanyuan Xiao and Jean Yang he has addressed some fundamental data analytic issues including normalization and differential expression.
Additionally, Dr Segal is engaged in both methodologic development for (Xiao, Yeh) and applications of (Baranzini, Sen) SNP array platforms, and similarly for phage-based antigen microarrays (Imboden).
Sequence Analysis: Since devising methods for formally evaluating putative clustering of translocation breakpoints, Dr Segal has applied these techniques in a variety of settings (Wiemels, Grimwade). Analogously, methods advanced for relating sequence to phenotype in the context of HIV-1 replication capacity (Barbour, Grant) have been extended (Xiao) and applied to other settings including rifampin resistance in TB (Cummings). He is actively engaged in a reevaluating predictions of CpG island methylation based on sequence-derived features. Similar and upcoming involvements pertain to analyses arising under the pharmacogenomics of membrane transporters project (Giacomini, Witte, Sali). A recently devised economical, high-throughput assay (called AmpliCot) that uses DNA hybridization kinetics to measure the diversity of lymphocyte TCR genes (Baum, McCune) requires attendant development of data analytic methods, as well as experimental design planning to optimize performance. He has also undertaken comparative analyses, and introduced new techniques, to the problem of predicting splice site signals.
Proteomics: Dr Segal is assisting with formulating approaches whereby UCSF-developed software (Protein Prospector; Burlingame, Baker, Chalkey) for performing protein identification based on database search following tandem mass spectrometry profiling is augmented with measures as to the accuracy/confidence of the identifications. Other proteomic projects, using differing platforms, pertain to biomarker discovery in the context of Sjogrens syndrome (Hardt, Fisher, Shiboski), cardiovascular disease (Pletcher, Xiao, Hulley) and breast cancer (Tice, Xiao, Cummings)
Flow Cytometry: Dr Segal is about to embark on developing and applying methods for analyzing flow cytometry data obtained in investigations of T-cell activation in HIV-1 disease (Barbour, Sinclair).
Sandrine Dudoit PhD
Associate Professor of Biostatistics, UC Berkeley School of Public Health
Sandrine Dudoit's research activities concern the development and application of statistical and computational methods for the analysis of biomedical and genomic data.
Specific areas of interest include:
- the design and analysis of gene expression experiments using DNA microarrays (e.g., microarray experiments monitoring transcript levels, ChIP-Chip experiments for identifying transcription factor binding sites, alternative splicing microarray experiments);
- nucleotide and protein sequence analysis (e.g., identification of regulatory motifs in DNA sequences);
- the genetic mapping of complex human traits (e.g., linkage analysis, linkage disequilibrium analysis, SNP-based association studies);
- the analysis of biological metadata (e.g., Gene Ontology (GO) annotation);
- the development of statistical software for the analysis of biological data (Bioconductor Project).
Dr Dudoit's methodological research interests include:
- loss-based estimation with cross-validation: classification, density estimation, model selection, prediction;
- multiple hypothesis testing.
Jane Fridlyand PhD
Genentech, Inc.
The main focus of Dr Fridlyand's research has been on the development of statistical methodology and applications in biological and medical problems. More specifically, her work has been centered on developing and applying data mining techniques to the high-dimensional datasets arising in gene expression and copy number microarray, methylation and genotyping studies. She is also interested in developing methods for combining different types of data across genomic platforms (e.g., gene expression, array CGH and methylation) and tumor imaging (MRS) data. Ongoing research includes:
Development of the methods for variable selection. Identifying the loci responsible for variation in quantitative or binary traits such as tumor size, cancer subtype or survival status, is a problem of great importance to biologists. One of the main features of the genomic datasets is their unfavorable "p" (number of variables) to "n" (number of samples) ratio. Many important genetic variables affect the trait of interest via epistasis rather than on their own, and identification of such genes is notoriously hard when p/n ratio is large. In her PhD thesis, Dr Fridlyand developed a novel approach for discovering interacting loci in the context of mouse linkage studies. There, she utilized binary decision trees in combination with powerful aggregation approaches by shifting the focus of the analysis from prediction to variable selection. This approach has been successfully used as an exploratory tool in the study of plasmacytoma-related morbidity in Emu-v-abl transgenic mice (Symon et al, 2002). Dr Fridlyand is interested in applying these ideas for deteriming variable importance to variable selection in even more complicated situations of microarray studies and generalizing them for pathway discovery.
Development of the methods for accurate class prediction. Accurate class prediction is a problem of the utmost importance in cancer classification. There, biologists are often interested in developing genetic methods for tumor subtype identification and prognosis. Given the complexity of the microarray tumor data involving unfavorable ratio of the number of variables to the number of samples, large number of sets of highly correlated variables (e.g., co-regulated genes), and high between-patient heterogeneity, the question arises as to whether classic statistical methods for discrimination can be used for this new type of data. Together with PhD adviser Dr Terry Speed and collaborator Dr Sandrine Dudoit, she has conducted a thorough comparison study with several publicly available gene expression datasets each containing known cancer subtypes. They were able to perform a comparison of a number of traditional discrimination methods such as K-Nearest-Neighbors (k-NN) and Linear Discriminant Analysis (LDA) with the state-of-the-art machine learning approaches including application of bagging and boosting. They have demonstrated that in a typical gene expression dataset, the otherwise successful machine learning classifiers do not have an advantage over traditional statistical methods noted for their high bias and low variance, e.g. k-NN or LDA with the assumption of uncorrelated variables (Dudoit, Fridlyand et al, 2002). It is very likely that as the number of samples in a typical microarray dataset increases, the machine learning methods capable of exploring the space of interaction will gain an edge over standard discrimination approaches. In particular, Dr Fridlyand plans to evaluate the performance of highly publicized classification method, Random Forests, in terms of its relative prediction accuracy and variable selection ability. Together with Drs Spellman and Gray, Dr Fridlyand is currently involved in the cancer genome profiling project which, if funded, is going to produce data on many thousands of tumors which would allow us to explore these issues.
Development of the methods for novel class discovery. Cluster analysis involves the search through data for observations that are similar enough to each other to be grouped together. When a clustering algorithm is applied to a set of observations, a partition of the data is obtained whether or not the data exhibit a true or "natural" grouping structure. This fact causes no problems if clustering is done for obtaining a practical grouping of the given set of objects, for instance for organizational purposes. However, if interest lies more in the recognition of an unknown classification of the data, an artificial clustering is not acceptable, and therefore clusters resulting from the algorithm must be investigated for their relevance. Apart from descriptive, graphical or exploratory methods, this task can be performed by using probabilistic models and suitable statistical significance tests. Discovery of novel tumor classes using gene expression data is one example where the need to reliably estimate the number of clusters and accurately allocate observations arises. With collaborator Dr Sandrine Dudoit, she proposed to apply resampling methods to (i) estimate the number of clusters in a dataset and (ii) improve accuracy of the cluster assignment. The approach to (i) uses ideas from discriminant analysis. Since the clusters obtained from cluster analysis are eventually used for prediction purposes, it is natural to apply discrimination techniques in clustering. For (ii), bootstrap aggregation is used to improve cluster accuracy and to assign confidence to the labels of the individual observations. They have successfully demonstrated the utility of both approaches on simulated data and real microarray datasets by conducting careful comparison studies of their methods with the available methods, This work is presented in Dudoit and Fridlyand (2002, 2003). Dr Fridlyand is interested in continuing to build upon the ideas proposed in these manuscripts and applying them to the tumor datasets of collaborators at the UCSF Cancer Center (Drs Gray, Bastian and Albertson) where one of the objectives is to identify homogeneous patient subsets to improve targeted drug development for breast, ovarian and melanoma cancers. In particular, some of these ideas find an interesting application to the Magnetic Resonance Spectroscopy (MRS) data where objectives include characterizing within-patient tumor heterogeity based on the metabolyte measurements as well as refining tumor margins for better targeted treatment. This work is joint with Drs Lu, T. McKnight and J. Hwang.
Development of the methods for the analysis of the array CGH data. The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic alterations seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. Microarray-based comparative genomic hybridization (array CGH) can be used to investigate genomic alterations. The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers. To utilize the spatial coherence between nearby clones, Dr Fridlyand proposed to use an unsupervised Hidden Markov Models approach. The clones are partitioned into states which represent underlying copy number of the group of clones. The structural changes in a tumor genome may be recorded and characterized computationally using the above methodology. The method is described in Fridlyand et al, 2004 and has been successfully applied to a number of cell line and primary tumor datasets. This research has greatly benefited from the continuing input of biological collaborators Drs Albertson, Pinkel and Snijders. Current ongoing effort is focused on refining the methodology and incorporating it in the aCGH package of R/BioConductor developed with P. Dimitrov, a PhD student at UC Berkeley. Additionally, Dr Fridlyand is very interested in using the HMM approach for data reduction, so that the resulting lower-dimensional dataset is used as an input to classification and variable selection procedures as well as procedures for combining different data types. Together with the visiting Ph.D student H. Willenbrock, she has also investigated approaches for identifying discrete levels of the copy number across the entire genome and their use in the downstream analysis (Willenbrock and Fridlyand, 2005) They have demonstrated that this leads to increase in power and better detection of copy number alteration events. They also hope to utilize this approach for identification of early versus late events in tumor development. With biological collaborators, she has successfully applied the above methodology to efficiently identify regions of homozygosity and heterozygosity in backcross mice using array CGH data (Snijders et al, 2005). Currently, together with P. Dimitrov, she is extending previous work by developing a variable duration HMM for copy number profile segmentation (HsMM). Written as a graphical model, this flexible framework allows for flexible incorporation of the focal aberrations, subclonal events and natural extensions to the allele-specific copy number analyses. One of the issues is computational and she is working on efficient approximations to the full solution to make algorithm applicable to the high density arrays.
Development of the methods for combining clinical, copy number and expression data for identification of driver genes and discovery of novel pathways. More cancer datasets are becoming available containing copy number, gene expression and methylation measurement as well as clinical information on the tumor samples. The approaches discussed in the current literature for combining different types of data focus almost exclusively on identifying dosage effect of individual gene via computing correlation coefficient between copy number and mRNA levels of that gene. Clinical information has been largely ignored in this context. They have used their methodology for quantifying genomic instability together with available transcriptional and clinical information to identify the groups of high risk patients and show the functional groups of genes which transcriptional activity is associated with increase in instability. (Fridlyand et al 2006, Chin et al, 2006 (accepted), Neve et al, 2006 (accepted)). This work is continuing to be done in collaboration with Drs Albertson, Gray and Waldman using breast and ovarian cancer datasets.
Another interesting question is whether one can use different types of genomic data to improve understanding of the pathway activation, identify novel gene connections in the known pathways and group samples according to whether they are involved in a given pathway and their specific mechanism of involvement. Initially they proposed to focus on the known KEGG pathways and with the view of being able to augment them with new genes. Our approach involves building a prediction model for the mRNA level of each pathway genes using all the available information on that gene as well as transcriptional data on the remaining pathway genes. Currently they are investigating the question of model selection for these type of data since existing approaches tend to overfit in the context of equal number of samples and variables, i.e. identify many false connections. The number of connections among genes serves as a measure for the pathway activation status. They also build the shortest path among pairs of genes and compare it to the known pathway structure. The same concept allows us to test whether the gene not known to be in the pathway actually belongs there Using regression model one can ask which samples are potential outliers and thus may not be involved in a given pathway and whether these samples have particular clinical characteristics. Finally, all the approaches are implemented with transcriptional data only thus allowing us to assess the contribution of the DNA level data (copy number and methylation) when RNA level data are available. This work is done in collaboration with Drs Yeh, Costello, Gray, Albertson and Pinkel.
Fridlyand Selected Publications
David Glidden PhD
Professor, UCSF Division of Biostatistics
David Glidden collaborates with Kung-Yee Liang (Johns Hopkins) on two distinct topics: ascertainment adjustment in complex diseases, and multipoint mapping based on genetic linkage. The former area deals with problems posed by simultaneously having to oversample families enriched for disease (to minimize genotyping) and accommodate (via a random effect specification) residual association deriving from unmeasured genetic and environmental factors influencing disease risk. A paper detailing estimation approaches and sensitivity to the assumed random effects distribution is in press (Glidden, 2002; Glidden and Liang, 2002a).
Techniques employing IBD-allele sharing among affected siblings to localize putative disease genes are widely used. Dr Glidden is devising improvements of such methods by allowing for the inclusion of covariates and extending to general pedigrees. Incorporation of age-at-(disease) onset as a covariate has been shown to yield increased efficiency (Glidden and Liang, 2002b)
Wen-Chi Hsueh PhD
Assistant Professor, UCSF Department of Medicine
Wen-Chi Hsueh's research group focuses on genetic epidemiological studies of complex phenotypes, in particular aging- and metabolism-related traits in humans. Through the use of extended families and large cohorts of unrelated individuals, the goal of their research is to identify susceptibility genes, environmental risk factors, and their patterns of interaction for traits of interest. They employ epidemiological principles, statistical tools and bioinformatics to perform linkage analysis and genetic association studies. Study populations include extended families from the Old Order Amish, and large cohorts from the US and Mexico with extensive information from prospective long-term longitudinal follow-ups or national surveys. A new study was initiated to study centenarians in Okinawa, Japan.
Our current projects and collaborations include the following:
- Genetic epidemiology of diurnal preference and its relationship with aging
- Genetic epidemiology of telomere length and its relationship with aging
- Candidate gene studies of obesity, type 2 diabetes, and longevity
- Genome-wide association study for aging and longevity
The hope is that findings from these studies will contribute to a better understanding of the disease etiology. Furthermore, they may have significant implications on disease treatment and prevention strategies, such as identifying populations at high risk and providing "tailored" preventive or therapeutic strategies to patients with specific genetic profiles.
Katerina Kechris PhD
Assistant Professor, Preventive Medicine & Biometrics, Computational Bioscience Program, University of Colorado at Denver
Katerina Kechris focuses on the development and application of statistical methods for analyzing molecular sequences and high throughput genomic data. In particular, she is studying the information encoded in the genome that specifies patterns of gene expression and post-transcriptional processing. To understand how these patterns have evolved over time, she is also interested in comparative genomics.
Hongzhe Li PhD
Professor of Biostatistics, University of Pennsylvania
Hongzhe Li's work falls into two distinct domains: statistical genetics and microarray analysis. With regard the former he has completed a project on using additive gamma frailty models for genetic linkage analysis, which includes an retrospective likelihood ratio approach (Li and Zhong, 2002) and a prospective likelihood ratio approach using the EM algorithm (Li, 2002). This model is further extended to two locus-disease models (Zhong and Li, 2002) with the aim of detecting additional linkage conditioning on allele sharing in the known linkage region. His current work further extends to test for genetic association in the linked region using nuclear family data.
With regard microarray data analysis Dr Li has developed models to link gene expression data to different phenotypes, including the binary or continuous phenotypes using the cluster-Rasch model by discretizing the gene expression data (Li and Hong, 2001). He has also developed kernel Cox regression models for relating gene expression profiles to censored survival data by using penalization methods (Li and Luan, 2002).
Katie Pollard PhD
Assistant Professor, UC Davis Genome Center & Department of Statistics
Areas of Investigation: Our group develops statistical and computational methods for the analysis of massive genomic datasets. We are interested in genome evolution, in particular identifying genome sequences that differ significantly between or within species and their relationship to biomedical traits of interest. Many of these sequences are non- coding, such as regulatory signals, structural sites, and RNA genes. One of our aims is to identify specific DNA alterations that are responsible for variation in gene expression. Current projects focus on two major areas: (1) fast evolving regions of the human genome and (2) adaptive evolution in microbial communities.
Approaches: Our human research makes use of the increasing number of vertebrate whole genome sequences as well as human population genomic data. Our microbial work utilizes whole genomes and metagenomic data - DNA sequenced directly from the ecosystem, representing a pool of genome fragments from multiple species. In both cases, we develop and apply probabilistic models of molecular evolution to detect sequences that evolve uniquely in one lineage (clade, species or sub-population). We then use statistical modeling, bioinformatics, and experimental validation to associate these changes in the mode or tempo of evolution with changes in biological function.
Saunak Sen PhD
Associate Professor, UCSF Division of Biostatistics
Saunak Sen's research centers on statistical design and methods for the genetic analysis of complex traits using genomic resources, specially in model organisms.
Complex traits are characterized by a small but discernible genetic component and, display significant individual and environmental variation. Examples include diabetes, hypertension, and many cancers. Some features of the interplay between genetic, individual, and environmental factors are easier to resolve in experimentally manipulable model organisms such as mouse, and rat.
Dr Sen specializes in statistical methods for model organisms populations used to find genomic regions contributing to variation in quantitative traits (called quantitative trait loci, QTL). Most work has been focussed on mouse, rat, and Arabidopsis populations. He has developed software for QTL studies: R/qtlDesign for design of QTL studies, and Pseudomarker for the analysis of QTL studies.
Currently Dr Sen is interested in statistical methods for the genetic analysis of genome-wide expression phenotypes, selective genotyping and phenotyping strategies, distributed statistical computing, and causal inference.
Yuanyuan Xiao PhD
Assistant Adjunct Professor, UCSF Division of Biostatistics
Yuanyuan Xiao's research interests and efforts have been particularly focused on yet not limited to the following research areas:
- developing robust and integrated analysis solutions related to the processing of high throughput genomic and epigenomic data, including expression, ChIP-chip, SNP and methylation microarrays and DNA sequences (Xiao et al. 2004; Yang et al. 2004; Xiao et al. 2001; Xiao et al. 2007; Xiao et al. 2008);
- developing computational strategies for high-throughput proteomic data produced by Mass Spectrometry in studies that investigate biological events in cardiovascular disease and breast cancer (in collaboration with Drs Segal, Pletcher and Tice);
- advancing and applying integrative statistical approaches to reveal the genetic and epigenetic determinants of childhood leukemia (in collaboration with Drs Yeh and Wiemels).
Jean Yee Hwa Yang PhD
Lecturer, School of Mathematics & Statistics, University of Sydney
Jean Yang's work has centered on the development of statistical methodology and attendant software for the design and analysis of gene expression experiments using cDNA microarrays (Yang et al., 2001a,b; Yang and Speed, 2002; Yang et al., 2002). She has close and ongoing collaborations with Professor John Ngai's Laboratory at UCB (Lin, Yang, et al., 2001; Diaz, Yang, et al., 2001).
Ru-Fang Yeh PhD
Assistant Professor, UCSF Division of Biostatistics
Ru-Fang Yeh's work focuses on the methodology development and collaborative research involving analysis of large-scale genomic data. She is particularly interested in utilizing sequence analysis and statistical modeling on data arising from genome projects and high-throughput molecular experiments to understand the regulation and interaction of genes. Selected topics of ongoing research include: (i) Statistical issues in the analysis of non-canonical microarrays, including SNP arrays for genotype calling (Xiao et al 2006, submitted) and copy number estimation (with Drs Xiao and Wiemels); tiling arrays for histone modification site detection via ChIP-chip (with Drs Wiencke and Zheng), and splice-junction or exon arrays for identifying alternative splices (Kechris et al 2006, submitted). (ii) Computational approaches for understanding gene regulation, including motif finding for pre-mRNA splicing (Yeh and Fairbrother et al 2002; Fairbrother et al 2004; Kechris et al 2006) and post-transcritional processing by microRNAs (with BMI student A. Adai and Dr McManus). (iii) Statistical methods to integrate phenotypic and genomic data at different molecular levels to identify candidate gene modules (Fernald et al 2005; with Drs Fridlyand and Feuerstein).

