Sergei L Kosakovsky Pond
Brief Academic Biography
Following formal undergraduate training in computer science (at Kiev State University, Ukraine), I received a PhD from the interdisciplinary program in Applied Mathematics at the University of Arizona. My theoretical graduate research into statistical methodology for evolutionary analyses of coding sequence alignments found an application in an HIV research group at UCSD, which I joined as a postdoctoral fellow in 2003. Until 2016, I was an associate professor in the Divisions of Infectious Diseases and Biomedical Informatics in the UCSD Department of Medicine. In addition, I am the director the Bioinformatics and Information Technologies Core at the UCSD Center of AIDS Research. In 2016, I joined the Institute for Genomics and Evolutionary Medicine at Temple University.
My research interest include developing models and computational approaches for comparative analysis of sequence data, especially large and rich data set from measurably evolving pathogens, such as HIV-1, Influenza A virus and Hepatitis C virus. My group has published a number of methodological and applied papers applying evolutionary algorithms and machine learning techniques to complex problems in sequence evolution, especially in the context of HIV population history, adaptation to new hosts, transmission, immune escape, and the development of drug resistance.
My current research interests can be loosely partitioned as follows.
Molecular signatures of natural selection.
Accomplishments. Natural selection acting on point mutations and recombination are two fundamental evolutionary forces shaping genetic diversity. Increasing volumes of data, taxonomic sampling, and biomedical research drive the unrelenting interest by research community to have access to faster, more accurate, and more powerful methods for quantifying evolutionary processes based on comparative sequence analysis, and interpreting them to generate testable hypotheses, elucidate mechanisms of disease progression or pathogen evolution, or answer fundamental questions of evolutionary biology. Over the past decade, I developed and published numerous statistical and computational methods to detect pervasive and episodic selection at the level of genes, sites or branches. Together with Spencer Muse, we were the first (2005) to show the importance of systematically modeling synonymous rate variation. I spearheaded the developed the first computational tractable approach to efficiently model the variation in selective pressures over sites and lineages, spawning a class of powerful and accurate Branch-Site Random Effects models, including the Mixed Effects Model of Evolution (MEME) approach which is 3-5x more powerful than previous state-of-the-art methods at detecting individual sites subject to episodic selection. I have also led the efforts to devise purpose built models for studying directional selection, especially in the context of drug resistance development. To mitigate the computational complexity of the models, and bootstrap them into the NGS era, we have developed a robust statistical approximation (FUBAR) to accelerate the estimation of site-level selection by 2-3 orders of magnitude.
Current models largely ignore functional, structural, or phenotypic information. We are currently developing the necessary statistical and computational underpinnings in order to augment our current codon-level models so that they can (i) model biases in amino-acid substitutions using biochemical properties or structural information; (ii) directly model the effect that pathogen or disease alleles genotypes may have on measurable phenotypes in the full phylogenetic framework. We are also pursuing further algorithmic improvements and statistical approximations to accelerate the performance of all selection detection techniques (essential for Big Data and genomics applications).
Machine learning for biomedical discovery.
Because biology is replete with apparently computationally intractable problems, I have been interested for quite some time now, in how methods in statistical learning can be creatively applied in this domain. I have applied genetic algorithms (GA) to solve random combinatorial optimization problems in sequence analysis, including recombination detection (GARD), accurate subtyping and new genotype discovery in rapidly evolving pathogens (SCUEAL), and automatic discovery of amino-acid substitution preferences (CodonTest). Our group has also pioneered the application of Bayesian Graphical Models (BGM) to study co-evolution among sites and infer networks of site and phenotype dependancies in viral genes (e.g. SpiderMonkey). Working together with a graduate student (Lance Hepler), I have developed and released a flexible and powerful machine learning framework (IDEPI) which delivers the best-in-class performance on a variety of “genotype-to-phenotype” prediction and classification problems, including drug resistance, cell tropism (HIV-1), genetic signatures predictive of disease outcomes, and automatic discovery of antibody epitopes in vaccine research. Future directions. There is a tremendous interest and opportunity for developing new methods for making sense of large volumes of sequencing data using “data-mining” and “pattern-discovery” techniques. For example, we are developing GA-based high-throughput algorithms for accurate and robust classification and interpretation of immunomics data, i.e. NGS of B-cell repertoires. Automated epitope discovery holds significant promise for vaccine research, especially for rapidly evolving RNA viruses.
Analytics and computational methods for genetically diverse NGS data.
I have led the development of a bioinformatics pipeline, tailed specifically for rapid and user-friendly analysis of HIV and HCV NGS samples, now used in over 20 publications from UCSD investigators. I developed new error-correction approaches, both for systematic (homopolymers length miscalls) and stochastic errors, and implemented numerous downstream analyses (phylogenetic, compartmentalization, selection analyses etc).I have been able to adapt this pipeline to also characterize the B-cell clonotypic diversity from NGS transcript data, and am currently actively applying it to several HIV vaccine projects, which need to use these data to understand the development, maturation, and reproducibility of broadly neutralizing antibody elicitation. Further, I have been instrumental with NGS data analysis, interpretation, and hypothesis testing in the collaborative efforts with other investigators, in the context of the transmission of, or the development of drug-resistant variants, genetic diversity in Przewalski's horses, clinical, virologic, and immunologic correlates of HIV-1 intraclade B dual infection, tracing the origin of a CXCR4 tropic transmitted HIV virus to seminal cells, co-receptor usage in primary and dual HIV-1 infection and the dynamics of viral rebound in different anatomical compartments.
NGS analysis is a workhorse of modern biomedical and translational research, and our plans include releasing a user-friendly version of the pipeline, adapting it to work on new sequencing platforms (e.g. we are the only ones currently using PacBio instruments for full-length gene HIV sequencing). There is also much to be done in incorporating population genetics and phylodynamics methods to the study of NGS data in order to reveal the underlying population dynamics of pathogens.
Molecular epidemiology of HIV-1.
The inference and use of molecular transmission networks for HIV-1 and other pathogen has become a very active area of world-wide research in the past three years. Because of my close collaborations with Dr Little (UCSD, San Diego Primary HIV Infection Cohort), Dr Haurbich (UCSD, CNICS), Dr Leigh Brown (U of Edinburgh, the UK Drug Resistance Database), and Dr de Gruttola (Harvard, Biostatistics), I was ideally positioned to make substantive contributions to both theoretical and empirical aspects of HIV molecular epidemiology. These encompass the first paper to propose a formal statistical test for using molecular network information to evaluate the efficacy of treatment as prevention, and develop a clinically applicable metric of network connectivity, as well as analyses of local, national, and global HIV-1 molecular transmission networks, to understand, for example, the various biomedical and demographic correlates of clustering in the network.
I have recently been awarded a 5-year collaborative U01 grant (with Dr Leigh Brown and Dr Volz as site PIs in the UK), under the auspices of Models of Infectious Disease Agent Study (MIDAS). We will develop more innovative models of pathogen transmission combining population genetics, sequence evolution, and network theory, provide efficient method implementation and fast approximate algorithms scalable to global-scale datasets, evaluate the effect of prevention and treatment approaches on epidemic dynamics in five localized epidemics of HIV and HCV, and model generalized epidemics for these and other pathogens. By developing computational and statistical methods that incorporate and analyze pathogen sequence and other epidemiologic data, we will be able to infer and characterize transmission networks to best identify targets for the most effective and parsimonious use of prevention interventions.
Evolutionary rate estimation.
Largely due to the incisive research focus of my former postdoctoral trainee, Dr Wertheim, I have become interested in the statistical performance of very popular (>2,000 citations) relaxed molecular clocks techniques in the emerging field of paleovirology -- the use of molecular data to infer when various human pathogens were introduced into the population. Our investigations revealed fundamental flaws with the application of such methods to date both ancient and recent events in viral history. For example, we demonstrated how purifying natural selection can lead to severe underestimates of when viruses like measles, ebola, Influenza A virus were introduced into the human population, when coronaviruses were introduced into their current hosts, resolving many puzzling inconsistencies in the field, where historical or other genomic (e.g. endogenous retroviruses) data contradicted molecular estimates. This paper has been highlighted by the Faculty of 1000:
- Approaches such as those taken here are necessary to understand the evolutionary process in the most general sense, to gain insight into the tempo and mode of viral evolution and crucially also to devise appropriate public health responses to new viral outbreaks in humans.
Our recent study applying these methods to understand the evolutionary origin of Herpes Simplex Virus in humans has received significant press attention.
Fundamentally, we are interested in resolving the confounding effect of natural selection and substitution saturation on divergence time estimates, especially when the organisms in question are evolving at measurable rates (e.g. RNA viruses). Using a combination of sophisticated selection-aware models, coupled with rapid molecular clock inference techniques (e.g. penalized likelihood or de novo models which permit rapid approximations of the rate-evolution component), we should be able to push the veil of time further back in the past, than currently available methods can.
Scientific Software Development
￼In addition to scholarly work, I have a keen interest in developing and popularizing open source software for biological data analysis. I am the primary developer of the HyPhy software package, which isan open source platform for comparative sequence analysis. HyPhy is a mature package with approximately 9,000 users in academia, government and industry that has been cited in over 1,700 peer-reviewed publications. Essentially all the methods and procedures developed in the viral evolution group are incorporated into the program, and the HyPhy engine has been adopted as part of other bioinformatics frameworks, including novel next generation sequencing, and network-inference methods. The companion datamonkey web server, providing a free high performance computing implementation of most of our techniques for the world-wide community, receives over 10,000 page views/day and has processed nearly 300,000 jobs since it was launched in 2004, conservatively contributing $500,000 of compute time to the global research community.
Five most cited publications
List last updated in Nov, 2014
- HyPhy: hypothesis testing using phylogenies (784 citations)
- Datamonkey: rapid detection of selective pressure on individual sites of codon alignments (436 citations)
- Not so different after all: A comparison of methods for detecting amino acid sites under selection (429 citations)
- Automated phylogenetic detection of recombination using a genetic algorithm (238 citations)
- Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology (229 citations)