Dr. Fengzhu Sun is professor of Computational Biology and Bioinformatics within the Department of Biological Sciences at USC with joint appointment in the Department of Mathematics. He has over 20 years of research experiences in using mathematical, statistical and computational tools to solve biological problems including protein interaction networks, single nucleotide polymorphisms and linkage disequilibrium, and metagenomics. He developed a widely used algorithm for haplotype block partition and tagSNP selection related to the international HapMap project. He also developed widely used tools for integrative studies of genotype-to-phenotype mapping combining information from geneontolgy, protein interaction networks, pathways, gene expression and SNPs. In metagenomics, he developed a widely used tool, local similarity analysis (LSA), for studying associations of operational taxonomic units. Recently, he developed new statistics for alignment free genome and metagenome comparison using counts of word patterns. He is an elected fellow of the American Association for the Advancement of Sciences (AAAS) and American Statistical Association (ASA), elected member of International Statistical Institute (ISI), an Astor visiting lecturer in statistics at Oxford University, and a program chair of RECOMB2013. He received the USC Provost’s Mellon Mentoring award in 2012. He has published over 150 papers and has been cited over 6000 times according to Google Scholar.
Next generation sequencing (NGS) technologies have generated enormous amount of shotgun read data and assembly of the reads is challenging, especially for organisms without reference sequences and metagenomes. We develop novel alignment-free and assembly-free statistics for genome and metagenome comparison. The key idea is to remove the background word counts from the observed counts when comparing genomes and metagenomes. Markov chains (MC) are usually used to model background molecular sequences and we develop a new statistical method to estimate the order of MCs based on short read data. The alignment-free sequence comparison statistics are used to study the relationships among species, to assign virus to their hosts, and to classify metagenomes and metatranscriptomes. In all applications, our novel methods yield results that are consistent with biological knowledge. Thus, our statistics provide powerful alternative approaches for genome and metagenome comparison based on NGS short reads.