## Informal description of research projects

All work is a joint effort with Sorin Istrail and co-authors noted under titles.

### Ph.D. projects

HapCompass-diploid: A fast cycle basis algorithm for accurate haplotype assembly of sequence data from diploid genomes

In this work, I presented a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occuring SNP alleles in a haplotype. In the graph model, haplotype phasings correspond to spanning trees and each spanning tree uniquely defines a cycle basis. I define the minimum weighted edge removal global optimization on this graph and develop an algorithm based on local optimizations of the cycle basis for resolving conflicting evidence. I compared HapCompass with leading combinatorial methods on both simulated and real data showing improved accuracy and runtime. The software is available for download from the Istrail Lab website.

HapCompass-poly: Haplotype assembly in polyploid genomes, identical by descent shared tracts, and cancer genomes

Joint with Wendy Wong

In this work, I presented a number of results, extensions, and generalizations of compass graphs and our HapCompass framework. I proved the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. Furthermore, I presented graph theory-based algorithms for the problem of haplotype assembly using our previously developed HapCompass framework for (1) novel implementations of haplotype assembly optimizations (minimum error correction), (2) assembly of a pair of individuals sharing a haplotype tract identical by descent, and (3) assembly of polyploid genomes. I then describe extensions to the polyploid haplotype assembly model and present novel algorithms for (1) complex variations, including copy number changes, as varying numbers of disjoint paths in an associated graph, (2) variable haplotype frequencies and contamination, and (3) computation of tumor haplotypes using simple cycles of the compass graph which constrain the space of haplotype assembly solutions. I evaluated our methods on 1000 Genomes Project, Pacific Biosciences, and simulated sequence data. The software is available for download from the Istrail Lab website.

Tractatus: an exact and subquadratic algorithm for inferring IBD multi-shared haplotypes tracts

In this work I present graph theoretic algorithms for the identification of all identical-by-descent (IBD) multi-shared haplotype tracts for an m by n haplotype matrix.

I introduce Tractatus, an exact algorithm for computing all IBD haplotype tracts in time linear in the size of the input, O(mn). Tractatus resolves a long standing open problem, breaking optimally the (worst-case) quadratic time barrier of O(m2n) of previous methods often cited as a bottleneck in haplotype analysis of genome-wide association study-sized data. This advance in algorithm efficiency makes an impact in a number of areas of population genomics rooted in the seminal Li-Stephens framework for modeling multi-loci linkage disequilibrium (LD) patterns with applications to the estimation of recombination rates, imputation, haplotype-based LD mapping, and haplotype phasing. I extend the Tractatus algorithm to include computation of haplotype tracts with allele mismatches and shared homozygous haplotypes in a set of genotypes. Lastly, I present a comparison of algorithmic runtime, power to infer IBD tracts, and false positive rates for simulated data and computations of homozygous haplotypes in genome-wide association study data of autism. The software is available for download from the Istrail Lab website.

DELISHUS: an efficient and exact algorithm for genome-wide detection of deletion polymorphism in autism and multiple sclerosis data

Joint with Eric Morrow, Bjarni Halldorsson

Much of recent autism genomics research has focused on large, highly penetrant, de novo deletions. I developed an algorithmic framework, which we termed DELISHUS, that implements exact algorithms for inferring regions of hemizygosity containing genomic deletions of all sizes and frequencies in SNP genotype data. I implement an efficient backtracking algorithm — that processes a 1 billion-entry genome-wide association study SNP matrix in a few minutes — to compute all inherited deletions in a GWAS dataset. I further extended the model to give an efficient algorithm for detecting de novo deletions. I also developed algorithms to, given a set of called deletions, compute the critical regions of recurrent deletions. We worked closely with Dr. Eric Morrow to compute a set of putative deletions in autism using DELISHUS and intensity-based methods and testing for association using sibling TDT tests and FBAT. This set is being experimentally validated. The software is available for download from the Istrail Lab website.

Quantitative reference transcriptomes for Nematostella vectensis early embryonic development and Crassostrea virginica in response to bacterial challenge

Joint with Ian McDowell, Marta Gomez-Chirarri, Joel Smith, Sarah Tulin

I worked closely with experimental biologists to implement a robust RNA-Seq transcriptome assembly workflow for the Eastern Oyster and Nematostella.

Long range phasing and the Clark phaseable sample size problem

Joint with Ryan Tarpine, Bjarni Halldorsson

The premise of this work is that long shared genomic regions (or tracts) are unlikely unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts which may be identical by state (IBS). We estimated for populations, using the US as a model, what sample size of genotyped individuals would be necessary to have sufficiently long shared haplotype regions (tracts) that are IBD, at a statistically significant level. I helped write and evaluate our long-range haplotype phasing method.

Exact algorithms to optimize probe selection for HIV allele-specific PCR

Joint with Rami Kantor, Mia Coetzer, Anubhav Tripathi

I modified Tractatus to compute the minimum number of probes to cover HIV viral haplotype sequences in specific subtypes. I then modeled computation of the minimum set of probes for various objectives: (1) covering x% of the viral haplotypes, (2) covering at least 1 of the IUPAC-code expanded sequences (computed exactly when modeled as set cover), and (3) allowing y mismatches between probe and sequence (computed exactly when modeled as maximum independent set).

Immunogenomics – alternative interpretations of the genetic code

Joint with Austin Huang, Jonathan Yewdell

In response to infection or stress, mammalian cells increase aminoacylation of methionine to non-cognate tRNAs resulting in non-canonical translation of the genetic code. We considered the following questions: (1) Is there a bias for misacylated codons in solvent exposed regions of proteins? (2) Do proteins that are up-regulated during exposure to a virus or associated with specific biomolecular processes have codons that are more misacylated than randomly selected protein coding genes? For (1), I wrote a tool to compile a human protein structure and mRNA database, computed solvent accessibility, and a set of statistical programs to investigate association between misacylation, codons, and protein structure. For (2), I wrote a tool that extracted mRNA sequences, computed codon counts, compared the distribution of particular classes of genes to random sets, fit the observations to a binomial distribution (where codons are either high or low/no misacylation), and tested for association.

MetaGWAS and the variation-to-genes-to-pathways problem

Joint work with Nir Atias, Roded Sharan

Genome-wide association studies have successfully identified thousands of variants associated to common disease but rarely do the implicated variants immediately suggest underlying biology. I performed a meta-analysis of two Multiple Sclerosis genome-wide association studies, associated variants to genes, and a collaborator used a novel algorithm to infer a joint protein-protein interaction subnetwork that connections the studies’ associated variants.

Ariadne – A specialized genome browser for genome-wide association study workflow processing and population genomics

I wrote Ariadne, an extensive software environment and library of bioinformatics tools to process genome-wide association study (GWAS) workflows. Currently Ariadne can extract and display chromosomes from H. sapiens records in NCBI, align annotations (e.g. SNPs, genes, etc.), and computationally predict genomic entities (e.g. splices sites, start/stop codons, etc). Ariadne is an extension of the Celera Genome Browser, thus it retains all of its features including real-time, pan and zoom from chromosome to nucleotide level, interactive sequence analysis, drag-and-drop annotation capabilities, etc. Ariadne is a standalone, Java client featuring high-performance, interactive graphics, client-side data analysis, and data integration using XML data models.

Genomathica: A suite of computational biology algorithms implemented in Mathematica for educational and training settings

Building on the work of Hsin-Ta Wu, Jeff Erickson, Ryan Tarpine, Lian Garton, Yongxing Guo, Long X. Nguyen, Kaveh Boghraty, Sanghoon Cha, Crystal Kahn

Throughout the course of six years of teaching CSCI1820 and CSCI2820, I, along with Sorin Istrail, Ryan Tarpine and several other students, developed a set of algorithms in Mathematica to integrate with class lectures and homework assignments. Mathematica is (1) easy for students to learn, (2) interactive, (3) programmable, and (4) provides immense functionality and visualization to teach algorithms. Algorithms and models implemented in the Genomathica package include sequence alignment, hidden Markov models, genome assembly, suffix trees, linkage disequilibrium, and protein folding.

Cyrene – a cis-Lexicon and cis-Browser for regulatory genomics

Joint with Ryan Tarpine

My work on the Cyrene project involved integrating the Cyrene regulatory genomics browser with a transcription factor binding site prediction algorithm. Working with the lead developer on the Cyrene team, I implemented a solution using position weight matrices (PWM) to iterate through the specified genomic regions and computationally infer binding sites depending on the converted PWM score. The algorithm is currently available for use in the Cyrene cis-Browser.

The Virtual Sea Urchin/Virtual Organism

Joint with Russell Turner, David Moskowitz, William Turtle, Aimee Lucido

The Virtual Sea Urchin is a software tool that facilitates analysis of the four dimensional sea urchin embryo allowing researchers to probe and interrogate the developmental atlas. I worked on a team with a advanced undergraduate students to build the current system which is able to display the 4D model of the sea urchin embryo along with cell specific, gene expression data using the Ogre 3D graphics engine and a C++ viewer. The construction of this project is done in collaboration with Professor Eric Davidson’s Laboratory at CalTech.

I reimplemented The Virtual Sea Urchin with JOGL and generalized to support time series and spacial gene expression data for any organism.

### M.S. projects

Giardia

I computed genome assemblies for Giardia genome using the Celera assembler and 454 sequence reads for comparison with a Newbler assembly.

Rabbit transcriptome

Joint with Jean Wu

Rabbit is an organism that is often used in the study of cardiac disease. I ran statistical analyses on experimental Oryctolagus cuniculus (rabbit) gene arrays including normalization and quality control of probe signal across biological replicates. Additionally, I analyzed the quality of the gene arrays using BLAST and an updated rabbit genome.