Projects

Phylogenetic inference based on simulated annealing

As part of his research in the Thornton Lab, Victor Hanson-Smith is developing a software system named PHYESTA (pronounced "fiesta"). The acronym stands for "PhyML extended for simulated thermal annealing." The system is a set of modules that extend the PhyML phylogenetic inference package, incorporating a new optimization algorithm based on simulated annealing and mixed branch lengths for heterotachy.

Stacks

Julian Catchen is developing a software pipeline named Stacks, intended to be used for identifying a set of homologous loci from short-read sequence samples.

Sequence alignment with minimum description length

An application named realign uses minimum description length, a concept from information theory, to generate pairwise global alignments. The idea behind this application is that the alignment that can be encoded in the fewest number of bits is the most likely description of the homologous sites in the sequences. The next steps for this project will be to continue to test it on collections of real sequences with known homology and extend it for multiple sequence alignment.

Data-driven workflow management for bioinformatics

The Pipeline Interface Program uses a high-level, rule-based, programming notation to describe the stages of a complex analysis pipeline. The system automatically schedules the steps in the workflow based on data dependences between stages. The current system assumes data is stored in a relational database, but an interesting potential extension would be to manage very large data sets in Amazon S3 or Hadoop based storage clouds.

Past Projects

Synteny Database: As part of his Ph.D. dissertation project, Julian Catchen created a set of analysis tools and a software pipeline for analyzing conserved synteny. The system is able to detect regions of co-orthologs, in which two sets of genes in teleost fish are orthologous to genes in other vertebrates, dating to the whole-genome duplication early in the establishment of the fish lineage.

Lazarus: Victor Hanson-Smith developed a set of scripts for using PAML applications to help reconstruct ancestral molecular sequences.

tRNA Datamart: The tRNA datamart is a repository of bacterial tRNA sequences. Visitors can select sequences by a variety of attributes (taxonomic classification, length, specificity, etc.) and download the selected sequences in a variety of formats, including FASTA sequence files, tab-separated text for loading into spreadsheets, or even a PDF formatted "book" of 2D cloverleaf diagrams.

Gene Duplication: This project, with Mike Lynch, was based on all-vs-all comparisons of genes within the completely sequenced genomes that were available at the time. The results showed that tandem duplication is a continuing process, with most new duplications being silenced soon after the duplication event.

Genetic Simulation Library: The GSL is a set of C++ classes (Individual, Generation, Population, etc.) and a rich set of random number generators designed for individual-based modeling in population biology. No work has been done on this software since 1997, but the classes may be a useful starting point for anyone who wants to incorporate genomic information into individual-based models.