Bench philosophy: Phylogenetic analysis programme iPhy

As Simple as Possible but not Simpler
by Steven D. Buckingham, Labtimes 03/2011

We have had iPhones and iPads, now we have iPhy – a new web-based phylogeny server that takes the sweat out of bioinformatics. You keep hearing about it and you know it is there. That wealth of data hidden in the data pile. To get it, you need specialised computing skills — but you’re a biologist, not a computer programmer. If yesterday’s challenge was getting data, today’s challenge is being able to use it effectively. Take phylogeny, for instance.

You keep hearing about it and you know it is there. That wealth of data hidden in the data pile. To get it, you need specialised computing skills — but you’re a biologist, not a computer programmer. If yesterday’s challenge was getting data, today’s challenge is being able to use it effectively.

Take phylogeny, for instance. In the old days, you would build the phylogeny of even well-studied groups on a single locus, typically a ribosomal DNA. But when you start doing this sort of analysis over larger groups of taxa, things start going wrong. For starters, a single locus might just not tell you enough about the complex relationships within the various subgroups. That means you have to look at datasets where the taxa you are interested in are represented by many species, each of which is covered by multiple genes. This “supermatrix” approach has lots to offer but it takes quite a bit of bioinformatics skill to cash in on it. While no sane phylogenist would complain about the increasing mass of sequence data, he might understandably grumble about having to learn programming to be able to use it.

Enter iPhy

iPhy is a web-based package announced this January, which is aimed at making it easier to do phylogenetics in the multigenome era (Jones et al., BMC Bioinformatics 2011, 12:30). It is essentially a pipeline that is designed to assemble sequences from several different sources into one place, and then do your analysis. It has a set of modules that help with building a consensus sequence and, importantly, selecting appropriate subsets of loci and taxa. iPhy’s creator, Martin Jones of Edinburgh University, hopes it will relieve phylogeneticists of some of the bioinformatic burden. “iPhy is mostly targeted at researchers doing phylogenetic research,” says Jones, “and by making it easy to use, we hope we will attract the type of users who don’t have the bioinformatics background necessary to build their own tools for large-scale phylogenetics.”

But iPhy is not just for phylogeneticists – it has potential for anyone who wants to assemble a set of aligned sequences for a particular gene for a large taxonomic group, using public sequence data.

For instance, imagine you are interested in mollusc globin genes. With iPhy you could search GenBank records for sequences annotated with “globin”. Sure, that can easily be done in any of the databases but why stop there? Let’s include sequences with significant similarity to known globins. Once you have run your search, iPhy assembles the results into a single, large dataset. You might then decide to generate multiple sequence alignments of globins for different mollusc classes and start looking for conserved regions, building motifs, constructing Hidden Markov Models and so on.

iPhy rewards a bit of imagination. For instance, some people have used iPhy as a kind of surveying tool to assess the sequence availability of various genes in a large taxonomic group, an example of this is shown in Jones’ recent paper on iPhy in BMC Bioinformatics. As Jones argues, “This could be useful, if you were picking loci to use for a study (whether for straightforward phylogenetics or something else) and wanted to see, which loci already had high coverage in your group of interest, thus minimising the amount of sequencing you would have to do.”

Everything in one package

Jones, a postdoc in Mark Blaxter’s group at Edinburgh University’s Institute of Evolutionary Biology, is not new to this sort of problem. A lot of the ideas behind iPhy were born as he worked on a previous project called TaxMan (BMC Bioinformatics 2006 Dec 18;7:536), which addressed a similar set of problems, albeit in a very different way. Indeed, there are existing services that do the same kind of tasks as iPhy. For example, SCaFoS (BMC Evol Biol 2007 Feb 8;7 Suppl 1:S2) attempts to build large concatenated data matrices from pre-defined sequence sets, TreeBase stores data matrices, while AMiGA lets you download concatenated mitochondrial sequences.

Low starting-barrier

But iPhy differs from existing tools in two respects. Firstly, it implements the entire phylogenetic work flow in one package – you can mine public sequence data, build and align consensus sequences, pick sets of genes and taxa for analysis, and generate annotated figures from the resulting trees (it’s important to note that one thing iPhy doesn’t do is the actual tree-building step – there are many excellent tools, which do that job and for large datasets, the job of phylogenetic reconstruction is best carried out on a dedicated machine).

Secondly, iPhy aims to be more user-friendly and has a much lower barrier to start using it than existing tools. Users don’t have to set up databases, download scripts, etc., they can just create an account on the server and start using it. This is an important factor as more researchers, who are not bioinformatics specialists, are finding themselves in a position of having to use bioinformatics tools. If a particular tool requires a lot of effort just to start using it (i.e. wrestling with database accounts and passwords and installing multiple dependencies), then the non-specialist user is likely to be put off. This is a phenomenon that we’re going to see not just in phylogenetics but in all types of sequence-based analysis.

Despite iPhy only being announced at the end of January, after a six-month round of beta testing, Jones is pleased with the uptake, with several datasets already created. “In January, ten datasets had been uploaded by users. At the start of April that had gone up to 50,” beams Jones.

Web-based application

iPhy is a website that’s made up of several parts. There’s a part that can read and identify DNA sequences, a part that can manipulate taxonomic trees, and a part that can choose which species to analyse in order to solve a particular genetic problem. It also incorporates many existing tools to do particular jobs (e.g. BLAST for similarity searching; CAP3 for consensus building; MUSCLE for multiple sequence alignment) and a user-friendly interface to tie everything together.

It gets its basic taxonomic information from the NCBI taxonomy, so in a sense it ‘knows’ about taxonomic relationships between sequences and species. If you happen to be one of those biologists who actually does have a computing background, you may wish to know that the web interface is built using the Grails web framework; the code is written in Groovy (a dialect of Java), and that it uses BioJava to read and understand GenBank/FASTA sequence files, as well as PostgreSQL to store all the data.

While working on his PhD, Jones got involved in looking at bioinformatics approaches to large-scale phylogenetics. That was when he spotted some recurring issues in phylogenetics. Firstly, taxon selection and model selection is going to become even more critical in an age of vastly accelerated sequencing. If we want to do a phylogenetic study on a particular group of organisms, it’s no longer feasible to attempt to include all available sequence data – there’s simply too much of it. So we have to pick a subset of available genes and species to analyse.

Picking the right subset

This is where picking the right subset is crucial, if we are to avoid systematic biases creeping in. Secondly, this increase in scale means that there is a pressing need for tools that are suitable for use by researchers who are not bioinformatics experts – tools that do the job properly, without you having to retrain. Thirdly, there is the lure of leveraging the enormous potential from some of the very large datasets that have already been assembled for phylogenetics.

Jones originally started developing iPhy as a desktop application but was never entirely happy with the large number of dependencies (libraries, external tools, databases, etc.) that the end-user would have to install. So, he started thinking about frameworks for web applications. Writing iPhy as a web application meant that the user wouldn’t need to install any new dependencies, and that the facilities for sharing datasets were built-in.

Indeed, the computational burden continues to shift towards the cloud. Web services, when well-crafted, can encapsulate best practices through a careful design. This includes careful selection of default options that guide the user through the optimal-use trajectory. The sheer size of the datasets means that it is not efficient to store them in too many places, certainly not on your desktop PC. Web services also mean that analyses can be standardised and recorded through saving of options files and pipeline design.

Postdocs aren’t exactly under-stressed for time, so what motivated Jones to go to all that trouble? Jones’ answer is disarmingly simple. “Our motivation is the hope that other researchers will find it useful.” Is that all? “Well, I’ve recently become very interested in the idea of a low ‘barrier to entry’ for scientific software; this is something that has not been a priority in the past but I think that it will become much more important in the future.”

Simplified bioinformatics

The accelerating scale of DNA sequencing means that, increasingly, the people who need access to bioinformatics tools will not be trained bioinformaticians. So Jones, and others, feel that those of us who are, have a responsibility to make the tools more accessible. “I don’t mean by this that bioinformatics tools should be dumbed down,” says Jones. “After all, DNA sequence analysis (of all types, not just phylogenetics) is a complicated endeavour and so, to a certain extent, the tools must be too. I simply mean that it must be possible to try out new bioinformatics tools without spending a day configuring your computer.” So, part of the motivation was to explore the possibility of providing a complex tool that could be accessed using only a web browser. iPhy is one more step in that growing trend.

More to come

iPhy is already looking to the future. Jones is slowly but steadily adding new features, as more people use the site and report bugs and feature requests. He is working on an API (Application Programming Interface), which lets users write a script in their favourite programming language that can access data from the iPhy website. For example, imagine you have created an iPhy dataset with a hundred loci and you want to build a phylogenetic tree for each locus, individually, and compare them to a reference taxonomy. Obviously, this would be very tedious to do via the website — even using iPhy’s features; so with the new API, users will be able to write a script that does everything automatically. So keep an eye on iPhy.

Last Changed: 10.11.2012