Bench philosophy: Proteogenomics

Finding the Right Rules
by Steven Buckingham, Labtimes 05/2014

Image: SciLifeLab/ Tobias Frygar

First described six years ago, proteogenomics is now an established discipline that is not only upgrading the annotation of existing genomes but is also proving its worth in interpreting new genomes as they emerge.

Once we figured out how to sequence a genome, the rest seemed straightforward. You feed your long sequence of A’s, T’s, C’s and G’s into the computer algorithms and let them figure out, which bits of the four-letter puzzle are the genes. After all, we know the triplets that say ‘start’ and the ones that say ‘stop’. Then once we’ve figured out, which bits get read off into RNA, all we need is to finish off with working out what the proteins will look like at the end of the pipeline. Job done! Now it’s over to the physiologists to put it all together.

But, of course, in practice it’s a bit more complicated than that. For one thing, it can be tricky sometimes finding the start and stop codons, especially in short genes, but we can get around that. And then there is the issue of splice variants, which means we need well-developed, subtle algorithms to spot them. Oh, and yes, there is the impossibility of estimating abundance...

It turns out, then, that working your way upwards from a sequence of nucleotides to a fully annotated genome is fraught with difficulties.

This is where proteogenomics comes to the rescue. Proteogenomics is an umbrella term for a collection of approaches that bring together proteomics and genomics, and so it has been applied to a wide range of research programmes. But at its heart it is really all about using proteomics to get better genome annotation.

The problem that proteogenomics is out to rectify, then, is that the automated annotation of genomes isn’t perfect. It sometimes gets things wrong. In fact, proteogenomics has brought home the message that automated genome annotation gets it wrong more often than many people realise. If genome annotation is our exam paper (“Discuss, in your own words, how people are made out of genomes”), proteogenomics is the examiner.

When genomes are sequenced, automated algorithms are brought into play, to interpret the meaning of the genome. For the most part, this means identifying genes and the regulatory elements that govern their expression. An early step in gene-spotting begins with identifying the start and stop codons – the nucleic acid triplets that mark the beginning and end of the gene.

This, in itself, looks fairly straightforward but it can be deceptive, especially where the coded protein is very small. For example, although most people think of the triplet encoding methionine as the standard start codon, there are in fact a number of variations, especially in prokaryotes.

Error-prone annotation

On top of that, genes are also broken up into exons separated by introns. Figuring out which bits are exons and which are introns is quite tricky. When the DNA is read off as messenger RNA, at certain places there is a choice between which of two or more exons is to be used. This process is called alternative splicing. Again, it is a challenge for automated algorithms to spot, just from the long stretch of nucleotides, where there is the potential for alternative splicing.

So it is not actually as easy as it sounds, to tell just by looking at a DNA sequence, what genes are actually in there and, by extension, what is going to end up in the cell. In other words, automated annotation is error-prone.

Other techniques, such as looking at the transcriptome with technologies like RNAseq, have indeed been used to guide annotation. The transcriptome gives us the sequences of RNA that have been produced from the genome. Comparing RNA sequences with the corresponding DNA sequence tells us things like where transcription started and stopped, where the exon/intron boundaries are and which exons were included in alternative splicing. This sort of information provides the ‘right answers’ used to train automated computer algorithms and improve their performance.

But transcriptomics still doesn’t tell us everything. For instance, some RNA doesn’t even get read off as proteins. Clamp et al. listed as many as 4,000 redundant RNA transcripts in humans (PNAS 104: 19428-33). Even if a protein is made, it may nonetheless be marked for instant degradation. Identifying splice variants in RNA sequences is not without problems either. No-one would doubt that deep sequencing of RNA has indeed provided the material for considerable improvement of genome annotation. It is just that proteomics gives us a picture of a vital later step in the story – the sequence and abundance of peptides produced from the RNA, along with their post-translational and post-transcriptional modifications.

How does proteogenomics work? Finding out how to predict the proteome sequences just from looking at the DNA sequence alone is the goal of automated annotation algorithms. The basic idea behind proteogenomics is to get the sequence of all the proteins present in a biological sample, then match them up with the corresponding sequences in the genome. After all, annotation is all about figuring out the rules that define the relationship between these two sequences. Get those rules 100% right and you have perfect annotation.

There are two parts to the big ­proteogenomics plan – two challenges to be met. The first is getting the protein sequences in the first place. This is where the expanding family of mass spectrometry techniques comes in. But the initial step in a proteogenomics pipeline is to extract the proteins from the tissue. Rather than use whole organisms, it is better to perform parallel runs on different tissues. This can yield the added benefit of detecting tissue-specific expression, as well as provide an in-built quality control based on ubiquitous proteins.

At this state, we have to take into account that mass spectrometers have a limited dynamic range – the range of protein sizes that they can handle from the smallest to the largest. To overcome these limitations, the proteins to be analysed must, therefore, be digested with enzymes, such as trypsin and Lys-C. This provides a soup of peptide fragments that covers a fairly narrow range of sizes.

Working backwards

The peptides are then separated into different pools in a step called prefractionation. There are several ways of doing this but one principle is to separate them according to their pI by isoelectric focusing.These fractions then serve as the working material for Tandem Mass Spectrometry, of which the first component is typically to perform a liquid chromatography step, to further fractionate the sample. The final output of the mass spectrometer is, as the name implies, a mass spectrum. Deducing the peptide sequence from the spectrum looks like chemistry and maths magic. One of these ‘magical’ methods is peptide fingerprinting. Because the mass of digested fragments of peptides is known, you are able to work backwards and figure out the combination of fragments that would have given you this spectrum.

The mass spectrometry churns out a whole pile of spectra, each of which is a sequence of a peptide that was found floating around in your soup. I have spared you (and me) the maths and chemistry details of this technique but, believe it or not, it is actually the easy part. The hard part is matching these sequences to the databases. Why is this so complicated? Simply because there is just so much of it.

We have a classic data deluge situation here – a modern proteomics station can churn out over a million spectra a day. But that’s not all. The size of the databases we have to search is even more problematic.

To get an insight into the scale of the problem, look at the statistics pages of the SwissProt and TrEMBL databases. You’ll notice that the number of SwissProt entries ( was going asymptotic a few years ago – the size of the database doubled between 2006 and 2009. But after 2010 it tailed off. Why? Because we have got most of the sequences now? Not a bit of it. ­

SwissProt is a manually annotated database. If you take a look at the stats for the automatically annotated database TrEMBL (, you will realise, that the line on the chart goes straight up! It currently stands at 80 million sequences, up from a mere 50 million or so at the beginning of the year. Yes, this year.

Okay, I admit the TrEMBL database includes more than one species but my point still stands. Searching millions of sequences against millions of database entries is certainly beyond the capabilities of anyone’s laptop. How about a cluster of servers, then? Ask Stephen Callister what he thinks. His group at the Pacific Northwest National Laboratory, USA, ran spectra from Shewanella bacterial species against genome databases, using a 32-node cluster of computers (Turse et al., 2010, PLoS One 5, e13968). It took 260 days of CPU time.

Here lies a serious challenge for ­proteogenomics and it is a recurring theme when talking about big data − the desperate need for faster search algorithms. But they are coming through. Some of them include database indexing. How does this work? The protein database is digested in silico, in other words, you predict the results of digesting all the elements in your protein database with a given protease. Many of the elements will be the same, and database indexing both removes this redundancy and indexes the peptides by their mass.

Other new algorithmic approaches work by deploying smarter ways of summarising the spectra. Inevitably, the algorithms coming through are being developed to work on existing computer hardware, so they will always lag behind hardware evolution. But rest assured, there is a cavalry of new search algorithms coming over the hill to our rescue. The size of the datasets that proteogenomics has to match brings problems of another kind, though. With searches of this magnitude, hordes of hungry significant type 1 errors (‘false positives’) are out to get you. These are chance matches that are actually meaningless but come up anyway, purely because we have rolled the dice so many times. Rare, admittedly, but that very rarity makes them easy to mistake for an interesting, low abundance protein – one of the very features that sells ­proteogenomics in the first place.

Other barriers in the development of proteogenomics spring from limitations in the mass spectrometry. It needs relatively large amounts of clean starting material and it isn’t so good at detecting low abundance, short, very hydrophobic or very hydrophilic peptides.

The single best answer

So what do we get from it? If proteins are, indeed, the end-point of what genes are for, then getting proteogenomics right would give us the gold-standard: a direct comparison of what the genome says with what it makes. In other words, it is the nearest we get to the ‘single best answer’ in functional gene annotation.

This year, the laboratory of Vineet Bafna at the University of California, San Diego, announced that they have combined proteogenomics with aggregated RNA data on the nematode Caenorhabditis elegans, one of the best-annotated genomes available (Woo et al., Proteome Research, 13:21-8). They revealed a shocking 4,044 novel ‘events’, including 215 new genes, 808 new exons, 12 new alternative splicings, 618 gene-boundary corrections, 245 exon boundary changes and 938 frame shifts!

That isn’t a one-off. Another group found the same sort of thing when they re-annotated the zebrafish genome, finding 157 new genes, as well as the exon border corrections, frameshifts and so on (Kelkar et al., Mol. Cell. Proteomics 2014, advance publication). Manda et al. found 12 novel proteins on the human chromosome 12, all of which had erroneously been annotated as pseudogenes, as well as alternative start points for several proteins (J Proteome Res 13: 3166-77).

Proteogenomics comes with caveats and gotchas − but what technique doesn’t? The most serious challenge is bringing the computation into a practical time-frame. But this issue plagues all big-data biology and there is room for confidence that this problem will be eroded over time, giving this latest ‘omics’ the potential for making a big impact on the quality of genome annotation.

Last Changed: 16.09.2014