Bench philosophy: Digging Deeper

by Steven Buckingham, Labtimes 03/2013

RNA sequencing is opening up next generation sequencing to a host of new disciplines. But when you mix NGS and transcriptomics together, amazing things happen. It’s easy, it’s cheap and it’s better.

Remember the excitement when the first genomes came out? For the first time we got complete gene families, we got piles of new gene sequences and we could compare genomes across different species. Remember poring over the annotated sequences, looking for mutations that would explain our pet diseases? But there were problems, weren’t there? For one, sequencing wasn’t exactly fast. It took several years for the first eukaryotic genome (the one millimetre worm, C. elegans), and it seemed ages waiting for the human genome to come out. Sure, as we got better at doing it, things sped up. But the methods used at the time placed a limit on how fast they could ever become.

Then it dawned on us that, even if we could sequence a genome in days, there are limits to what we can get from a sequence. Of course, it goes back to a basic principle we all learnt back in school – DNA makes RNA and it is the RNA that makes proteins. So, how an organism will look and behave depends more on the RNA than on the DNA. For one, there is a whole lot of information in RNA that doesn’t show up in the genome.

Don’t be deceived

Forget what your teachers told you: even the messenger RNA is not a faithful copy of the gene. It will have been spliced, edited and modified in a number of ways that cannot be predicted just from looking at the gene. Then there is the quantity of RNA – more RNA obviously means more of the protein – which itself is a function of how rapidly it is made and turned over. On top of that, messenger RNA isn’t the whole story. One of the most exciting insights into genomics this decade is that there are other forms of RNA, called “non-coding RNA”, that control protein expression.

But two things have happened since those antediluvian days. The problem of sequencing speeds stopped being a problem when deep sequencing came along. Sometimes called “next generation sequencing”, deep sequencing is a catch-all for a number of approaches that solve the problem of sequencing in radically different ways to the “traditional” Sanger method. No matter for the moment how they do it, what does matter is that they do it fast. A genome a day is not far away.

The second thing that happened is RNA sequencing (RNAseq): a set of procedures that apply deep sequencing to the RNA present in the cell. It not only tells you what RNA molecules are present in your sample but also how much of each one there is. In other words, RNAseq opens the door to quantitative transcriptomics.

Lots of details to consider

How is it done? RNAseq is a three-step process: build a short-length library, deep sequence it, reassemble it. A cDNA or RNA library is extracted from the sample in the usual way and broken into small fragments. But don’t be deceived: there are lots of things you have to get right here if you want good results. For one, let’s assume you take the common step of including poly-adenylate selection. You will select for transcripts that bear a string of adenylate bases, a signature feature of messenger RNA. Granted, this gets rid of ribosomal RNA, which is good, but there is a problem: you will also have filtered out short non-coding RNAs, so your transcriptome will be missing out a whole exciting class of RNAs.

Well, a bit of planning at that step will fix those issues. Let’s move on to the next part: sequencing the fragments. Although deep sequencing has already become something of a standard method, there are several details to consider to get the best results. Take two to start with: which sequencing platform should we use? And what read length should we go for? Okay, let’s imagine we have got our sequences back. The next job is to sort out all the artefacts in the results. I’m afraid there will be no shortage of these. We’ll have to get rid of the sequencing adapters, of course – that goes without saying. But there are some more subtle issues. For one, we’ll hit problems if we leave in the low complexity reads and the near-identical reads that are just artefacts of the PCR process. Then there are, of course, the sequencing errors. But don’t worry; there are established methods for detecting most of these artefacts.

Computer power needed

Now we get to the exciting part: assembling the reads into a transcriptome. There are basically two ways to do this: either by comparing it to a reference genome, or by assembling it de novo. Which of these strategies we go for depends on what our project is trying to do. But we also have to think about what computing resources we’ve got – de novo assembly needs a lot more computing power, for instance.

So what do we get for our efforts? RNAseq is a major step forward, not least because we now have a fast, convenient route to a quantitative profile of the transcriptome. But to get an idea of why it is stirring so much excitement, just hold it up against what it is replacing – microarray tiling and cDNA sequencing. The resolution of RNAseq is down to individual bases, so no advantage over cDNA sequencing there, although you won’t get that with microarrays.

Same for the need for a reference genome, both RNAseq and cDNA sequencing steal a march over microarrays. But RNAseq does share some significant plusses with microarrays compared to cDNA sequencing. For one, think of the speed. Because it is based on high-throughput sequencing rather than Sanger methods, it leaves cDNA sequencing coughing in the dust. And, of course, RNAseq and microarrays give you expression information, which is not there in cDNA sequencing. So, RNAseq gives you the better of its two older rivals. But it also has some of its own distinctive advantages that push it in front of them both. First, it has an enormous dynamic range. That means very rare transcripts will be picked up, while very abundant ones won’t saturate the signal. Compare its over 8,000-fold dynamic range with the 100-fold range you can expect from microarrays. On top of that it is fast. With RNAseq there is no amplification step, so you don’t need so much RNA to start with. That opens a wide door for single-cell sequencing.

No free lunch

But you don’t get these advantages for nothing. Indeed, RNAseq faces some serious challenges. First, there is the issue that is all too familiar with high-volume biology: computing power and storage space. The sheer volume of data spewed out by RNAseq – up to 50 gigabases per run – means you will need a regular supply of hard drives. Don’t forget, having large datasets doesn’t just eat up disk space: you need to be able to read and process that data. But the bioinformatics challenge goes deeper than just an IT issue. RNAseq, and the deep sequencing that undergirds it, generates short reads that can be a statistical nightmare to match up.

Then there are issues at the bench level. For instance, the details of how you fragment the library are known to introduce biases and we just don’t know enough about those biases yet. Last year, a group from the Mount Sinai School of Medicine in New York tried to measure them directly and found that the RNA ligases used in the library preparation have their own biases, which can skew the results (Jayaprakash et al., Nucleic Acids Res. 2011 Nov; 39(21):e141. doi: 10.1093/nar/gkr693).

Avoid the pitfalls

Again, although counting transcripts is thought to be very accurate with RNAseq, all the same nasty traps lie waiting here, too. For one, counting can be complicated when there is a lot of alternative splicing. That is because transcripts can share a lot of exons, so when you have a read from one of these exons it can be hard to tell, to which transcript it belongs. And once again, the counts can be influenced by differences in the cDNA library construction.

Sequencing depth is another issue that can affect the results. A recent paper by Ana Conesa’s group from the Centro de Investigación Príncipe Felipe, Valencia, Spain showed that differential expression calls – a measure comparing the level of RNA expression in different samples – can be alarmingly dependent on how deep you sequence (Tarazona et al., Genome Res. 2011 Dec; 21(12):2213-3).

But don’t get me wrong – these are details to be ironed out and they have no chance of stopping the explosion of new applications for RNAseq. To take just one instance, labs are racing to leverage the method in forward genetic screens. Hunting for mutations that cause a disease, usually mean mapping to a site on a chromosome, followed by some very tedious sequence analysis. The speed and low cost of RNAseq means you can identify single mutations very quickly. But it works the other way too: when you know the mutation but don’t know what it does. In this case, having an accurate snapshot of the transcriptome can shed light on gene function very ­quickly.

Hunting for mutations

Take FUS, for example. FUS has been known to play a role in amyotropic lateral sclerosis; however, it wasn’t clear exactly what role. Earlier this year, John Lander’s group at the University of Massachusetts Medical Center together with Leonard H. van den Berg’s group at the University Medical Centre Utrecht, used RNAseq to tackle this issue (van Blitterswijk et al., PLoS ONE 8(4): e60788. doi:10.1371/journal.pone.0060788). They compared the expression profiles in cells expressing mutant forms of the gene, or cells in which expression had been knocked down, and came up with evidence that FUS works through the splicing machinery.

We need to know more about the biases that can undermine the interpretation of RNAseq datasets. But with the time and cost of sequencing falling rapidly, and an increasing interest in rare and non-coding RNAs, RNAseq is likely to replace cDNA sequencing and microarrays in the future. The vendors of computer storage solutions can look forward to a wealthy future.

Last Changed: 09.05.2013