RNA-seq data processing and analysis tutorial
RNA-seq has become a powerful approach to study the continually changing cellular transcriptome. Here, one of the most common questions is to identify genes that are differentially expressed between two conditions, e.g. controls and treatment. The main exercise in this tutorial will take you through a basic bioinformatic analysis pipeline to answer just that, it will show you how to find differentially expressed (DE) genes. Briefly,
- in the main exercise, we will,
- check the quality of the raw reads with FastQC
- map the reads to the reference genome using Star
- convert between SAM and BAM files format using Samtools
- assess the post-alignment reads quality using QualiMap
- count the reads overlapping with genes regions using featureCounts
- build statistical model to find DE genes using edgeR from a prepared R script
As discussed during the lecture, RNA-seq experiment does not end with a list of DE genes. If you have time after completing the main exercise, try one (or more) of the bonus exercises. The bonus exercises can be run independently of each other, so choose the one that matches your interest.
- In the bonus section you can find additional exercises
- BEx. 01 Functional annotation how to put DE genes in the biological context of functional annotations
- BEx. 02 Exon usage how to perform analysis of differential exon usage and study alternative splicing
- BEx. 03 Visualisation how to view RNA-seq bam files and present DE results with graphics
- BEx. 04 De novo transcriptome assembly how to assembly transcriptome if no reference is present
Data description
The data you will be using in this exercise is from the recent paper YAP and TAZ control peripheral myelination and the expression of laminin receptors in Schwann cells. Poitelon et al. Nature Neurosci. 2016. In the experiments performed in this study, YAP and TAZ were knocked-down in Schwann cells to study myelination, using the sciatic nerve in mice as a model.
Myelination is essential for nervous system function. Schwann cells interact with neurons and the basal lamina to myelinate axons using receptors, signals and transcription factors. Hippo pathway is an evolutionary conserved pathway involved in cell contact inhibition, and it acts to promote cell proliferation and inhibits apoptosis. The pathway integrates mechanical signals (cell polarity, mechanotransduction, membrane tension) and gene expression response. In addition to its role in organ size control, the Hippo pathway has been implicated in tumorigenesis, for example its deregulation occurs in a broad range of human carcinomas. Transcription co-activators YAP and TAZ are two major downstream effectors of the Hippo pathway, and have redundant roles in transcriptional activation.
The material for RNA-seq was collected from 2 conditions (wt and YAP(kd)TAZ(kd)), each in 3 biological replicates (see table below).
Accession | Condition | Replicate |
---|---|---|
SRR3222409 | KO | 1 |
SRR3222410 | KO | 2 |
SRR3222411 | KO | 3 |
SRR3222412 | WT | 1 |
SRR3222413 | WT | 2 |
SRR3222414 | WT | 3 |
For the purpose of this tutorial, that is to shorten the time needed to run various bioinformatics steps, we have down-sampled the original files. We randomly sampled, without replacement, 25% reads from each sample, using fastq-sample from the fastq-tools tools.
Bioinformatics: processing raw sequencing files
Reading manuals, trying different tools/options, finding solutions to problems are daily routine work for bioinformaticians. By now you should have some experience with using command line and various bioinformatic tools, so in order to feel like a pro we encourage you to try your own solutions to the problems below, before checking the solution key. Click to see the suggested answers to compare them with your own solutions. Discuss with person next to you and ask us when in doubt. Remember that there is more than one way to skin a cat. Have fun!
Preparing a working directory
To get going, let’s book a node, create a working directory named transcriptome in the _/proj/g2018002/nobackup/
Book a node. As for other tutorials in this course we have reserved half a node per person. If you have not done it yet today book a node now as otherwise you will take away resources from your fellow course participants.
Click to see how to book a node on Thursday
Click to see how to book a node on Friday
Create a folder named transcriptome for your project in your _/proj/g2018002/nobackup/
Click to see suggested commands
Sym-link the .fastq.gz files located in /sw/courses/ngsintro/rnaseq/DATA/p25. A great chance to practice your bash loop skills.
Click to see suggested commands
Check if you linked the files correctly. You now should be able to see 12 links to the .fastq.gz files.
FastQC: quality check of the raw sequencing reads
After receiving raw reads from a high throughput sequencing centre it is essential to check their quality. Why waste your time on data analyses of the poor quality data? FastQC provide a simple way to do some quality control check on raw sequence data. It provides a modular set of analyses which you can use to get a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
Read more on FastQC. Can you figure out how to run it on Uppmax?
Create fastqc folder in your transcriptome directory. Navigate to fastqc folder.
Click to see suggested commands
Load bioinfo-tools and FastQC modules
Click to see suggested commands
Run FastQC on all the .fastq.gz files located in the transcriptome/DATA. Direct the output to the fastqc folder. Check the FastQC option for input and output files. The bash loop comes handy again.
Click to see suggested commands
Download the FastQC for the proceeded sample from Uppmax to your compute and have a look at it. Go back to the FastQC website and compare your report with Example Report for the Good Illumina Data and Example Report for the Bad Illumina Data data.
Discuss whether you’d be happy when receiving this very data from the sequencing facility.
Jump to the top
STAR: aligning reads to a reference genome
After verifying that the quality of the raw sequencing reads is acceptable we can map the reads to the reference genome. There are many mappers/aligners available, so it may be good to choose one that is adequate for your type of data. Here, we will use a software called STAR (Spliced Transcripts Alignment to a Reference) as it is good for generic purposes, fast, easy to use and has been shown to outperform many of the other tools when aligning 2x76bp paired-end data (2012). Before we begin mapping, we need to obtain genome reference sequence (.fasta file) and a corresponding annotation file (.gtf) and build a STAR index. Due to time constrains, we will practice on chromosome 11 only. Then we will use the pre-prepared index for the entire genome to do the actual mapping.
Accessing reference genome and genome annotation file
It is best if the reference genome (.fasta) and annotation (.gtf) files come from the same source to avoid potential naming conventions problems. It is also good to check in the manual of the aligner you use for hints on what type of files are needed to do the mapping.
Check STAR manual what files are needed for the mapping.
What is the idea behind building STAR index? What files are needed to build one? Where do we take them from? Could one use a STAR index that was generated before?
Create the reference sub-folder in transcriptome directory
Click to see how to create the directory
Download the reference genome .fasta file for chromosome 11, mouse and the corresponding genome annotation .gtf file from Ensmeble webite.
Click for the link to the Ensemble website
http://www.ensembl.org/info/data/ftp/index.htmlClick to see file names to be downloaded
Click to see how to transfer files from Ensembl website to Uppmax
The files you have just downloaded are compressed using gzip; you need to decompress them before use.
Click to see how
You should now have Mus_musculus.GRCm38.dna.chromosome.11.fa and Mus_musculus.GRCm38.85.gtf in the sub-folder reference
preparing index
Create indexChr11 sub-folder in the transcriptome directory
Click to see how to create directory
Load STAR module on Uppmax. Use module spider star to check which version of STAR are available and load the latest one.
Click to see how to load module
Build STAR index for chromosome 11 using the downloaded reference .fasta and gene annotation .gtf files. Check STAR manual for details
Click again to see suggested commands
Check if building the index worked
Sym-link STAR index to for the entire genome into the transcriptome directory. The index for the whole genome was prepared for us before class in the very same way as for the chromosome 11 in steps above. It just requires more time (ca. 4h) to run. The index can be found here: /sw/courses/ngsintro/rnaseq/index
Click again to see how to link the index
mapping
Now we are ready to map our reads to the reference genome, via STAR index.
Create star sub-folder in the transcriptome directory. Create sub-sub-folder named SRR3222409 to save the mapping results for the sample SRR3222409.
Click to see how to create folders
Map reads to the reference genome for SRR3222409 sample. Do not forget that we are working with paired-end reads so each sample has two matching reads file. Check the STAR manual for the parameters to:
- use index for the entire genome
- to read in zipped .fastq.gz files for both forward and reverse reads
- to run the job on the 8 allocated cores
- to direct the mapping results to the SRR3222409 sub-sub folder
- to give the results prefix SRR3222409
Click to see how to write the mapping command with the above parameters
You should now have .sam file in the SRR3222409 as well as a series of log files. Have a look how the mapping went.
Map or copy over. Map the remaining samples in the analogous way. Running short of time? Copy over the results that we have prepared for you before the class. They are here: /sw/courses/ngsintro/rnaseq_2016/main/Star
Click to see how to copy results, sample by sample
Click to see how to copy results using bash loop
Samtools: converting between SAM and BAM
Before we proceed further with our data processing, let’s convert our mapped reads from STAR, saved in the default .SAM text format, into the binary .BAM format. Why? BAM files take less space so it is easier to store them and they are the most commonly required file format for many of the down-stream bioinformatics tools. In addition, they can be sorted and indexed shortening the time needed to proceed them in comparison with .SAM format. Also, then they will be ready for exploration in IGV, the Integrative Genomic Viewer.
Read through Samtools documentation and see if you can figure it out how to:
- convert SAM into BAM
- sort BAM files
- index BAM files
Create bams sub-folder in transcriptome, navigate to bams sub-folder and load samtools module
Click to see the suggested commands, file by file
Sym-link in bams sub-folder all the SAM files containing the mapped reads, as created during the Star mapping step. You can use the bash loop if you apply wild cards (slightly more advanced but try first before looking at the answer key)
Click to see the suggested commands, sample by sample
Click to see the suggested commands, using bash loop
Check to see the linked files
Convert SAM to BAM: for the first sample SRR3222409_Aligned.out.sam into SRR3222409_Aligned.out.bam
Click to see the suggested commands
Convert SAM to BAM for the remaining samples or copy them over. You can find the BAMs prepared for your in /sw/courses/ngsintro/rnaseq/main/bams/out
Click to see how to copy over sample by sample
Click to see how to copy over using a bit more advanced bash loop
Check to see the converted BAM files
Sort BAM file sort SRR3222409_Aligned.out.bam file and save it as SRR3222409_Aligned.out.sorted.bam in the bams sub-folder
Click to see how to sort BAM file
Sort BAM files for the remaining samples or copy them over. You can find the BAMs prepared for your in /sw/courses/ngsintro/rnaseq/main/bams/out, ending with sorted.bam
Click to see how to copy over sample by sample
Click to see how to copy over using a bit more advanced bash loop. It is the same as above, just the file extension is different.
Check to see the sorted BAM files
Index the sorted BAM files
Click to see how to index BAM file, sample by sample
Click to see how to index BAM file, using bash loop
Check to see the BAM indexes
QualiMap: post-alignment quality control
Some important quality aspects, such as saturation of sequencing depth, read distribution between different genomic features or coverage uniformity along transcripts, can be measured only after mapping reads to the reference genome. One of the tools to perform this post-alignment quality control is QualiMap. QualiMap examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.
Read through QuliMap documentation and see if you can figure it out how to run it to assess post-alignment quality on the RNA-seq mapped samples. The tool is already installed on Uppmax and available as QuliMap module
Create QualiMap sub-folder in transcriptome directory, navigate to qualimap sub-folder and load QualiMap/2.2 module
Click to see the suggested commands
Run QualiMap for the fist sample on sorted BAM file SRR3222409_Aligned.out.sorted.bam directing the results to */proj/g2018002/nobackup/
Click to see the suggested commands
Check if the QualiMap run correctly
Run QualiMap for the remaining sorted BAM files or copy the results over. These can be found /sw/courses/ngsintro/rnaseq_2016/main/qualimap
Click to see how to copy over the results, sample by sample
Click to see how to copy over the results, using bash loop
Check the QualiMap results. What do you think? Are the samples of good quality? How can you tell?
Jump to the top
featureCounts: counting reads
After ensuring mapping quality we can count the reads to obtain a raw count table. We could count the reads by hand, opening the BAM in the IGV along the genome annotation file, and counting the reads overlapping with the regions of interest. This of course would take forever for the entire genome but it is never a bad idea to see how the data look like for the selected few genes of interest. For get the counts for the entire genome one can use many of the already available tools doing just that. Here we will use featureCounts, an ultrafast and accurate read summarization program, that can count mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations.
Read featureCounts documentation and see if you can figure it out how to run summarize paired-end reads and count fragments overlapping with exonic regions.
Create featurecounts sub-folder in the transcriptome directory and navigate there.
Click to see how...
Load featureCounts module. featureCounts is available on Uppmax as part of the subread package
Click to see how to load featureCounts
Sym-link SAM files generated by Star and located in */proj/g2018002/nobackup/
Click to see sym-link files, sample by sample
Click to see sym-link files, using bash loop
Run featureCounts on the SAM files, counting fragments overlapping exon regions and saving the count tables as tableCounts. The libraries are un-stranded and you can proceed all the samples in one go.
Click to see how to run featureCounts on all samples
Check if featureCounts run. You should have two files now:
Have a look at the tableCounts and tableCounts.summary. Can you figure out what these files contain? Do you think that counting work? How can you tell?
Jump to the top
MultiQC: combining QC measures across all the samples
Read more on MultiQC. Can you figure out why this tool has become very popular? Can you figure out how to combine FastQC, Star, QualiMap and featureCounts results for all the samples into interactive report?
Navigate to transcriptome directory and load module MultiQC/1.3
Click to see how
Run MultiQC
Click to see how to run MultiQC
Transfer the MultiQC report to your computer and have a look at it. What can you notice?
Jump to the top
Differential expression
As mentioned during the lecture, the best way to perform differential expression is to use one of the statistical packages, within R environment, that were specifically designed for analyses of read counts arising from RNA-seq, SAGE and similar technologies. Here, we will one of such packages called edgeR. Learning R is beyond the scope of this course so we prepared basic ready to run scripts from a command line scripts to find DE genes between KO and Wt.
Create DE sub-folder in the transcriptome directory and navigate there
Click to see how
Load R module and R packages
Click to see how
Sym-link tableCounts as created by featureCounts and sym-link the prepared gene annotation file tableCounts_annotation.txt prepared by us before the class and located: /sw/courses/ngsintro/rnaseq/main/DE as well as sym-link R script located in the same directory
Click to see how
Run diffExp.R script
Click to see how
A file results_DE.txt should be created in the DE sub-folder
Copy over to your computer open the results_DE.txt
file in Excel or alike. Given FDR value of 0.05, how many DE genes are there? How many up and down-regulated? What are the top changes? How does it change when we only look at the DE that have minimum log-fold-change 1?
Jump to the top
Bonus exercise: from differential expression to biological knowledge
Introduction to functional annotation
In this part of the exercise we will address the question which biological processes are affected in the experiment; in other words we will functionally annotate the results of the analysis of differential gene expression (performed in the main part of the exercise). We will use Gene Ontology (GO) term and reactome pathway annotations.
When performing this type of analysis one has to keep in mind that the analysis is only as accurate as the annotation available for your organism, so if working with non-mainstream model organisms which do have most of the annotations only electronically inferred (as opposed to direct evidence), the results may not be fully reflecting the actual situation.
There are many methods to approach the question which biological processes and pathways are overrepresented amongst the differentially expressed genes, compared to all genes included in the analysis. They use several types of statistical tests (e.g. hypergeometric test, Fisher’s exact test), and many have been developed with microarray data in mind. Not all of these methods are appropriate for RNA-seq data, which as you remember from the lecture, exhibit length bias in power of detection of differentially expressed genes (i.e. longer genes, which tend to gather more reads, are more likely to be detected as “differentially expressed” than shorter genes, solely because of the length).
We will use the R / Bioconductor package goseq, specifically designed to work with RNA-seq data. This package provides methods for performing Gene Ontology and pathway analysis of RNA-seq data, taking length bias into account.
In this part, we will use the same data as in the main workflow. The starting point of the exercise is the file with results of the differential expression produced in the main part of the exercise.
Running functional annotation is typically not computationaly-heavy and it may be easier to run it on your local computer. Therefore this module can be performed on Uppmax or on your local computer. For the latter you need have R statistical programming language installed. An optional graphical interface to R such as RStudio is also recommended.
Follow icon for running the module on Uppmax. Follow to run things on your own computer.
Setting-up workspace
Install packages used in the exercise. You can do it by pasting the following two commands in R session:
source("http://bioconductor.org/biocLite.R")
biocLite(c("goseq","GO.db","reactome.db","org.Mm.eg.db"))
To perform the exercise you will need data included in the following location at Uppmax:
/sw/courses/ngsintro/rnaseq/bonus/funannot
Copy over the directory to your working directory on your local computer
Click to see an example of a command
Copy over the directory to your working space transcriptome
Click to see an example of a command
Workflow
Load R and R modules required in the exercise:
Click to see an example of a command
Enter the exercise working directory:
Click to see how
To perform the functional annotation you can use a wrapper script annotate_de_results.R, which is executed as in the main exercise:
Click to see how
The results will be saved in the directory GO_react_results.
Enter the exercise working directory:
Click to see how
To perform the functional annotation you can use a wrapper script annotate_de_results.R, which is executed as in the main exercise:
Click to see how
The results will be saved in the directory GO_react_results.
Alternatively, you can open the script in RStudio (or a text editor such as Atom or Sublime) and execute each step of the script in a live R session. This way you will be able to “see inside” the script and you can try to follow the individual steps.
Interpretation
The results are saved as tables in the directory GO_react_results.
The columns of the results tables are:
| category | over_represented_pvalue | under_represented_pvalue | numDEInCat | numInCat | term | ontology |
You can view the tables in a text editor, and try to find GO terms and pathways relevant to the experiment using a word search functionality.
The results are also collected in the following objects:
You can explore the results either by printing them all on the screen
Click to see example
or by performing a string search using grep
Click to see example
Click to see example
Do you think the functional annotation reflects the biology of the experiments we have just analysed?
Jump to the top
Bonus exercise: exon usage
Introduction to differential exon usage
Multi-exon genes are affected by alternative splicing and thus can express a variety of different transcripts from the same genomic sequence. Differences in the relative expression of these isoforms between tissues and species occur naturally between cell types and allow cells to adapt to the environment.
It is important to distinguish differential transcript usage (DTU) from gene-level differential expression (which was covered in the main part of the exercise) and from transcript-level differential expression. DTU considers changes in the proportions of the isoforms of a gene that are expressed as opposed to changes of the individual transcript levels.
Although the main units of interest are the transcripts, it has been difficult to obtain accurate and precise transcript-level expression estimates due to the extensive overlap between different transcripts. To circumvent that number of methods have been developed that, instead of estimating expression levels of transcripts, analyse levels of transcript building blocks (exons). Hence, differential exon usage (DEU) can be viewed as a proxy to the differential transcript usage.
Note that DEU is a more general concept than alternative splicing since it also includes changes in the usage of alternative transcript start sites and polyadenylation sites, which can cause differential usage of exons at the 5’ and 3’ boundary of transcripts.
The Bioconductor package DEXSeq implements a method to test for differential exon usage in comparative RNA-Seq experiments. By differential exon usage we mean changes in the relative usage of exons caused by the experimental condition. The relative usage of an exon is defined as number of transcripts from the gene that contain this exon vs. number of all transcripts from the gene.
In this exercise we will use reads mapped to chromosome 11 only as performing this analysis on entire data set is quite time consuming and requires considerable computing power. The starting point are files with reads summarised per each annotated exon, prepared beforehand by us.
This module can be performed on Uppmax, or on your local computer if you have installed R statistical programming language and (optionally) a graphical interface to R such as RStudio.
Follow icon for running the module on Uppmax. Follow to run things on your own computer.
Setting-up workspace
Libraries to install and load if exercise is performed locally
Install packages used in the exercise. You can do it by pasting the following two commands in R session:
source("http://bioconductor.org/biocLite.R")
biocLite("DEXSeq")
In addition you may need to install X11 app (XQuartz on MacOS) to produce the html report.
To perform the exercise you will need data included in the following location at Uppmax:
/sw/courses/ngsintro/rnaseq/bonus/exon
Copy over the directory transcriptome folder
Click to see an example of a command
Copy over the directory to your workspace on your local computer
Click to see an example of a command
Workflow
Load R and R modules required in the exercise:
Click to see how
Enter the exercise working directory:
Click to see how
Perform the functional annotation using a wrapper script deu.R
Click to see how
Enter the exercise working directory:
Click to see how
Perform the functional annotation using a wrapper script deu.R
Click to see how
Alternatively, you can open the script in RStudio (or a text editor such as Atom or Sublime) and execute each step of the script in a live R session. This way you will be able to “see inside” the script and try to follow the individual steps.
The results in html format are saved in the directory DEXSeqReport. For detailed description of the individual report sections please consult DEXSeq user manual. Additionally, a table with significant exons and statistics relevant to their differential usage deu_signif_exons.txt is saved in the working directory.
How many differentially used exons were identified in the data? Do you think this result makes sense?
Jump to the top
Bonus exercise: RNA-seq visualisation
Introduction
Data visualisation is important to be able to clearly convey results, but can also be very helpful as tool for identifying issues and noteworthy patterns in the data. In this part you will use the bam files you created earlier in the RNA-seq lab and use IGV (Integrated Genomic Viewer) to visualize the mapped reads and genome annotations. In addition we will produce high quality plots of both the mapped read data and the results from differential gene expression.
IGV
If you are already familiar with IGV you can load the mouse genome and at least one bam file from each of the treatments that you created earlier. The functionality of IGV is the same as if you look at genomic data, but there are a few of the features that are more interesting to use for RNA-seq data.
Integrated genomics viewer from Broad Institute is a nice graphical interface to view bam files and genome annotations. It also has tools to export data and some functionality to look at splicing patterns in RNA-seq data sets. Even though it allows for some basic types of analysis it should be used more as a nice way to look at your mapped data. Looking at data in this way might seem like a daunting approach as you can not check more than a few regions, but in in many cases it can reveal mapping patterns that are hard to catch with just summary statistics.
For this tutorial you can chose to run IGV directly on your own computer (follow ) or on Uppmax (follow ). If you chose to run it on your own computer you will have to download some of the bam files (and the corresponding index files) from Upppmax. If you have not yet installed IGV you also have to get a copy of the program.
Copy bam files to your local computer and run IGV
Click to see how to transfer files from Uppmax
NB! Use the sorted bam files and also copy over the .bai files Log in to Uppmax in a way so that the generated graphics are exported via the network to your screen
Method 1. Login in to Uppmax with X-forwarding enabled
Click to see how
This will allow any graphical interface that you start on your compute node to be exported to your computer. However, as the graphics is exported over the network it can be fairly slow in redrawing windows and the experience can be fairly poor.
Method 2. Go to Rackham-gui
Once you log into this interface you will have a linux desktop interface in a browser window. This interface is running on the login node so if you want to do any heavy lifting you need to login to your reserved compute node also here. This is done by opening a terminal in the running linux environment and log on to your compute node as before NB! If you have no active reservation you have to do that first.
Load necessary modules and start IGV
This should start the IGV so that it is visible on your screen. If not please try to reconnect to Uppmax or consider running IGV locally as that is often the fastest and most convinient solution.
Once we have the program running you select the genome that you would
like to load. As seen in the image below.
Note that if you are working with a genome that are not part of the available genomes in IGV, one can create genome files from within IGV. Please check the manual of IGV for more information on that.
To open your bam files click on File and chose the option “Load from file” select your bam file and make sure that you have a .bai index for that bam file in the same folder. You can repeat this and open multiple bam files in the same window, which makes it easy to compare samples. For every file you open a number of panels are opened that visualize the data in different ways. The first panel named “Coverage” summarize the coverage of basepairs in the window you have zoomed to. The second that ends with the name “Junctions” show how reads were spliced to map, eg. reads that stretch over multiple exons are split and mapped one part in one exon and the next in another exon. The third panel shows the reads as they are mapped to the genome. If one right click with the mouse on the read panel there many options to group and color reads.
To see actual reads you have to zoom in until the reads are drawn on screen. If you have a gene of interest you can also use the search box to directly go to that gene.
If you for example search for the gene “Mocs2” you should see a decent amount of reads mapping to this region. For more detailed information on the splice reads you can instead of just looking at the splice panel right click on the read panel and select “Sashimi plots” This will open a new window showing in an easy readable fashion how reads are spliced in mapping and you will also be able to see that there are differences in between what locations reads are spliced. This hence gives some indication on the isoform usage of the gene.
To try some of the features available in IGV you can try to address the following questions.
Are the reads you mapped from a stranded or unstranded library?
Pick a gene from the toplist of most significant genes from the DE analysis and search for it using the search box in IGV. Would you say that the pattern you see here confirms the gene as differentially expressed between treatments?
One can visualize all genes in a given pathway using the gene list option under “Regions” in the menu. Would you agree with what they state in the paper about certain pathways being down-regulated. If you need hints for how to proceed see Gene List tutorial at Broad.
Create publication ready plots from RNA-seq data
Creating high quality plots of RNA-seq analysis are most easily done using R. Depending on your profiency in reading r-code and using R you can in this section either just call scripts from the command lines with a set of arguments or you can open the r-script in a text editor and run the code step by step from an interactive r-session. Irrespective of the method you choose make sure you load the same R modules as before and do all steps on a compute node.
Load R module and R packages
Click to see how
Some of the example plots we generate here are based on the results from the DE analysis so perform all of these steps from within the DE folder that you created earlier.
Start by copying the scripts from the course folder to your DE directory.
Move to DE and copy R-scripts
Click to see how to copy the files to your DE folder
You should now have four files in your DE folder.
We
start off by creating similar plots to how data is visualised in IGV,
but using R means that we could add other types of information that
are not implemented in IGV.
To look at read coverage in our bam files for a gene of interest (pick one that was reported to be differentially expressed) and go to the Ensembl to identify genomic coordinates and chromosome location for this gene.
Run the script named genePlot.R
Click to see a an real example
This will generate a plot named coverage.pdf that show annotations and
read coverage for the 6 bam files we use in the analysis for
chromosome 14 from postion 31217860 to 31230350.
To view the file copy it from Uppmax to your own computer and open it in a
pdf reader.
Click to see command to copy files
Make sure you run this command from your own computerBesides this type of plot that sort of mimics what can be done in IGV, R makes it possible to visualise patterns of gene expression in many different ways. Here we will create a few different plots that is often seen and used in RNA-seq expression analysis.
MDS plot
A popular way to visualise general patterns of gene expression in your data is to produce either PCA (Principal Component Analysis) or MDS (MultiDimensional Scaling) plots. These methods aim at summarizing the main patterns of expression in the data and display them on a two-dimensional space and still retain as much ingformation as possible. To properly evaluate these kind of results is non-trivial, but in the case of RNA-seq data we often use them to get an idea of the difference in expression between treatments and also to get an idea of the similarity among replicates. If the plots shows clear clusters of samples that corresponds to treatment it is an indication of treatment actually having an effect on gene expression. If the distance between replicates from a single treatment is very large it suggests large variance within the treatment, something that will influence the detection of differentially expressed genes between treatments.
To generate a MDS plot showing that show the the two leading fold changes in gene expression between samples run the MAplot.R script as this.
Run the script named MDSplot.R
Click to see how to do this
This generates another pdf file named MAplot.pdf in the DE folder. To
view it copy it to your local disk as before.
Based on these results are you surprised that your DE analysis detected a fairly large number of significant genes?
MA-plot and Volcanoplot
One can also visualize the actual results of a differential gene expresssion analysis in many different ways. One way is by generating a MA-plot that plots the mean expression and estimated log ratios for all genes in an analysis.
To create an MA plot for your analysis you can run the script called MAplot.R from your DE folder. It will read in the results from the DE analysis that you did earlier and produce the plot saved as MA-plot.pdf.
Run the script MAplot.R
Click to see how to do this
What do you think the red dots represent?
A related type of figure will instead plot fold change (on log2 scale) on the x-axis and -log10 p-value on the y-axis. Scaling like this means that genes with lowest p-value will be found at the top of the plot. In this example we will highligt (in red) the genes that are significant at the 0.05 level after correction for multiple testing and that have an estimated fold change larger than 4 (log2 (4) = 2).
Run the script named Volcano.R
Click to see how to do this
Anything noteworthy about the patterns in the plot?
Other type of popular plots for genome-wide expression patterns is to create heatmaps for sets of genes. If you run the script called heatmap.R from the folder where your DE is saved it will extract the 50 genes that have the lowest p-value in the experiment and create a heatmap from these. In addition to colorcoding the expression levels over samples for the genes it also clusters the samples and genes based on inferred distance between them.
Run the script named heatmap.R
Click to see how to do this
Compare this plot to a similar plot in the paper behind the data.
Most of these plots can be done with a limited set of code. In many
cases these “standard” plots can be created with two to three lines of
code as the packages that has been written to handle RNA-seq
expression data often contains easy to use functions for generating
them.
Bonus exercise: transcriptome assembly
Transcriptome De Novo Assembly
Trinity
Trinity is one of several de novo transcriptome assemblers. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. This approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.
Grabherr MG, Haas BW, Yassour M et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology. 2011 May 15;29(7):644-52.
Getting started
Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:
Inchworm assembles the RNA-Seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.
A basic recommendation is to have 1G of RAM per 1M pairs of Illumina reads in order to run the Inchworm and Chrysalis steps. Simpler transcriptomes require less memory than complex transcriptomes. Butterfly requires less memory and can also be spread across multiple processors.
The entire process can require ~1 hour per million pairs of reads in the current implementation. There are various things that can be done to modify performance. Please review the guidelines in the official Trinity documentation for more advice on this topic. Typical Trinity usage is as follows:
Exercise 1: Running Trinity
In the following exercise you will have chance to run trinity on a data set that is suitable to be finished within a short lab session. Note that for many larger data sets and/or complex transcriptomes running times and memory requirements might be much larger than in this example. The actual commands to run trinity is very easy and the manual at https://github.com/trinityrnaseq/trinityrnaseq/wiki answers most questions related to running the program. The major challenge with running de-novo assembly projects is not to get the programs to run. but rather to evaluate the results after the run. In many cases a very large number of potential transcripts are generated and often try to use sequence properties to filter the initial data. In addition one often tries to compare the obtained sequences to closely related species, try to predict open reading frames to get a feeling for how the experiment has turned out. In order to get a feeling for this we will in the exercise assemble two data sets and use simple unix tool to calculate basics stats on the assembled sequences. For this to be a meaningful exercise you should not look at the hints prior to trying some commands your self. The key to get going with these types of analysis is to realize that one does not need a specialised program to collect basic summary statistics from text files (note that fasta files are simple text files of a specified structure).
Have a look at the example data used in this exercise.
The data is obtained from mouse dendritic cells (mouse_left.fasta and mouse_right.fasta and) and a whitefly (whitefly_both.fasta), and the files are located in /sw/courses/ngsintro/rnaseq/bonus/assembly
.
The mouse data is strand-specific (RF), the whitefly data is unstranded.
For strand-specific data, specify the library type.
There are four library types:
Paired reads: RF: first read (/1) of fragment pair is sequenced as anti-sense (reverse(R)), and second read (/2) is in the sense strand (forward(F)); typical of the dUTP/UDG sequencing method. FR: first read (/1) of fragment pair is sequenced as sense (forward), and second read (/2) is in the antisense strand (reverse)
Unpaired (single) reads: F: the single read is in the sense (forward) orientation R: the single read is in the antisense (reverse) orientation
By setting the –SS_lib_type parameter to one of the above, you are indicating that the reads are strand-specific. By default, reads are treated as not strand-specific.
Load modules and copy data
Click to see how to do this
Check the manual of Trinity again and try to figure out what parameters and settings that are needed to start trinity on the test data. Remember to try and use all 8 cores.
Run Trinity command
Click for a complete trinity command using 8 cores
NB! -It is recommended to use fully specified paths for sequence files with Trinity. -Depending on version of Trinity used –max_memory is sometime given by the command –JM
Exercise 2: Assess the data
Explore the Trinity output file Trinity.fasta located in the trinity_out_dir/output directory (or output directory you specify). Transcripts are grouped as follows: * components: the set of all sequences that share at least one k-mer (including paralogs) * contigs: transcripts that share a number of k-mers (the set of isoforms of a gene) * sequences (isoforms and allelic variation)
Count the number of sequences in the Trinity.fasta file (hint: try using the unix commands ‘grep’ and ‘wc’)
Click to see how one can count sequences
What is the -c switch doing?
Get basic information about the assembly with TrinityStats.
- How many “genes” did Trinity assemble?
- How many transcripts?
- How large is the assembly? (nr of bases)
- What is N50?
Filter out sequences shorter than 1000 nucleotides hint: do a web search for appropriate tools. Someone else must have had the exact same problem. Count the number of sequences again.
Click to a solution
What is the fasta_formatter step doing?
Align some sequences to a protein database and assess full-lengthness using NCBI blast database. Also try to see if you can find instances of spliced genes in your data by using the UCSC genome browser (do a web search to find it)
- Select BLAT from the menu at the top of the page and paste in a mouse transcript sequence from Trinity.fasta.
- Select the mouse/mm10 genome and click “submit”.
- Click on the top scoring hit.
Examine the alignments by clicking “details” on the resulting page.
- Your sequences will be displayed in the browser.
- Enable the mouse annotations (ENSEMBL gene build, UCSC genes, human proteins etc.).
Optional: Do a new transcriptome assembly of whitefly RNAseq data using above code as help.
Closing remarks and where to go next
It is not possible to learn RNA-seq data processing and analysis in one day… The good news is that there are many available tools and well-written tutorials with examples to learn from. In this tutorial we have covered the most important data processing steps that may be enough when the libraries are good. If not, there is plenty of trouble-shooting that one can try before discarding the data. And once the count table are in place, the biostatistical and data mining begins. There are no well-defined solutions here, all depends on the experiment and questions to be asked, but we strongly advise learning R. Not only to use the specifically designed statistical packages to analyze NGS count data, but also to be able to handle the data and results as well as to generate high-quality plots. There is no better way of learning than to try…
For those interested in RNA-seq analysis Scilifelab offer a more advanced course in RNA-sequnence analysis each semester. If you also have in interest in learning R we do for the first time this year offer a one-week introduction course in R programming. For more information on both of of these courses see Courses offered by Scilifelab.
Jump to the top
More reading
- Robinson, MD, and Oshlack, A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25.
About authors
Thomas Källman , Agata Smialowska , Olga Dethlefsen @ NBIS, National Bioinformatics Infrastructure Sweden
Jump to the top