Transcriptome Mapping First

**Please read everything carefully!**

Reference-genome based: Tophat and Cufflinks

A common problem in the analysis of RNA-seq data is to relate it to a known genome sequence and use that information to study the expression of genes - within a sample or across multiple conditions, such as tissues or time points. A popular pipeline to perform such an analysis is the Tuxedo protocol, which consist of set of programs that can be used to go from mapping of short reads to reference genomes all the way to detection of differential gene expression. The two main programs included in the package are Tophat - a short read mapper and Cufflinks that performs analysis of the mapped reads. In this exercise you will learn how to use some of these tools to study gene expression differences between different human tissues.

Note: Do not simply copy the various unix commands quoted throughout this tutorial. Many include placeholders (folder names etc), so make sure you alter the command to actually reflect whatever file names you have created.

Illumina Bodymap2.0 data consists of 16 human tissues that were sequenced using both single-end and pair-end technologies. The mapped reads can be visualised at the Ensembl genome browser

In this tutorial we will due to time constraints focus on a limited set of tissues and only do the analysis for chromosome 1 of the human genome. In addition, the files we use are just subsets of the original files as the analysis of the full files undoable within the time frame of a course lab.

The main goal goal with this tutorial is to detect differential gene expression between two different tissues (pick any two tissues that you want to compare). For all included tissues there is one single-end library and one pair-end library available. In order to test for significance in gene expression more than one replicate from each tissue is needed and we will in this lab use the two different library types as replicates in the detection of differential gene expression. So if you choose to compare brain and kidney, you will have to the analysis for both single-end (ERR030890) and pair-end (ERR030882) libraries for brain and compare to single-end (ERR030893) and pair-end (ERR030885) libraries for kidney.

Below is a summary of the data and tissues available.

**Please start by testing Brain vs Kidney!**

Afterwards you can go back and test more tissues against each other.

Tophat

Tophat is a script pipeline built on-top of the popular short-read aligner Bowtie. It is used for aligning RNA-Seq reads to a reference genome and can identify exon-exon splice junctions. More specifically, it produces data that we can use to not only study the expression of genes, but also their different isoforms. You will have a bit of waiting time during the exercises as the more complex analyses are running, so please check out some of the details of tophat when waiting.

Cufflinks is a collection of programs that perform different steps in the analysis of aligned RNA-seq reads (Details). The output is usually a list of transcribed loci (primarily ‘genes’) and their expression levels within and/or between samples. For the analysis of multiple data sets, the general workflow in cufflinks consists of the following steps:

In the following, we will be going through the necessary steps to accomplish this.

Step-by-Step

  1. Prepare your data
  2. Load software
  3. Run Tophat on individual samples
  4. Run Cufflinks on individual samples
  5. Run Cuffmerge merge detected transcript over all samples
  6. Run Cuffdiff

1) Book a node

We have reserved half a node for each student during this course. By now, you are probably already familiar with the procedure:

**NB! Do this only once and make sure you do not have multiple reservations running at the same time, otherwise you will take away resources from the other course participants!**
$ salloc -A g2016001 -t 08:00:00 -p core -n 8 --no-shell --reservation=g2016001_28 &

2) Prepare your data

**Note: It is completely up to your how you organize your data - what follows below is merely a suggestion:** **Remember that you have to run the analysis on SE and PE for two tissues! Examples below shows how to do this for one tissue...**
$ cd ~/glob
$ mkdir transcriptome
$ cd transcriptome
$ mkdir results
ln -s /proj/g2016001/labs/transcriptome_map/reads/PE
ln -s /proj/g2016001/labs/transcriptome_map/reads/SE
ln -s /proj/g2016001/labs/transcriptome_map/reference

Your directory structure should look like this:

We are skipping a few steps here, namely obtaining a reference annotation and genome sequence and preparing the latter for use with the Bowtie2 aligner. We have taken care of that for you (located in the subfolder /reference). This is due to two main factors. First, it takes a lot of CPU hours to convert a genome sequence into a Bowtie index. Second, finding the latest release of a genome sequence free of unmapped fragments and haplotype data as well as a fully Tophat/Cufflinks-compatible annotation of that sequence is an exercise in frustration for beginners. One useful resource here is the FTP server of Illumina here. Finally, also note that we are providing you with the outputs of the different steps. This is to make sure that if you run into some trouble, like software crashing half-way through analysis, you can still continue with your exercises.

3) Load software

You have done this before, but here is a quick reminder:

$ module load bioinfo-tools samtools/0.1.19 bowtie2/2.2.3 tophat/2.0.12 cufflinks/2.2.1

If any of these packages does not load as expected, you can check that module names are correct using the command

$ module avail

It may be that you need to load a different version.

4) Run Tophat

What goes in and what comes out?

In:

Out:

NOTE: The /reads folder contains a small subset of an actual FASTQ, limited to a specific chromosome. For the next step, please chose two tissues that you want to analyse and make sure to align both a pair-end and a single-end library from you tissues of choice. We do this since the time needed to align an actual read file with data from all human chromosomes can take several hours. With these sub-sampled data sets, it should be possible to align them with tophat in a reasonable time frame. However, if this should still take too long (>20mins), you may wish to abort this step. You already have the corresponding output in the subfolder results/tophat/.

Tophat will take one or multiple FASTQ files and align the reads therein to a genomic reference. A common command may look like this:

$ tophat -o results/tophat_outputSE30888 --solexa-quals -p 8 --library-type=fr-unstranded reference/rm.chr.1 SE/ERR030888.fq.gz

$ tophat -o results/tophat_outputPE30880 --solexa-quals -p 8 -r 200 --mate-std-dev 90 --library-type=fr-unstranded reference/rm.chr.1 PE/ERR030880_1.fq.gz PE/ERR030880_2.fq.gz

We specify the output location (-o), the number of CPUs to use (-p), which type of sequencing library was used to produce the data (here ‘fr-unstranded’), in which format the quality information was stored (here ‘solexa’, pre 1.3), the location of the reference annotation, the location of the Bowtie2-formatted index file for the genome sequence and finally a FASTQ file. For pair-end data we have two FASTQ files and also define the expected size and variation in size of the fragments sequenced.

While this is running, you may want to head over to the tophat manual and have a look at the available options and technical details.

The aligned reads are found in the output directory you have chosen, and named accepted_hits.bam. For convenience, you may want to sym-link this file into your main project folder. This links the accepted_hits.bam in the folder results/tophat_outputSE30888 to results/SE30888.bam making it easier to refer to the correct bam file in later steps.

$ cd results
$ ln -s tophat_outputSE30888/accepted_hits.bam SE30888.bam
$ ln -s tophat_outputPE30880/accepted_hits.bam PE30880.bam
$ cd ..

What goes in and what comes out?

In: A read alignment in BAM format (SAM is also an option, but should not be used due to it being uncrompressed)

Out: A number of files, including a transcriptome annotation reconstructed from the read distribution

Please adjust the name of the bam file as well as the output folder name to the appropriate names for your analysis. **Note: Remember the path to the output folder that you choose below, you will need it later!!

** Run cufflinks on all tophat results

General command format:

$ cufflinks -o my_output_folder -p 8 -g reference/Homo_sapiens.GRCh38_Chr1.77.gtf my_infile.bam

Note: The Cufflinks step can take a while - so this is now a good time to get a coffee, or read through the available documentation on the Cufflinks website. While we have tried to cover the technical details of these tools to some degree in the lecture, there are a lot of details that will help you use these programs to their greatest effect.

Here we specify where to store the output, how many CPUs to use as well as where to find the reference files (genome sequence and annotation). Depending on the size of the BAM/SAM file, this step may require several hours to complete. To make this analysis feasible within the time limits of the course, we have created the chromosome-limited files you have been using.

The command line output will read something like:

> Processed 48858 loci. [* * * * *] 100%

> Map Properties:

> Total Map Mass: 1893657.20

> Fragment Length Distribution: Truncated Gaussian (default)

> Default Mean: 200

> Default Std Dev: 80

One important thing that can be noted here:

genes.fpkm_tracking

isoforms.fpkm_tracking

skipped.gtf

transcripts.gtf

The first two files contain basic information about expressed genes and transcripts, respectively - those known from the annotation file as well as novel loci identified by cufflinks -and the strength of their expression, given in FPKM. FPKM stands for ‘Fragments Per Kilobase of exon per Million fragments mapped’ and is a normalized measure of transcript abundance. That is the short explanation. The longer version for the more mathematically inclined among us can be found at Mortazavi et al. 2008.

These output files are tab-delimited and can e.g. be opened in e.g. Microsoft Excel (or similiar) to be analyzed and/or visualized.

6) Cuffmerge: Reconciling different transcript models

What goes in, what comes out:

In: An optional reference annotation and a list of transcript annotations to merge

Out: A consensus annotation, taking into account all input annotations

It is important to keep in mind that reference annotations are very likely incomplete. This is because some genes or individual exons may be expressed at very low levels or under specific conditions, thus having evaded prior detection. Moreover, many vertebrate genomes have only been annotated by reference to other genomes, which themselves may only be poorly characterized. Using the expression data obtained through cufflinks may hence allow us to improve existing annotations. Cuffmerge is a tool that takes cufflinks-derived annotation files (known & ‘novel’ loci) and reconciles them into a consensus annotation, discarding e.g. spuriously transcribed loci and merging overlapping loci into larger transcription units where possible.

Again, the commands below are just examples, your files and folder will be called differently.

$ cd ~/glob
$ mkdir cuffmerge
$ cd cuffmerge
$ ln -s ../cufflinks.brainSE/transcripts.gtf brainSE.gtf
$ ln -s ../cufflinks.brainPE/transcripts.gtf brainPE.gtf

Note: If this didn’t work (check that the linked files actually exist and have content), then you probably chose a different way of organizing your folders and will have to figure out where the cufflink files your generated earlier are located ;)

Now that we have both transcript model files in one location, we can attempt to merge them. For this, we first have to create a text file that contains a list of GTF files to merge (quite inconvenient, I know). Use whichever tool you feel comfortable with and write the name of each gtf file line by line, then save it as transcripts.txt. If you would like to be fancy, you could try to use unix commands to do this (tip: use ls and “>”)

$ cuffmerge -o merged -g reference/Homo_sapiens.GRCh38_Chr1.77.gtf -p 8 -s reference/rm.chr.1.fa transcripts.txt

This will save the reconciled annotation file as merged/merged.gtf. Symlink this file into your main project folder.

$ cd ~/glob/transcriptome/
$ ln -s cuffmerge/merged/merged.gtf

Now we are ready to check for differential expression in our read data from chromosome 1.

7) Cuffdiff: Differential expression analysis

What goes in, what comes out:

In: A consensus annotation (or just the reference annotation), read alignments for all samples that are to be compared and quantified

Out: A number of files, including tests for differential expression for all pairwise comparisons (gene_exp.diff)

Cuffdiff takes aligned reads from two or more samples, estimates comparable expression values and performs statistical analysis of the resulting data to determine which genes exhibit significantly different activity profiles between any two samples. The nitty-gritty details of the underlying mathematics can be found here: cuffdiff.

For running Cuffdiff, we type something like this (being in the main directory of our project):

$ cuffdiff -o cuffdiff.brain_vs_kidney -L brain,kidney -p 8 -u merged.gtf brainPE.bam,brainSE.bam kidneyPE.bam,kidneySE.bam

Adopt this to your data - if uncertain, run cuffdiff -h to learn more about the options. Note that the labels you give them are arbitrarily chosen, pick names that make sense to you.

This will write the output of the analysis into the subfolder cuffdiff.brain_vs_kidney (or whatever folder name you chose) and do a pairwise comparison of the samples (Note: the order in which you list the labels needs to be the same as the order of SAM/BAM files!). NB! To reduce the computing time we do not use the flag -b that correct for genome sequence (see manual for details), but we resolve issues arising from reads mapping to multiple loci in the genome using the flag -u.

The main file of interest to us is gene_exp.diff. It includes the analysis of differential expression. A quick way to find the cases of interest is to filter the file for genes that show evidence of differential expression (identified by the tag ‘yes’ in the ‘significant’ column).

$ head -n1 gene_exp.diff > results.txt
$ grep yes gene_exp.diff >> results.txt

(This copies the header of the output file as well as all rows tagged as significant into a new text file - open this file in a text editor or spread sheet program).

Where to go next - Visualize the results

So now you have analyzed the expression of genes between two samples. However, usually the work does not end here. For example, you may want to perform a thorough analysis of your output, visualize distributions and obtain statictics. This can, for example, be done either through clever scripting in R, or by use of a recently developed software suite called CummeRbund. cummeRbund reads the native output from Cuffdiff, parses it into a database and provide ample options for in-depth analysis of the data.

For this part you need graphics, so you need to log in to Uppmax using thinlinc (as you have done before) or by using the -Y parameter when using ssh.

$ module load RStudio
$ rstudio /proj/g2016001/labs/transcriptome_cummeRbund/cummeRbund_course_code.R

Installing cummeRbund in R or Rstudio takes a bit of time, so a good idea is to start the installation before taking a break. In the R Studio environment, you need to make two changes in the R script.

  1. Change the working directory to your own transcriptome directory
  2. Change the cufflinks directory to the one you generated earlier.

Optional: Using the EnsEMBL accession numbers of the significant genes, you can go to http://www.ensembl.org to retrieve information on the function of these genes and see whether you can draw any conclusions as to why these genes would be differentially expressed in the two tissues.

As an alternative to CummeRbund I have prepared a simple, dependency-free R script that you may run from the command line to summarize the Cuffdiff results and create a Volcano plot.

$ cd ~glob/transcriptome
$ cp /proj/g2016001/extra/summarizeDE.R ~glob/transcriptome
$ Rscript summarizeDE.R ~glob/transcriptome/cuffdiff.brain_vs_kidney/gene_exp.diff 0.1 1

where,

‘0.1’ is the FDR threshold and ‘1’ is the log2FC

The script should print the number of differentially expressed, down- and up-regulated genes given the FDR and log2FC threshold. It also prints the entries for the top 10 down-up and up-regulated genes. Finally, it creates a .pdf file with a Volcano plot (google Volcano plot for explanation if not clear)

To open the Volcano plot from Uppmax

$ firefox Volcano_0.1_1.pdf 

You can experiment with summarizing your results using different cut-offs for FDR and log2FC. Does the number of differentially expressed genes change? Does the Volcano change?

After obtaining the differential expression results, it it common to transfer the files to the local computer. One can get far by just looking at the results in e.g. Excel, running R scripts or using online tools.

Closing remarks

This tutorial has introduced you to a very straight-forward, but somewhat simplified pipeline for the analysis of RNA-seq data by use of a reference genome to study transcription. Both Cufflinks and Tophat come with additional parameters that we have not touched upon to avoid unnecessary confusion. Likewise, the read data we have used was strand-unspecific. This has some drawbacks, specifically with respect to accuracy in the isoform analysis. Or perhaps you are not interested in comparing expression between pairs of samples but in a time series. For this reason as well as others, you may need to adjust one or several parameters to get the best results - depending on the nature of your data. We therefore highly recommend you to carefully read the manuals (and the original publications) so as to familiarize yourself with these additional options.

Other alternatives

There are several different tools available that have the ability to do many of the steps described above with the Tuxedo pipeline. Here are a few options for the different steps

Short read mappers that are suitable for RNA-seq data

Counting reads from mapped data (bam files) and a annotation file (GTF/GFF file)

Mapping & counting RNASeq

Detect differential gene expression

Most software suitable for detection of differential gene expression are developed in R and available as add-on packages to R at bioconductor