18 May 2017

Outline

  1. Why sequence the transcriptome

  2. Overview of RNA-seq (from RNA to sequence)

  3. RNA-seq analysis (from sequence to RNA and expression)

  4. Quality control (QC) of sequence data

  5. Reference based analysis

  6. De-novo assembly of transcriptomes

  7. Gene/Transcript expression analysis

  8. Differential gene expression

NGS technologies

  • Roche 454 (seq by synthesis)
  • Illumina HiSeq (seq by synthesis)
  • Illumina MiSeq (seq by synthesis)
  • Life Technologies SOLiD (seq by ligation)
  • PacBio SMRT cell (seq by synthesis)
  • Complete Genomics (only human)
  • Life Technologies IonTorrent
  • Oxford Nanopore
  • etc

Why sequence the transcriptome?

Applications for RNA-seq

  • Identify gene sequences in genomes
  • Investigate patterns of gene expression (differential expression analysis)
  • Study isoforms and allelic expression
  • Learn about gene functions
  • Identify co-expressed gene networks
  • Identify SNPs in genes
  • etc

What makes RNA-seq different to genome sequencing

  • Dynamic expression, which means that each tissue, time-point and cell will be different
  • Protein coding genes, non-coding RNA-seq, etc
  • Smaller sequence space as it is a subset of the complete genome

Overview of RNA-seq

- from RNA to sequence

What is in the sequences

Exome sequencing
poly-A selection
rRNA depletion

From RNA to sequence

High-level RNA-seq workflow

  • Experimental design (biology, medicine, statistics)
  • RNA Extraction (biology, biotechnology)
  • Library preparation (biology, biotechnology)
  • Sequencing (engineering, chemistry, physics, biotechnology and bioinformatics)
  • Data processing (bioinformatics)
  • Data analysis (bioinformatics, biostatistics, biology, medicine)

Type of data

  • The actual sequence data is the same as for genome sequence data
  • Both single- and pair-end data
  • Major difference is that transcripts often comes from one strand and most data retain this structure eg. the sequence data is from only one strand.

RNA-seq workflow

QC of sequence files

Quality control of fastq files (code)

fls <- dir(dataFolder, "SRR3222409_1.*", full=TRUE)
qaSummary <- ShortRead:::qa(fls, type="fastq")
perCycle <- qaSummary[["perCycle"]]

To make the next two slides then simply use the commands:

ShortRead:::.plotCycleBaseCall(perCycle$baseCall)
ShortRead:::.plotCycleQuality(perCycle$quality)

Quality control of fastq files (1)

Quality control of fastq files (2)

Quality values

The FASTQ format consists of 4 lines per sequence:

  1. identifier,2) raw sequence, 3) optionally raw sequence and description again, and 4) encoding the quality value
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
The character '!' represents the lowest quality while '~' is the highest (ASCII).

The quality value Q is an integer mapping of p, the probability that the corresponding base call is incorrect.

Phred quality score:
Q = -log10(p)

Filter and trim fastq data

Based on the QC results there are at times a need to filter "bad" data. For example:

  • filter reads based on quality score, e.g. with avg. quality below 20 within 4-basepair wide sliding windows
  • filter reads that after this are shorter than a given cut-off
  • remove unwanted sequences, e.g. adapter sequences or contaminants
  • trimming can rescue coverage and reduce noise

  • decide on deduplication

Tools for QC and trimming/filtering of fastq data

Reference based analysis

The required files

  • Read data (fastq files)
    From the sequencing machines

  • Reference genome (fasta file)
    The genome sequence of the species.

  • Reference annotation (GTF/GFF files)
    Where the genes located on the genome (gene transfer format)

Mapping reads to genomes

Mapping reads to genomes - complete alignment

Mapping reads to genomes - splice sites

Mapping reads

Lots of different aligners exists which are based on different algorithms, usually there is a tradeoff between speed versus accuracy and sensitivity.

  • Brute force comparison
  • Smith-Waterman
  • Suffix tree
  • Burrows-Wheeler Transform

In general the "largest difference" is on the default settings, as most mappers will allow you to overrule these.

Aligners

STAR: ultrafast universal RNA-seq aligner (bioinformatics 2013)
Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure

Fast and accurate short read alignment with Burrows-Wheeler transform (bioinformatics 2009)
backward search with Burrows-Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

Hisat2
graph-based alignment of next generation sequencing reads to a population of genomes (Hierarchical Graph FM index)

TopHat: discovering splice junctions with RNA-Seq (bioinformatics 2009)
TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.

The BAM/SAM format

SAM = Sequence Alignment/Map format

You want them stored as BAM files (!). To look at a BAM file, use

samtools view name_of_file
samtools view name_of_file | head -1

Look at the mapped reads (code)

library(Gviz);
edb <- EnsDb.Mmusculus.v79
gen <- "mm10"
bamFiles <- list.files(paste0(dataFolder,"bams/out"), pattern = "*sorted.bam$", full.names = TRUE, recursive = TRUE)
options(ucscChromosomeNames=FALSE)
gat <- GenomeAxisTrack()
gr <- getGeneRegionTrackForGviz(edb, chromosome = 2, start = 66673425, end = 66784914, featureIs = "tx_biotype")
genome(gr) <- "mm10"
aTrack2 <- AnnotationTrack(range = bamFiles[4], genome = gen, name = "Reads", chromosome = 2)

toPlot <- append(list(gat, GeneRegionTrack(gr)), aTrack2)

To make the next slide simply use the command:

plotTracks(toPlot, from = 66670000, to = 66700000, sizes = c(10,10,200))

Look at the mapped reads

QC of mapped data

Group Total bases Tag counts Tags/Kb
CDS Exons 33302033 20002271 600.63
5'UTR 21717577 4408991 203.01
3'UTR 15347845 3643326 237.38
Introns 1132597354 6325392 5.58
TSS up 1kb 17957047 215331 11.99
TES dn 1kb 18298543 266161 14.55

QC of mapped data

De-novo assembly of transriptomes

Why?

  • No reference genome available
  • Identify novel transcripts (genes or splice variants)
  • Identify fusion genes or transcription from unknown origin

Can be likened to a 3D jigsaw puzzle of reads

Challenges include sequence errors, repeats, polyploidy, GC content/complexity, large amount of data, and contaminations.

How?

  • Assemble the transriptome from the generated short reads
  • All useful methods rely on generating a structure called a de-bruijn graph that in this context aims at extenting k-mers in the short reads to longer unambiguous sequences corresponding to transcripts

From short reads to isoforms

Yellow bases =
overlap start

Trinity

Trinity --seqType fq --left reads_1.fq --right reads_2.fq --CPU 6 --max_memory 20G 

Other possibilities

QC of de-novo assemblies

  • How large fraction of reads can be mapped back to the assembly
  • Total sequence length of all assembled sequences
  • Number of trancripts generated
  • Compare to related species with genome information
  • Predict open reading frames and protein domains
  • etc

GAGE: a critical evaluation of genome assemblies and assembly algorithms (Genome Res 2012)

Gene/Transcript expression analysis

From sequence to expression

  • Read counts is a measure of expression level
  • Many reads mapping to a genomic feature = high expression
  • The expression can be quantified on any level eg. gene, transcript (isoforms), or exons.

Counting reads

Reference based approach

In the last example multi-mapping reads were disregarded, would it be a difference if instead of gene_A and gene_B it said transcript_A and transcript_B?

Quantifying abundance

Near-optimal probabilistic RNA-seq quantification
(Nature biotec 2016) two orders of magnitude faster than previous approaches and achieves similar accuracy. Kallisto pseudoaligns reads to a reference, producing a list of transcripts that are compatible with each read while avoiding alignment of individual bases.

HTSeq a Python framework to work with high-throughput sequencing data
(Bioinformatics 2015) once a project deviates from standard workflows, custom scripts are needed. We present HTSeq a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation
(Nature biotec 2010) aka Cufflinks algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome
(BMC bioinf 2011) A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms does not rely on the existence of a reference genome

Counting reads

de-novo assembly

  • Map reads to the assembled transcriptome
  • Count reads mapping to the different transcripts that have been assembled.
  • It is not as easy as it sound as transcripts can be very redundant and a read might map to multiple transcripts equally well.
  • Tools available RSEM, (kallisto and salmon)

Count matrix


Note the large fraction of 0

Counts to expression

  • The counts depend on both the number of sequences generated from a samples and on the sequence length of the gene/transcript
  • Counts per million (CPM) is used to take out the effect of sequence depth
  • Reads per kilobasepair and million (RPKM, FPKM) also takes the lengh of genes into account
  • Several other normalisations are available and is used to optimize power in detection of differential expression

  • When should a gene be considered "present"/"expressed"?

Normalisation

Normalisation

Normalisation

Reads per KiloBase Million (RPKM)

  • Count up the total reads in a sample and divide that number by 1,000,000, per million scaling factor.
  • Divide the read counts by the per million scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  • Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

Fragments per KiloBase Million (FPKM)

  • The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn't count this fragment twice).

Transcripts per KiloBase Million (TPM)

  • Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
  • Count up all the RPK values in a sample and divide this number by 1,000,000. This is your per million scaling factor.
  • Divide the RPK values by the per million scaling factor. This gives you TPM.

"Biological QC" - PCA

The "purpose" of a PCA is to determine what is the best way to re-express X?

 

"Biological QC" - PCA

The "purpose" of a PCA is to determine what is the best way to re-express X?

 

"Biological QC" - PCA

The "purpose" of a PCA is to determine what is the best way to re-express X?

"Biological QC" - PCA

Scaling does not matter (top), whereas log2 transformation of the data results in the same conclusion but different picture (bottom).
PCA tutorial

"Biological QC" - single gene analysis

Would you be happy with the below results?
  SumFPKM = 357 941 648

"Biological QC" - single gene analysis

Would you be happy with the below results?
SumFPKM = 357 941 648

The sum of the FPKM for the 4.572 snoRNAs, yRNAs, miRs, rna5, rna2, scaRNA is 357 331 046, which is 99.8% of the total

Differential gene expression

  • Read counts (and CPM etc) does not follow a normal distribution
  • In most experiments few replicates are used, limited power unless we use model based analysis for detection of DE
  • It is recommended to use softwares designed specifically for RNA-seq DE analysis
  • Some useful R Bioconductor tools

Differential gene expression

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data (Bioinformatics 2010)
An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated


Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 (Genome Biol 2014)
small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression


CummeRbund
takes the various output files from a cuffdiff run and creates a SQLite database of the results describing appropriate relationships betweeen genes, transcripts, transcription start sites, and CDS regions. Once stored and indexed, data for these features, even across multiple samples or conditions, can be retrieved very efficiently numerous plotting functions as well for commonly used visualizations

Differential gene expression (code)

countFiles = list.files(paste0(dataFolder,"countFiles"), pattern=".txt$", full.names=TRUE); # assume count files are .txt files
gffFile=paste0(dataFolder,"Mus_musculus.GRCm38.85.chr11.flat.dexseq.gff")

read in the information on the experiment - Note the order of the samples must be the same
exp=read.table(paste0(dataFolder,"experiment.txt"), sep="\t", header=T)

read in the data and create the DEXSeq object - the model to test is specified at this step
dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=exp,
    design= ~ sample + exon + condition:exon, flattenedfile=gffFile )

dxd = estimateSizeFactors( dxd )   # normalise to sequencing depth

dispersion estimation to distinguish technical and biological variation (noise) from real effects on exon usage due to the different conditions
dxd = estimateDispersions( dxd )  # takes time!!!
  
dxd = testForDEU( dxd )           # testing for differential exon usage
dxd = estimateExonFoldChanges( dxd, fitExpToVar="condition")

dxr = DEXSeqResults( dxd )        # create the DEXSeq object

Dispersion plot

Exon use

Summary

  • Many different expertises needed for RNA-seq experiments
  • Think ahead, plan wisely, ask for help
  • If your experimental design is wrong, nothing will help
  • Assess and try to improve the quality of raw reads use QC tools and talk to sequencing centre
  • Remove low quality samples
  • Dont be afraid of trying multiple approaches
  • These is a reason why there is no consensus in the field which programs are the best for the different steps…

  • Check the validity of your samples on both technical and biological level

Summary

Reference based

  • Map to genome and use available annotations
  • Count reads and check that the majority map according to expectations
  • Note this it is rarely as simple as it is made out here(!)
  • Does count summaries make sense?
  • Use appropriate tools and software for detection of differentially expressed (DE) genes
  • Note that this could depend on your experimental setup

Summary

De-novo

  • Use reference genome to guide it, if available
  • Spend lots of time in assessing the results e.g. by comparing related species, looking at ORFs etc
  • Consider merging with other data sources
  • Try multiple assemblers

Gene expresssion

  • Ensure that your experimental design allows addressing the question of interest.
  • More replicates translates into more power for differential gene expression and easier publication process

Exercises for today and tomorrow

Main exercise

Reference based expression analysis

  • checking the quality of the raw reads with FastQC
  • map reads to the reference genome using STAR
  • converting between SAM and BAM format using Samtools
  • assess the post-alignment read quality using QualiMap
  • count reads overlapping with gene regions using featureCounts
  • Identify differentially expressed genes using a prepared r-script relying on edgeR

Bonus exercises

Only if times allow

  • functional annotation
  • putting DE genes in the biological context
  • exon usage, studying the alternative splicing
  • data visualisation and graphics
  • de novo transcriptome assembly

Tips and Trixs

You will re-use the same commands over and over again

Avoid running things directly from the command line as this is not reproducible!
- for command line tools at the very least "log" in a bash script
FILES=*bam
for currentFile in $FILES
do
  echo "loop around many files in the same folder at the same time"
  echo "the name of the current file can be found via $currentFile"
done

Manuals for some of the tools