RNA-seq data processing and analyses

Linköping, May 2018

Outline

Why sequence the transcriptome
Overview of RNA-seq (From RNA to sequence)
RNA-seq analysis (From sequence to RNA and expression)
- Quality control of sequence data
- Mapping based approach
- Transcriptome assembly
- Estimate gene expression

1. Why sequence the transcriptome

Applications for RNA-seq

Identify gene sequences in genomes
Investigate patterns of gene expression
Study isoforms and allelic expression
Learn about gene functions
Identify co-expressed gene networks
Identify SNPs in genes

What makes RNA-seq different to genome sequencing

Dynamic expression, which means that each tissue, time-point and cell will be different
Most data comes from "functional units" eg. protein coding genes
Smaller sequence space as it is a subset of the complete genome

2. Overview of RNA-seq

- from RNA to sequence

What is in the sequences

From RNA to sequence

High-level RNA-seq workflow

Experimental design (biology, medicine, statistics)
RNA Extraction (biology, biotechnology)
Library preparation (biology, biotechnology)
Sequencing (engineering, chemistry, physics, biotechnology and bioinformatics)
Data processing (bioinformatics)
Data analysis (bioinformatics, biostatistics, biology, medicine)

Type of data

The actual sequence data is the same as for genome sequence data
Both single- and pair-end data
Major difference is that transcripts often comes from one strand and most data retain this structure eg. the sequence data is from only one strand.

RNA-seq workflow

QC of sequence files

Quality control of fastq files

Filter and trim fastq data

Based on the QC results there are at times a need to filter "bad" data. For example:

filter reads based on quality score, e.g. with avg. quality below 20 within 4-basepair wide sliding windows
filter reads that after this are shorter than a given cut-off
remove unwanted sequences, e.g. adapter sequences or contaminants

Tools for QC and trimming/filtering of fastq data

fastQC (QC)
Prinseq (QC and filtering)
FastX-toolkit (QC and trimming/filtering)
R, via ShortRead package (QC and trimming/filtering)
Trimmomatic (Trimming/filtering)
Cutadapt (Trimming/filtering)

Reference based expression analysis

What do need for this

Read data (fastq files)
From the sequencing machines
Reference genome (fasta file)
The genome sequence of the species.
Reference annotation (GTF/GFF files)
Where the genes located on the genome

Mapping reads to genomes

Mapping reads to genomes

Mapping reads to genomes

Mapping reads

QC of mapped data

Group	Total bases	Tag counts	Tags/Kb
CDS Exons	33302033	20002271	600.63
5'UTR	21717577	4408991	203.01
3'UTR	15347845	3643326	237.38
Introns	1132597354	6325392	5.58
TSS up 1kb	17957047	215331	11.99
TES dn 1kb	18298543	266161	14.55

QC of mapped data

Available tools: Qualimap, RseQC, Picard, RNA-SeQC

De-novo assembly of transcriptomes

Why?

No reference genome available
Identify novel transcripts (genes or splice variants)
Identify fusion genes or transcription from unknown origin

How?

Assemble the transcriptome from the generated short reads
All useful methods rely on generating a structure called a de-bruijn graph that in this context aims at extenting k-mers in the short reads to longer unambiguous sequences corresponding to transcripts

From short reads to isoforms

Trinity

Other useful tools

SOAP-denovo
TRANS Oases
Trans-ABYSS

QC of de-novo assemblies

How large fractino of reads can be mapped back to the assembly
Total sequence length of all assembled sequences
Number of trancripts generated
Compare to related species with genome information
Predict open reading frames and protein domains
etc etc

RNA-seq for expression analysis

From sequence to expression

Read counts is a measure of expression level
Many reads mapping to a genomic feature = high expression
The expression can be quantified on any level eg. gene, transcript (isoforms), or exons.

Counting reads

Reference based approach

Counting reads

de-novo assembly

Map reads to the assembled transcriptome
Count reads mapping to the different transcripts that have been assembled.
It is not as easy as it sound as transcripts can be very redundant and a read might map to multiple transcripts equally well.
Tools available RSEM, (kallisto and salmon)

Count matrix

Counts to expression

The counts depend on both the number of sequences generated from a samples and on the sequence length of the gene/transcript
Counts per million (CPM) is used to take out the effect of sequence depth
Reads per kilobasepair and million (RPKM, FPKM) also takes the lengh of genes into account
Several other normalisations are available and is used to optimize power in detection of differential expression

Detection of differential gene expression

Read counts (and CPM etc) does not follow a normal distribution
In most experiments few replicates are used, limited power unless we use model based analysis for detection of DE
It is hence recommended to use softwares designed specifically for RNA-seq DE analysis
Some useful tools: Cuffdiff (avoid, use Ballgown), edgeR, DESeq2, limma

Exon use

Summary

Many different expertises needed for RNA-seq experiments
Think ahead, plan wisely, ask for help
If your experimental design is wrong nothing will help
Assess and try to improve the quality of raw reads use QC tools and talk to sequencing centre

Summary

Reference based

Map to genome and use available annotations
Count reads and check that the majority map according to expectiations
Does count summaries make sense?
Use appropriate tools and software for detection of DE

Summary

De-novo

Use reference genome to guide it, if available
Spend lots of time in assessing the results e.g. by comparing related species, looking at ORFs etc
consider merging with other data sources
Try multiple assemblers

Gene expresssion

Ensure that your experimental design allows addressing the question of interest. More replicates translates into more power for differential gene expression and easier publication process

Exercises for today and tomorrow

Main exercise

Reference based expression analysis

checking the quality of the raw reads with FastQC
map reads to the reference genome using STAR
converting between SAM and BAM format using Samtools
assess the post-alignment read quality using QualiMap
count reads overlapping with gene regions using featureCounts
Identify differentially expressed genes using a prepared r-script relying on edgeR

Bonus exercises

Only if times allow

functional annotation, putting DE genes in the biological context
exon usage, studying the alternative splicing
data visualisation and graphics
de novo transcriptome assembly

Outline

1. Why sequence the transcriptome

Applications for RNA-seq

What makes RNA-seq different to genome sequencing

2. Overview of RNA-seq

- from RNA to sequence

What is in the sequences

From RNA to sequence

High-level RNA-seq workflow

Type of data

RNA-seq workflow

QC of sequence files

Quality control of fastq files

Quality control of fastq files

Filter and trim fastq data

Tools for QC and trimming/filtering of fastq data

Reference based expression analysis

What do need for this

Mapping reads to genomes

Mapping reads to genomes

Mapping reads to genomes

Mapping reads

QC of mapped data

QC of mapped data

De-novo assembly of transcriptomes

Why?

How?

From short reads to isoforms

Trinity

Other useful tools

QC of de-novo assemblies

RNA-seq for expression analysis

From sequence to expression

Counting reads

Reference based approach

Counting reads

de-novo assembly

Count matrix

Counts to expression

Detection of differential gene expression

Exon use

Summary

Summary

Reference based

Summary

De-novo

Gene expresssion

Exercises for today and tomorrow

Main exercise

Reference based expression analysis￼

Bonus exercises

Only if times allow

Reference based expression analysis