Mapping reads to a reference and converting the results to BAM format

Data available for exercise

You can carry out this exercise using the RNA-seq data we provide or your own data, if you have any.

In order to make the steps run quickly during the lab, we have extracted only those RNA-seq reads that mapped to one gene (RAB11FIP5) that we will examine in more detail in later exercises.

All FASTQ files that you will need for this exercise can be found in

/proj/b2013006/webexport/downloads/courses/RNAseqWorkshop/isoform/referenceBased/data

on UPPMAX and through this URL.

If you want to map more files for practice, you can continue with files for additional samples, found in

/proj/b2013006/webexport/downloads/courses/RNAseqWorkshop/isoform/RAB11FIP5_fastqFiles

on UPPMAX and through this URL.

A pre-built human genome index for HISAT2 is found here

/proj/b2013006/webexport/downloads/courses/RNAseqWorkshop/reference/hg19_hisat2

on UPPMAX and through this URL. Note that hg19 indicates the version of the human genome assembly that was indexed.

A pre-built human genome index for STAR is found here

/proj/b2013006/webexport/downloads/courses/RNAseqWorkshop/reference/hg19_Gencode14.overhang75

on UPPMAX and through this URL.

Mapping short reads to a reference using HISAT2

Here, you will map the reads to the hg19 reference genome using the RNA-seq aligner HISAT2. Note that if you are using your own non-human data, you need to use a reference genome for the corresponding species.

There are many features that can be tweaked using HISAT2. For more information on all flags that can be used go here.

Read below for the flags we use for this exercise. Remember to change filenames accordingly so that you can run your program properly and know which files you have used.

To load the HISAT2 module on UPPMAX, execute:

 module load bioinfo-tools      # This is to get access to all bioinformatics tools available on UPPMAX
 module load HISAT2/2.1.0       # The specific mapping program

Now you can map the reads from one of the samples (or several; it’s up to you which ones) using a command such as the one below.

mkdir outDir

hisat2 -p N --dta-cufflinks -x path/to/index/fileName \
  -1 path/to/reads/sample_1.fastq -2 path/to/reads/sample_2.fastq \
  -S outDir/hisat2.sam --summary-file outDir/hisat2_summary.txt

The flags used are:

This should run fairly quickly and create the files you specified with -S and --summary-file.

If everything worked, HISAT2 should report some statistics about how many reads were mapped, on your terminal and in the summary file.

Try to answer the following:

To answer these questions, you should look at the input to and output from HISAT2. You may also need to consult the HISAT2 manual, information about the FASTQ format and the SAM format specification.

Mapping short reads to a reference using STAR

Here we will map the reads to the hg19 reference genome using a popular RNA-seq aligner, STAR. There are many many features that can be tweaked using STAR. For more information concerning different features that can be used see the manual.

Read below for the flags we use for this exercise. Remember to change filenames accordingly so that you can run your program properly and know which files you have used.

To load the STAR module on UPPMAX, execute

 module load bioinfo-tools
 module load star

Now you can map the reads from one of the samples (or several; it’s up to you which ones) using a command such as the one below.

mkdir outDir

STAR  --runThreadN N --outSAMstrandField intronMotif --genomeDir /path/to/index \
  --readFilesIn /path/to/reads/sample_1.fastq /path/to/reads/sample_2.fastq \
  --outFileNamePrefix outDir/

flags used are

This should run fairly quickly and put a file called Aligned.out.sam in the directory that you specified with --outFileNamePrefix.

Look at the output files from STAR, and try to answer the same questions as for HISAT2 above.

Converting SAM files to BAM files

If you were able to run HISAT2 and STAR sucessfully, this should have produced files with mapped reads in SAM format. These files need to be converted to sorted and indexed BAM files for efficient downstream analysis.

You should try to give the BAM files representable names, in order to make it easier to manage your files. A good naming scheme for BAM files is to use names that indicate what you mapped and how. As an example, if you mapped sample 12 using HISAT2 you could create a file named sample12_RAB11FIP5.hg19.HISAT2.bam.

The most commonly used tool for converting from SAM to BAM is Samtools (follow the link for more information about Samtools).

To load the Samtools module on UPPMAX, execute:

module load bioinfo-tools
module load samtools

The Samtools command to convert from SAM to a sorted BAM file is:

samtools sort -o output.bam input.sam

Remember to use an appropriate filename instead of output.bam!

Next, we need to index the BAM file.

samtools index properName.bam

The indexing step should create an index file with the suffix .bai.

You can also get a report on your mapped reads using the samtools command flagstat:

samtools flagstat properName.sorted.bam > properName.sorted.flagstat

Since the BAM file contains all the information from the original SAM file, remember to remove the SAM file once you are finished, in order to free up disk space.

The sorted, indexed bam file can be viewed in the Integrative Genomics Viewer (IGV). Instructions are here.

Try to answer the following:

When you are done…

…compare your answers to the solutions here.