Exercise: Illumina Assembly
In this exercise you will assemble genomes de novo using commonly used assembly software. You will work with Illumina data of Rhodobacter sphaerioides, data that was used in the GAGE-B comparison of assemblers. All settings used for the different programs are the ones used by the GAGE-B project. Due to time-restrictions we are forced to stick to assemblers that run quickly, but if you have an assembly project of your own you are encouraged to try other assemblers too. Remember that no assembler is best for all projects; if possible you need to try several. One of the assemblers, MaSuRCA, will take a bit longer to run (about an hour) so be prepared to start this to run over lunch!
Setup
Make a copy of the folder which includes the read data using cp -vr /sw/courses/assembly/illumina_assembly/data .
Rhodobacter HiSeq data
Use the files Rhodo_Hiseq_trimmed_read{1,2}.fastq
for the assemblies.
Spades
In your working directory make a folder called spades, and load the spades module:
mkdir spades
module load bioinfo-tools
module load spades/3.10.1
Spades is very easy to run in the basic configuration. Usually you want to run with the “–careful” flag, but this will take too long for this exercise. It will take around 16 mins anyway, grab a coffee.
spades.py -t 8 --pe1-1 ../data/Rhodo_Hiseq_trimmed_read1.fastq --pe1-2 ../data/Rhodo_Hiseq_trimmed_read2.fastq -o Spades_Rhodobacter
In Spades_Rhodobacter you will now have a number of files, including contigs.fasta and scaffolds.fasta.
Take a look at the files using less
. Can you see any regions where contigs have been scaffolded together?
Abyss
Make a new folder for the Abyss assembly and load the necessary modules:
mkdir abyss
module load abyss/2.0.2
module load bowtie
Start Abyss using the abyss-pe script:
abyss-pe k=31 l=1 n=5 s=100 np=8 name=asm lib='reads' reads='../data/Rhodo_Hiseq_trimmed_read1.fastq ../data/Rhodo_Hiseq_trimmed_read2.fastq' aligner=bowtie
Once done you will have two files called asm-contigs.fa and asm.scaffolds.fa.
SOAP denovo
Soap de novo is one of very few program that can assemble large genomes using high coverage Illumina data.
Make a directory for the Soap de novo assembly and load the module:
mkdir soapdenovo
module load soapdenovo/2.04-r240
SoapDeNovo requires a config file that you need to create yourself using a text editor like nano:
nano soap.config
Enter this information:
[LIB]
avg_ins=220
reverse_seq=0
asm_flags=3
rank=1
q1=../data/Rhodo_Hiseq_trimmed_read1.fastq
q2=../data/Rhodo_Hiseq_trimmed_read2.fastq
Exit and save the file by ctrl-x (if using nano) and answer yes when asked to save.
Start SoapDeNovo by:
SOAPdenovo-63mer all -s soap.config -o asm -F -R -E -w -u -K 55 -p 8 | tee SOAPdenovo.log
# tee saves the output to a file as well as printing to screen
Examine the contigs properties.
SoapDeNovo also comes with a GapCloser utlity that tries to improve the assemblies by closing gaps in the scaffolds. Try it out using:
GapCloser -b soap.config -a asm.scafSeq -o asm.new.scafSeq -t 8 | tee -a SOAPdenovo.log
Are there any improvements?
MaSuRCA
MaSuRCA takes quite some time to run, so start it over lunch! (…or if you’re not into waiting,
there are finished results in /sw/courses/assembly/illumina_assembly/assemblies
).
Make an directory called masurca and load the necessary modules:
mkdir masurca
module load MaSuRCA/3.2.1
Like SOAP, MaSuRCA also needs a configuration file. Luckily, it can make a template for you! Run
masurca -g masurca_config
to make a template, and open it in nano for editing.
Find the line called PE
and add the FULL paths to your Rhodo_Hiseq_trimmed_read1.fastq
and Rhodo_Hiseq_trimmed_read2.fastq
files. Generally, you should NOT give pre-trimmed data to MaSuRCA, as this will deteriorate the assembly (it does it’s own pre-processing), but in this case we do it to get a slightly shorter run time. Next change the prefix to pe 220 20
to set the proper type, insert length, and standard deviation. Also find the JUMP, PACBIO, and OTHER lines, and add a #
to remove them from the assembly.
Finally change the number of threads to 8, save the file, and exit nano.
Now run the command:
masurca masurca_config
To generate an assembly script, and FINALLY run:
./assemble.sh
to start the assembly!
Once done you will have two files called CA/10-gapclose/genome.ctg.fasta and CA/10-gapclose/genome.scf.fasta.
Quast
module load bioinfo-tools quast/4.5.4
Now load the masurca files into Quast together with the earlier spades, abyss, and soap-denovo contigs. Are there any major differences? How do the assemblies compare to each other?
Since the Rhodobacter genome is available, we will cheat and use it for comparison, by including it as a reference.
quast.py -R R_sphaeroides.fasta -o assembly_metrics -l label1[,label2,label3,...] -t 1 <draft_genome1.fasta> [<draft_genome2.fasta> <draft_genome3.fasta> ...]
Pilon
module load Pilon/1.22
module load bwa/0.7.17
module load samtools/1.5
Use Pilon to polish the assemblies. Do they improve? Does running Pilon again on the polished assemblies make any improvements?
bwa index <assembly>
bwa mem -t 4 <assembly> <read1> <read2> | samtools sort -@ 4 -T <temp_file_name> -O BAM -o <output_bwa_alignment.bam> -
samtools index <output_bwa_alignment.bam>
java -jar $PILON_HOME/pilon.jar --genome <assembly> --frags <output_bwa_alignment.bam> --threads 4 --outdir <polished_assembly_dir> --output <polished_assembly_prefix> --changes
Finding the best k
Use KmerGenie to find the optimal k.
module load KmerGenie/1.7039
kmergenie list_of_readfiles.fofn
Also examine the assembly graphs from different values of k to see which gives the best layout. Do they agree?
The tool Bandage can be used to make an image of the graph.
module load Bandage/0.8.0
Bandage load assembly.gfa --draw