Training ab-initio predictor
From this maker run evidence based, we can train our ab-initio predictors and then use them for the second run of annotation. You will need a set of genomic sequences with gene structures (sequence coordinates of starts and ends of exons and genes) and the most important part is selected the right set of genes. In many cases, or as a first step towards modeling complete genes, it is sufficient to have only the coding parts of the gene structure (CDS). We will only train augustus today as it is one the best ab-initio predictor and one of the hardest to train. Maker also support SNAP (Works good, easy to train, not as good as others ab-initio especially on longer intron genomes), GeneMark (Self training, no hints, buggy, not good for fragmented genomes or long introns). FGENESH (Works great, costs money even for training) and now EVM.
Training Augustus
First you need to write the libraries path you will need in .bash_profile to perform the following analyses.
/home/login/annotation_course/course_material/lib/install_perllib_missing.sh
source ~/.bash_profile
Then load all modules that we will need to train Augustus
module load bioinfo-tools
module load perl
module load perl_modules
module load BioPerl/1.6.924_Perl5.18.4
module load cufflinks/2.2.1
Create project folder
We create a new folder in which we will store all the configuration files and input files we will need/create. To do so, type:
mkdir train_augustus
cd train_augustus
You will need to symlink all data you will need such as the gff files from the first run of maker and the chromosome 4 fasta sequence.
ln -s ~/annotation_course/practical3/maker_no_abinitio/annotationByType/maker.gff dmel_results_noAbinitio.gff
Compile a set of training and test genes
First step is to select only the coding genes from the maker.gff file and remove all tRNA
~/annotation_course/course_material/git/GAAS/annotation/Tools/Maker/maker_gff3manager_JD_v8.pl -f dmel_results_noAbinitio.gff -o dmel_results_noAbinitio_clean
cd dmel_results_noAbinitio_clean
In this folder you will need to create different folders
mkdir filter
mkdir protein
mkdir nonredundant
mkdir blast_recursive
mkdir gff2genbank
Next step, we need to filter the best genes we will use for the training, we need complete genes, models with a distance with an other model (distance genes to genes) over 500 nt and a AED score under 0.3 (those are our default parameters).
~/annotation_course/course_material/scripts/filter_sort.pl -file codingGeneFeatures.gff -F 4.fa -o filter/codingGeneFeatures.filter.gff -c -r -d 500 -a 0.3
We also need to select the longest ORF
~/annotation_course/course_material/scripts/find_longest_CDS.pl -f filter/codingGeneFeatures.filter.gff -o codingGeneFeatures.filter.longest_cds.gff
There are different ways of proceeding after the first selection and we are using “the approached of spliced alignments of protein sequences of the same or a very closely related species” against the assembled genomic sequence. In order to do so, we translate our coding genes into proteins, format the protein fasta file to be able to run a recursive blast and then select the best ones. Indeed, each sequence can contain one or more genes; the genes can be on either strand. However, the genes must not overlap, and only one transcript per gene is allowed.
gffread -y codingGeneFeatures.filter.longest_cds.tmp -g 4.fa codingGeneFeatures.filter.longest_cds.gff
~/annotation_course/course_material/scripts/fix_fasta.sh codingGeneFeatures.filter.longest_cds.tmp > protein/codingGeneFeatures.filter.longest_cds.proteins.fa
rm codingGeneFeatures.filter.longest_cds.tmp
module load blast
makeblastdb -in protein/codingGeneFeatures.filter.longest_cds.proteins.fa -dbtype prot
blastp -query protein/codingGeneFeatures.filter.longest_cds.proteins.fa -db protein/codingGeneFeatures.filter.longest_cds.proteins.fa -outfmt 6 -out blast_recursive/codingGeneFeatures.filter.longest_cds.proteins.fa.blast_recursive
~/annotation_course/course_material/git/GAAS/annotation/Tools/Util/gff/gff_filter_by_mrna_id.pl --gff codingGeneFeatures.filter.longest_cds.gff --blast blast_recursive/codingGeneFeatures.filter.longest_cds.proteins.fa.blast_recursive --outfile nonredundant/codingGeneFeatures.nr.gff
module load augustus/3.2.3
Sequences need to be converted in a simple genbank format.
gff2gbSmallDNA.pl nonredundant/codingGeneFeatures.nr.gff 4.fa 500 gff2genbank/codingGeneFeatures.nr.gbk
In order for the test accuracy to be statistically meaningful the test set should also be large enough (100-200 genes). You should split the set of gene structures randomly.
randomSplit.pl gff2genbank/codingGeneFeatures.nr.gbk 100
- What happened? how can you solve it? what might be the consequences of it?
Train Augustus
Now that you have created a set of gene to train augustus, let’s train it!
Augustus need a set of parameters that are provided :
please use the path where you copied augustus_path in the Busco exercise yesterday.
new_species.pl --AUGUSTUS_CONFIG_PATH=augustus_path --species=dmel_login
AUGUSTUS_CONFIG_PATH=augustus_path
etraining --species=dmel_login gff2genbank/codingGeneFeatures.nr.gbk.train
augustus --species=dmel_login gff2genbank/codingGeneFeatures.nr.gbk.test | tee run.log
- Look at the accuracy report, what does it mean? why? see Training Augustus