Part 3: Single cell genome assembly

× Info! If you get disconnected from Uppmax click here to know how to get back to it.

In this part of the course, you will start doing assemblies of ‘real’ (but reduced) single cell genome datasets. We will compare two single cell specific assemblers, namely Spades and IDBA-UD, and one ‘general-purpose’ assembler called Ray (which were introduced during the morning). The idea is that you will be able to compare the results of these different assemblers on two kinds of datasets (HiSeq and MiSeq), as well as different pre-treatments (‘trimming’). You will also have a chance to explore how to decide which assembly is the best (‘assembly metrics’), as there is no simple answer to this question.
Out of the total 12 assemblies we like you to compare, we suggest:

Talk to each other to form a group of maximum 4 persons
Pick a group numer and write your name in the corresponding spreadsheet (see below)
Split the work, optimally one person does only 3 assemblies on one of the datasets and pre-treatments (1 table in the spreadsheet)

This way you can focus and skip handling too many folders and files. Assembly is also relatively time-consuming (although we have prepared reduced datasets for the tutorial to keep the times reasonable).
Actual tables to be filled in are provided in Google Docs and the links can be found below.
Do not worry if you miss something, we will collect result from all groups and discuss it together.

Group 1:
Group 2:
Group 3:
Group 4:
Group 5:
Group 6:
Group 7:
Group 8:

3.1. Organize working folder
3.2. Pre-processing
3.3. Assembly
3.4. Assessing assembly quality using Quast
3.5. Gene prediction using Prodigal
3.6. Running completeness estimates
3.7. Identifying ribosomal RNAs

These steps will help you obtain results to think about the following questions:

Questions:

Q3.1: Did you notice how many reads were discarded in the pre-processing? Do the numbers differ between the Miseq and Hiseq datasets? You can use the group google spreadsheet to see the results for all the datasets.

Q3.2: Did you notice any differences between HiSeq and MiSeq data and the different assemblers?

Q3.3: Did you notice the impact of trimming, is it the same for all assemblers?

Q3.4: What do you think is the best way to assess the ‘quality’ of an assembly? (e.g. total size, N50, number of predicted ORFs, completeness)

Q3.5: What do you think is the best way to assemble this particular SC dataset? Why?

Q3.6: What is the identity of the organism based on the analyses you have performed? What phylum does it belong to and is there any closely related organisms in the databases?

Q3.7: Try to find out in what type of environment you might find similar organisms in.

Previous page Next page