Exercise: Data Quality Assessment
Exercises:
- Use
md5sum
to calculate the checksums of the data files in the folder/sw/courses/assembly/QC_Data/
. Redirect (>
operator) the output into a file calledchecksums.txt
in your workspace.Solution: (click here)
The correct solution requires you to change to the directory the data is in. The file paths must reflect the directory structure you want to check. ```bash cd /sw/courses/assembly/QC_Data/ md5sum */* > /proj/g2017025/nobackup/$USER/checksums.txt cd /proj/g2017025/nobackup/$USER cat checksums.txt b3acf269e095e93120cf7ee96126e3e4 Bacteria/bacteria_R1.fastq.gz 93a06cff64ecf32c4f56ebe98cf24d82 Bacteria/bacteria_R2.fastq.gz 9a5a3305130fe3e53785ac4fbf0e0584 Ecoli/E01_1_135x.fastq.gz ``` Incorrect Solution: ```bash md5sum /sw/courses/assembly/QC_Data/*/* > checksums.txt cat checksums.txt b3acf269e095e93120cf7ee96126e3e4 /sw/courses/assembly/QC_Data/Bacteria/bacteria_R1.fastq.gz 93a06cff64ecf32c4f56ebe98cf24d82 /sw/courses/assembly/QC_Data/Bacteria/bacteria_R2.fastq.gz 9a5a3305130fe3e53785ac4fbf0e0584 /sw/courses/assembly/QC_Data/Ecoli/E01_1_135x.fastq.gz ``` When these checksums are checked, they check the files in `/sw/courses/assembly/QC_Data/` rather than the files in your directory. -
Make a copy of the data in your workspace (note the
.
at the end of the command):cp -vr /sw/courses/assembly/QC_Data/* .
Use
md5sum -c
to check the checksums are complete.solution:
When you check the checksums of the transferred files, make sure the files you check are correct.
cp -vr /sw/courses/assembly/QC_Data/* . chmod u+w -R Bacteria Ecoli md5sum -c checksums.txt Bacteria/bacteria_R1.fastq.gz: OK Bacteria/bacteria_R2.fastq.gz: OK Ecoli/E01_1_135x.fastq.gz: OK
-
Use
file
to get the properties of the data files. In which format are they compressed?solution:
file */* Bacteria/bacteria_R1.fastq.gz: gzip compressed data, was "bacteria_R1.fastq", from Unix, last modified: Thu Nov 10 15:16:50 2016 Bacteria/bacteria_R2.fastq.gz: gzip compressed data, was "bacteria_R2.fastq", from Unix, last modified: Thu Nov 10 15:16:52 2016 Ecoli/E01_1_135x.fastq.gz: gzip compressed data, was "E01_1_135x.fastq", from Unix, last modified: Wed Oct 5 14:15:05 2016
This tells you all the files are gzip compressed.
-
Use
zcat
andless
to inspect the contents of the data files. From which sequencing technology are the following files and do you notice anything else?a.
Bacteria/bacteria_R{1,2}.fastq.gz
b.
Ecoli/E01_1_135x.fastq.gz
solution:
zcat Bacteria/bacteria_R1.fastq.gz | less - result omitted - zcat Bacteria/bacteria_R2.fastq.gz | less - result omitted - zcat Ecoli/E01_1_135x.fastq.gz | less -S -- result omitted -
a. Inspecting the headers of
Bacteria/bacteria_R{1,2}.fastq.gz
indicates the sequences come from the Illumina platform. One can also see the sequences are of different lengths, indicating the files have been filtered in some way already.b. Inspecting the headers of
Ecoli/E01_1_135x.fastq.gz
indicates the sequences come from the PacBio platform. This is further supported by the long sequences in the file. - Identify the different parts of the Illumina header information:
@HWI-ST486:212:D0C8BACXX:6:1101:2365:1998 1:N:0:ATTCCT
solution:
Each part is:
- Machine number
- Run number
- Flowcell ID
- Lane
- Tile
- X-coordinate on the Tile
-
Y-coordinate on the Tile
- First/Second in pair
- If it has failed Illumina QC (Chastity)
- Control bit
- Barcode index sequence/ID
- Identify the different parts of the Pacific Biosciences header information:
@m151121_235646_42237_c100926872550000001823210705121647_s1_p0/81/22917_25263
solution:
Each part is:
- m = movie
- time of run (yymmdd_hhmmss)
- Instrument Serial Number
- SMRT Cell Barcode
- Set Number (deprecated)
- Part Number
- ZMW hole number
- Subread region (start _ end)
- What does each tool in this compound command do, and what is the purpose of this command?
zcat *.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m
solution:
zcat *.fastq.gz # concatenates compressed files to one output stream seqtk seq -A - # seqtk is a toolkit for manipulating sequence data. The -A converts input to fasta output. grep -v "^>" # grep searches for lines beginning (^) with the string > and excludes them (-v). tr -dc "ACGTNacgtn" # tr translates characters from one set to another. The -dc deletes characters not in the "ACGTNacgtn" set. wc -m # wc is the word count tool. wc -m counts characters.
The purpose of the command is to count the total number of bases in your files.
-
How many bases are in:
a.
Bacteria/bacteria_R{1,2}.fastq.gz
?b.
Ecoli/E01_1_135x.fastq.gz
?solution:
a. The number of bases in
Bacteria/bacteria_R{1,2}.fastq.gz
is:zcat Bacteria/bacteria_R{1,2}.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m 225890464
b. The number of bases in
Ecoli/E01_1_135x.fastq.gz
is:zcat Ecoli/E01_1_135x.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m 748508257
-
In the data set
Ecoli/E01_1_135x.fastq.gz
, how many bases are in reads of size 10kb or longer?solution:
zcat Ecoli/E01_1_135x.fastq.gz | seqtk seq -A -L 10000 - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m 510546313
-
Run FastQC on the data sets. How many sequences are in each file?
solution:
fastqc -t 6 */*.fastq.gz firefox */*.html
- There are 766616 sequences in
Bacteria/bacteria_R{1,2}.fastq.gz
. - There are 87217 sequences in
Ecoli/E01_1_135x.fastq.gz
.
- There are 766616 sequences in
-
What is the average GC% in each data set?
solution:
- Bacteria/bacteria_R{1,2}.fastq.gz: 40%
- E01/E01_1_135x.fastq.gz: 49%
-
Which quality score encoding is used?
solution:
All files use the Sanger / Illumina 1.9 quality score encoding.
-
What does a quality score of 20 mean?
solution:
An expectation of 1 error in 100bp.
-
What does a quality score of 40 mean?
solution:
An expectation of 1 error in 10000bp.
-
Which distribution should the per base sequence plot be similar to in the FastQC output for Illumina data?
solution:
A Uniform distribution.
-
Which distribution should the per sequence GC plot be similar to in the FastQC output for Illumina data?
solution:
A Normal/Gaussian distribution.
-
Which value should the per sequence GC distribution be centered on?
solution:
The average GC content.
-
How much duplication is present in
Bacteria/bacteria_R{1,2}.fastq.gz
?solution:
24% in R1 and 15% in R2.
-
What is adapter read-through?
solution:
When the read sequence continues past the end of the DNA insert/fragment into the adapter sequence on the other end.
-
Use
trimmomatic
to trim adapters from the data setBacteria/bacteria_R{1,2}.fastq.gz
. Thetrimmomatic
jar file can be found in$TRIMMOMATIC_HOME
, and the adapter files can be found in$TRIMMOMATIC_HOME/adapters/
.a. Trim only the adapters. How much is filtered out?
b. Quality trim the reads as well. How much is filtered out?
solution:
a.
cd Bacteria java -jar $TRIMMOMATIC_HOME/trimmomatic-0.36.jar PE bacteria_R1.fastq.gz bacteria_R2.fastq.gz \ bacteria_R1.clean_pair.fastq.gz bacteria_R1.clean_unpair.fastq.gz \ bacteria_R2.clean_pair.fastq.gz bacteria_R2.clean_unpair.fastq.gz \ ILLUMINACLIP:$TRIMMOMATIC_HOME/adapters/TruSeq3-PE.fa:2:30:10 TrimmomaticPE: Started with arguments: bacteria_R1.fastq.gz bacteria_R2.fastq.gz bacteria_R1.clean_pair.fastq.gz bacteria_R1.clean_unpair.fastq.gz bacteria_R2.clean_pair.fastq.gz bacteria_R2.clean_unpair.fastq.gz ILLUMINACLIP:/sw/apps/bioinfo/trimmomatic/0.36/milou/adapters/TruSeq3-PE.fa:2:30:10 Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Quality encoding detected as phred33 Input Read Pairs: 766616 Both Surviving: 764633 (99.74%) Forward Only Surviving: 1983 (0.26%) Reverse Only Surviving: 0 (0.00%) Dropped: 0 (0.00%) TrimmomaticPE: Completed successfully
b.
java -jar $TRIMMOMATIC_HOME/trimmomatic-0.36.jar PE bacteria_R1.fastq.gz bacteria_R2.fastq.gz \ bacteria_R1.clean_qc_pair.fastq.gz bacteria_R1.clean_qc_unpair.fastq.gz \ bacteria_R2.clean_qc_pair.fastq.gz bacteria_R2.clean_qc_unpair.fastq.gz \ ILLUMINACLIP:$TRIMMOMATIC_HOME/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 \ TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 TrimmomaticPE: Started with arguments: bacteria_R1.fastq.gz bacteria_R2.fastq.gz bacteria_R1.clean_qc_pair.fastq.gz bacteria_R1.clean_qc_unpair.fastq.gz bacteria_R2.clean_qc_pair.fastq.gz bacteria_R2.clean_qc_unpair.fastq.gz ILLUMINACLIP:/sw/apps/bioinfo/trimmomatic/0.36/milou/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Quality encoding detected as phred33 Input Read Pairs: 766616 Both Surviving: 573596 (74.82%) Forward Only Surviving: 170485 (22.24%) Reverse Only Surviving: 5253 (0.69%) Dropped: 17282 (2.25%) TrimmomaticPE: Completed successfully