Part 2: Familiarizing yourself with the data
It is important to be able to recognize what sequence data looks like and how to convert them to different formats to be able to utilize various bioinformatics tools for data analysis.
The files provided for your analysis are in Fastq format, and these are so-called raw reads .
Look into the directory where the data files are located.
For example, dataset 1.
ls /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/
Take a look at what a Fastq file looks like.
less /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R1_001.fastq
less
command allows you to take a look at a text file without changing its contents.
You can use arrow keys to scroll up and down the window. Type q
to exit.
Now, we will do a few exercises to see what we have in these raw reads.
In our practice, there are 2 data sets reads generated by HiSeq 2000 and MiSeq sequencing instruments.
- HiSeq 2000 read pairs are named as G5_Hiseq_R1_001.fastq and G5_Hiseq_R2_001.fastq
- MiSeq read pairs are named as G5_Miseq_R1_001.fastq and G5_Miseq_R2_001.fastq.
For example, to count the reads from the forward reads in the HiSeq data set, type:
grep -c -e "^@HWI" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R1_001.fastq
grep
: searches for lines that begin with a string of characters that you are looking for. In this example, the string of characters are @HWI.
-c
flag counts the lines containing this string of characters.
-e
flag means you are using Regular Expressions to search for the pattern.
In this case, ^
character means you expect to only check for lines that begin with these specified characters.
Count how many reads are in the reverse Fastq file (with the _R2_001.fastq). Write down in a seperate file the number of reads for each file.
Type the following commands to see the 10 first reads ID of both forward an reverse file:
grep -e "^@HWI" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R1_001.fastq | head
grep -e "^@HWI" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R2_001.fastq | head
Try to identify the similarities and differences between the _R1_001.fastq and _R2_001.fastq files
Next, try to count the number of sequences in MiSeq data, type:
grep -c -e "^@MISEQ" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset2/G5_Miseq_R1_001.fastq
MiSeq instrument produces different Fastq headers from HiSeq instrument and the headers start with @MISEQ .
Do the same for the second MiSeq Fastq file.
Keep those numbers in a file, we will later use them to compare some results
Questions:
Q2.1: Looking at a fastq file, can you tell a repeating pattern in the file?
Do you notice any unique characters or strings that would help you identify where a read begins and ends?
Q2.2: Did you notice anything similar/different in the Fastq headers of the two read pairs?
How can you identify paired-end reads?