Part 2: Familiarizing yourself with the data

× Info! If you get disconnected from Uppmax click here to know how to get back to it.

It is important to be able to recognize what sequence data looks like and how to convert them to different formats to be able to utilize various bioinformatics tools for data analysis. The files provided for your analysis are in Fastq format, and these are so-called raw reads . Look into the directory where the data files are located.
For example, dataset 1.

ls /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/

Take a look at what a Fastq file looks like.

less /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R1_001.fastq

less command allows you to take a look at a text file without changing its contents. You can use arrow keys to scroll up and down the window. Type q to exit. Now, we will do a few exercises to see what we have in these raw reads.
In our practice, there are 2 data sets reads generated by HiSeq 2000 and MiSeq sequencing instruments.

HiSeq 2000 read pairs are named as G5_Hiseq_R1_001.fastq and G5_Hiseq_R2_001.fastq
MiSeq read pairs are named as G5_Miseq_R1_001.fastq and G5_Miseq_R2_001.fastq.

Count reads

For example, to count the reads from the forward reads in the HiSeq data set, type:

grep -c -e "^@HWI" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R1_001.fastq

grep : searches for lines that contains a string of characters that you are looking for. In this example, the string of characters are @HWI.
-c flag counts the lines containing this string of characters.
-e flag means you are using Regular Expressions to search for the pattern.
In this case, ^ character means you expect to only check for lines that begin with these specified characters.

Count how many reads are in the reverse Fastq file (with the _R2_001.fastq) and write down in a seperate file the number of reads for each file.

Understanding Fastq format

Type the following commands to see the 10 first reads ID of both forward an reverse file:

grep -e "^@HWI" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R1_001.fastq | head
grep -e "^@HWI" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset1/G5_Hiseq_R2_001.fastq | head

Try to identify the similarities and differences between the _R1_001.fastq and _R2_001.fastq files

HISEQ v MISEQ

Try to count the number of sequences in MiSeq data, type:

grep -c -e "^@MISEQ" /proj/g2015028/nobackup/single_cell_exercises/sequences/dataset2/G5_Miseq_R1_001.fastq

MiSeq instrument produces different Fastq headers from HiSeq instrument and the headers start with @MISEQ .
Do the same for the second MiSeq Fastq file.
Keep those numbers in a file, we will later use them to compare some results

Questions:

Q2.1: Looking at a fastq file, can you tell a repeating pattern in the file?
Do you notice any unique characters or strings that would help you identify where a read begins and ends?

Q2.2: Did you notice anything similar/different in the Fastq headers of the two read pairs?
How can you identify paired-end reads?

Previous page Next page