Quality assessment and quality control of sequence data
table of contents
expected learning outcomes
Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amounts of error. Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis.
The objective of this activity is to understand some relevant properties of raw sequence data. We will focus on properties such as length, quality scores and base and k-mer distribution in order to assess the quality of the data and discard low quality or uninformative reads.
We will use basic UNIX commands, FastQC, and the FASTX-Toolkit applied to two small data sets (Illumina and 454). For some background on error-types generated by these technologies, we recommend the following references:
- Make sure you have the following programs installed (if you used the customized USB flash drive for software installation, you already have them):
- The data you are going to use have been copied from your USB flash drive during the software installation to the ~/wg_jan2013/activities/QC folder. Go to that folder and you should find:
- bartonella_illumina.fastq: a small subset of a Illumina run (v.1.5) in a FASTQ file
- bartonella_454.fasta and bartonella_454.qual: sequences and qualities for small subset of a 454 Titanium run
- mRNA_small.fq: a small subset of a Illumina mRNA-seq run (v.1.5) in a FASTQ file
- microRNA_small.fastq.gz: a small subset of a Illumina smallRNA-seq run (v.1.9) in a FASTQ file
- barcodes.txt: File with infromation about barcodes used in file mRNA_small.fq.gz
- The Perl scripts that are going to be used in this activity have also been installed:
- Have a look at the documentation for FastQC and FASTX-Toolkit.
exercise 1: checking Illumina data with FastQC
The goal of this exercise is to inspect the sequence data of an Illumina run. The sequence data belongs to a bovine isolate of the intracellular bacterium Bartonella, which is transmitted by insect vectors. Although FASTQ is the main file received from a sequence provider, some users want to perform the base calling step themselves, using a different package than the proprietary Illumina software. This is not covererd in this exercise, and we start directly from the FASTQ file.
- Launch FastQC by typing fastqc in the terminal window.
- Load the bartonella_illumina.fastq file into FastQC (File->Open). You can view the results either within the FastQC application or the exported report.
- Inspect the data contained in the sequence file, bartonella_illumina.fastq. Have a look at the numbers output on the “Basic Statistics” page. How many sequences do we have? What is the sequence length? And the GC content? Examine the “Per base sequence quality” and ”Per sequence quality scores” pages. Roughly, how many incorrect base calls are expected at most positions? Do you think this run gave good quality sequences?There are 10000 sequences of 38 nucleotides length. The total GC content is 37%.The expected incorrect base calls range from 1 in ~8000 (quality score = 39) to 1 in ~5000 (quality score = 37).The sequences are average good quality
- Examine the “Per base sequence content”, “Per base GC content” and “Per sequence GC content” pages. FastQC points out a “potential problem” with an orange exclamation mark. Do you think we should worry about it in this particular case?
- Examine the “Overrepresented sequences” page. Why does FastQC give a warning message? Hint: It identified a sequence that is repeated 17 times and that could be an adaptor contamination.
(Note that there is an error message in the “kmer content” page. In this case, this error is probably due to the small size of the dataset, but there are other factors that can also make this test fail)
exercise 2: checking 454 data with FastQC
The goal of this exercise is to do the same kind of quality checks as in exercise 1, but on 454 data this time. The primary data from 454 is stored in a sff file, but in general, FASTA and qual files are also provided. It is possible to parse the sff files with different parameters using the proprietary software provided by 454 (sfffile/sffinfo) or a free, open-source tool, sff_extract. Sff extraction is not covered in this exercise, and we start from the bartonella_454.fasta and bartonella_454.qual files.
- Examine the FASTA and qual files.
- Since FastQC and some FASTX-Toolkit programs can only take FASTQ as an input, you need to convert the FASTA and qual files to a FASTQ file. This can be done using a short Bioperl script. Type the following command into your terminal window (make sure you are located in the directory where your files are):
fastaQual2fastq.pl bartonella_454.fasta bartonella_454.qual > bartonella_454.fastq
- Inspect the resulting FASTQ file
- Repeat the steps in exercise 1 for this dataset. What would you say of:
- The sequence length range and GC content? The read length distribution is bimodal, with one mode having a mean of ~65 and the other of about 550. Sequence reads are GC rich.
- The quality scores? The quality of the sequence reads is high but it drops after ~400bp.
- What does the “Per base sequence content” graph tell us?The sequence reads are GC rich. The distribution of nucleotides is not uniform at the 5′ end of the reads.
- Can you explain what the peaks at the first 9 positions represent in the “Per base sequence content”? What should we do with this dataset before using it for further analyses? The first 9 bases of almost all sequences are ATATCGCGA. This represents an adaptor sequence at the 5′ end that would need to be trimmed.
- The sequence length range and GC content?
exercise 3: trimming adaptor sequences, demultiplexing data and detecting artifacts using FASTX-Toolkit
- Trim the adaptor sequence of the bartonella_454.fastq file using fastx-toolkit. Open the resulting file in FastQC and compare the results with the untrimmed file Hints: use
- Examine the mRNA_small.fq file in FastQC. What do would you say about the quality of this run? What about nucleotide frequencies – what do the patterns at the beginning and end of the sequence represent? Hints: 1. There are four samples multiplexed into the same lane labeled with four diferent barcodes (look at the barcodes file); 2. It´s mRNA
- Split sequences in the file mRNA_small.fq into barcodes
use fastx_barcode_splitter -h
- Examine the outputs in FastQC and note the patterns at the start of the sequences. How can you explain them? Look at the barcodes file
- Examine the microRNAsmall.fastq file using FastQC. What can you conclude from the nucleotide distribution graphThere is a sequence that is more abundant that any other and whose nucleotide sequence can almost be “guessed” from the graph
- Examine the Overrepresented sequences tab. What can you conclude? What could you say about the sequences that have “no hit”? Do you think it is expected to have overrepresented sequences in a microRNA sequencing experiment? Hint: use blast
see also Other tools to verify quality of second-generation sequencing results are available:
- TRIMMOMATIC, a flexible read trimming tool for Illumina sequencing data.
- Galaxy, a web-based genomics pipeline, in which FASTX-Toolkit and FastQC are integrated.
- PRINSEQ, either as a standalone package or through a web-interface can generate summary statistics of sequence and quality data, which can subsequently be used to filter, reformat and trim next-generation sequence data.
- Perl and Bioperl, to write small scripts. There already exist a very large number of packages devoted to genomics in Bioperl.
- R and Bioconductor, are other solution to import and verify data. There are many contributed packages and modules available.