Quality assessment and quality control of sequence data
table of contents
expected learning outcomes
Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amounts of error. Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis.
The objective of this activity is to understand some relevant properties of raw sequence data. We will focus on properties such as length, quality scores and base and k-mer distribution in order to assess the quality of the data and discard low quality or uninformative reads.
- The data you are going to use are already in your instance. In the qc folder, you should find:
- DATASET 1: bartonella_illumina.fastq: a small subset of an Illumina run (v.1.5) belonging to a bovine isolate of the intracellular bacterium Bartonella
- DATASET 2: EMP_Misc_16v4EMP_NoIndex_L005_R1_001-sample.fastq.gz: a small subset of an Illumina run (v.1.9) belonging to one Earth Microbiome Project environmental sample sequenced for 16S amplicons (demultiplexed from a set of pooled samples).
- DATASET 3: SRR026762-sample.fastq: a subset of an Illumina run (v.1.9) belonging to a microRNA profiling experiment in human embryonic stem cells
- Have a look at the documentation for FastQC and FASTX-Toolkit.
exercise 1: checking Illumina data with FastQC
The goal of this exercise is to inspect the sequence data of an Illumina run.
- Launch FastQC by typing fastqc in the terminal window.
- Load the bartonella_illumina.fastq file into FastQC (File->Open). You can view the results either within the FastQC application or the exported report.
- Inspect the data contained in the sequence file, bartonella_illumina.fastq. Have a look at the numbers output on the “Basic Statistics” page. How many sequences do we have? What is the sequence length? And the GC content? Examine the “Per base sequence quality” and ”Per sequence quality scores” pages. Roughly, how many incorrect base calls are expected at most positions? Do you think this run gave good quality sequences?There are 10000 sequences of 38 nucleotides length. The total GC content is 37%.The expected incorrect base calls range from 1 in ~8000 (quality score = 39) to 1 in ~5000 (quality score = 37).The sequences are average good quality
- Examine the “Per base sequence content”, “Per base GC content” and “Per sequence GC content” pages. FastQC points out a “potential problem” with an orange exclamation mark. Do you think we should worry about it in this particular case?
- Examine the “Overrepresented sequences” page. Why does FastQC give a warning message? Hint: It identified a sequence that is repeated 17 times and that could be an adaptor contamination.
(Note that there is an error message in the “kmer content” page. In this case, this error is probably due to the small size of the dataset, but there are other factors that can also make this test fail)
- Load the data into FASTQC (note that there is no need to unzip it first).
- What can explain the nucleotide frequency pattern observed in the 8 first bases?
- What can explain the nucleotide frequency pattern observed in the rest of the sequence?
- Whould you be worried in this case about the sequence duplications levels, overrepresented sequences and kmer content? Why?
- Load the data into FASTQC.
- Look at the per base sequence quality and the per base sequence content and make a diagnosis of the data.
- Look at the overrepresented sequences. What is their source? Hint 1: There are two types (excluding the only “N” sequences); Hint 2: Use Blast and look at the description of the dataset.
exercise 2: quality filtering and trimming adaptor sequences using FASTX-Toolkit
Depending on the downstream program that will be used, we may need to quality filter and/or trim the adaptor sequences. We will work with DATASET 3.
- Remove the sequences that have more than 25% of the bases with Q < 30 using the appropiate fastx-toolkit program (once you find it, use “-h” option for usage). Note that you need to add “-Q 33″ to the command as the sequences have a different encoding (Phred + 33) than the default of fastx-toolkit (Phred + 64). How many reads are kept at the end of the process? Hint: use
fastq_quality_filter53832 reads are retained
- Look at the resulting file in FASTQC and make a new diagnosis of the data.
- Chekc if the adaptor sequence used to generate this library (SmallRNA3pAdapter_1.5 ATCTCGTATGCCGTCTTCTGCTTG) is on the sequences. Hint 1: use
- Clip the adaptor sequence of the quality filtered file using the appropiate fastx-toolkit program (once you find it, use “-h” option for usage). Set the options so that you remove the sequences that are shorter than 8 nucleotides and that you see the number of kept and discarded sequences on screen. Note that you need to add “-Q 33″ to the command as the sequences have a different encoding (Phred + 33) than the default of fastx-toolkit (Phred + 64). How many reads are kept at the end of the process? Hint: use
fastx_clipper52701 reads are retained
- Open the resulting file in FastQC and compare the results with the unclipped file. What do you observe? Hint: There is still some undesired sequences resulting from the library preparation/sequencing process remaining, problably due to “primer-dimers” created during the library preparation and sequencing.
- If time allows, go through another clipping step to remove the remaining contaminant sequences Hint1: The sequencing primer used to sequence this library and which is probably causingthe “primer-dimers” is SmallRNASequencingPrimer (CGACAGGTTCAGAGTTCTACAGTCCGACGATC); Hint 2: Concatenate clipping steps using pipes
- Examine the resulting file in FastQC once more. Are you satisfied with the outcome?
see also Other tools to verify quality of second-generation sequencing results are available:
- TRIMMOMATIC, a flexible read trimming tool for Illumina sequencing data.
- Galaxy, a web-based genomics pipeline, in which FASTX-Toolkit and FastQC are integrated.
- PRINSEQ, either as a standalone package or through a web-interface can generate summary statistics of sequence and quality data, which can subsequently be used to filter, reformat and trim next-generation sequence data.
- Perl and Bioperl, to write small scripts. There already exist a very large number of packages devoted to genomics in Bioperl.
- R and Bioconductor, are other solution to import and verify data. There are many contributed packages and modules available.