Quality assessment and quality control of sequence data
expected learning outcomes
Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amounts of error. Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis.
The objective of this activity is to understand some relevant properties of raw sequence data. We will focus on properties such as length, quality scores and base and k-mer distribution in order to assess the quality of the data and discard low quality or uninformative reads.
We will use basic UNIX commands, FastQC, and Trimmomatic applied to several datasets. We will use datasets from several source experiments (genome shotgun, amplicons, RAD-tags, metagenome shotgun and microRNA) so that you can see the particularities of the different data types.
exercise 1: checking Illumina data with FastQC The goal of this exercise is to learn how to inspect sequence data using FASTQC, both using the user interface and command line. Keep at hand the documentation for FastQC.
DATASET 1: bartonella_illumina.fastq: a small subset of an Illumina run (v.1.5) belonging to a bovine isolate of the intracellular bacterium Bartonella
- Have a look at the bartonella_illumina.fastq file.
Show hintUse less or more.
- How many reads are there?
Show HintUse grep or wc.
- Launch FastQC by typing fastqc in the terminal window. Note that if you want to be able to use the terminal after you launch fastqc, you need to add an “&” at the end of the command. Also, note that you can have several terminal windows open at the same time.
- Load the bartonella_illumina.fastq file into FastQC (File->Open). You can view the results either within the FastQC application or the exported report.
- Inspect the data contained in the sequence file, bartonella_illumina.fastq. Have a look at the numbers output on the “Basic Statistics” page. How many sequences do we have? What is the sequence length? And the GC content? Examine the “Per base sequence quality” and “Per sequence quality scores” pages. Roughly, how many incorrect base calls are expected at most positions? Do you think this run gave good quality sequences?
Show AnswerThere are 10000 sequences of 38 nucleotides length. The total GC content is 37%.
Show answerThe expected incorrect base calls range from 1 in ~8000 (quality score = 39) to 1 in ~5000 (quality score = 37).The sequences are average good quality
- Examine the “Per base sequence content”, “Per base GC content” and “Per sequence GC content” pages. FastQC points out a “potential problem” with an orange exclamation mark. Do you think we should worry about it in this particular case?
”ShowHint: Look at the Evaluating Results page of FastQC.
- Examine the “Overrepresented sequences” page. Why does FastQC give a warning message?
Show hintHint: It identified a sequence that is repeated 17 times and that could be an adaptor contamination.
DATASET 2: EMP_Misc_16v4EMP_NoIndex_L005_R1_001-sample.fastq.gz: a small subset of an Illumina run (v.1.9) belonging to one Earth Microbiome Project environmental sample sequenced for 16S amplicons (demultiplexed from a set of pooled samples).
- Load the data into FASTQC (note that there is no need to unzip it first).
- What can explain the pattern observed in the “Per base sequence content”? What can explain the particular pattern observed at the first 8 bases?
Show hintHint: Look at the description of the dataset.
- Which of the warnings given by fastqc we should worry about? Why?
Show hintHint: Look at the description of the dataset.
DATASET 3: CAN.tgz: A compressed folder that contains 12 fastq files, each corresponding to a RAD sequenced sample.
- Run fastqc on all the files using the command line version of the program.
Show hintHint1: You need to untar/unzip the folder first. Hint2: Do “fastqc -h” for options.
- Examine some of the output text files and look at the graphical results on your web browser. How can you explain the pattern observed in the first six bases?
Show hintHint: the key is in the RAD-seq library preparation protocol that makes all reads start by the same few nucleotides, corresponding to the restriction site of the enzyme used to create the library.
exercise 2: quality filtering and trimming adaptor sequences using Trimmomatic Depending on the downstream program that will be used, we may need to quality filter and/or trim the adaptor sequences. We will work with datasets 4 and 5.
DATASET 4: sample_1_P1.fq and sample_1_P2.fq: a subset of an Illumina MiSeq Paired End run corresponding to 20 multiplexed shotgun metagenomes.
- Load the file into FASTQC and make a diagnosis of the data.
- Let´s assume that for our downstream analyses, we need very high quality paired long reads. Use trimmomatic in paired end mode to improve the overall quality of the data (Phred scores > 35) by keeping only long reads (length >80 bp). Try several sliding window sizes to achieve this.
Show hintHave a look at the documentation for Trimmomatic; note that paired end data produces four output files, whose names need to be provided when calling the program.
- Can you explain what each of the output files represent?
- Think about cases where a so strict quality control might not be necessary.
DATASET 5: SRR026762-sample.fastq: a subset of an Illumina run belonging to a microRNA profiling experiment in human embryonic stem cells
- Load the data into FASTQC and make a diagnosis of the data. What is the source of the overrepresented sequences?
Show hintHint 1: There are two types (excluding the only “N” sequences); Hint 2: Use Blast and look at the description of the dataset.
- Use Trimmomatic in single end mode to remove low quality bases and trim adaptor sequence. The adaptor used to generate this library was SmallRNA3pAdapter_1.5 ATCTCGTATGCCGTCTTCTGCTTG.
Show hintNote that Trimmomatic provides adaptor sequences, whcih we have placed in your working directory; however, the adaptor sequence you need here may not be included.
- Look at the resulting file in FASTQC and make a new diagnosis of the data. What do you observe?
Show hintHint: There is still some undesired sequences resulting from the library preparation/sequencing process remaining, problably due to “primer-dimers” created during the library preparation and sequencing.
- Try to identify those sequences by increasing the k-mer length in fastqc (command line).
- If time allows, go through another clipping step to remove the remaining contaminant sequences
Show hintHint1: The sequencing primer used to sequence this library and which is probably causingthe “primer-dimers” is SmallRNASequencingPrimer (CGACAGGTTCAGAGTTCTACAGTCCGACGATC)
- Examine the resulting file in FastQC once more. Are you satisfied with the outcome?
see also Other tools to verify quality of second-generation sequencing results are available:
- Fastx-toolkit:a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
- Galaxy, a web-based genomics pipeline, in which FASTX-Toolkit and FastQC are integrated.
- PRINSEQ, either as a standalone package or through a web-interface can generate summary statistics of sequence and quality data, which can subsequently be used to filter, reformat and trim next-generation sequence data.
- Perl and Bioperl, to write small scripts. There already exist a very large number of packages devoted to genomics in Bioperl.
- R and Bioconductor, are other solution to import and verify data. There are many contributed packages and modules available.