First steps in genomic data analysis
Compiled by Julia M.I. Barth, 21 January 2020
Table of contents
- The Variant Call Format (VCF)
- Hard quality filtering of variants
- Further quality filtering
- Filter on minimum read depth (DP) and genotype quality (GQ)
- Remove multiallelic SNPs and indels, monomorphic SNPs, and SNPs in the close proximity of indels
- Remove individuals with a high amount of missing data
- Remove variants with a high amount of missing genotypes and filter on minor allele frequency
- Performing a principal component analysis (PCA) to inspect the data structure
Introduction
Background and Aim
This activity will guide you though some of the the initial steps in population genomic data analyses – that is, the first steps after the sequencing reads were mapped against a reference genome assembly and variants were called (if you’re interested in mapping and variant calling, check out the “Workshop on Genomics” exercises).
The aim is to get some hands-on experience working with VCF files, practice some of the considerations in producing a subset of high quality SNPs, and to explore the data structure using a principal component analysis (PCA). The aim is also to apply and expand on the UNIX and R skills that you learned in the previous activities: UNIX and R.
Learning goals
After this activity, you should be able to perform first quality checks and filtering steps, preparing a population genomic dataset file to explore e.g., population structure.
- Apply the UNIX and R skills you learned in the previous exercises.
- Learn about the VCF format and how to handle and manipulate VCF files.
- Plot different parameters to check the overall quality of the data set.
- Perform hard filtering, removing low quality variants.
- Perform further filtering to prepare the data file for analyses.
- Explore the data structure using a PCA.
Requirements
BCFtools v1.9
PLINK v1.9
EIGENSOFT smartpca v7.2.1
R / Rstudio
The data set
We will work with a data set containing genomic data of 204 individuals of Atlantic cod (Gadus morhua) sampled at seven sites in the Atlantic, North Sea, and Baltic Sea (Barth JMI et al. (2019) Disentangling structural genomic and behavioural barriers in a sea of connectivity. Mol Ecol 28, 1394–1411).
The individuals in this data set were sequenced using Illumina HiSeq paired-end sequencing to ~10x coverage, reads were mapped using BWA-mem to the Atlantic cod reference assembly, and variants were called jointly for a total number of about 860 individuals using the GATK Haplotypecaller pipeline. All filtering steps were performed per chromosome (i.e., per linkage group) for parallelized processing, followed by merging of high quality variants of all linkage groups for follow-up analyses.
In this activity, we will perform similar filtering steps, starting with the unfiltered raw VCF, but we will work with a reduced data file that contains only the 204 individuals in the study mentioned above, and only 10% randomly subsampled variants of one linkage group (LG 5) to reduce time and computer resources (the original compressed LG 5 VCF file contained 3,281,615 variants and had a file size for 11 GB). By randomly subsampling the variants, physically very close variant sites that may share the same information (they are in linkage disequilibrium) have already been excluded, for which reason this tutorial does not contain a step on filtering for linked sites. If you are interested in those steps, have a look at the tutorials from 2016 and 2018.
How to do this activity
- Connect to your Amazon instance via the terminal using SSH.
- Questions or activities are indicated with
Q . - Switch to Rstudio in the browser (http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787) if you see this:
R task . - All text in
red on gray background
indicates commands or file names that you can type or copy to the terminal. Text inblack on gray background
refers to terminal outputs. - Try to work independently using the provided manuals and information, discuss questions with your neighbor. If you get stuck, check the answer-box:
Show me the answer!
Oh no, only if you get really stuck!! First try to find the answer yourself!
There are often alternative options and other solutions to the tasks. E.g., whether you prefer cat
over less
doesn’t matter in most cases – go ahead and use your solution! Don’t hesitate to ask the workshop team if you have questions!
1) The Variant Call Format (VCF)
VCF (Variant Call Format) is a standardized text file format that is used to store genetic variation calls such as SNPs or insertions/deletions. The full format specifications and valuable information about the different tags can be found here.
In the following first part of the exercise, we will explore how the information in a VCF is stored, and how we can inspect it.
/home/popgen/workshop_materials/21_first_steps_genomics
.
What is the file size of cod204.lg05.1.vcf.gz
? How many bytes and how many MB does it have?
Show me the answer!
cd /home/popgen/workshop_materials/21_first_steps_genomics
ls -l
“long listing format” to see a list of the folder content including the file size: the VCF file is 320855142
bytes.ls -lh
to see the content list with the file sizes printed in “human-readable” format: it’s 306 M
(MiB, to see the the size in actuall megabyte (MB) use: ls -l --block-size=MB
.
Tip: for more functions of ls
type ls --help
or check the manual with man ls
.
The file cod204.lg05.1.vcf.gz
is a very small VCF file with reduced file size for the purpose of this activity. Typical VCF files including full-genome sequencing data and many individuals are often several Gigabytes (GB) in size.
cod204.lg05.1.vcf.gz
?
Show me a hint!
If you used head cod204.lg05.1.vcf.gz
you will only see gibberish. Why?
It’s a compressed file (indicated by the ending “.gz”).
Compressing a VCF file with gzip or bgzip and indexing it with tabix is the common way in which huge VCF files are stored.
To see the first ten lines, you could decompress the entire file. However, this is not advisable as it would take several minutes (with larger files even hours) and it would inflate the file size to several gigabyte (GB).
Instead, you can make use of a program that allows you to view a part of the decompressed the VCF file without decompressing the whole file. There are several different options and programs to do this – can you think of one?
Show me the answer!
zcat cod204.lg05.1.vcf.gz | head
(zcat
is decompressing the file, but head
tells the commands to stop decompressing any further after ten lines are printed)Or you could use:
zless cod204.lg05.1.vcf.gz
(similarly, zless
only decompresses the part you are actually displaying; exit zless just like less by pressing q
)Or you could use:
bcftools view cod204.lg05.1.vcf.gz | head
(BCFtools ia a software for variant calling and manipulation of VCF files. We will extensively work with BCFtools in the remaining exercise.)
Tip: Check out what these programs are doing with --help
or man
.
The first line is already telling you that this is indeed a VCF file:
##fileformat=VCFv4.2
These first lines, starting with ##
, are part of the “VCF header”, which stores meta-information about the data. If you use zless
you can scroll down to see more of the header and also the genotype data, which make up the majority of the file.
There can be a lot of information in the header. Different meta-information lines, e.g. “INFO”, “FILTER”, “FORMAT”, “ALT” describe types of information that are given for every variant in the VCF file. A few examples:
##INFO<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
: The number of of often the alternative allele is occuring.##FORMAT<ID=GT,Number=1,Type=String,Description="Genotype">
: The genotypes coded as allele values separated by/
(or|
if the data is phased).##contig<ID=LG01,length=28303952>
: Part of a list of chromosomes, contigs, or scaffolds of the reference assembly.
The allele values are 0
for the reference allele and 1
for the alternative allele. If there are more alternative alleles the second alternative allele will have the value 2
, the third 3
, etc. Missing data is denoted by dots .
.
For more information about the different tags, have a look at the VCF file format description.
The very last line of the header specifies the column names for the genotype data. The first eight columns are mandatory, follow by “FORMAT” and the sample IDs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AVE1409002 AVE1409003 AVE1409004 [...] TVE130512 TVE130513 TVE130514
You may want to edit these header lines to add changes that you did to your VCF file, or you received a VCF file from colleagues and wish to inspect the header in more detail to see what they did.
cod204.lg05.1.header.txt
.
To find the command, have a look at the BCFtools manual, or just type bcftools
on the command line to see the available options.
Give me a hint!
bcftools view
option. Either type bcftools view
on the command line, or have a look the manual.
Show me the answer!
bcftools view -h cod204.lg05.1.vcf.gz > cod204.lg05.1.header.txt
With
view -h
only the VCF header is selected (the header is small, therefore the small “h”). By using the capital -H
the header is suppressed and only the “large” body of genotype data is considered.
cod204.lg05.1.header.txt
, compared to the whole VCF file?
Show me the answer!
ls -l
The header is 7476 bytes (7 KB) small. This is only about 0.002% of the entire VCF, and a size we can safely open with a text editor.
##fileformat=VCFv4.2
stating the date of today: ##fileDate=20200121
.
Save the new header as: cod204.lg05.1.header.ed.txt
. Use BCFtools to replace (or “rehead”) the VCF file cod204.lg05.1.vcf.gz
with the new edited header and save the re-headed VCF as cod204.lg05.1.ed.vcf.gz
.
Make sure your changes took effect and your file is still valid by visually inspecting the file and the file size of cod204.lg05.1.ed.vcf.gz
.
Tip: Type bcftools
to see general options for the program; then use bcftools reheader
to see the more specific options for the reheader function.
Show me the answer!
nano cod204.lg05.1.header.txt
or vi cod204.lg05.1.header.txt
Add
##fileDate=20200121
, save the file (nano: “Ctrl + o”, vi: “:w”; don’t forget to save under the new name), and close the text editor (nano: Ctrl + x, vi: “:q”)bcftools reheader -h cod204.lg05.1.header.ed.txt cod204.lg05.1.vcf.gz -o cod204.lg05.1.ed.vcf.gz
(with -h
the header file is specified, with -o
the new output file name)With
zless cod204.lg05.1.ed.vcf.gz
you can check the header lines, and by using ls -lh
you can see if the file has still roughly the same size.
In the genotype part of the VCF file, one line corresponds to one variant site. Each line starts with the chromosome ID “CHROM”, the position “POS”, and the variant “ID”, and is followed by the bases of the two alleles (“REF” for the reference allele, and “ALT” for the alternative allele(s)). The 6th column (“QUAL”) is the phred-scaled quality score, a measure for how likely it is that this site is indeed a variant. In the next column, “PASS” tells us that the variant passed all filters.
In the “INFO” column, information related to the header info tags for each variant site is given (e.g., AC is the allele count in genotypes for the “ALT” allele(s), AF the allele frequency of the “ALT” allele, and AN is the total number of alleles, etc. We will use some of the “INFO” field specifications later on for hard quality filtering).
The “FORMAT” column specifies the fields and order of the data that is given with each individual for each genotype and the remaining columns list each individual with the values corresponding to the “FORMAT” field. The first “FORMAT” field is always the genotype (GT), the remaining fields are variable, but in our data file, GT is followed by AD (allelic depths for the ref and alt alleles), DP (read depth), GQ (genotype quality), and PL (phred-scaled genotype likelihood):
This is how the first variant of the genotype part of the VCF file looks like:
LG05 338 . A G 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:10,0:10:30:0,30,381 [...] 0/0:7,0:7:21:0,21,248
Show me the answer!
wc --help
to see the options.We know that the genotype part of the VCF file contains one line per variant site, meaning that we can use BCFtools to suppress the header and forward (pipe;
|
) the output to wc -l
. Using the flag -l
means “count the lines”:
bcftools view -H cod204.lg05.1.vcf.gz | wc -l
(takes ~1 min)
328253
A) What kind of variants do you see (e.g., SNP, insertion/deletion)?
Show me the answer!
zless cod204.lg05.1.vcf.gz
The first variant site (“338”) is a SNP. Columns four and five for “REF” (the reference allele) and “ALT” (the alternative allele) show A and G.
The second variant site (“357”) is a deletion. Column four contains multiple bases, while column five shows a G.
LG05 338 . A G
LG05 357 . GTGGGCGCTACTCCAGGCAAAGGAGGACTCTGTACCGGAGGCTTGGATCAGCT G
Sometimes you may also see an asterisk (*), which represents a spanning deletion.
B) What is the genotype (GT) of the first individual (AVE1409002) at the first variant site (LG05 338)?
Show me the answer!
0/0
, which means twice the reference allele A (A/A). The allele values in the GT field are 0 for the reference allele (the “REF” column) and 1 for the first alternative allele (“ALT”). GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855
C) What is the genotype read depth (DP, specified as “FMT/DP” to distinguish it from the mean DP “INFO/DP” per variant site) for the first three individuals (AVE1409002, AVE1409003, AVE1409004) at the first variant site (LG05 338)?
Show me the answer!
GT:AD:DP:GQ:PL
0/0:20,0:20:57:0,57,855
0/0:9,0:9:27:0,27,345
0/0:10,0:10:30:0,30,381
2) Hard quality filtering of variants
To remove low-quality variants (e.g., variants with a low read depth or variants only supported by poorly aligned reads) from our data set, we will apply hard quality filtering. In the terminology of the authors of GATK, “hard filtering” refers to using specific filtering cutoffs for specific annotations. This stands opposed to using the so-called Variant Quality Score Recalibration (VQSR) method, which does not use any specific “hard cutoffs” but combines information from multiple annotations from a known, highly validated variant training set in a machine learning algorithm. Variants that are not passing the chosen thresholds or the VQSR filtering can either be completely removed or they can be kept in the VCF, but a flag is then added to them noting that they did not pass the filter. Since often such a training set is not available, we will apply “hard quality filtering” here.
The following part of the activity is adjusted from this “GATK tutorial“.
2.1) Inspection of quality measurements
Before we choose the measurements and thresholds for hard quality filtering, let’s start by having a look at some of these measurements and the distributions of their values.
We will consider the following seven measurements, which are given for each variant site in the “INFO” column of the VCF (more information about each of these measurements is given further below):
Fisher Strand (FS), Strand Odds Ratio (SOR), RMS Mapping Quality (MQ), Mapping Quality Rank Sum Test (MQRankSum), Quality by depth (QD), Read Position Rank Sum Test (ReadPosRankSum), and the combined depth across samples (INFO/DP).
Below, the measurements of the first variant site are highlighted:
LG05 338 . A G 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:10,0:10:30:0,30,381 [...] 0/0:7,0:7:21:0,21,248
cod204.lg05.1_FS.txt
.
Tip: Look at the examples given in the manual. The FS values should be saved as one column, with one line for the value of each variant.
Show me the answer!
bcftools query cod204.lg05.1.vcf.gz -f'%FS\n' > cod204.lg05.1_FS.txt
(takes ~1 min)With the
-f
option, you can define the output format. The %FS
indicates that for each variant, the FS value should be printed, and \n
tells the program that after each FS value there should be a new line.
Show me the answer!
head cod204.lg05.1_FS.txt
will show you the first lines. There is just one column.Typing
wc -l cod204.lg05.1_FS.txt
will show you the number of lines, which is the same as the number of variant sites we have ( 328253
)With
bcftools view -H cod204.lg05.1.vcf.gz | less
you can visually check if the first variants in the file and compare them to the extracted FS values.
http://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:8787
Open the provided script “gda_plotting.R” in R and follow the points 2) and 2.1) in that R script.
Show me the plot!
Fisher Strand (FS), as well as Strand Odds Ratio (SOR), are measurements for strand bias – that is a sequencing bias, in which the reads of one DNA strand (e.g., the forward strand) support one genotype (e.g., a heterozygous call), while the reads for the reverse strand support e.g., a homozygous genotype. In cases of extreme strand bias, SNPs are not correctly called.
- Fisher Strand (FS) is the Phred-scaled probability that there is strand bias at that site. It tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. A value close to 0 indicates little or no strand bias. GATK recommends hard filtering of variants with FS greater than 60, but this is a very conservative value that will only filter out cases of extreme bias and leave many false positives in the data set.
Show me the answer!
Setting such thresholds depends on the quality of the data set, as well as the planned downstream analyses. Here, we plan to inspect population structure, so we could risk removing some false negatives for the sake of overall quality improvement.
cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
.
Tip: Each measurement should be saved in one column separated by tabs, with each variant on a single line.
Show me the answer!
bcftools query cod204.lg05.1.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\n' > cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
Again, -f
defines the output format. The %FS\t%SOR\t
… indicates that for each variant first the FS value should be printed, then a tab \t
should be printed, followed by the SOR value, followed by a tab etc… . At the end, the \n
tells the program that after all six measurements, there should be a new line.
Show me the plot!
- Strand Odds Ratio (SOR) is also a measurement for strand bias, but it takes into account the ratios of reads that cover both alleles and is therefore a better measurement for sites that have more reads in one direction than the other. SOR values range from 0 to greater than 9, a value close to 0 indicates little or no strand bias. GATK recommends hard filtering of variants with SOR greater than 3.
- RMS Mapping Quality (MQ) and Mapping Quality Rank Sum Test (MQRankSum) are both measures for mapping quality. The Root Mean Square (RMS) mapping quality (MQ) of reads across all samples provides an estimation of the overall mapping quality to support a variant call. It includes the standard deviation (SD) of the mapping qualities, where a low SD indicates values close to the mean. Good mapping qualities are around MQ 60. GATK recommends hard filtering of variants with MQ less than 40.
- The Mapping Quality Rank Sum Test (MQRankSum) compares the mapping qualities of reads supporting the reference and the alternate allele. If all reads map correctly, the expected MQRankSum is 0. Negative values indicate a lower MQ of the reads supporting the alternate allele, compared to reads supporting the reference allele. MQRankSum values usually range from about -10.5 to 6.5. GATK recommends hard filtering of extreme outlier variants with MQRankSum less than -12.5.
- Quality by depth (QD) is the “confidence in a call” (i.e., the QUAL score), normalized by the read depth at that site. GATK recommends hard filtering of variants with QD below 2.
- The Read Position Rank Sum Test (ReadPosRankSum) measures the relative position of the reference versus the alternative allele within reads. E.g., SNPs towards the ends of reads may be more likely a sequencing error. A value around 0 means there is little to no difference in where the alleles are found relative to the ends of reads. GATK recommends hard filtering of variants with ReadPosRankSum less than -8.0.
- The combined depth across samples (INFO/DP) is simply the sum of the read depth (i.e., the genotype depth (FORMAT/DP) fields) over all samples. Variants that have a high mean depth across all samples are indicative of either repetitive regions or paralogs. A good rule of thumb is to remove variants that show twice the mean read depth. You can calculate the mean read depth in R:
mean(cod204.lg05.1_hf.values$DP)
. Now, don’t be shocked about the high read depth. This field has not been updated after reducing the vcf from 860 individuals to 204 individuals, so the information is given for all 860 individauls. We can use this field nevertheless for excluding variants with read depth larger than twice the mean.
Show me the answer!
Again, choosing filtering thresholds is highly depended on the quality of the data and the planned downstream analyses. Here are some possible suggestions for variants to exclude:
- FS (GATK recommendation FS > 60): FS > 40 (we are more strict here)
- SOR (GATK recommendation SOR > 3): SOR > 3 (this seems just fine)
- MQ (GATK recommendation MQ < 40): MQ < 40 (we would loose many variants with a lower threshold, so we stick to the recommended)
- MQRankSum (GATK recommendation MQRankSum > -12.5): MQRankSum > -5.0 and MQRankSum < 5.0 (a value around 0 is preferred and we don’t loose many variants if we set strict thresholds)
- QD (GATK recommendation QD > 2): QD > 2 (we would loose many variants with a lower threshold, so we stick with the recommendation)
- ReadPosRankSum (GATK recommendation ReadPosRankSum > -8.0): ReadPosRankSum > -4.0 (not many variants are below -8, so we can set it more strict)
- INFO/DP (no GATK recommendation, dependend on sample number and sequencing depth): INFO/DP > 16,000 (the mean DP is 8400.935, we exclude sites that show about twice the mean DP)
2.2) Application of hard filtering thresholds
We will now apply the filtering thresholds and remove variants that don’t match the chosen thresholds.
Don’t forget to specify the input file cod204.lg05.1.vcf.gz
, the output format (we’ll use -O z
since we would like to save the data file again as compressed VCF), and the output filename (-o cod204.lg05.1.hf.vcf.gz
). Type bcftools filter
in the command line to see the options.
Tip: It is possible to either use -e
and the logical operator “or” ||
to exclude low-quality variants from the file, or use -i
and the logical operator “and” &&
to include high-quality variants.
Running the filtering will take about five minutes, so time for a break and some coffee!
Show me the code!
A possible solution choosing the “exclude” option:
bcftools filter -e 'FS>40.0 || SOR>3 || MQ<40 || MQRankSum<-5.0 || MQRankSum>5.0 || QD<2.0 || ReadPosRankSum<-4.0 || INFO/DP>16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.1.vcf.gz
A possible solution choosing the “include” option:
bcftools filter -i 'FS<40.0 && SOR<3 && MQ>40.0 && MQRankSum>-5.0 && MQRankSum<5 && QD>2.0 && ReadPosRankSum>-4.0 && INFO/DP<16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.1.vcf.gz
! Be aware that the commands will not produce the exact same results. The reason for this is that some measurements will not be available for all sites, because there are, for example, too many missing genotypes to calculate this measurement. In such cases, these sites would not be excluded (because there is no measurement to exclude), so they would remain in the filtered file, while with the include option, they would not be included (since there is no measurement based on which they could be included).
- What is the file size?
- How many variant sites have been removed? Tip: see Q6
- Has the FS filter been applied correctly? What is the largest FS value?
Tip: usebcftools query
as you did above (Q8) but instead of saving the output, pipe it directly tosort
. Use the correct flags with “sort” (see the options withsort --help
and only check the first values by piping the output again tohead
.
Show me the answer!
ls -lh
213M
– the file is about 30% smaller than the unfiltered one.bcftools view -H cod204.lg05.1.hf.vcf.gz | wc -l
231014
– the file also has about 30% fewer variant sites.bcftools query cod204.lg05.1.hf.vcf.gz -f'%FS\n' | sort -gr | head
– we need the sort flag-g
since we have floating point numbers, and since we are interested in the largest value, we need to reverse the sorting using the flag-r
. Usingsort -g | tail
would work just the same way.
39.994
39.989
39.913
39.91bcftools query cod204.lg05.1.hf.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\n' > cod204.lg05.1.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
cod204.lg05.1.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt
, and plot the distributions using Rstudio and the script “gda_plotting.R” (section 2.1.2).
Show me the plot!
3) Further quality filtering
After hard quality filtering, we have a data set containing only variant sites that we trust with high confidence. However, so far we have only removed variants that had skewed values across all samples – we haven’t yet looked at the individual genotypes at each variant site. For example, we may have kept a variant that has a high quality across all 204 individuals, but there are three individuals where the read depth (FMT/DP) or the genotype quality (GQ) for the individual genotype is very low, so we don’t trust the calls for these individuals.
LG05 338 . A G 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:10,0:10:30:0,30,381 0/0:5,0:5:15:0,15,209 [...] 0/0:7,0:7:21:0,21,248
Furthermore, we will exclude SNPs that are in the close proximity of indels. That is, because the proper alignment of sequences around indels can often be problematic and produce false positives.
For example, dependent on the “gap opening penalty”, a region can be aligned to have four indels (‐) and a SNP.
CTTTGAACCAAATTT----ACTGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACT--AGGT-TGTTGCC
Or it can be aligned to have three indels (‐) and four SNPs.
CTTTGAACCAAATTT--TCAGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACTAGGT-TGTTGCC
In addition, we will also remove monomorphic SNPs (i.e., SNPs that were called because they differ from the reference, but that are not polymorphic in our sample).
Then, since we are only interested in biallelic SNPs for the downstream analyses in this activity, we will remove indels and multiallelic SNPs (for data sets were indels are also going to be analysed, it is recommended to subset indels form SNPs as the very first step and adjust the above hard filtering thresholds separately).
And last but not least, we will remove individuals with high amounts of missing data, as well as variants with lots of missing genotypes, and we will filter on the minor allele frequency (MAF).
3.1) Filter on minimum read depth (DP) and genotype quality (GQ)
We will again use bcftools filter
and filter reads that have a read depth below 3 or a genotype quality below 20. The difference is, that instead of the whole variant site, we are now considering single genotypes (the information in the “FORMAT” fields”) using -e 'FMT/DP<3 | FMT/GQ<20'
and we will not remove the whole variant site, but only set the respective genotypes to missing (./.
) by using the bcftools filter -S .
command.
Use the already “hard-filtered” data file as input cod204.lg05.1.hf.vcf.gz
, specify a compressed VCF as the output format -O z
, and name the new output file (-o cod204.lg05.1.hf.DP3.GQ20.vcf.gz
).
bcftools filter
, and the manual. It takes ~3 min to run the command.
Show me the command!
bcftools filter -S . -e 'FMT/DP<3 | FMT/GQ<20' -O z -o cod204.lg05.1.hf.DP3.GQ20.vcf.gz cod204.lg05.1.hf.vcf.gz
Here we use the logical operator “|” instead of “||”, since using “||” would mean that every genotype at a variant site is set to missing, even if only one genotype doesn’t fulfill the threshold. More information about the use of these operators in BCFtools can be found here.
You may wonder whether a minimum read depth of three is sufficient to confidently call the variant. In our case it is, since the GATK Haplotypecaller pipeline calls genotypes across all 860 samples simultaneously and includes information from all reads from all individuals to adjust the genotype quality. A call with DP=3 may thus have a high GQ and can be confidently kept.
With the following command we extract genotypes of samples that have a read depth lower than three -i 'FMT/DP<3'
and in the output we would like to see the genotype (GT), followed by the read depth (DP) and the genotype quality (GQ) for that sample -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n'
.
Before you execute the following command, have a brief thought about the expected output regarding the genotypes.
bcftools query -i 'FMT/DP<3' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n' cod204.lg05.1.hf.DP3.GQ20.vcf.gz | less -S
Show me the answer!
./.
. Thus, when only extracting genotypes of samples that have a read depth lower than three, we would expect them to be all set to missing.
GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3
GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6
GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6
GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 [...]
You may also check the genotypes with a read depth of exactly 3. bcftools query -i 'FMT/DP=3' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n' cod204.lg05.1.hf.DP3.GQ20.vcf.gz | less -S
.
You will see that there are only few genotypes with a high genotype quality (GQ) that are not set to missing.
3.2) Remove multiallelic SNPs and indels, monomorphic SNPs, and SNPs in the close proximity of indels
- To remove monomorphic SNPs, we will use
bcftools filter
as before to exclude-e
all sites at which no alternative alleles are called for any of the samplesAC==0
and all sites at which only alternative alleles are calledAC==AN
. In addition, to remove SNPs that are within 10 bp of indels, we add the flag--SnpGap 10
. - To reduce the data file to contain only biallelic SNPs, we use the
bcftools view
command using the flags-m2 -M2 -v snps
, which set both the minimum-m
and maximum-M
number of alleles to 2, and keeps only snps via-v snps
.
You can execute both commands in one command line by piping |
the output of the first command (bcftools filter …) to the second command (bcftools view …). This way you only have to specify the input file cod204.lg05.1.hf.DP3.GQ20.vcf.gz
with the first command, and the output format -O z
and output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
with the second command.
bcftools filter
and bcftools view
, and the manual. It will take ~2 min to run.
Show me the command!
bcftools filter -e 'AC==0 || AC==AN' --SnpGap 10 cod204.lg05.1.hf.DP3.GQ20.vcf.gz | bcftools view -m2 -M2 -v snps -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
zless -S
).
Show me how!
zless -S cod204.lg05.1.hf.DP3.GQ20.vcf.gz
to see the original file,and compare it with the output file:
zless -S cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
Or: bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | less -S
By using less -S
we “chop” the long lines and only display the beginning of each line, making it easier to compare the reference and alternative alleles and the allele numbers. You can scroll horizontally across the lines using the arrow keys.
Can you identify what has changed?
Show me the answer!
bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | wc -l
82594
Remember, we had 328,253 variants at the beginning and 231,014 variants after the hard quality filtering steps, now after the last filtering steps we’re down to 82,588.
3.3) Remove individuals with a high amount of missing data
Sometimes, sequencing doesn’t work as well for some individuals as it does for others. Such individuals will be marked by low quality genotypes and low read depth, resulting in high amounts of missing data after the above filtering steps.
We can check the amount of missing data by using the bcftools stats
command. Have a look at the options by typing bcftools stats
in the terminal or check the manual for what it can do. With -s -
we can request stats for all samples.
Run bcftools stats -s - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
and quickly scroll through the large output.
Find the section PSC, Per-sample counts
, which displays some summary statistics per sample. The last column named [14]nMissing
displays the number of missing genotypes per sample.
There is also other valuable information that allows us for example to assess the effectiveness of the filtering steps above (e.g., in clolumn [9]nIndels
, also under PSC
the number should be 0 since we removed all indels).
Also have a quick look at the number of [11]nSingletons
, listed in column 11. We will get back to that in part 3.4) of the activity.
cod204.lg05.1.hf.DP3.GQ20.allele.imiss
.
Visually check that the file is as expected.
Show me the code!
bcftools stats -s - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | grep -E ^PSC | cut -f3,14 > cod204.lg05.1.hf.DP3.GQ20.allele.imiss
Do the visual check e.g., using
less cod204.lg05.1.hf.DP3.GQ20.allele.imiss
Show me the plot!
bcftools view -s
command to remove the three individuals with more than 70% missing data from the data file cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
.
Save the output as compressed VCF by using option -O z
and specify the output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz
with option -o
. It takes ~1 min to run this command.
Show me the code
bcftools view -s ^ORE1203009,ORE1203008,ORE1203013 -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz
cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz
. Have the three individuals been successfully removed?
Show me the code
bcftools query -l cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz | wc -l
201
Yes, our data set had 204 individuals before.
3.4) Remove variants with a high amount of missing genotypes and filter on minor allele frequency
In the missingness plot above, we saw that the data set in general has a lot of missing data. This is because we set genotypes with low quality and low read depth to missing. As the penultimate step, we will now remove all variants that have more than 20% missing data.
As the last step, we will remove the singletons that you saw above in the bcftools stats
command. Singletons – alleles that are observed only once – and other alleles that have a low minor allele frequency (MAF) frequently present false calls, which can be obstructive in downstream analyses by adding noise to the data (note however, that you may also need them e.g., for demographic analyses).
If the data will be used for a genome-wide association study (GWAS), usually a rather high MAF threshold is applied, as it requires very strong statistical power to make meaningful statements about very rare alleles. When interested in population structure with many samples, alleles present in only one individual do not add information and can therefore be removed. Thus, the optimal MAF filtering threshold depends on the type of data you have, the analysis you are planning to do, the data quality, and the sample size.
Show me the answer!
A slightly higher MAF threshold of 0.05 would exclude all sites that are called less than 20.1 times (5% of 402). Given that we know that we have samples with 20 individuals (40 alleles), such number of minor alleles could very well be true variants and should rather not be removed.
Note: for the above calculations, I assumed that we have 402 allels per site. However, in BCFtools, MAF is calculated as the minor allele count (MAC) divided by the total number of alleles (AN), meaning that missing data is taken into account. Since we filter simultaneously for missing data, AN will be between 322 and 402.
If missing data should not be taken into account, filtering can also be done based on MAC.
Taken together, probably something between 0.01 and 0.03 could be a good MAF threshold for our data set.
bcftools filter -e
to exclude all variants that have more than 20% missing genotypes or have a minor allele frequency smaller or equal to 0.02 from the data file cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz
. Save the output as compressed VCF -O z
and use the output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz
.
Show me the code!
bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.02' -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz
What would be the lowest alternative allele count to expect?
Use again bcftools query
and extract the INFO field “AC”, pipe the output into sort and print only the first lines.
Show me the code!
bcftools query cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz -f'%AC\n' | sort -g | head
7
7
7
[…]
OPTIONAL R task: Switch to Rstudio, adjust the code of point 3) in the “gda_plotting.R” script and plot the missingness per sample after the above filtering steps. Has it changed?
Show me the answer!
bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz | wc -l
From 328,253 variants, 11,864 SNPs are left. This means we removed 96% percent of the data.
4) Performing a principal component analysis (PCA) to inspect the data structure
The “smartpca” program implemented in the software EIGENSOFT uses a combination of principal component analysis (PCA; a technique used to identify and illustrate the strongest components of variation in a dataset) and new statistics designed specifically for genomic data. The methods are described in Patterson et al.(2006) PLOS Genetics 2(12): e190.
The EIGENSOFT smartpca needs four input files: The “.ped” (containing the genotype information) and “.map” (containing the chromosome/position information) files that we will produce using the software PLINK1.9, a “.pedind” file containing individual information, and a ‘.par’ file specifying the analysis settings.
cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz
to the PLINK “.ped/.map” format by running the following command:
plink --vcf cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz --aec --recode --out cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02
If you would like to know more about this format have a look here.
Use sed
to replace the “LG” in the cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map
file with “” (nothing) and save the output in a new file called tmp
.
Verify that the new file is correct. If so, rename tmp
again to cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map
.
Show me the command!
cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map | sed s'/LG//'g > tmp
The “stream editor”
sed
will substitute s/
the found expression “LG” with “” (nothing) within the “global” /g
environment, meaning the whole line/file.
less tmp
mv tmp cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map
cut
the first five columns of the “.ped” file and save them in a new file called “tmp”.
Show me the command!
cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cut -d ' ' -f1-5 > tmp
The default delimiter whitespace in
cut
are tabs, but the .ped file has spaces. We can define this with -d ' '
. With -f
we specify columns 1 to 5.
Have a look at the “tmp” file. It contains the family IDs and the individual IDs in the first two columns (in our case this is the same), followed by zeros that indicate “no information” for the remaining columns (fatherID, motherID, sex). In addition, we need a column to assign individuals to populations.
To do this, “cut” the first three characters (see cut --help
) from cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped
and “paste” them to the tmp
file. Save the output as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind
.
Remove the “tmp” file.
Show me the command!
cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cut -c 1-3 | paste tmp - > cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind
“cut -c 1-3” selects the first three characters, which are combined via the standard output (stdout, represented by “
-
) with the saved “tmp” file.rm -i tmp
template.par
to include the respective input file names, save the par file as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par
, and run the smartpca with the syntax smartpca -p cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par > pca01.out
.
Show me the plot!
The first principal component axis explains almost 2% of the total genomic variation and separates the eastern Baltic Sea (dark blue) from the remaining samples. The second axis explains 1% of the total genomic variation and separates the North Sea populations from the fjord-type cod. Although we used just a tiny fraction of the data, the population structure is comparable with Fig. 1b in Barth JMI et al. (2019) Mol Ecol 28, 1394–1411.
Well done!! Now it is time for socializing. My suggestion for todays pub is the “Zapa bar”.