First steps in genomic data analysis

Compiled by Julia M.I. Barth, 21 January 2020





Table of contents


  1. The Variant Call Format (VCF)
  2. Hard quality filtering of variants
  3. Further quality filtering
  4. Performing a principal component analysis (PCA) to inspect the data structure



Introduction


Background and Aim
This activity will guide you though some of the the initial steps in population genomic data analyses – that is, the first steps after the sequencing reads were mapped against a reference genome assembly and variants were called (if you’re interested in mapping and variant calling, check out the “Workshop on Genomics” exercises).
The aim is to get some hands-on experience working with VCF files, practice some of the considerations in producing a subset of high quality SNPs, and to explore the data structure using a principal component analysis (PCA). The aim is also to apply and expand on the UNIX and R skills that you learned in the previous activities: UNIX and R.


Learning goals

After this activity, you should be able to perform first quality checks and filtering steps, preparing a population genomic dataset file to explore e.g., population structure.

  • Apply the UNIX and R skills you learned in the previous exercises.
  • Learn about the VCF format and how to handle and manipulate VCF files.
  • Plot different parameters to check the overall quality of the data set.
  • Perform hard filtering, removing low quality variants.
  • Perform further filtering to prepare the data file for analyses.
  • Explore the data structure using a PCA.



Requirements
BCFtools v1.9
PLINK v1.9
EIGENSOFT smartpca v7.2.1
R / Rstudio


The data set

We will work with a data set containing genomic data of 204 individuals of Atlantic cod (Gadus morhua) sampled at seven sites in the Atlantic, North Sea, and Baltic Sea (Barth JMI et al. (2019) Disentangling structural genomic and behavioural barriers in a sea of connectivity. Mol Ecol 28, 1394–1411).

Sample map and picture of Atlantico cod

The individuals in this data set were sequenced using Illumina HiSeq paired-end sequencing to ~10x coverage, reads were mapped using BWA-mem to the Atlantic cod reference assembly, and variants were called jointly for a total number of about 860 individuals using the GATK Haplotypecaller pipeline. All filtering steps were performed per chromosome (i.e., per linkage group) for parallelized processing, followed by merging of high quality variants of all linkage groups for follow-up analyses.

In this activity, we will perform similar filtering steps, starting with the unfiltered raw VCF, but we will work with a reduced data file that contains only the 204 individuals in the study mentioned above, and only 10% randomly subsampled variants of one linkage group (LG 5) to reduce time and computer resources (the original compressed LG 5 VCF file contained 3,281,615 variants and had a file size for 11 GB). By randomly subsampling the variants, physically very close variant sites that may share the same information (they are in linkage disequilibrium) have already been excluded, for which reason this tutorial does not contain a step on filtering for linked sites. If you are interested in those steps, have a look at the tutorials from 2016 and 2018.



How to do this activity

  • Connect to your Amazon instance via the terminal using SSH.
  • Questions or activities are indicated with Q.
  • Switch to Rstudio in the browser (http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787) if you see this: R task .
  • All text in red on gray background indicates commands or file names that you can type or copy to the terminal. Text in black on gray background refers to terminal outputs.
  • Try to work independently using the provided manuals and information, discuss questions with your neighbor. If you get stuck, check the answer-box:

    Show me the answer!

    Oh no, only if you get really stuck!! First try to find the answer yourself!

There are often alternative options and other solutions to the tasks. E.g., whether you prefer cat over less doesn’t matter in most cases – go ahead and use your solution! Don’t hesitate to ask the workshop team if you have questions!



1) The Variant Call Format (VCF)


VCF (Variant Call Format) is a standardized text file format that is used to store genetic variation calls such as SNPs or insertions/deletions. The full format specifications and valuable information about the different tags can be found here.
In the following first part of the exercise, we will explore how the information in a VCF is stored, and how we can inspect it.

Q1 Open the terminal and navigate to the analysis folder including the data file for this activity: /home/popgen/workshop_materials/21_first_steps_genomics.
What is the file size of cod204.lg05.1.vcf.gz? How many bytes and how many MB does it have?

Show me the answer!

cd /home/popgen/workshop_materials/21_first_steps_genomics
ls -l “long listing format” to see a list of the folder content including the file size: the VCF file is 320855142 bytes.
ls -lh to see the content list with the file sizes printed in “human-readable” format: it’s 306 M (MiB, to see the the size in actuall megabyte (MB) use: ls -l --block-size=MB.

Tip: for more functions of ls type ls --help or check the manual with man ls.

The file cod204.lg05.1.vcf.gz is a very small VCF file with reduced file size for the purpose of this activity. Typical VCF files including full-genome sequencing data and many individuals are often several Gigabytes (GB) in size.

Q2 What is written in the first ten lines of the VCF file cod204.lg05.1.vcf.gz?

Show me a hint!

If you used head cod204.lg05.1.vcf.gz you will only see gibberish. Why?
It’s a compressed file (indicated by the ending “.gz”).
Compressing a VCF file with gzip or bgzip and indexing it with tabix is the common way in which huge VCF files are stored.

To see the first ten lines, you could decompress the entire file. However, this is not advisable as it would take several minutes (with larger files even hours) and it would inflate the file size to several gigabyte (GB).

Instead, you can make use of a program that allows you to view a part of the decompressed the VCF file without decompressing the whole file. There are several different options and programs to do this – can you think of one?

Show me the answer!

You could, for example, use either:
zcat cod204.lg05.1.vcf.gz | head (zcat is decompressing the file, but head tells the commands to stop decompressing any further after ten lines are printed)
Or you could use:
zless cod204.lg05.1.vcf.gz (similarly, zless only decompresses the part you are actually displaying; exit zless just like less by pressing q)
Or you could use:
bcftools view cod204.lg05.1.vcf.gz | head (BCFtools ia a software for variant calling and manipulation of VCF files. We will extensively work with BCFtools in the remaining exercise.)

Tip: Check out what these programs are doing with --help or man.

The first line is already telling you that this is indeed a VCF file:
##fileformat=VCFv4.2

These first lines, starting with ##, are part of the “VCF header”, which stores meta-information about the data. If you use zless you can scroll down to see more of the header and also the genotype data, which make up the majority of the file.

There can be a lot of information in the header. Different meta-information lines, e.g. “INFO”, “FILTER”, “FORMAT”, “ALT” describe types of information that are given for every variant in the VCF file. A few examples:

  • ##INFO<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">: The number of of often the alternative allele is occuring.
  • ##FORMAT<ID=GT,Number=1,Type=String,Description="Genotype">: The genotypes coded as allele values separated by / (or | if the data is phased).
  • The allele values are 0 for the reference allele and 1 for the alternative allele. If there are more alternative alleles the second alternative allele will have the value 2, the third 3, etc. Missing data is denoted by dots ..

  • ##contig<ID=LG01,length=28303952> : Part of a list of chromosomes, contigs, or scaffolds of the reference assembly.

For more information about the different tags, have a look at the VCF file format description.

The very last line of the header specifies the column names for the genotype data. The first eight columns are mandatory, follow by “FORMAT” and the sample IDs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AVE1409002 AVE1409003 AVE1409004 [...] TVE130512 TVE130513 TVE130514

You may want to edit these header lines to add changes that you did to your VCF file, or you received a VCF file from colleagues and wish to inspect the header in more detail to see what they did.

Q3 Find a command in BCFtools that allows you to separate the header from the large body of genotype data. Save the header in a separate text file named cod204.lg05.1.header.txt.
To find the command, have a look at the BCFtools manual, or just type bcftools on the command line to see the available options.

Give me a hint!

Have a look at the bcftools view option. Either type bcftools view on the command line, or have a look the manual.

Show me the answer!

bcftools view -h cod204.lg05.1.vcf.gz > cod204.lg05.1.header.txt
With view -h only the VCF header is selected (the header is small, therefore the small “h”). By using the capital -H the header is suppressed and only the “large” body of genotype data is considered.

Q4 What is the file size of cod204.lg05.1.header.txt, compared to the whole VCF file?

Show me the answer!

ls -l
The header is 7476 bytes (7 KB) small. This is only about 0.002% of the entire VCF, and a size we can safely open with a text editor.

Q5 Open the header on the command line in a text editor. Add a new line below the line ##fileformat=VCFv4.2 stating the date of today: ##fileDate=20200121 .
Save the new header as: cod204.lg05.1.header.ed.txt. Use BCFtools to replace (or “rehead”) the VCF file cod204.lg05.1.vcf.gz with the new edited header and save the re-headed VCF as cod204.lg05.1.ed.vcf.gz.
Make sure your changes took effect and your file is still valid by visually inspecting the file and the file size of cod204.lg05.1.ed.vcf.gz.

Tip: Type bcftools to see general options for the program; then use bcftools reheader to see the more specific options for the reheader function.

Show me the answer!

Make use of a text editor: nano cod204.lg05.1.header.txt or vi cod204.lg05.1.header.txt
Add ##fileDate=20200121, save the file (nano: “Ctrl + o”, vi: “:w”; don’t forget to save under the new name), and close the text editor (nano: Ctrl + x, vi: “:q”)
bcftools reheader -h cod204.lg05.1.header.ed.txt cod204.lg05.1.vcf.gz -o cod204.lg05.1.ed.vcf.gz (with -h the header file is specified, with -o the new output file name)
With zless cod204.lg05.1.ed.vcf.gz you can check the header lines, and by using ls -lh you can see if the file has still roughly the same size.

In the genotype part of the VCF file, one line corresponds to one variant site. Each line starts with the chromosome ID “CHROM”, the position “POS”, and the variant “ID”, and is followed by the bases of the two alleles (“REF” for the reference allele, and “ALT” for the alternative allele(s)). The 6th column (“QUAL”) is the phred-scaled quality score, a measure for how likely it is that this site is indeed a variant. In the next column, “PASS” tells us that the variant passed all filters.

In the “INFO” column, information related to the header info tags for each variant site is given (e.g., AC is the allele count in genotypes for the “ALT” allele(s), AF the allele frequency of the “ALT” allele, and AN is the total number of alleles, etc. We will use some of the “INFO” field specifications later on for hard quality filtering).

The “FORMAT” column specifies the fields and order of the data that is given with each individual for each genotype and the remaining columns list each individual with the values corresponding to the “FORMAT” field. The first “FORMAT” field is always the genotype (GT), the remaining fields are variable, but in our data file, GT is followed by AD (allelic depths for the ref and alt alleles), DP (read depth), GQ (genotype quality), and PL (phred-scaled genotype likelihood):

This is how the first variant of the genotype part of the VCF file looks like:
LG05 338 . A G 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:10,0:10:30:0,30,381 [...] 0/0:7,0:7:21:0,21,248

Q6 How many variant sites are stored in the VCF?

Show me the answer!

“Word count” (wc) can be used to count words, characters, or lines. Use wc --help to see the options.
We know that the genotype part of the VCF file contains one line per variant site, meaning that we can use BCFtools to suppress the header and forward (pipe; |) the output to wc -l. Using the flag -l means “count the lines”:

bcftools view -H cod204.lg05.1.vcf.gz | wc -l (takes ~1 min)
328253

Q7 Examine the first two variant sites in the VCF file and answer the following questions:

A) What kind of variants do you see (e.g., SNP, insertion/deletion)?

Show me the answer!

You can inapect the variants using zless cod204.lg05.1.vcf.gz

The first variant site (“338”) is a SNP. Columns four and five for “REF” (the reference allele) and “ALT” (the alternative allele) show A and G.
The second variant site (“357”) is a deletion. Column four contains multiple bases, while column five shows a G.

LG05 338 . A G
LG05 357 . GTGGGCGCTACTCCAGGCAAAGGAGGACTCTGTACCGGAGGCTTGGATCAGCT G

Sometimes you may also see an asterisk (*), which represents a spanning deletion.

B) What is the genotype (GT) of the first individual (AVE1409002) at the first variant site (LG05 338)?

Show me the answer!

The first individual has the genotype 0/0, which means twice the reference allele A (A/A). The allele values in the GT field are 0 for the reference allele (the “REF” column) and 1 for the first alternative allele (“ALT”).
GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855

C) What is the genotype read depth (DP, specified as “FMT/DP” to distinguish it from the mean DP “INFO/DP” per variant site) for the first three individuals (AVE1409002, AVE1409003, AVE1409004) at the first variant site (LG05 338)?

Show me the answer!

The read depths (DP) are 20, 9, and 10.
GT:AD:DP:GQ:PL
0/0:20,0:20:57:0,57,855
0/0:9,0:9:27:0,27,345
0/0:10,0:10:30:0,30,381



2) Hard quality filtering of variants


To remove low-quality variants (e.g., variants with a low read depth or variants only supported by poorly aligned reads) from our data set, we will apply hard quality filtering. In the terminology of the authors of GATK, “hard filtering” refers to using specific filtering cutoffs for specific annotations. This stands opposed to using the so-called Variant Quality Score Recalibration (VQSR) method, which does not use any specific “hard cutoffs” but combines information from multiple annotations from a known, highly validated variant training set in a machine learning algorithm. Variants that are not passing the chosen thresholds or the VQSR filtering can either be completely removed or they can be kept in the VCF, but a flag is then added to them noting that they did not pass the filter. Since often such a training set is not available, we will apply “hard quality filtering” here.
The following part of the activity is adjusted from this “GATK tutorial“.


2.1) Inspection of quality measurements

Before we choose the measurements and thresholds for hard quality filtering, let’s start by having a look at some of these measurements and the distributions of their values.

We will consider the following seven measurements, which are given for each variant site in the “INFO” column of the VCF (more information about each of these measurements is given further below):
Fisher Strand (FS), Strand Odds Ratio (SOR), RMS Mapping Quality (MQ), Mapping Quality Rank Sum Test (MQRankSum), Quality by depth (QD), Read Position Rank Sum Test (ReadPosRankSum), and the combined depth across samples (INFO/DP).

Below, the measurements of the first variant site are highlighted:
LG05 338 . A G 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:10,0:10:30:0,30,381 [...] 0/0:7,0:7:21:0,21,248

Q8 Use the BCFtools “query” option to extract the Fisher Strand (FS) values of all variants and save the output in a separate text file named cod204.lg05.1_FS.txt.

Tip: Look at the examples given in the manual. The FS values should be saved as one column, with one line for the value of each variant.

Show me the answer!

bcftools query cod204.lg05.1.vcf.gz -f'%FS\n' > cod204.lg05.1_FS.txt (takes ~1 min)
With the -f option, you can define the output format. The %FS indicates that for each variant, the FS value should be printed, and \n tells the program that after each FS value there should be a new line.

Q9 Inspect the file. How many columns are in the file? How many lines? Does the number correspond to the number of variant sites in the VCF? Do the first values correspond to the FS values of the first variants in the VCF file? Apply your UNIX skills!

Show me the answer!

Using head cod204.lg05.1_FS.txt will show you the first lines. There is just one column.
Typing wc -l cod204.lg05.1_FS.txt will show you the number of lines, which is the same as the number of variant sites we have ( 328253 )
With bcftools view -H cod204.lg05.1.vcf.gz | less you can visually check if the first variants in the file and compare them to the extracted FS values.

R task1 Now lets switch to R studio to plot the FS values. Open Rstudio in your internet browser:
http://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:8787

Open the provided script “gda_plotting.R” in R and follow the points 2) and 2.1) in that R script.

Show me the plot!


Fisher Strand (FS), as well as Strand Odds Ratio (SOR), are measurements for strand bias – that is a sequencing bias, in which the reads of one DNA strand (e.g., the forward strand) support one genotype (e.g., a heterozygous call), while the reads for the reverse strand support e.g., a homozygous genotype. In cases of extreme strand bias, SNPs are not correctly called.

  • Fisher Strand (FS) is the Phred-scaled probability that there is strand bias at that site. It tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. A value close to 0 indicates little or no strand bias. GATK recommends hard filtering of variants with FS greater than 60, but this is a very conservative value that will only filter out cases of extreme bias and leave many false positives in the data set.

Q10 Based on the above information and the plotted distribution of FS, choose a FS filtering threshold to remove low quality variants.

Show me the answer!

We could stick with the GATK recommendation and remove variants with FS > 60. However, since this is a conservative value and our distribution tells us, that we would not loose many more variants if we lower the threshold to, let’s say, FS > 40 (0.64% of variants have FS > 60; 1.11% have FS > 40), we could as well set the threshold to FS > 40.
Setting such thresholds depends on the quality of the data set, as well as the planned downstream analyses. Here, we plan to inspect population structure, so we could risk removing some false negatives for the sake of overall quality improvement.

Q11 Use the BCFtools “query” option to jointly extract the Fisher Strand (FS), Strand Odds Ratio (SOR), Mapping Quality Rank Sum Test (MQRankSum), Read Position Rank Sum Test (ReadPosRankSum), Quality by depth (QD), RMS Mapping Quality (MQ),and the combined depth across samples (INFO/DP) and save the values separated by the tab character in a separate text file named cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt.

Tip: Each measurement should be saved in one column separated by tabs, with each variant on a single line.

Show me the answer!

bcftools query cod204.lg05.1.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\n' > cod204.lg05.1_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt

Again, -f defines the output format. The %FS\t%SOR\t… indicates that for each variant first the FS value should be printed, then a tab \t should be printed, followed by the SOR value, followed by a tab etc… . At the end, the \n tells the program that after all six measurements, there should be a new line.

R task2 Now lets switch again to R studio to plot the distributions of all measurements. Use the provided script “gda_plotting.R” and follow the point 2.1.1).

Show me the plot!

Q12 As you did for the FS values above, use the distribution plots and the information about each quality measurement given below to come up with a meaningful filtering threshold.

  • Strand Odds Ratio (SOR) is also a measurement for strand bias, but it takes into account the ratios of reads that cover both alleles and is therefore a better measurement for sites that have more reads in one direction than the other. SOR values range from 0 to greater than 9, a value close to 0 indicates little or no strand bias. GATK recommends hard filtering of variants with SOR greater than 3.
  • RMS Mapping Quality (MQ) and Mapping Quality Rank Sum Test (MQRankSum) are both measures for mapping quality. The Root Mean Square (RMS) mapping quality (MQ) of reads across all samples provides an estimation of the overall mapping quality to support a variant call. It includes the standard deviation (SD) of the mapping qualities, where a low SD indicates values close to the mean. Good mapping qualities are around MQ 60. GATK recommends hard filtering of variants with MQ less than 40.
  • The Mapping Quality Rank Sum Test (MQRankSum) compares the mapping qualities of reads supporting the reference and the alternate allele. If all reads map correctly, the expected MQRankSum is 0. Negative values indicate a lower MQ of the reads supporting the alternate allele, compared to reads supporting the reference allele. MQRankSum values usually range from about -10.5 to 6.5. GATK recommends hard filtering of extreme outlier variants with MQRankSum less than -12.5.
  • Quality by depth (QD) is the “confidence in a call” (i.e., the QUAL score), normalized by the read depth at that site. GATK recommends hard filtering of variants with QD below 2.
  • The Read Position Rank Sum Test (ReadPosRankSum) measures the relative position of the reference versus the alternative allele within reads. E.g., SNPs towards the ends of reads may be more likely a sequencing error. A value around 0 means there is little to no difference in where the alleles are found relative to the ends of reads. GATK recommends hard filtering of variants with ReadPosRankSum less than -8.0.
  • The combined depth across samples (INFO/DP) is simply the sum of the read depth (i.e., the genotype depth (FORMAT/DP) fields) over all samples. Variants that have a high mean depth across all samples are indicative of either repetitive regions or paralogs. A good rule of thumb is to remove variants that show twice the mean read depth. You can calculate the mean read depth in R: mean(cod204.lg05.1_hf.values$DP). Now, don’t be shocked about the high read depth. This field has not been updated after reducing the vcf from 860 individuals to 204 individuals, so the information is given for all 860 individauls. We can use this field nevertheless for excluding variants with read depth larger than twice the mean.

Show me the answer!

Again, choosing filtering thresholds is highly depended on the quality of the data and the planned downstream analyses. Here are some possible suggestions for variants to exclude:

  • FS (GATK recommendation FS > 60): FS > 40 (we are more strict here)
  • SOR (GATK recommendation SOR > 3): SOR > 3 (this seems just fine)
  • MQ (GATK recommendation MQ < 40): MQ < 40 (we would loose many variants with a lower threshold, so we stick to the recommended)
  • MQRankSum (GATK recommendation MQRankSum > -12.5): MQRankSum > -5.0 and MQRankSum < 5.0 (a value around 0 is preferred and we don’t loose many variants if we set strict thresholds)
  • QD (GATK recommendation QD > 2): QD > 2 (we would loose many variants with a lower threshold, so we stick with the recommendation)
  • ReadPosRankSum (GATK recommendation ReadPosRankSum > -8.0): ReadPosRankSum > -4.0 (not many variants are below -8, so we can set it more strict)
  • INFO/DP (no GATK recommendation, dependend on sample number and sequencing depth): INFO/DP > 16,000 (the mean DP is 8400.935, we exclude sites that show about twice the mean DP)


2.2) Application of hard filtering thresholds

We will now apply the filtering thresholds and remove variants that don’t match the chosen thresholds.

Q13 Using the BCFtools “filter” option, come up with a command that will produce a high-quality data file. You may use the thresholds you chose, or adapt them according to the suggestions above.
Don’t forget to specify the input file cod204.lg05.1.vcf.gz, the output format (we’ll use -O z since we would like to save the data file again as compressed VCF), and the output filename (-o cod204.lg05.1.hf.vcf.gz). Type bcftools filter in the command line to see the options.

Tip: It is possible to either use -e and the logical operator “or” || to exclude low-quality variants from the file, or use -i and the logical operator “and” && to include high-quality variants.

Running the filtering will take about five minutes, so time for a break and some coffee!

Show me the code!

A possible solution choosing the “exclude” option:
bcftools filter -e 'FS>40.0 || SOR>3 || MQ<40 || MQRankSum<-5.0 || MQRankSum>5.0 || QD<2.0 || ReadPosRankSum<-4.0 || INFO/DP>16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.1.vcf.gz

A possible solution choosing the “include” option:
bcftools filter -i 'FS<40.0 && SOR<3 && MQ>40.0 && MQRankSum>-5.0 && MQRankSum<5 && QD>2.0 && ReadPosRankSum>-4.0 && INFO/DP<16000' -O z -o cod204.lg05.1.hf.vcf.gz cod204.lg05.1.vcf.gz

! Be aware that the commands will not produce the exact same results. The reason for this is that some measurements will not be available for all sites, because there are, for example, too many missing genotypes to calculate this measurement. In such cases, these sites would not be excluded (because there is no measurement to exclude), so they would remain in the filtered file, while with the include option, they would not be included (since there is no measurement based on which they could be included).

Q14 Check the filtered file:

  • What is the file size?
  • How many variant sites have been removed? Tip: see Q6
  • Has the FS filter been applied correctly? What is the largest FS value?
    Tip: use bcftools query as you did above (Q8) but instead of saving the output, pipe it directly to sort. Use the correct flags with “sort” (see the options with sort --help and only check the first values by piping the output again to head.

Show me the answer!

! Note, the values below may be different depending on the filter thresholds you have chosen.

  • ls -lh
    213M – the file is about 30% smaller than the unfiltered one.
  • bcftools view -H cod204.lg05.1.hf.vcf.gz | wc -l
    231014 – the file also has about 30% fewer variant sites.
  • bcftools query cod204.lg05.1.hf.vcf.gz -f'%FS\n' | sort -gr | head – we need the sort flag -g since we have floating point numbers, and since we are interested in the largest value, we need to reverse the sorting using the flag -r. Using sort -g | tail would work just the same way.
    39.994
    39.989
    39.913
    39.91
  • bcftools query cod204.lg05.1.hf.vcf.gz -f '%FS\t%SOR\t%MQRankSum\t%ReadPosRankSum\t%QD\t%MQ\t%DP\n' > cod204.lg05.1.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt

R task3 Extract again all quality measurements (FS, SOR, MQ, MQRS, QD, RPRS, DP) as you did above (see Q11), save the output in a new text file named cod204.lg05.1.hf_FS.SOR.MQRS.RPRS.QD.MQ.DP.txt, and plot the distributions using Rstudio and the script “gda_plotting.R” (section 2.1.2).

Show me the plot!



3) Further quality filtering


After hard quality filtering, we have a data set containing only variant sites that we trust with high confidence. However, so far we have only removed variants that had skewed values across all samples – we haven’t yet looked at the individual genotypes at each variant site. For example, we may have kept a variant that has a high quality across all 204 individuals, but there are three individuals where the read depth (FMT/DP) or the genotype quality (GQ) for the individual genotype is very low, so we don’t trust the calls for these individuals.

LG05 338 . A G 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:10,0:10:30:0,30,381 0/0:5,0:5:15:0,15,209 [...] 0/0:7,0:7:21:0,21,248

Furthermore, we will exclude SNPs that are in the close proximity of indels. That is, because the proper alignment of sequences around indels can often be problematic and produce false positives.

For example, dependent on the “gap opening penalty”, a region can be aligned to have four indels (‐) and a SNP.
CTTTGAACCAAATTT----ACTGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACT--AGGT-TGTTGCC

Or it can be aligned to have three indels (‐) and four SNPs.
CTTTGAACCAAATTT--TCAGGAGTTCTGTTGCC
CTTTGAACC---TTTGCTAACTAGGT-TGTTGCC

In addition, we will also remove monomorphic SNPs (i.e., SNPs that were called because they differ from the reference, but that are not polymorphic in our sample).

Then, since we are only interested in biallelic SNPs for the downstream analyses in this activity, we will remove indels and multiallelic SNPs (for data sets were indels are also going to be analysed, it is recommended to subset indels form SNPs as the very first step and adjust the above hard filtering thresholds separately).

And last but not least, we will remove individuals with high amounts of missing data, as well as variants with lots of missing genotypes, and we will filter on the minor allele frequency (MAF).


3.1) Filter on minimum read depth (DP) and genotype quality (GQ)

We will again use bcftools filter and filter reads that have a read depth below 3 or a genotype quality below 20. The difference is, that instead of the whole variant site, we are now considering single genotypes (the information in the “FORMAT” fields”) using -e 'FMT/DP<3 | FMT/GQ<20' and we will not remove the whole variant site, but only set the respective genotypes to missing (./.) by using the bcftools filter -S . command.
Use the already “hard-filtered” data file as input cod204.lg05.1.hf.vcf.gz, specify a compressed VCF as the output format -O z, and name the new output file (-o cod204.lg05.1.hf.DP3.GQ20.vcf.gz).

Q15 Puzzle the code together with the help of the above information, bcftools filter, and the manual. It takes ~3 min to run the command.

Show me the command!

bcftools filter -S . -e 'FMT/DP<3 | FMT/GQ<20' -O z -o cod204.lg05.1.hf.DP3.GQ20.vcf.gz cod204.lg05.1.hf.vcf.gz

Here we use the logical operator “|” instead of “||”, since using “||” would mean that every genotype at a variant site is set to missing, even if only one genotype doesn’t fulfill the threshold. More information about the use of these operators in BCFtools can be found here.

You may wonder whether a minimum read depth of three is sufficient to confidently call the variant. In our case it is, since the GATK Haplotypecaller pipeline calls genotypes across all 860 samples simultaneously and includes information from all reads from all individuals to adjust the genotype quality. A call with DP=3 may thus have a high GQ and can be confidently kept.

Q16 Let’s have a brief visual check whether our filter has been applied correctly:

With the following command we extract genotypes of samples that have a read depth lower than three -i 'FMT/DP<3' and in the output we would like to see the genotype (GT), followed by the read depth (DP) and the genotype quality (GQ) for that sample -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n'.

Before you execute the following command, have a brief thought about the expected output regarding the genotypes.

bcftools query -i 'FMT/DP<3' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n' cod204.lg05.1.hf.DP3.GQ20.vcf.gz | less -S

Show me the answer!

With our filtering command above, we specified to set genotypes to missing ./.. Thus, when only extracting genotypes of samples that have a read depth lower than three, we would expect them to be all set to missing.

GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3
GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6
GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=3 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=2 GQ=6
GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=2 GQ=6 GT=./. DP=0 GQ=. GT=./. DP=1 GQ=3
GT=./. DP=2 GQ=6 GT=./. DP=1 GQ=3 GT=./. DP=2 GQ=6 [...]

You may also check the genotypes with a read depth of exactly 3. bcftools query -i 'FMT/DP=3' -f '[GT=%GT\tDP=%DP\tGQ=%GQ\t]\n' cod204.lg05.1.hf.DP3.GQ20.vcf.gz | less -S.
You will see that there are only few genotypes with a high genotype quality (GQ) that are not set to missing.


3.2) Remove multiallelic SNPs and indels, monomorphic SNPs, and SNPs in the close proximity of indels

  • To remove monomorphic SNPs, we will use bcftools filter as before to exclude -e all sites at which no alternative alleles are called for any of the samples AC==0 and all sites at which only alternative alleles are called AC==AN. In addition, to remove SNPs that are within 10 bp of indels, we add the flag --SnpGap 10.
  • To reduce the data file to contain only biallelic SNPs, we use the bcftools view command using the flags -m2 -M2 -v snps, which set both the minimum -m and maximum -M number of alleles to 2, and keeps only snps via -v snps.

You can execute both commands in one command line by piping | the output of the first command (bcftools filter …) to the second command (bcftools view …). This way you only have to specify the input file cod204.lg05.1.hf.DP3.GQ20.vcf.gz with the first command, and the output format -O z and output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz with the second command.

Q17 Apply the command by using the information above, bcftools filter and bcftools view, and the manual. It will take ~2 min to run.

Show me the command!

bcftools filter -e 'AC==0 || AC==AN' --SnpGap 10 cod204.lg05.1.hf.DP3.GQ20.vcf.gz | bcftools view -m2 -M2 -v snps -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz

Q18 Verify that the filters have been applied by visually inspecting the file (e.g., using zless -S).

Show me how!

You may use:
zless -S cod204.lg05.1.hf.DP3.GQ20.vcf.gz to see the original file,
and compare it with the output file:
zless -S cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz

Or: bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | less -S

By using less -S we “chop” the long lines and only display the beginning of each line, making it easier to compare the reference and alternative alleles and the allele numbers. You can scroll horizontally across the lines using the arrow keys.

Can you identify what has changed?

Q19 How many variants are left in our data file after all the above filtering steps have been applied?

Show me the answer!

bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | wc -l
82594
Remember, we had 328,253 variants at the beginning and 231,014 variants after the hard quality filtering steps, now after the last filtering steps we’re down to 82,588.



3.3) Remove individuals with a high amount of missing data

Sometimes, sequencing doesn’t work as well for some individuals as it does for others. Such individuals will be marked by low quality genotypes and low read depth, resulting in high amounts of missing data after the above filtering steps.

We can check the amount of missing data by using the bcftools stats command. Have a look at the options by typing bcftools stats in the terminal or check the manual for what it can do. With -s - we can request stats for all samples.

Run bcftools stats -s - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz and quickly scroll through the large output.

Find the section PSC, Per-sample counts, which displays some summary statistics per sample. The last column named [14]nMissing displays the number of missing genotypes per sample.
There is also other valuable information that allows us for example to assess the effectiveness of the filtering steps above (e.g., in clolumn [9]nIndels, also under PSC the number should be 0 since we removed all indels).

Also have a quick look at the number of [11]nSingletons, listed in column 11. We will get back to that in part 3.4) of the activity.

Q20 Use your UNIX (and regex) skills to extract only the 3rd and 14th column of the PSC output to get a list of sample names and numbers of missing genotypes. Save the file as: cod204.lg05.1.hf.DP3.GQ20.allele.imiss.
Visually check that the file is as expected.

Show me the code!

bcftools stats -s - cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz | grep -E ^PSC | cut -f3,14 > cod204.lg05.1.hf.DP3.GQ20.allele.imiss
Do the visual check e.g., using less cod204.lg05.1.hf.DP3.GQ20.allele.imiss

R task4 Lets switch to R studio to plot the missingness per sample. Use the provided script “gda_plotting.R” and follow the point 3) of the R script.

Q21 Apply the bcftools view -s command to remove the three individuals with more than 70% missing data from the data file cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz .
Save the output as compressed VCF by using option -O z and specify the output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz with option -o. It takes ~1 min to run this command.

Show me the code

bcftools view -s ^ORE1203009,ORE1203008,ORE1203013 -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz cod204.lg05.1.hf.DP3.GQ20.allele.vcf.gz

Q22 Count the number of samples in cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz. Have the three individuals been successfully removed?

Show me the code

bcftools query -l cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz | wc -l
201
Yes, our data set had 204 individuals before.



3.4) Remove variants with a high amount of missing genotypes and filter on minor allele frequency

In the missingness plot above, we saw that the data set in general has a lot of missing data. This is because we set genotypes with low quality and low read depth to missing. As the penultimate step, we will now remove all variants that have more than 20% missing data.

As the last step, we will remove the singletons that you saw above in the bcftools stats command. Singletons – alleles that are observed only once – and other alleles that have a low minor allele frequency (MAF) frequently present false calls, which can be obstructive in downstream analyses by adding noise to the data (note however, that you may also need them e.g., for demographic analyses).
If the data will be used for a genome-wide association study (GWAS), usually a rather high MAF threshold is applied, as it requires very strong statistical power to make meaningful statements about very rare alleles. When interested in population structure with many samples, alleles present in only one individual do not add information and can therefore be removed. Thus, the optimal MAF filtering threshold depends on the type of data you have, the analysis you are planning to do, the data quality, and the sample size.

Q23 Knowing that our data set contains 201 individuals with a minimum of 20 individuals per sampling site, and that we will use it to dissect population structure, what could be an appropriate MAF filtering threshold?

Show me the answer!

The default MAF threshold in some programs is 0.01. Filtering for MAF 0.01 means that remaining sites in the VCF have a minimum minor allele frequency of 1%. Our data set contains 201 individuals (thus maximally 402 alleles per site). If we filter for MAF 0.01, a variant is excluded if its minor allele is found in less than 4.02 genotypes (1% of 402). This would remove alleles where the minor allele occurs four times or less, including the singletons.

A slightly higher MAF threshold of 0.05 would exclude all sites that are called less than 20.1 times (5% of 402). Given that we know that we have samples with 20 individuals (40 alleles), such number of minor alleles could very well be true variants and should rather not be removed.

Note: for the above calculations, I assumed that we have 402 allels per site. However, in BCFtools, MAF is calculated as the minor allele count (MAC) divided by the total number of alleles (AN), meaning that missing data is taken into account. Since we filter simultaneously for missing data, AN will be between 322 and 402.
If missing data should not be taken into account, filtering can also be done based on MAC.

Taken together, probably something between 0.01 and 0.03 could be a good MAF threshold for our data set.

Q24 Use bcftools filter -e to exclude all variants that have more than 20% missing genotypes or have a minor allele frequency smaller or equal to 0.02 from the data file cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz. Save the output as compressed VCF -O z and use the output file name -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz.

Show me the code!

bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.02' -O z -o cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz cod204.lg05.1.hf.DP3.GQ20.allele.missi.vcf.gz

Q25 Often the alternative allele is also the minor allel, thus we can inspect the alternative allele count (AC) to see whether our MAF filter took effect.
What would be the lowest alternative allele count to expect?

Use again bcftools query and extract the INFO field “AC”, pipe the output into sort and print only the first lines.

Show me the code!

We filtered for MAF 0.02, so we would expect a minimum AC of 7 (because 2% of 322 = 6.44).

bcftools query cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz -f'%AC\n' | sort -g | head
7
7
7
[…]

OPTIONAL R task: Switch to Rstudio, adjust the code of point 3) in the “gda_plotting.R” script and plot the missingness per sample after the above filtering steps. Has it changed?

Q26 How many variants (in percent) have been removed by all our filtering steps?

Show me the answer!

bcftools view -H cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz | wc -l
From 328,253 variants, 11,864 SNPs are left. This means we removed 96% percent of the data.



4) Performing a principal component analysis (PCA) to inspect the data structure


The “smartpca” program implemented in the software EIGENSOFT uses a combination of principal component analysis (PCA; a technique used to identify and illustrate the strongest components of variation in a dataset) and new statistics designed specifically for genomic data. The methods are described in Patterson et al.(2006) PLOS Genetics 2(12): e190.

The EIGENSOFT smartpca needs four input files: The “.ped” (containing the genotype information) and “.map” (containing the chromosome/position information) files that we will produce using the software PLINK1.9, a “.pedind” file containing individual information, and a ‘.par’ file specifying the analysis settings.

Q27 Convert the filterd VCF file cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz to the PLINK “.ped/.map” format by running the following command:
plink --vcf cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.vcf.gz --aec --recode --out cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02

If you would like to know more about this format have a look here.

Q28 The EIGENSOFT smartpca is often not happy if the chromosome IDs in the “.map” file contain anything else than numbers.
Use sed to replace the “LG” in the cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map file with “” (nothing) and save the output in a new file called tmp.
Verify that the new file is correct. If so, rename tmp again to cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map.

Show me the command!

cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map | sed s'/LG//'g > tmp
The “stream editor” sed will substitute s/ the found expression “LG” with “” (nothing) within the “global” /g environment, meaning the whole line/file.

less tmp
mv tmp cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.map

Q29 To generate the “.pedind” file, apply again your UNIX skills and cut the first five columns of the “.ped” file and save them in a new file called “tmp”.

Show me the command!

cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cut -d ' ' -f1-5 > tmp
The default delimiter whitespace in cut are tabs, but the .ped file has spaces. We can define this with -d ' '. With -f we specify columns 1 to 5.

Have a look at the “tmp” file. It contains the family IDs and the individual IDs in the first two columns (in our case this is the same), followed by zeros that indicate “no information” for the remaining columns (fatherID, motherID, sex). In addition, we need a column to assign individuals to populations.

Q30 Add a tab-separated 6th column specifying that AVE1409002-AVE1409024 belong to the population “AVE”, BOR1104001-BOR1205019 belong to the population “BOR”, etc.. .
To do this, “cut” the first three characters (see cut --help) from cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped and “paste” them to the tmp file. Save the output as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind.
Remove the “tmp” file.

Show me the command!

cat cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.ped | cut -c 1-3 | paste tmp - > cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.pedind
“cut -c 1-3” selects the first three characters, which are combined via the standard output (stdout, represented by “-) with the saved “tmp” file.
rm -i tmp

Q31 A template parameter file (template.par) and a smartpca manual page (eigensoft_smartpca_readme.txt) can be found in the analysis folder. Using the text editor of your choice, modify the template.par to include the respective input file names, save the par file as cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par, and run the smartpca with the syntax smartpca -p cod204.lg05.1.hf.DP3.GQ20.allele.missi.miss20.maf0.02.par > pca01.out.

R task5 Lets switch a last time to R studio to plot the PCA. Use the provided script “gda_plotting.R” and follow point 4) in the R script.

Show me the plot!

The first principal component axis explains almost 2% of the total genomic variation and separates the eastern Baltic Sea (dark blue) from the remaining samples. The second axis explains 1% of the total genomic variation and separates the North Sea populations from the fjord-type cod. Although we used just a tiny fraction of the data, the population structure is comparable with Fig. 1b in Barth JMI et al. (2019) Mol Ecol 28, 1394–1411.

Well done!! Now it is time for socializing. My suggestion for todays pub is the “Zapa bar”.