The VCF format, filtering, and first analyses
Julia M.I. Barth, 22 January 2018
Background and Aim
With the advances of sequencing technology, reading whole genomes and re-sequencing population samples has become incredibly cheap over the last years. This has led to a shift from single-marker analyses to whole-genome studies, which at the same time caused a change from graphical user interfaces (GUI) software to command-line based programs. Such programs are much faster and more suitable for handling very large data files, but it might take some time to get used to ‘not seeing your data and analyses’.
Learning goals
- Apply the UNIX skills you learned in the previous exercises.
- Learn about the VCF format and how to handle and manipulate VCF files.
- Perform a PCA with genomic data.
Requirements
bcftools v1.6
VCFtools v0.1.14
EIGENSOFT smartpca v6.0.1
R / Rstudio (for plotting)
Example data
The example data is a VCF file containing biallelic SNP data of several individuals in two populations. The data was produced using Illumina sequencing, mapped against a reference genome, and already filtered to contain only high quality SNPs (e.g. high read depth and genotype quality).
How to do this activity
Questions or tasks are indicated with
All text in red underlined with gray color
indicates commands or file names that you can type or copy to the terminal. Text in black underlined with gray color
refers to terminal outputs.
Try to work independently, use the provided manuals, discuss with your neighbor. If you get stuck, check the answer-box:
Show me the answer!
There are always alternative options and solutions. Whether you prefer cat
over less
, or whether you found another solution to the task, go ahead and use your way of doing it!
Table of contents
A VCF (Variant Call Format) file is a standardized text file format that is used to store genetic variation calls such as SNPs or insertions/deletions of multiple individuals. The full format specifications can be found here.
In the following first part of the exercise, you will learn how the information in a VCF is stored, and how you can inspect it.
~/workshop_materials/01_unix_intro/vcf_activity
What is the file size of data.vcf.gz
? How many bytes and how many MB does it have?
Show me the answer!
cd ~/workshop_materials/01_unix_intro/vcf_activity
ls -l
(long listing format) to see a list of the folder content including the file size: the VCF file is 184547450 bytes.ls -lh
to see the content list with the file sizes printed in human-readable format: it’s 176 MB.Tip: for more functions of
ls
type ls --help
or man ls
.
The file ‘data.vcf.gz’ is a small VCF file with reduced file size for the purpose of this exercise. Typical VCF files including full-genome sequencing data and many individuals are often several Gigabytes (GB) in size.
Show me a hint!
head data.vcf.gz
you will only see gibberish! Why?It’s a compressed file (indicated by the ending ‘.gz’.
Compressing VCF files with gzip (or bgzip and indexing it with tabix) is the common way in which huge VCF files are stored.
To see the first ten lines, you could decompress the file and subsequently use head
. However, this is not advisable as it would take several minutes (with larger files even hours) and it would inflate the file size to several gigabyte (GB).
Instead, you can make use of a program that allows you to see the first lines of the compressed VCF file. There are several different options and programs to do this – just come up with one!
Show me the answer!
zcat data.vcf.gz | head
zless data.vcf.gz
bcftools view data.vcf.gz | head
‘bcftools’ ia a software for variant calling and manipulation of VCF files.Tip: Check out what these programs are doing with
--help
or man
.
The first line is already telling you that this is indeed a VCF file:
##fileformat=VCFv4.1
You can also see some more information about the VCF file, e.g., when it was created and which reference file was used:
##fileDate=20160104
##reference=file:///work/jobs/11187331/nuc_reference.fasta
These lines, starting with ##
, are part of the ‘VCF header’, which stores useful meta-information about the data. If you use zless
you can scroll down to see more of the header and also the genotype data, which make up the majority of the file.
There can be a lot of information in the header. Different meta-information lines, e.g. ‘INFO’, ‘FILTER’, ‘FORMAT’, ‘ALT’ describe types of information that are given for every variant in the VCF file. A few examples:
##INFO<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered">
:The read depth.
##FORMAT<ID=GT,Number=1,Type=String,Description="Genotype">
:The genotypes coded as allele values separated by /
. The allele values are 0
for the reference allele and 1
for the alternative allele. Missing data is denoted as .
.
##contig<ID=LG01,length=28303952>
: A list of chromosomes, contigs, or scaffolds.
The very last line of the header specifies the column names for the genotype data. The first 8 columns are mandatory, follow by ‘FORMAT’ and the sample IDs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BS001 BS002 BS003 BS004 BS005 BS006 BS007 BS008 BS009 BS010 NS001 NS002 NS003 NS004 NS005 NS006 NS007 NS008 NS009 NS010
You may want to edit these header-lines to add changes that you did to your VCF file, or you received a VCF file from colleagues and wish to inspect the header in more detail to see what they did.
data.header.txt
.
To find the command have a look at the bcftools manual, or just type bcftools --help
on the command line to see available options.
Show me the answer!
bcftools view -h data.vcf.gz > data.header.txt
With ‘view -h’ the VCF header only is selected, with ‘-H’ the header is suppressed (and only the genotype data considered).
data.header.txt
, compared to the whole VCF file?
Show me the answer!
ls -l
The header is 5330 bytes (5.3 KB) small. That is only 0.003% of the entire VCF, and a size we can safely open with a text editor.
data.header.ed.txt
. Replace (‘re-head’) the VCF file ‘data.vcf.gz’ with the new edited header and save the re-headed VCF as data.ed.vcf.gz
. Make sure your changes took effect and your file is still valid by inspecting the individual names and the file size of ‘data.ed.vcf.gz’.
Show me the answer!
nano data.header.txt
or vi data.header.txt
, edit and save.bcftools reheader -h data.header.ed.txt data.vcf.gz -o data.ed.vcf.gz
(takes 1 min)bcftools query -l data.ed.vcf.gz
ls -lh
In the genotype part of the VCF file, one line corresponds to one variant. Each line starts with the chromosome ID ‘CHROM’, the position ‘POS’, and the variant ‘ID’, and is followed by the bases of the two alleles (‘REF’ for the reference allele, and ‘ALT’ for the alternative allele). The 6th column (‘QUAL’) is the phred-scaled quality score, a measure for how likely it is that this site is indeed a variant. In the next column, ‘PASS’ tells us that the variant passed all filters. In the ‘INFO’ column, information related to the header info tags is given (e.g., AC is the allele count in genotypes for the ‘ALT’ allele, AF the allele frequency of the ‘ALT’ allele, and AN is the total number of alleles, etc.). The ‘FORMAT’ column specifies the fields and order of the data that is given with each individual, and the remaining columns list each individual with the values corresponding to the ‘FORMAT’ field. The first ‘FORMAT’ field is always the genotype (GT), which in our example is followed by AD (allelic depths for the ref and alt alleles), DP (read depth), GQ (genotype quality), and PL (phred-scaled genotype likelihood):
LG03 4837 . C A 559.22 PASS AC=11;AF=0.009016;AN=1220;BaseQRankSum=0.431;ClippingRankSum=0;DP=7189;ExcessHet=4.6637;FS=43.708;InbreedingCoeff=-0.0285;MLEAC=11;MLEAF=0.009016;MQ=55.93;MQRankSum=-2.093;QD=3.37;ReadPosRankSum=0.889;SOR=3.89 GT:AD:DP:GQ:PGT:PID:PL 0/1:12,3:15:90:0|1:4829_C_T:90,0,495 0/0:18,0:18:54:.:.:0,54,702 0/0:19,0:19:54:.:.:0,54,810 0/0:13,0:13:39:.:.:0,39,533 ./.:15,0:15:3:.:.:0,3,523 ./.:17,0:17:9:.:.:0,9,603 0/0:17,0:17:51:.:.:0,51,649 0/0:13,0:13:39:.:.:0,39,501 0/0:9,0:9:27:.:.:0,27,343 0/0:10,0:10:30:.:.:0,30,395 0/0:7,0:7:21:.:.:0,21,288 ./.:8,0:8:0:.:.:0,0,255 0/0:12,0:12:33:.:.:0,33,495 0/0:14,0:14:42:.:.:0,42,479 0/0:14,0:14:39:.:.:0,39,585 ./.:6,0:6:18:.:.:0,18,247 ./.:6,0:6:15:.:.:0,15,225 0/0:21,0:21:60:.:.:0,60,900 0/0:15,0:15:42:.:.:0,42,630 0/0:11,0:11:30:.:.:0,30,450
A) What kind of variant is it (e.g. SNP, insertion/deletion (indel))? What would you expect if it was an indel?
Show me the answer!
LG03 4837 . GT GTTT
LG03 4837 . CAGAA CA
B) What genotype (GT) has the first individual BS001 at the second SNP (LG03 8159)?
Show me the answer!
GT:AD:DP:GQ:PL 0/0:15,0:15:39:0,39,585
The first individual has the genotype 0/0, which means twice the reference allele G (G/G). The allele values in the GT field are 0 for the reference allele (the REF column), 1 for the alternative allele (ALT).
C) What is the read depth (DP) for the first three individuals BS001, BS002, and BS003 at the second SNP (LG03 8159)?
Show me the answer!
GT:AD:DP:GQ:PL
0/0:15,0:15:39:0,39,585
0/1:10,6:16:99:202,0,356
0/0:12,0:12:30:0,30,450
Now that we know the VCF format, we can start to adjust our file for certain analyses. As mentioned before, our VCF file is already filtered for quality, read depth, etc. (find more on variant calling and quality filtering on the Genomics workshop page).
However, you may want to use different thresholds for minor allele frequency or the proportion of missing data for specific analyses. Or you might want to filter out certain variants or chromosomes. The software VCFtools has various functions to manipulate, inspect, filter, and merge VCF files. VCFtools can also output statistics such as heterozygosity, allele frequencies, or Fst.
Individuals and/or variants with a lot of missing data (often due to low sequence coverage) are not very informative and may interfere with certain analyses. Excluding individuals and/or variants with too much missing data as a first step also reduces file size and speeds up analyses.
Tip: Have a look at the vcftools manual website or man vcftools
.
Show me the answer!
vcftools --gzvcf data.ed.vcf.gz --missing-indv --out data.ed
The input format and the file to be processed is defined by ‘––gzvcf’. The file reporting the missingness on a per-individual basis is requested with ‘––missing-indv’. The file has the suffix ‘.imiss’. Using ‘––out’ the output filename prefix is defined. If none was specified, the ouput filename prefix is simply ‘out’.
Use less
or head
to see the format of the data.ed.imiss
file:
INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS
BS001 1184094 0 53664 0.0453207
BS002 1184094 0 124941 0.105516
BS003 1184094 0 84133 0.0710526
...
Show me the answer!
cat data.ed.imiss | sort -rk5
Sort the 5th column ‘k5’ from large to small (reverse) ‘r’.Two individuals (NS009, BS008) have 25% missing data.
Along with excluding individuals with few data, we also want to exclude sites for which 20% or more of the genotypes are missing.
Show me the answer!
--max-missing
must be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed.--remove-indv
followed by the individual name. This option can be used multiple times to specify more than one individual.
data.ed.imiss25.miss20.vcf.gz
) directly as a compressed VCF file using bgzip
(output to standard out then pipe to bgzip). Include all VCF header info (--recode-INFO-all
).
Tip: Get inspired by the examples at the beginning of the vcftools manual.
(takes about 2 minutes)
Show me the answer!
vcftools --gzvcf data.ed.vcf.gz --max-missing 0.8 --remove-indv NS009 --remove-indv BS008 --out data.ed.imiss25.miss20 --recode --recode-INFO-all --stdout | bgzip > data.ed.imiss25.miss20.vcf.gz
Here, ‘––out’ specifies the name/location for the log file. If no name is specified, the log file will be named “out.log”. With ‘––recode’ it is specified that a new VCF should be generated. To keep all meta information in the genotype data we should specify ‘––recode-INFO-all’. The output is directed to standard out (stdout) using ‘––stdout’ and can thus be piped to another program or directly written to a new file. With ‘bgzip’ the piped output is compressed.
Verify that your command worked by checking the individual IDs, having a look at the new file size and the log-file entry, and by comparing the number of variants in the two VCF files:
How many variants were excluded due to the applied missingness filters?
Do the numbers given in the log file and your calculations on the original file and the new file match?
Tip: Use a cobination of bcftools (to suppress the header) and word-count (wc
) to count the variants.
Show me the answer!
ls -lh
: The new file is about 40% smaller!bcftools view -H data.ed.vcf.gz | wc -l
View the VCF file without the header (-H) and count the lines (wc -l) of the genotype part. Remember, the VCF contains one variant per line.bcftools view -H data.ed.imiss25.miss20.vcf.gz | wc -l
Compare the number above with the filtered VCF.Tip: You can start the commands above in
screen
or in a seperate terminal tab, so you can continue working until they’re finished.bcftools query -l data.ed.imiss25.miss20.vcf.gz
Check the individual names. Are BS008 and NS009 still listed?less data.ed.imiss25.miss20.log
: Compare the numbers above with the log-file: [...] After filtering, kept 18 out of 20 Individuals [...] After filtering, kept 760,108 out of a possible 1,184,094 Sites [...]
2.2) Filter on minor allele frequency
Filtering on minor allele frequency (MAF; the frequency of the allele that appears less often) can be done for multiple reasons. One reason is that in samples sequenced with low coverage, variants with very low MAF frequently present false calls, and can be obstructive by adding noise to the data. If the data will be used for a genome-wide association study (GWAS), usually a rather high MAF threshold is applied, as it requires very strong statistical power to make meaningful statements about very rare alleles. When interested in population structure with many samples, alleles present in only one individual do not add information for the inference of structure.
Therefore, the optimal MAF filtering threshold depends on the type of data you have, the analysis you are planning to do, the data quality, and the sample size.
If you have no idea what an appropriate threshold could be, read hint 1:
Show me hint 1!
It may help to output the allele frequencies using the
--freq
flag and plot them in a histogram using e.g., RStudio in the browser http://YOUR_IP.compute-1.amazonaws.com:8787
. If you’re not so familiar with R yet, use the R-cheat-sheet (20180122_vcf_plot.R) for the plotting. vcftools --gzvcf data.ed.imiss25.miss20.vcf.gz --freq --out data.ed.imiss25.miss20
If you’re still unsure as of which threshold to choose, read hint 2.
Show me hint 2!
If you don’t know if MAF 0.01 would filter anything or not, just filter for MAF 0.01 to see what happens. Make sure to save the output file (named
data.ed.imiss25.miss20.maf0.01.vcf.gz
) directly as a compressed VCF file using bgzip and include all VCF header info as above. Have a look at the log file. Have any variants been removed? Why?
Show me the answer to hint 2!
vcftools --gzvcf data.ed.imiss25.miss20.vcf.gz --maf 0.01 --out data.ed.imiss25.miss20.maf0.01 --recode --recode-INFO-all --stdout | bgzip > data.ed.imiss25.miss20.maf0.01.vcf.gz
From the resulting log file:
After filtering, kept 730,193 out of a possible 760,108 Sites
This means 29,915 variants have a MAF below 0.01 and have indeed been removed. But how is this possible and what does it mean?
If we filter the dataset with 18 individuals (36 alleles) for MAF 0.01, a variant is excluded if its minor allele is found less than 0.36 times (1% of 36). We would not expect to have alleles that exist 0.36 times (they exist either not at all (0 times) or once (1 time) or more often). So, a MAF threshold of 0.01 is obviously too low for this data set and would only exclude completely invariant sites.
Usually we would not expect to have invariant variants in our data file. However, the two removed individuals might have been the only carriers of a variant allele at a number of sites, and after excluding these individuals, these sites might therefore be invariant.
Compare the genotypes at the site ‘LG03 10020068’ in the original VCF and the filtered one (note, between LG03 and the position is a ‘tab’, not a space):
zcat data.ed.vcf.gz | grep 'LG03 10020068'
zcat data.ed.imiss25.miss20.vcf.gz | grep 'LG03 10020068'
The second last individual (NS009) is the only one with the reference allele. We removed it, so we now expect a MAF of 0 for this site. If you haven’t done so before, you may now want to have a look at the unfiltered allele frequencies:
vcftools --gzvcf data.ed.imiss25.miss20.vcf.gz --freq --out data.ed.imiss25.miss20
Check the allele frequencies for the site LG03 10020068:
cat data.ed.imiss25.miss20.frq | grep 'LG03 10020068'
LG03 10020068 2 30 T:0 A:1
The allele frequency for ‘T’ is indeed 0!
Count the number of sites with MAF 0 by checking how often ‘0’ occurs in column 6 and 8 of the .frq file:
cat data.ed.imiss25.miss20.frq | sed 's/:/ /g' | sed 1d | cut -f6 | grep -cx 0
cat data.ed.imiss25.miss20.frq | sed 's/:/ /g' | sed 1d | cut -f8 | grep -cx 0
667 + 29,248 = 29,915
The command ‘sed’ is used to replace the colons with tabs in order to list the frequency in a seperate column. ‘sed’ is also used to remove the first line ‘1d’. ‘cut -f6’ cuts out the 6th column, and with ‘grep -cx’ the occurence of exactly (‘-x’) the value 0 is counted (‘-c’).
Tip: To exclude only invariant sites, it is more convenient to filter for a minor allele count (MAC) of 1 instead of MAF.
OK, so the MAF threshold of 0.01 only removed invariant sites – back to the original task: What would be an appropriate MAF threshold to filter our data?
An appropriate MAF threshold is given in the answer below:
Show me the answer!
A MAF threshold of 0.01 would exclude all alleles that are called less than 0.36 times (1% of 36), and thus would only remove invariant sites.
A MAF threshold of 0.05 would exclude all alleles that are called less than 1.8 times (5% of 36), and thus would remove singleton sites.
A MAF threshold of 0.1 would exclude all alleles that are called less than 3.6 times (10% of 36), and thus would remove alleles where the minor allele occurs three times or less. These could still be false calls in low coverage data, but given that there are only 36 alleles in total, three alternative alleles are a good indication for a true variant.
However, to analyse population structure in this dataset, filtering for MAF 0.1 will work well.
data.ed.imiss25.miss20.maf0.1.vcf.gz
) directly as a compressed VCF file using bgzip and include all VCF header info as above.
Show me the answer!
vcftools --gzvcf data.ed.imiss25.miss20.vcf.gz --maf 0.1 --out data.ed.imiss25.miss20.maf0.1 --recode --recode-INFO-all --stdout | bgzip > data.ed.imiss25.miss20.maf0.1.vcf.gz
Show me the answer!
2.3) Exclude chromosome 7 and linked sites
VCFtools offers various options to exclude or include certain parts of your file in the analysis. Have a look at the ‘Site filtering” section in the manual.
For population structure analysis we want to exclude sites that are very close to each other and thus likely share the same information (they are in linkage disequilibrium). In addition, we also exclude chromosome 7 as we know that a large inversion on this chromosome causes many SNPs to be linked.
data.ed.imiss25.miss20.maf0.1.LD.vcf.gz
) directly as a compressed VCF file using bgzip and include all VCF header info (takes about 1 min).
Show me the answer!
vcftools --gzvcf data.ed.imiss25.miss20.maf0.1.vcf.gz --not-chr LG07 --thin 10 --out data.ed.imiss25.miss20.maf0.1.LD --recode --recode-INFO-all --stdout | bgzip > data.ed.imiss25.miss20.maf0.1.LD.vcf.gz
The command ‘––thin’ takes the first site it reads and excludes any sites that are within the specified distance from that first site, then it keeps the next site, etc. A scaffold or chromosome can be excluded using ‘––not-chr’. This flag can also be used multiple times to exclude further chromosomes.
Tip: Linkage disequilibrium-based pruning is normally more useful than thinning and possible with the software PLINK. For a tutorial on how to filter on linkage disequilibrium (LD) you can follow the WPSG2016 vcftools/PLINK exercise.
Have a look at the log-file; how many variants were excluded this time? If you like you can again compare the numbers to the output file using wc
like above. Check whether chromosome 7 was really excluded from the file.
Show me the answer!
less data.ed.imiss25.miss20.maf0.1.LD.log
After filtering, kept 173823 out of a possible 251607 Sites
bcftools view -H data.ed.imiss25.miss20.maf0.1.LD.vcf.gz | grep LG07 | head
No hit should be found.
For a comparison try: bcftools view -H data.ed.imiss25.miss20.maf0.1.vcf.gz | grep LG07 | head
.
Tip: The chromosome data is excluded from the genotype part of the VCF file, but it is still present in the VCF header!
2.4) Export the VCF to PLINK format
As a last step, we’ll use VCFtools to export the VCF into a format that we can use to analyze population structure by performing a principal component analysis (PCA).
The software PLINK is a comprehensive genome analysis toolset with an extensive list of functions. It was originally developed for human data (hence it uses all those human terms like “family” and “father”), but the new PLINK 1.9 can also be used with genomic data of non-model organisms.
Due to time constraints PLINK is not included in this activity, but if you’re interested to learn more about PLINK and how to filter on linkage disequilibrium (LD) you can follow the WPSG2016 vcftools/PLINK activity.
chrom_map.txt
to PLINK format. Use data.ed.imiss25.miss20.maf0.1.LD
as the file prefix for the new file.
Show me the answer!
vcftools --gzvcf data.ed.imiss25.miss20.maf0.1.LD.vcf.gz --plink --chrom-map chrom_map.txt --out data.ed.imiss25.miss20.maf0.1.LD
Type ls -l
to see the new files. You should have two new data files (‘.ped’ and ‘.map’, and a log file).
The smartpca program implemented in the software EIGENSOFT uses a combination of Principal Component Analysis (PCA; a technique used to emphasize variation and bring out strong patterns in a dataset) and new statistics designed specifically for genomic data. The methods are described in detail in an excellent and accessible manuscript by the authors of the program.
The EIGENSOFT smartpca needs four input files: The ‘.ped’ (containing the genotype information) and ‘.map’ (containing the chromosome/position information) files that we exported above, a ‘.pedind’ file containing individual information, and a ‘.par’ file with the specified parameters for analysis.
Show me the answer!
cat data.ed.imiss25.miss20.maf0.1.LD.ped | cut -f1-6 > tmp
Have a look at the ‘tmp’ file. It contains the family IDs and the individual IDs in the first two columns (in our case this is the same), followed by zeros that indicate ‘no information’ for the remaining columns (fatherID, motherID, sex, phenotype). In addition, we need a column to specify which individual belongs to which population.
data.ed.imiss25.miss20.maf0.1.LD.pedind
. Remove the ‘tmp’ file.
Show me the answer!
cat data.ed.imiss25.miss20.maf0.1.LD.ped | cut -c 1,2 | paste tmp - > data.ed.imiss25.miss20.maf0.1.LD.pedind
‘cut -c 1,2’ selects the first two ‘characters’, which are combined via the stdout (‘-‘) with the saved ‘tmp’ file. rm -i tmp
A template parameter file (template.par
) and a smartpca manual page (eigensoft_smartpca_readme.txt
) can be found in the analysis folder.
data.ed.imiss25.miss20.maf0.1.LD.par
and run the smartpca with the syntax ‘smartpca -p parfile > pca01_out’. Plot the PCA using RStudio in the browser using your own code or the provided R cheat-sheet 20180122_vcf_plot.R
.
Show me the answer!
smartpca -p data.ed.imiss25.miss20.maf0.1.LD.par > pca01_out
Connect to Rstudio in the browser:
http://YOUR_IP.compute-1.amazonaws.com:8787/
After plotting, you should see a plot similar to this one here.
The first principal component axis explains almost 13% of the total genomic variation and seperates the individuals of the two populations (BS and NS). The second axis explains 6.17% of the total genomic variation and shows variation within the BS population.
Another smartpca example on blue and yellow crater lake cichlids can be found here.