VCF overview
Compiled by Julia M.I. Barth, 10 January 2025
Background and Aim
In this exercise, we will explore how the information is stored in a VCF file, and some simple ways of how we can inspect it. This will also serve as a way to practice basic UNIX skills.
The data set
We will work with a data set containing genomic data of 204 individuals of Atlantic cod (Gadus morhua) sampled at seven sites in the Atlantic, North Sea, and Baltic Sea (Barth JMI et al. (2019) Disentangling structural genomic and behavioural barriers in a sea of connectivity. Mol Ecol 28, 1394–1411).
How to do this activity
- Connect to your Amazon instance via the terminal using SSH.
- Questions or activities are indicated with
Q . - All text in
red on gray background
indicates commands or file names that you can type or copy to the terminal. Text inblack on gray background
refers to terminal outputs. - Try to work independently using the provided manuals and information, discuss questions with your neighbor. If you get stuck, check the answer-box:
Show me the answer!
Oh no, only if you get really stuck!! First try to find the answer yourself!
There are often alternative options and other solutions to the tasks. E.g., whether you prefer cat
over less
doesn’t matter in most cases – go ahead and use your solution! Don’t hesitate to ask the workshop team if you have questions!
1) The Variant Call Format (VCF)
VCF (Variant Call Format) is a standardized text file format that is used to store genetic variation calls such as SNPs or insertions/deletions. The full format specifications and valuable information about the different tags can be found here.
/home/wpsg/workshop_materials/21_vcf_overview
.
What is the file size of cod204.lg05.1.vcf.gz
? How many bytes and how many MB does it have?
Show me the answer!
cd /home/popgen/workshop_materials/21_first_steps_genomics
ls -l
“long listing format” to see a list of the folder content including the file size: the VCF file is 320855142
bytes.ls -lh
to see the content list with the file sizes printed in “human-readable” format: it’s 306 M
(MiB, to see the the size in actuall megabyte (MB) use: ls -l --block-size=MB
.
Tip: for more functions of ls
type ls --help
or check the manual with man ls
.
The file cod204.lg05.1.vcf.gz
is a very small VCF file with reduced file size for the purpose of this activity. Typical VCF files including full-genome sequencing data and many individuals are often several Gigabytes (GB) in size.
cod204.lg05.1.vcf.gz
?
Show me a hint!
If you used head cod204.lg05.1.vcf.gz
you will only see gibberish. Why?
It’s a compressed file (indicated by the ending “.gz”).
Compressing a VCF file with gzip or bgzip and indexing it with tabix is the common way in which huge VCF files are stored.
To see the first ten lines, you could decompress the entire file. However, this is not advisable as it would take several minutes (with larger files even hours) and it would inflate the file size to several gigabyte (GB).
Instead, you can make use of a program that allows you to view a part of the decompressed the VCF file without decompressing the whole file. There are several different options and programs to do this – can you think of one?
Show me the answer!
zcat cod204.lg05.1.vcf.gz | head
(zcat
is decompressing the file, but head
tells the commands to stop decompressing any further after ten lines are printed)Or you could use:
zless cod204.lg05.1.vcf.gz
(similarly, zless
only decompresses the part you are actually displaying; exit zless just like less by pressing q
)Or you could use:
bcftools view cod204.lg05.1.vcf.gz | head
(BCFtools is a software for variant calling and manipulation of VCF files. We will extensively work with BCFtools in the remaining exercise.)
Tip: Check out what these programs are doing with --help
or man
.
The first line is already telling you that this is indeed a VCF file:
##fileformat=VCFv4.2
These first lines, starting with ##
, are part of the “VCF header”, which stores meta-information about the data. If you use zless
you can scroll down to see more of the header and also the genotype data, which make up the majority of the file.
There can be a lot of information in the header. Different meta-information lines, e.g. “INFO”, “FILTER”, “FORMAT”, “ALT” describe types of information that are given for every variant in the VCF file. A few examples:
##INFO<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
: The number of of often the alternative allele is occuring.##FORMAT<ID=GT,Number=1,Type=String,Description="Genotype">
: The genotypes coded as allele values separated by/
(or|
if the data is phased).##contig<ID=LG01,length=28303952>
: Part of a list of chromosomes, contigs, or scaffolds of the reference assembly.
The allele values are 0
for the reference allele and 1
for the alternative allele. If there are more alternative alleles the second alternative allele will have the value 2
, the third 3
, etc. Missing data is denoted by dots .
.
For more information about the different tags, have a look at the VCF file format description.
The very last line of the header specifies the column names for the genotype data. The first eight columns are mandatory, follow by “FORMAT” and the sample IDs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AVE1409002 AVE1409003 AVE1409004 [...] TVE130512 TVE130513 TVE130514
You may want to edit these header lines to add changes that you did to your VCF file, or you received a VCF file from colleagues and wish to inspect the header in more detail to see what they did.
cod204.lg05.1.header.txt
.
To find the command, have a look at the BCFtools manual, or just type bcftools
on the command line to see the available options.
Give me a hint!
bcftools view
option. Either type bcftools view
on the command line, or have a look the manual.
Show me the answer!
bcftools view -h cod204.lg05.1.vcf.gz > cod204.lg05.1.header.txt
With
view -h
only the VCF header is selected (the header is small, therefore the small “h”). By using the capital -H
the header is suppressed and only the “large” body of genotype data is considered.
cod204.lg05.1.header.txt
, compared to the whole VCF file?
Show me the answer!
ls -l
The header is 7476 bytes (7 KB) small. This is only about 0.002% of the entire VCF, and a size we can safely open with a text editor.
##fileformat=VCFv4.2
stating the date of today: ##fileDate=20200121
.
Save the new header as: cod204.lg05.1.header.ed.txt
. Use BCFtools to replace (or “rehead”) the VCF file cod204.lg05.1.vcf.gz
with the new edited header and save the re-headed VCF as cod204.lg05.1.ed.vcf.gz
.
Make sure your changes took effect and your file is still valid by visually inspecting the file and the file size of cod204.lg05.1.ed.vcf.gz
.
Tip: Type bcftools
to see general options for the program; then use bcftools reheader
to see the more specific options for the reheader function.
Show me the answer!
nano cod204.lg05.1.header.txt
or vi cod204.lg05.1.header.txt
Add
##fileDate=20200121
, save the file (nano: “Ctrl + o”, vi: “:w”; don’t forget to save under the new name), and close the text editor (nano: Ctrl + x, vi: “:q”)bcftools reheader -h cod204.lg05.1.header.ed.txt cod204.lg05.1.vcf.gz -o cod204.lg05.1.ed.vcf.gz
(with -h
the header file is specified, with -o
the new output file name)With
zless cod204.lg05.1.ed.vcf.gz
you can check the header lines, and by using ls -lh
you can see if the file has still roughly the same size.
In the genotype part of the VCF file, one line corresponds to one variant site. Each line starts with the chromosome ID “CHROM”, the position “POS”, and the variant “ID”, and is followed by the bases of the two alleles (“REF” for the reference allele, and “ALT” for the alternative allele(s)). The 6th column (“QUAL”) is the phred-scaled quality score, a measure for how likely it is that this site is indeed a variant. In the next column, “PASS” tells us that the variant passed all filters.
In the “INFO” column, information related to the header info tags for each variant site is given (e.g., AC is the allele count in genotypes for the “ALT” allele(s), AF the allele frequency of the “ALT” allele, and AN is the total number of alleles, etc. We will use some of the “INFO” field specifications later on for hard quality filtering).
The “FORMAT” column specifies the fields and order of the data that is given with each individual for each genotype and the remaining columns list each individual with the values corresponding to the “FORMAT” field. The first “FORMAT” field is always the genotype (GT), the remaining fields are variable, but in our data file, GT is followed by AD (allelic depths for the ref and alt alleles), DP (read depth), GQ (genotype quality), and PL (phred-scaled genotype likelihood):
This is how the first variant of the genotype part of the VCF file looks like:
LG05 338 . A G 93.01 . AC=1;AF=0.0005862;AN=408;BaseQRankSum=0.674;ClippingRankSum=0;DP=8975;ExcessHet=3.0357;FS=0;InbreedingCoeff=-0.0104;MLEAC=1;MLEAF=0.0005862;MQ=56.57;MQRankSum=0;QD=11.63;ReadPosRankSum=1.24;SOR=0.148;AC_Het=1;AC_Hom=0;AC_Hemi=0 GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855 0/0:9,0:9:27:0,27,345 0/0:10,0:10:30:0,30,381 [...] 0/0:7,0:7:21:0,21,248
Show me the answer!
wc --help
to see the options.We know that the genotype part of the VCF file contains one line per variant site, meaning that we can use BCFtools to suppress the header and forward (pipe;
|
) the output to wc -l
. Using the flag -l
means “count the lines”:
bcftools view -H cod204.lg05.1.vcf.gz | wc -l
(takes ~1 min)
328253
A faster way is to use the BCFtools stats command, and either save and display the output, or grep
the line with the “number of records”:
bcftools stats cod204.lg05.1.vcf.gz | grep "number of records:"
SN 0 number of records: 328253
A) What kind of variants do you see (e.g., SNP, insertion/deletion)?
Show me the answer!
zless -S cod204.lg05.1.vcf.gz
The first variant site (“338”) is a SNP. Columns four and five for “REF” (the reference allele) and “ALT” (the alternative allele) show A and G.
The second variant site (“357”) is a deletion. Column four contains multiple bases, while column five shows a G.
LG05 338 . A G
LG05 357 . GTGGGCGCTACTCCAGGCAAAGGAGGACTCTGTACCGGAGGCTTGGATCAGCT G
Sometimes you may also see an asterisk (*), which represents a spanning deletion.
B) What is the genotype (GT) of the first individual (AVE1409002) at the first variant site (LG05 338)?
Show me the answer!
0/0
, which means twice the reference allele A (A/A). The allele values in the GT field are 0 for the reference allele (the “REF” column) and 1 for the first alternative allele (“ALT”). GT:AD:DP:GQ:PL 0/0:20,0:20:57:0,57,855
C) What is the genotype read depth (DP, specified as “FMT/DP” to distinguish it from the mean DP “INFO/DP” per variant site) for the first three individuals (AVE1409002, AVE1409003, AVE1409004) at the first variant site (LG05 338)?
Show me the answer!
GT:AD:DP:GQ:PL
0/0:20,0:20:57:0,57,855
0/0:9,0:9:27:0,27,345
0/0:10,0:10:30:0,30,381