(Compiled for the WPSG2016 by Julia M.I. Barth)

Table of contents

Background and Aim

Please find the intro-slides here.

With the advances of sequencing technology, reading whole genomes and re-sequencing population samples has become incredibly cheap over the last years. In the field of population genetics, this progress has led to a shift from single-marker analyses to whole-genome studies, which at the same time caused a change from analysis software using graphical user interfaces to command-line based programs. Such programs are much faster and more suitable for handling very large data files, but it might take some time to get used to “not seeing your data and analyses”.

During this workshop we will focus on the analysis of genomic data sets, rather than on their generation. However, in order to prepare your data for any biological interpretation, some initial adjustments are almost always necessary. This can be a simple change of file format, or it can be that you want to remove that one individual that has a very low sequencing depth. In addition, you might also want to include different filtering steps for different analyses, to specify varying thresholds for the proportion of missing data, or for the minor-allele frequency.

During this tutorial we will primarily use the command-line based software VCFtools and PLINK, as well as some of the basic UNIX commands you learned yesterday, and we will filter and convert population data from a raw VCF file so we can use it for analyses of e.g. population structure.

Instead of working with a data set that is mapped against a well-assembled genome (like human data), we will work with data from a non-model organism where a high-quality genome is not yet available and which is thus arranged in thousands of smaller scaffolds instead of a few chromosomes. This adds a bit of complexity to handling of the data, but is likely similar to what most of you will be doing in your own work. The example data has already gone through quality filtering (e.g. it was filtered to contain only variants that have a read depth (DP) between 3 and 30, and a minimum genotype quality (GQ) of 20) and filtering for bi-allelic markers (SNPs) only. These SNPs have been called using the variant calling software GATK. If you’re interested in how to actually get to such a data set, you might have a look at the exercises of the Genomics workshop from last week.

Getting Started

All computer programs needed for this activity are installed on the Amazon server. These are:
bcftools v1.3
VCFtools v0.1.14
PLINK v1.9p – special build of PLINK v1.9 for high contig size (request from authors, or ask Julia)
R / Rstudio / gnuplot (for plotting only, whichever you prefer)

The example data file can be found here: ~/wpsg_2016/activities/vcftools_plink
Or can be downloaded here.

How to do the exercise:
The main purpose of this activity is to get you familiar with the software listed above, to work on the command line, and to do so independently.
You could easily go through the exercise and just copy and paste the code, but that way you would most likely only wear out your Ctrl, C, and V keys without getting the most out of it. Instead, use the help options and the manual if you don’t know how to do the task.
With every question or task a link to the software manual (and very often the required section of the manual) is provided. Use those links and try to find the solution BEFORE looking into the answer box!
If you get stuck, check the answer and try to understand it. If your command was not correct, you can (and should) of course copy and paste long commands rather then type them letter-by-letter. All text underlined with gray color indicates commands that you can type or copy to the terminal.
There are always alternative options and solutions for UNIX code. E.g., if you prefer cat to less, go ahead and use cat. If you’ve found another solution that works as well: Great! Just go ahead and use your solution!

If you have questions, bother the instructors! They are here to help you! 😉

And now, let’s start!

Exercise 1 – The VCF format

A VCF (Variant Call Format) file is a standardized text file format that is used to store genetic variation calls such as SNPs or insertions/deletions of multiple individuals. The full format specifications can be found here.
In the following first part of the exercise, we will learn how the information in a VCF is stored, and how we can inspect it.

Open the terminal and use change directory cd to navigate to the analysis folder including the example data file.
What is the file size of the example data file data.vcf.gz? How many bytes, or MB does it have?

Show me the answer!

cd ~/wpsg_2016/activities/vcftools_plink

In the analysis folder, type ls -l (long listing format) to see a list of the folder content including the file size:
the VCF file is 166399415 bytes.

Use ls -lh to see the content list with the file sizes printed in human-readable format: it’s 159 MB.

Tip: for more functions of ls type ls –help or man ls. This help-text concept is valid for most programs, but sometimes you have to type -h or –h instead of help.

The file data.vcf.gz is a small VCF file with reduced file size for the purpose of this exercise. Typical VCF files including full-genome sequencing data and many individuals are often several Gigabytes (GB) in size.

What is in the file?

Use head data.vcf.gz to see the first ten lines of the file.

Oups, only gibberish!
It’s a compressed file (indicated by the ending .gz). Compressing VCF files with gzip (or bgzip and indexing it with tabix) is the standard way in which VCF files are stored.
We could decompress the file, which would take some time and inflate the file size to several GB, so we’re not going to do that. Instead, we can use another tool to display the file content: “bcftools”, a software for variant calling and manipulating of files in VCF format.
Have a look at the manual here: bcftools manual, or just type bcftools on the command line to see available options.

Do you find a command that would allow us to have a look into (“view”) the compressed VCF?

Show me the answer!

From the manual:
“bcftools view [OPTIONS] file.vcf.gz [REGION […]]”

Ok, lets test that one:
bcftools view data.vcf.gz

Ah…. what’s happening?? It’s printing the whole file on the screen!

Press the “panic-button” ctrl-c (control c) to abort the command.

Can you come up with something that we can combine with bcftools view data.vcf.gz to have a sneak-peek into the file without printing everything on the screen? Try something you learned yesterday!

Show me the answer!

bcftools view data.vcf.gz | head

Tip: On some systems less or zless can also decompress files on the fly and display the content.

With these commands we can already see that we indeed are dealing with a VCF file:
##fileformat=VCFv4.1

We also see some more information about the VCF file, e.g. when it was created, which reference file was used, etc.
##fileDate=20140121
##reference=file:///work/jobs/11187331/reference.scf.fasta

These lines are part of the “VCF header”, which stores useful meta-information about the data and always starts with ##. You may want to edit these header-lines to add changes that you did to your VCF file, or you received a VCF file from colleagues and wish to inspect the header in more detail to see what they did.
Can you find a command in “bcftools” on how to separate the header from the large genotype data in order to open and edit it with a text editor?
Type bcftools view to see the options.

Found something? Try it!

Show me the answer!

bcftools view -h data.vcf.gz

Uh – it’s only the header, but still a lot of information! Save the header to a file called data.header.txt instead of printing it on the screen so you can have a look at it with a text editor.

Show me the answer!

bcftools view -h data.vcf.gz > data.header.txt

What’s the file size of data.header.txt that we extracted from the VCF, compared to the whole VCF file?

Show me the answer!

ls -lh
The header is 330972 bytes (324 KB) small. That is only 0.2% of the entire VCF, and a size we can safely open with a text editor.

If you like you can download the header to look at it with a text editor, or you can examine it right here on the command line. How would you do that?

Show me the answer!

less data.header.txt
– to go one window forward: space (or f)
– to go one window backward: b
– depending on which console program you’re using, scrolling might also work!
– to go to the end of the file: shift g
– to exit press q
less –help or man help to see more options

There can be a lot of information in the header. Different meta-information lines, e.g. “INFO”, “FILTER”, “FORMAT”, “ALT” describe types of information that are given for every variant in the VCF file. A few examples:
##INFO<ID=DP,Number=1,Type=Integer,Description=”Approximate read depth; some reads may have been filtered”> – The read depth.
##FORMAT<ID=GT,Number=1,Type=String,Description=”Genotype”> The genotypes, which, as we will see below are encoded as allele values separated by /. The allele values are 0 for the reference allele and 1 for the alternative allele. Missing data is denoted as “.”.
##contig<ID=scf7180003948298,length=1427> A list of contigs or scaffolds.
The very last line of the header specifies the column names for the genotype data. The first 8 columns are mandatory, follow by “FORMAT” and the sample IDs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IND01 IND02 IND03 IND04 IND05 AFR01 AFR02 AFR03 AFR04 AFR05

After this line the genotypes are specified, but this is not part of the header anymore.
Any idea how we could have a sneak-peek into the genotype data without printing everything on the screen?

Tip: We know that the last line of the header starts with #CHROM, so we could grab CHROM and the following line by specifying grep -A1.

Show me the answer!

bcftools view data.vcf.gz | grep CHROM -A1
We use the bcftools view command as before, but instead of printing only the head, we read the whole file and “pipe” the output without seeing it to another program called “grep”, which searches the piped input for lines containing a match to a given pattern (here “CHROM”). For the program to also show the following line, we need to specify -A1

The program will continue the search until the end of the file, which can take several minutes. We know that CHROM occurs only in the header line, so you can abort the search after seeing the first lines with the panic-button ctrl-c (control c).

On some systems, less or zless will also work:
less data.vcf.gz | grep CHROM -A1

Tip: To see more than one line after the match, specify more line numbers with -A (e.g. -A2 for two lines), to see lines before the match, use -B.

Alternatively, you could use bcftools view -H data.vcf.gz | head, which would show the first ten lines of the VCF file after the header (so without the column names, as those are in the header!). Using bcftools view -H data.vcf.gz alone would show the entire VCF file without the header.

This is the last header line, and the first genotype line:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT IND01 IND02 IND03 IND04 IND05 AFR01 AFR02 AFR03 AFR04 AFR05
scf7180003948298 203 . T C 2899.83 PASS AC=3;AF=0.793;AN=4;BaseQRankSum=0.731;ClippingRankSum=0.731;DP=151;FS=6.8;GQ_MEAN=16;GQ_STDDEV=21.53;InbreedingCoeff=0.2421;MLEAC=47;MLEAF=0.81;MQ=60;MQ0=0;MQRankSum=0.731;NCC=11;QD=34.24;ReadPosRankSum=0.804 GT:AD:DP:GQ:PL 1/1:0,8:8:24:327,24,0 ./.:0,4:4:12:167,12,0 ./.:0,0:0:.:. ./.:0,0:0:.:. ./.:0,5:5:15:199,15,0 ./.:0,4:4:12:167,12,0 0/1:4,1:5:27:27,0,103 ./.:0,2:2:6:84,6,0 ./.:0,0:0:.:. ./.:5,0:5:15:0,15,186

In the genotype part of the VCF file, one line corresponds to one variant. It starts with the chromosome ID, the position, and an ID (missing here (.)) of the variant, and is followed by the bases of the two alleles (REF for the reference allele, and ALT for the alternative allele). The 6th column is the phred-scaled quality score, a measure for how likely it is that this site is indeed a variant. In the next column, “PASS” tells us that the variant passed all filters. In the INFO column, information related to the header INFO tags is given. E.g., AC is the allele count in genotypes for the ALT allele, AF the allele frequency of the ALT allele, and AN is the total number of alleles, etc. The FORMAT column specifies the fields and order of the data that is given with each individual, and the remaining columns list each individual with the values corresponding to the FORMAT field. The first FORMAT field is always the genotype (GT), which in our example is followed by AD (allelic depths for the ref and alt alleles), DP (read depth), GQ (genotype quality), and PL (phred-scaled genotype likelihood).

Examine the first variant in our VCF file and answer the following questions:
A) What kind of variant is it (e.g. SNP, indel (insertion/deletion))? What would you expect if it was an indel?

Show me the answer!

It’s a SNP. Columns 4 and 5 for REF (the reference allele) and ALT (the alternative allele) show T and C. If it was an indel, it could e.g. look like this:
scf7180003948298 203 . GT GTTT
scf7180003948298 203 . CAGAA CA

B) What genotype (GT) does the first individual IND01 have?

Show me the answer!

GT:AD:DP:GQ:PL 1/1:0,8:8:24:327,24,0
The first individual has the genotype 1/1, which means twice the alternative allele C (so it has C/C). The allele values in the GT field are 0 for the reference allele (what is in the REF column), 1 for the alternative allele (ALT).

C) I mentioned the field “read depth” (DP) above. Can you tell me the read depth for the first three individuals IND01-IND03?

Show me the answer!

The read depths (DP) are 8, 4, and 0.
GT:AD:DP:GQ:PL
1/1:0,8:8:24:327,24,0
./.:0,4:4:12:167,12,0
./.:0,0:0:.:.

IND02 and IND03 both have missing data at the first variant (GT ./.). While the genotype for IND03 was not called at all, the genotype for IND02 was replaced with missing data during quality filtering due to low read depth and genotype quality (GQ).

______________________________________________________________________________________________

Exercise 2 – VCFtools

Now that we know the VCF format, we can start to adjust our file for certain analyses. As I mentioned before, our VCF file is already filtered for quality, read depth, etc. (find more on variant calling and quality filtering on the Genomics workshop page).
However, you may want to use different thresholds for minor allele frequency or the proportion of missing data for specific analyses. Or you might want to filter out certain variants or chromosomes. The software VCFtools is a package that has various functions to manipulate, inspect, filter, and merge VCF files. VCFtools can also output statistics such as heterozygosity, allele frequencies, or Fst.

Go to the terminal and type man vcftools. Have a look at the “SYNOPSIS” to get to know the general commands needed to run VCFtools. If this appears cryptic, have a look at the “EXAMPLES”. Once finished, exit the man pages by typing q.

Then, let us begin to work on the data.vcf.gz file using VCFtools!

2.1) Filtering for missing data

Individuals and/or variants with a lot of missing data (often due to low ‘sequence coverage’) are not very informative and usually are not included in further analyses. In order to exclude such non-informative data and at the same time also reduce file size and speed-up analyses, the first thing we’ll do is to exclude all individuals and variants that have too much missing data.

Let’s start with checking how much missing data our individuals have.
Have a look at the vcftools manual website or the man vcftools and find a command to be used to get an estimate of the missingness per individual.
Try it!

Show me the answer!

vcftools –gzvcf data.vcf.gz –missing-indv –out data

–gzvcf defines the input format and the file to be processed. Other possible formats are –vcf and –bcf.
–missing-indv generates a file reporting the missingness on a per-individual basis. The file has the suffix “.imiss”.
–out defines the output filename prefix. If none was specified, the ouput filename prefix is simply “out”.

Have a look at the text file output data.imiss! Can you spot the odd one out?

Show me the answer!

less data.imiss

INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS
IND01 1753496 0 854766 0.487464
IND02 1753496 0 739475 0.421715
IND03 1753496 0 771782 0.440139
IND04 1753496 0 1252843 0.714483
IND05 1753496 0 760947 0.43396
AFR01 1753496 0 808858 0.461283
AFR02 1753496 0 851414 0.485552
AFR03 1753496 0 772084 0.440311
AFR04 1753496 0 942482 0.537487
AFR05 1753496 0 618201 0.352553

IND04 has much more missing data then all other individuals, so we’ll exclude it from further analysis.
Can you find a command that would allow us to do so? Have a look into the vcftools manual or the man vcftools again!

Show me the answer!

–remove-indv <string>
Specifies an individual to be removed from the analysis. This option can be used multiple times to specify multiple individuals.

Keep this command in mind, we’ll use it at the next step!!

Along with filtering individuals with few data, we also want to exclude variants with too much missing data. How much missing data we allow will depend on the analyses that we’re planing to do, but for a lot of analyses (such as population structure) we don’t want variants for which half or more of the data is missing. By excluding such variants, we can fairly reduce the file size and increase the speed of follow-up analyses!

Have another look into the VCFtools manual or the man vcftools to see if there is a command that would allow us to exclude all variants with 50% or more missing data.

Show me the answer!

–max-missing <float>
This command excludes sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates that no missing data is allowed).

Now, combine the two commands to remove the one individual with a lot of missing data and to exclude sites with more than 50% missing data, and write the data to a new compressed VCF file called data.noIND04.miss0.5.vcf.gz. Make sure to also include all the INFO fields and values in the new VCF!

Tip: Look at the examples at the beginning of the VCFtools manual to find a solution on how to output a new VCF file to standard out (stdout) and then compress it with gzip.

Show me the answer!

vcftools –gzvcf data.vcf.gz –remove-indv IND04 –max-missing 0.5 –recode –recode-INFO-all –out data.noIND04.miss0.5 –stdout | gzip -c > data.noIND04.miss0.5.vcf.gz

– We use –recode to specify that we want to generate a new VCF.
–recode-INFO-all is used to keep all meta information in the genotype data. You can also just keep certain or none of the info tags.
–out specifies the name/location for the log file. If no name is specified, the log file will be named “out.log”.
–stdout is used to direct the output to standard out (stdout) so it can be piped or directly written to a new file
gzip takes the piped output and compresses it. The -c flag tells the program to not compress the original file, but write a new file (the -c flag is used in the example on the VCFtools webpage, but does actually not have an effect when the input is piped into gzip and could thus be skipped). See more options of gzip with gzip –help.

NB! Takes about 1 minute!

Verify that your command worked by checking the individual IDs, having a look at the new file size and the log-file entry, and by comparing the number of variants in the new reduced data.noIND04.miss0.5.vcf.gz to the original data.vcf.gz VCF file.

Answer the following questions:
How much smaller is the new file?
How many variants were excluded due to the applied missingness filters?
Do the numbers given in the log file and your calculations on the original file and the new file match?

Tip: You can use bcftools to check the individual IDs! Use view and grep as in Exercise 1, or have a look at the bcftools manual to see whether there is a more direct command to output the sample IDs. Check the file size like you did in Exercise 1, and use word-count (wc) to count the variants.

Show me the answer!

Get sample names:

bcftools view data.noIND04.miss0.5.vcf.gz | grep CHROM
bcftools query -l data.noIND04.miss0.5.vcf.gz
Is IND04 still listed?

Check the file size:

ls -lh
The new file is about half the size!

Check the log-file:
less data.noIND04.miss0.5.log
[…] After filtering, kept 9 out of 10 Individuals […] After filtering, kept 1005919 out of a possible 1753496 Sites […]

Compare the number of variants:

bcftools view data.vcf.gz -H | wc -l
bcftools view data.noIND04.miss0.5.vcf.gz -H | wc -l
As every line is a variant, we can simply count the lines wc -l of our VCF file without the header -H.
The numbers match the ones given in the log file.
747577 variants had more then 50% missing data.

 

2.2) Exclude the mitochondrial scaffold and very close sites

VCFtools offers various options to exclude or include certain parts of your file in the analysis. Have a look at the “Site filtering” section in the manual.

For our downstream analysis we want to exclude the mitochondrial scaffold from the VCF. We also want to exclude sites that are very close to each other and thus likely share the same information.

Make a new VCF without the mitochondrial scaffold “scf7180003949713”, and in which sites closer then 10 bp from each other are excluded. Use the file data.noIND04.miss0.5.vcf.gz as input and name the new file data.noIND04.miss0.5.noMT.thin10.vcf.gz. Compress the output and include all INFO fields as before. Check the manual for the necessary commands!

Show me the answer!

vcftools –gzvcf data.noIND04.miss0.5.vcf.gz –not-chr scf7180003949713 –thin 10 –recode –recode-INFO-all –out data.noIND04.miss0.5.noMT.thin10 –stdout | gzip -c > data.noIND04.miss0.5.noMT.thin10.vcf.gz

–thin takes the first site it reads and excludes any sites that are within the specified distance from that first site, then it keeps the next site, etc.
–not-chr is used to exclude the specified scaffold or chromosome. It can also be used multiple times to exclude further scaffolds.

Tip: Linkage disequilibrium-based pruning is normally more useful than thinning and possible with PLINK (we’ll learn about this option later).

Have a look at the log-file; how many variants were excluded this time? If you like you can compare the numbers again to the output file using word-count like above. Check whether scf7180003949713 is really excluded from the file.

Show me the answer!

less data.noIND04.miss0.5.noMT.thin10.log
141976 sites were excluded due to our filters.

bcftools view -H data.noIND04.miss0.5.noMT.thin10.vcf.gz | grep scf7180003949713
No hit should be found. Don’t interrupt – this time you need to search the file until the end (takes less than 1 minute).

Tip: The scaffold data will be excluded from the genotype part of the VCF file, but the scaffold ID will still be present in the VCF header!

 

2.3) Export the VCF to PLINK format

As a last step, we’ll use VCFtools to export the VCF into a format that can be read by the software PLINK.

Since we have thousands of scaffolds rather than only a few chromosomes, we should specify a map that defines every scaffold as one “chromosome”, otherwise the scaffold ID information will be lost during the conversion to the PLINK format. This is optional, but needed if you would like to perform analysis per scaffold as e.g. filtering for Linkage disequilibrium. We can use any alphanumeric name, so for simplicity we just use the scaffold IDs also as chromosome IDs. The chromosome map must contain every scaffold ID found in the file, so we just extract the scaffold IDs from the VCF file (not the header, see reason above!) with:

We use cut to “cut” the first column (-f 1) of the file (check cut –help for more info):
bcftools view -H data.noIND04.miss0.5.noMT.thin10.vcf.gz | cut -f 1

(don’t type this yet, see final command below)

As we have multiple variants per scaffold, we also receive every scaffold ID multiple times with the grep command. However, we only want the scaffold ID once, so we pipe the output to uniq, which reduces the list to unique IDs.
| uniq

We need the list in two columns – the original scaffold ID and the new identifier (for which we just use the scaffold ID again). We do this with the language awk to “print” the piped output twice:
| awk ‘{print $0″\t”$0}’

Using only awk ‘{print}’ would print the entire file (once). With the $ variable, we specify a certain part of the file, separated by whitespace. $0 means the entire file, $1 would be the first column, $2 the second etc. As both columns should be separated by a tab, we add “\t” in-between.

Finally, we save the output in a new text file: > data.noIND04.miss0.5.noMT.thin10.scf.chrom-map.txt

Put the parts above together and run the following command on your terminal to get the chromosome map:
bcftools view -H data.noIND04.miss0.5.noMT.thin10.vcf.gz | cut -f 1 | uniq | awk ‘{print $0″\t”$0}’ > data.noIND04.miss0.5.noMT.thin10.scf.chrom-map.txt

Have a look at the output. Does it look like the chromosome map that we need?

Show me the answer!

less data.noIND04.miss0.5.noMT.thin10.scf.chrom-map.txt

scf7180003948298 scf7180003948298
scf7180003948300 scf7180003948300
scf7180003948301 scf7180003948301

Yes, this is the format needed to use scaffold names as “chromosomes”.

Use VCFtools to export our final VCF (data.noIND04.miss0.5.noMT.thin10.vcf.gz) including the chromosome information (data.noIND04.miss0.5.noMT.thin10.scf.chrom-map.txt) to PLINK format. Use data.noIND04.miss0.5.noMT.thin10 as the file prefix for the new file. Check the manual again for the commands you need.

Show me the answer!

vcftools –gzvcf data.noIND04.miss0.5.noMT.thin10.vcf.gz –plink –chrom-map data.noIND04.miss0.5.noMT.thin10.scf.chrom-map.txt –out data.noIND04.miss0.5.noMT.thin10

Type ls -l to see the new files.

Show me the answer!

data.noIND04.miss0.5.noMT.thin10.map
data.noIND04.miss0.5.noMT.thin10.ped

______________________________________________________________________________________________

Exercise 3 – PLINK

PLINK is a comprehensive genome analysis toolset with an extensive list of functions. It was originally developed for human data (hence has all those human terms like “family” and “father”), but the new PLINK 1.9 can also be used with genomic data of non-model organisms. The standard version will allow a maximum of 59 chromosomes, so here we use a special build that supports the use of up to ~64000 scaffolds or contigs.

3.1) Input formats

PLINK accepts VCF files as input, but the preferred format to work with in PLINK are the PLINK text files with endings .ped (and .map), and the smaller binary PLINK files with endings .bed (+ .bim + .fam). Do not confuse this .bed with the UCSC Genome Browser’s BED format, which is totally different. In PLINK, the input data file is split in 2 (or 3) files, each containing solely information on either genotypes and sample information (.ped), only genotypes (.bed), variant positions and information (.map / .bim), or sample information (.fam).

Have a look at the PLINK .ped and .map files you exported with VCFtools!
Tip: Turn off line-wrap when using less with the flag -S

Show me the answer!

less -S data.noIND04.miss0.5.noMT.thin10.ped

IND01 IND01 0 0 0 0 C C 0 0
IND02 IND02 0 0 0 0 0 0 0 0
IND03 IND03 0 0 0 0 0 0 A G
IND05 IND05 0 0 0 0 C C A A
AFR01 AFR01 0 0 0 0 C T A G
AFR02 AFR02 0 0 0 0 C C A G
AFR03 AFR03 0 0 0 0 0 0 A G
AFR04 AFR04 0 0 0 0 0 0 0 0
AFR05 AFR05 0 0 0 0 C C A G …

less data.noIND04.miss0.5.noMT.thin10.map

scf7180003948298 scf7180003948298:263 0 263
scf7180003948298 scf7180003948298:313 0 313
scf7180003948298 scf7180003948298:1268 0 1268

The .ped file contains the variant information with one allele per column (meaning 2 columns are needed for the two alleles of one variant). It has one line per sample with the first six columns being:
1 Family ID
2 Within-family ID (‘IID’; cannot be ‘0’)
3 Within-family ID of father (‘0’ if missing)
4 Within-family ID of mother (‘0’ if missing)
5 Sex code (‘1’ = male, ‘2’ = female, ‘0’ = unknown)
6 Phenotype value (‘1’ = control, ‘2’ = case, ‘-9’/’0’/non-numeric = missing data)

7-n The seventh and eighth columns are the alleles for the first variant, the second variant, etc. Missing data is coded as 0 (or -9).

For our data we do not have information about the father, mother, the sex, or phenotype, so those columns are all set to missing data (0 or -9).

The .map file contains the variant positions. It has 4 columns:
1 Chromosome code. PLINK 1.9 also permits contig or scaffold names here, but most older versions do not.
2 Variant identifier
3 Position in morgans or centimorgans (optional; ‘0’ = unknown)
4 Chromosome position in bp

A detailed description of the file formats is available here.

We went through quite a few steps to obtain our file in PLINK format. Quickly verify that our data is still OK by comparing the genotypes of the first variant (scf7180003948298:263) in the new .ped/.map files with the genotypes in the original VCF file (data.vcf.gz).

Show me the answer!

less -S data.noIND04.miss0.5.noMT.thin10.ped
The alleles for the first variant are in the 7th and 8th columns:

IND01 IND01 0 0 0 0 C C
IND02 IND02 0 0 0 0 0 0
IND03 IND03 0 0 0 0 0 0
IND05 IND05 0 0 0 0 C C
AFR01 AFR01 0 0 0 0 C T
AFR02 AFR02 0 0 0 0 C C
AFR03 AFR03 0 0 0 0 0 0
AFR04 AFR04 0 0 0 0 0 0
AFR05 AFR05 0 0 0 0 C C …

bcftools view -H data.vcf.gz | grep ‘scf7180003948298.*263’

GT:AD:DP:GQ:PL
0/0:7,0:7:21:0,21,269
./.:3,0:3:6:0,6,90
./.:3,0:3:9:0,9,115
./.:0,0:0:.:.
0/0:7,0:7:21:0,21,246
0/1:1,5:6:24:181,0,24
0/0:7,0:7:21:0,21,231
./.:3,0:3:9:0,9,117
./.:2,0:2:6:0,6,84
0/0:7,0:7:21:0,21,269

Yes, the information is identical (the line in gray indicates the genotype data for the removed individual IND04).

Tip: If you want to look at any other variant in the .ped file, which is not at the beginning, you can extract one or more variants by specifying their variant ID in a text file and using PLINK with the flag –extract (see the manual here: Input filtering)

By looking at the .ped file, you might have noticed that the “Family IDs” in column 1 (i.e. population/sampling site IDs) are the same as the “Within-family IDs” in column 2 (i.e. individual IDs). However, for analyses of population structure, these should contain information about the population from which the individual was sampled. Let us change that!

We will update the IDs using PLINK. Have a look at the general usage info in the PLINK manual or just type plink in the terminal to get information about how to use PLINK.

Once you’re familiar with the command structure, change the “Family IDs” to “IND” for the samples IND01-05 and to “AFR” for the samples AFR01-AFR05, and save the output in a new file in text format (.ped) with the file prefix data.noIND04.miss0.5.noMT.thin10.newID.
Have a look at the “Update sample information” section of the PLINK manual for how to change the IDs (see Tip below!), and at the “Data management” section for how to save as a new .ped file.

Tip: You have to make a text file with the updated IDs first.

Show me the answer!

– Make a text file with the old IDs by “cutting” (cut) the first two columns (-f 1,2) of the file and saving it as a new file (>).
less data.noIND04.miss0.5.noMT.thin10.ped | cut -f 1,2 > updateID.txt

– Edit the text file and add two new columns with the new “Family” and “Within-family” IDs using your favorite unix text editor:
nano updateID.txt

IND01 IND01 IND IND01
IND02 IND02 IND IND02
IND03 IND03 IND IND03
IND05 IND05 IND IND05
AFR01 AFR01 AFR AFR01
AFR02 AFR02 AFR AFR02
AFR03 AFR03 AFR AFR03
AFR04 AFR04 AFR AFR04
AFR05 AFR05 AFR AFR05

Save the file by pressing: Ctrl o enter, and exit nano with: Ctrl x

– Update the IDs using PLINK:
plink –file data.noIND04.miss0.5.noMT.thin10 –update-ids updateID.txt –recode –out data.noIND04.miss0.5.noMT.thin10.newID

With –file the input .ped file is specified, with –update-ids the updated ID file, –recode tells PLINK to save the output as .ped file, and –out defines the output prefix.

Error: Invalid chromosome code ‘scf7180003948298’ on line 1 of .map file.
(Use –allow-extra-chr to force it to be accepted.)

We get an error message because our chromosome code doesn’t correspond to the expected code of “chr 1”, “chr 2”, “chr 3”, etc, max. 59). We can fix this by allowing extra chromosomes with the –aec flag. See the manual here.

Try again with the –aec flag!

Show me the answer!

plink –file data.noIND04.miss0.5.noMT.thin10 –aec –update-ids updateID.txt –recode –out data.noIND04.miss0.5.noMT.thin10.newID

Inspect the new files and the log file.
You may notice the following lines in the log file:

Ambiguous sex IDs written to data.noIND04.miss0.5.noMT.thin10.newID.nosex
By default, samples with ambiguous sex have their phenotypes set to missing when analysis commands are run, and the IDs of those sampes are written to a file with the ending “.nosex”. You can add the flag –allow-no-sex to prevent this, but since we do not have phenotypes anyway, we do not bother.

[…] Performing single-pass .bed write (863943 variants, 9 people).
–file: data.noIND04.miss0.5.noMT.thin10.newID-temporary.bed +
data.noIND04.miss0.5.noMT.thin10.newID-temporary.bim +
data.noIND04.miss0.5.noMT.thin10.newID-temporary.fam written. […]

PLINK 1.9 automatically converts most non-binary formats (such as .ped, .vcf) to a temporary binary .bed format before starting analysis. With our command above, it re-codes the temporary binary file after changing the family IDs back to the text format and saves it in .ped format.
For large genomic data files it is more convenient to continue working with the smaller binary format, especially if you’re planing on multiple analyses.

Change the command above so that your file is saved in binary .bed/.bim/.fam instead of .ped/.map. Use this section of the manual for help.

Show me the answer!


plink –file data.noIND04.miss0.5.noMT.thin10 –aec –update-ids updateID.txt –make-bed –out data.noIND04.miss0.5.noMT.thin10.newID

Tip: You may also just use –keep-autoconv; however, this flag does not work in combination with regular filtering operations like –update-ids.

Check the new files in your folder (ls -l). You’ll find the binary .bed, and the accompanying position and sample text files .bim and .fam, as well as a log file.
Have a look at the .bim and .fam files to see how they are formatted!

Show me the answer!

data.noIND04.miss0.5.noMT.thin10.newID.bim

scf7180003948298 scf7180003948298:263 0 263 T C
scf7180003948298 scf7180003948298:313 0 313 G A
scf7180003948298 scf7180003948298:1268 0 1268 A G

The first four columns of the .bim file are the same as in the .map file. In addition, column 5 and 6 show the alleles at the variant.

less data.noIND04.miss0.5.noMT.thin10.newID.fam

IND IND01 0 0 0 -9
IND IND02 0 0 0 -9
IND IND03 0 0 0 -9
IND IND05 0 0 0 -9
AFR AFR01 0 0 0 -9
AFR AFR02 0 0 0 -9
AFR AFR03 0 0 0 -9
AFR AFR04 0 0 0 -9
AFR AFR05 0 0 0 -9

The .fam file has the same columns as the first six columns in the .ped file.

As all the meta-information is now stored in .bim and .fam, the binary .bed file contains only the genotypes. As it is a binary format, we cannot easily look at it, but if you’re interested in the coding and how to see the file, please see the manual here.

 

3.2) Filtering

3.2.1) Minor allele frequency (MAF)

We already used VCFtools to filter our file for missing data, the mitochondrial scaffold, and very close sites (within 10bp).
Now we also want to filter on minor allele frequency (MAF), the frequency of the allele that appears less often. Filtering on MAF can be done for multiple reasons. One of them is that in samples sequenced with low coverage variants with very low MAF frequently present false calls, and can be obstructive by adding noise to the data. If the data will be used for a genome-wide association study (GWAS), usually a rather high MAF threshold is applied, as it requires very strong statistical power to make meaningful statements about very rare alleles. When interested in population structure, alleles present in only one individual do not add information to the inference of structure.
Therefore, the optimal MAF filtering threshold depends on the type of data you have, the analysis you are planning to do, the data quality, and the sample size.
Ignoring the quality, what would you think would be an appropriate threshold to filter on for population structure analysis?

If you have no idea what an appropriate threshold could be, read hint 1:

Show me hint 1!

The default MAF threshold in PLINK is 0.01. Filtering for MAF 0.01 means that the minor allele appears with a frequency of 1%.
It may help to output the allele frequencies using the –freq flag and then plot them in a histogram using R or your favorite plotting program.
Note, the file is ~62 MB, so downloading may take a while and it is recommended to use the plotting options available on your Amazon instance (R, RStudio, RStudio in the Browser (http://YOUR_PUBLIC_DNS:8787))
plink –bfile data.noIND04.miss0.5.noMT.thin10.newID –aec –freq –out data.noIND04.miss0.5.noMT.thin10.newID

Note: as we converted the .ped file to .bed, we specify –bfile to load the binary file as input instead of –file.
The allele frequencies are written to a file with the name data.noIND04.miss0.5.noMT.thin10.newID.frq

If you’re still lost as to which threshold to choose, read hint 2.
If you’ve decided on a threshold, go ahead and apply it. Check this section of the manual on how to filter for MAF. Save the new file in binary .bed format with a new prefix “data.noIND04.miss0.5.noMT.thin10.newID.maf.[your threshold]” and proceed to look up the answer below.

Show me hint 2!

After excluding IND04, we have only 9 individuals left in our data set (thus maximally 18 alleles per site). If we filter for MAF 0.01, a variant is excluded if its minor allele is found less than 0.18 times (1% of 18). Would this filter anything at all?

If you’ve read hint 2 and you don’t know if MAF 0.01 would filter anything or not, you may check the manual on how to filter for MAF and just filter for MAF 0.01 to see what happens! Save the output in binary .bed format and name the file test.maf0.01. Have a look at the log file. Have any variants been removed?

Show me hint 3!

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID –aec –maf 0.01 –out test.maf0.01 –make-bed

From the resulting log file:
… 24645 variants removed due to minor allele threshold(s) …

Oh, 24645 variants have a MAF below 0.01 and have been removed? How is this possible? What does it mean?

Usually we would not expect to have non-variant variants 😉 in our data file. However, the individual (IND04) that we removed before might have carried the only variant allele at a number of sites, and after excluding this individual, these sites might therefore be invariant.
Have a look at the genotypes at the site scf7180003985401:19789 in the original VCF:
bcftools view -H data.vcf.gz | grep ‘scf7180003985401.*19789’

GT:AD:DP:GQ:PL
0/0:9,0:9:24:0,24,314
0/0:8,0:8:21:0,21,307
0/0:10,0:10:30:0,30,391
0/1:2,4:6:65:149,0,65
./.:9,0:9:0:0,0,253
./.:7,0:7:0:0,0,205
0/0:11,0:11:30:0,30,450
0/0:9,0:9:24:0,24,360
0/0:9,0:9:21:0,21,315
0/0:7,0:7:21:0,21,236

The fourth individual is the only one with the alternative allele. We removed it, so we now expect a MAF of 0 for site scf7180003985401:19789. If you haven’t done so before, you may now want to have a look at the unfiltered allele frequencies:

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID –aec –freq –out data.noIND04.miss0.5.noMT.thin10.newID

Have a look at the content of the frequency file ending with .frq:
less data.noIND04.miss0.5.noMT.thin10.newID.frq

CHR SNP A1 A2 MAF NCHROBS
scf7180003948298 scf7180003948298:263 T C 0.1 10
scf7180003948298 scf7180003948298:313 G A 0.4167 12
scf7180003948298 scf7180003948298:1268 A G 0.5 12

Then check the MAF for the site scf7180003985401:19789:
less data.noIND04.miss0.5.noMT.thin10.newID.frq | grep ‘scf7180003985401.*19789’

CHR SNP A1 A2 MAF NCHROBS
scf7180003985401 scf7180003985401:19789 0 G 0 14

The MAF is indeed 0!

Count the number of sites with MAF 0 by checking how often “0” occurs in column 5 of the .frq file:
less data.noIND04.miss0.5.noMT.thin10.newID.frq | tr -s ‘ ‘ | cut -d ‘ ‘ -f 5 | grep -cx 0
24645

The columns in the .frq frequency file are divided by multiple spaces, so simply using cut on the 5th column does not work. Instead, we pipe the less output to tr (check tr –help for more info) and “squeeze” multiple spaces into one with the -s flag. The output from tr is then piped into cut, but here we have to specify that we have spaces instead of the default delimiter tab. We do this with the -d flag (check cut –help for more info). Then we can use -f for the fifth column, pipe the output to grep, and count (-c) exactly (-x) how often 0 occurs (-x means that the match to ‘0’ is counted by grep only when the whole line contains only ‘0’ and nothing else; e.g., the ‘0’ in the number 10 is not counted with -x but would be counted otherwise).

To exclude only completely “invariant” variants, it is actually more convenient to filter for minor allele count (MAC) instead of MAF. Using MAC 1 as a threshold will only include variants for which the minor allele is found at least once.

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID –aec –mac 1 –out test.mac1 –make-bed
Check the log file; how many variants have been removed?
24645 variants removed due to minor allele threshold(s)

Great, the number of removed variants is equal to the number of sites with a MAF of 0.
So, back to the original task. What would be an appropriate MAF threshold to filter our data?

An appropriate MAF threshold is given in the answer below:

Show me the answer!

After excluding IND04, we have only 9 individuals left in our data set (thus maximally 18 alleles per site). If we filter for MAF 0.01, a variant is excluded if its minor allele is found less than 0.18 times (1% of 18). We would not expect to have alleles that exist 0.18 times (they exist either not at all (0 times) or once (1 time) or more often). So the PLINK default of 0.01 is too low a threshold for this data set and would only exclude completely invariant sites.

In contrast, the MAF threshold of 0.01 might be more appropriate if we had 200 individuals (thus 400 alleles), as it would then exclude all alleles that are called less than four times (1% of 400).

For only 9 individuals (thus 18 alleles), one could choose a MAF threshold of 0.1, which would exclude all alleles that are called less than 1.8 times (10% of 18), and thus would remove singleton sites. Sites with a MAF of 2 could still be false calls for data with low coverage, but out of only 9 alleles, it is certainly possible to sample a true allele only twice.

Go ahead and apply the proposed MAF threshold. See this section of the manual for the commands you need. Save a new .bed file with the prefix data.noIND04.miss0.5.noMT.thin10.newID.maf0.1

Show me the answer!

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID –aec –maf 0.1 –out data.noIND04.miss0.5.noMT.thin10.newID.maf0.1 –make-bed

For reading in the binary .bed file instead of the text .ped file you have to specify –bfile instead of just –file.

 

3.2.2) Linkage disequilibrium

PLINK has several options to estimate linkage disequilibrium (LD), the non-independent segregation of two loci, and filter variants according to certain distances and thresholds.

For biallelic markers, one of the most commonly used measures for LD is the squared coefficient of correlation (r2) (Hill & Robertson, 1968). It ranges between 0 and 1, where 0 indicates no correlation (= no linkage) and 1 perfect correlation (= perfect linkage disequilibrium).

3.2.1.A) R2 statistics

Estimate the r2 values for the scaffold scf7180003983566. We want to compare only variants that are within 1000 bp of each other, but within this distance, all variants should be compared with all others.

NB! We will only use one scaffold for this exercise since calculating r2 performs and saves pairwise comparisons between loci, which can easily lead to an unmanageable huge data file!! Make sure to specify that you only want to analyze one scaffold with the –chr flag (–chr scf7180003983566).

Check the manual to see the options for analyzing individual scaffolds, and estimating r2.
Save the output with the prefix scf7180003983566.1kb.r0

Tip: Read the 9th and 10th bullet point of the LD statistics reports section to see how to specify a window size and variant distance, and how to request an output listing all r2 values.

Show me the answer!

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID.maf0.1 –aec –chr scf7180003983566 –r2 –ld-window-kb 1 –ld-window 1000 –ld-window-r2 0 –out scf7180003983566.1kb.r0

– select the scaffold with –chr
– specify that we want to calculate r2 –r2
– specify that we want to look at all variants within 1000 bp (1kb) with –ld-window-kb
– choose a high-enough number for –ld-window so that all variants within the 1kb window will be compared
– specify that we want to see all r2 values with –ld-window-r2

NB! If the –ld-window-kb, –ld-window, and –ld-window-r2 flags are not specified, PLINK will apply the default thresholds of 1000 kb for the window size, 10 variants as minimum distance, and it will not report r2 values with an r2 less than 0.2.

Have a look at the output and try to understand the different columns!

Show me the answer!

less scf7180003983566.1kb.r0.ld

CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
scf7180003983566 450 scf7180003983566:450 scf7180003983566 541 scf7180003983566:541 0.166667
scf7180003983566 450 scf7180003983566:450 scf7180003983566 1260 scf7180003983566:1260 0.166667
scf7180003983566 450 scf7180003983566:450 scf7180003983566 1271 scf7180003983566:1271 0

The .ld output file has 7 columns:
1 Scaffold ID of variant A
2 Position of variant A
3 ID of variant A
4 Scaffold ID of variant B
5 Position of variant B
6 ID of variant B
7 r2 between variant A and B

Roughly, how many variants are linked according to the r2 values?
Do the r2 values between two sites change with increasing distance?

Plot a histogram of the distribution of r2 values to answer the first question and plot the distance between the compared variants against the r2 values to answer the second question.
Use the plotting options available on your Amazon instance (R, RStudio, RStudio in the Browser (http://YOUR_PUBLIC_DNS:8787)), or download the file to your computer.

Show me the answer!

Plot it in R:
setwd(“PATH”) #set the working directory
ld head(ld) #display the first lines of the file
hist(ld$R2) #get a histogram of the distribution of r2 values

distance ld_dist plot(ld_dist$distance,ld_dist$R2) #make a scatterplot of the distance vs r2
abline(lm(ld_dist$R2~ld_dist$distance), col=”red”) #add a regression line

In the histogram you can see that most of the SNPs are not, or only weakly correlated.
When we check the scatterplot, we see that LD (in terms of r2) decreases as a function of distance.
The main force influencing the level of LD in certain genomic regions is recombination, and the frequency of recombination between two locations depends on their distance on the chromosome. Therefore, for sites sufficiently distant from each other the frequency of crossovers between them is high enough to remove the correlation between alleles.

This method can also be applied to a whole genome to obtain a genome-wide estimate of linkage decay.

3.2.1.B) R2 pruning

We already filtered our data set for variants that are in close proximity (and thus possible LD) with VCFtools. However, with LD-based pruning we are able to not arbitrarily delete close variants, but restrict our filtering to only those SNPs that are next to each other AND genetically linked based on r2 values.

Can you come up with a command to produce a new data set where all variants within a window of 1 kb and an r2 above 0.8 are removed? Set the step-size (the number of SNPs to shift the window at each step) to 1, so all variants withing 1 kb will be compared with each other.
This is a two-command task. Use the files with prefix data.noIND04.miss0.5.noMT.thin10.newID.maf0.1 as input, and name the output with prefix data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8.

Check the Variant pruning section of the manual for a hint on how to do this.

Show me the answer!

The threshold to filter the linked and close variants is set with –indep-pairwise, and needs three parameters. The first one is the window size in variant counts or in kilobases if the ‘kb’ modifier is present (1 kb). The second is the step size (in variant counts) that shifts the window (here 1). And the last parameter is the pairwise r2 threshold (0.8).
This command produces two text files: The file ending with .prune.in contains the IDs of all variants in linkage equilibrium according to our thresholds (“good guys”), and the file ending with .prune.out contains the IDs of variants in linkage disequilibrium, exceeding our threshold (“bad guys”).
The stdout and stderr are redirected to &> /dev/null, since PLINK will output a long record of how many variants are pruned per chromosome. Try what happens without &> /dev/null if you’re interested.

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID.maf0.1 –aec –indep-pairwise 1 kb 1 0.8 –out data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8 &> /dev/null

With the following 2nd command, we exclude the variants exceeding our threshold from the data file and save it with a new name:

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID.maf0.1 –aec –exclude data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8.prune.out –out data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8 –make-bed

Bonus task:
If you like, you can check again how many variants have been removed, and if the numbers are consistent with the variants in the files.
You can also produce another r2 report for the filtered file (like we did above) and plot the values again in a histogram and distance plot. In which way are the plots of the LD pruned file different from the plots we produced before?

Finally, can you find out how many variants we filtered out in total compared to the raw data file data.vcf.gz?

Show me the answer!

less data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8.bim | wc -l
351264

bcftools view -H data.vcf.gz | wc -l
1753496

In total, we excluded 1,402,232 variants – almost 80%.

Congratulations! You produced a nicely filtered variant file, which can now be exported to different formats and analyzed!

3.3) Export formats

PLINK has various options to export the data into different formats.
Find a list of these formats here.

We already know how to save the data in PLINK .ped format (–recode) and in binary .bed format (–make-bed).
In the same way we can save our file in a format which is suitable to import to R and analyze for example with the R package adegenet (see the next exercise tonight!). Or you can export it in STRUCTURE format, or again in VCF format, or in several other formats.

Export the final data file data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8 in the format suitable for the R package adegenet and you have the possibility to perform, e.g., a principal component analysis (PCA) with it tonight!

Tip: We need the one “without the dominance component”.

Show me the answer!

plink –bfile data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8 –aec –out data.noIND04.miss0.5.noMT.thin10.newID.maf0.1.ldkb1r0.8 –recode A

 

Survey

Please let us know how you got along with the exercise in this very short opinion poll.