SmartPCA exercise - Evolution and Genomics

Milan Malinsky, 22 January 2018

Background and aim

In this exercise, you are going to learn how to determine whether your samples come from one homogenous population or whether there is some population structure. The smartpca program uses a combination of Principal Component Analysis (PCA; a standard statistical method) and new statistics designed specifically for genomic data to address this question. The methods are described in detail in an excellent and accessible manuscript by the authors of the program.

We are going to use a dataset that has been prepared very similarly to that in the file formats exercise (minor allele frequency > 5%, converted into plink format, LD filtering to remove linked sites).

Lake Itamba exercise introduction

The dataset includes 70,818 genome-wide SNPs from 30 individual cichlid fish from the crater lake Itamba in southern Tanzania. We want to know whether there is some population structure; we think there may be some, because some fish have a more blueish colour while other fish have a more yellowish colour. Maybe they are incipient species? Or maybe colour is insignificant. But, maybe there are still two or more genetically distinct populations separated by something we can’t see? (e.g. the slender long fish are genetically a different population from the thick short ones? or maybe the size of eyes matters? or the sound the fish make when they swim? or whatever else…).

We are going to work in the ~/workshop_materials/01_unix_intro/smartPCA_activity folder. Have a look at the file Itamba_plink_LD0.2.ped (remember use less -S). The ‘Family ID’ and ‘Within-family ID’ are the same, the samples are unrelated (no families as far as we know), sex is set to unknown, and phenotype is also set to unknown. In fact, none of this matters, since smartpca uses this file only to read the SNP values (column 7 onwards). The Itamba.pedind file is where smartpca reads individual sample IDs (column 2) and pre-assigned population labels (column 6; in our case that’s the ‘blue’ and ‘yellow’ labels, as we think these may be the populations). Having this separate small file is convenient, because it is much easier to edit a small file than to make changes to a large .ped file.

All the parameters that control a smartpca run need to be defined in a two-column text file. The name of the parameter is in the first column (followed by a colon “:”); the second column must contain the parameter value. Open the Itamba.par file. You should now be able to understand the genotypename, snpname, and indivname parameters. Parameters evecoutname and evaloutname define the names of the files where the results of PCA (position of each individual along eigenvectors, and eigenvalues associated with the eigenvectors) are going to be written. These five parameters are always required to run the program. Many other parameters are available and are described in the README file that comes with the software. Have a look. What does the numchrom parameter mean? Unfortunately smartpca does not have manual pages nor a help function, so typing man smartpca or smartpca --help will not work.

Is there genetic structure in Itamba cichlids?

We are now ready to run the program. Simply type smartpca -p Itamba.par. The analysis should only take a few seconds. A lot of text output is printed on the screen; if you want, you can capture this into a file: smartpca -p Itamba.par > Itamba.out.

First have a look at the PCA output files:

Itamba_PCA.evec – the position of each individual along eigenvectors 1-10 (columns 2-11)
Itamba_PCA.eval – the ordered eigenvalues corresponding to the eigenvectors

The following R code can be used to plot the position of each individual along the first two (the most significant) eigenvectors:
>eval <- read.table("~/workshop_materials/01_unix_intro/smartPCA_activity/Itamba_PCA.eval") >evec1.pc <- round(eval[1,1]/sum(eval)*100,digits=2) >evec2.pc <- round(eval[2,1]/sum(eval)*100,digits=2) >evec <- read.table("~/workshop_materials/01_unix_intro/smartPCA_activity/Itamba_PCA.evec") >par(mar = c(6.1, 6.1, 2.1, 2.1)) >plot(evec[,2], evec[,3], xlab=paste("eigenvector1\n",evec1.pc, "% of observed genetic variation", sep=""), ylab=paste("eigenvector2\n",evec2.pc, "% of observed genetic variation", sep=""), col=c(rep("dark blue",14),c(rep("gold",17))), pch=c(rep(15,14), rep(17,17)), cex=2.5, cex.axis=1.5,cex.lab=1.5)

You should see a plot like this. Points are coloured according to our preconceived idea of populations (blue/yellow). Does it seem that the blueish and yellowish fish are different populations? Could the two individuals at the right be from a different population? Or is there a genetic separation between the individuals towards the top and towards the bottom of the screen? What do you think?

To to get a rigorous answer to these questions we can use the statistics that were output to the screen (and/or the Itamba.out file). The Tracy-Widom statistic provides a formal test (based around PCA) to answer the question: “Does the data provide statistical evidence for a population structure?”. Find the Tracy-Widom statistics in the output (start on line 20) and check the p-value for eigenvector 1. The p-value is 0.10523. You may be familiar with the idea that p-value of less than 0.05 could be considered statistically significant. That cutoff is somewhat arbitrary, but seeing a p-value higher than 0.1 suggests that there is little evidence of any population structure whatsoever. How about the next p-value, for eigenvector 2? This is 0.044, so perhaps significant. But we should not get too excited. We should not look at the p-value for eigenvector 2 if the one for eigenvector 1 was not significant. In the words of the authors (from the manuscript): “If an eigenvalue is not significant, then further testing of smaller eigenvalues should not be done.”

The “Anova statistics for population differences” provide a formal test for genetic differentiation between our pre-labelled populations (yellow; blue). We already “know” these are not really populations by looking at the PCA plot. The p-value is 0.19, in agreement with our visual observation.

Conclusions for Lake Itamba

We can conclude that the Lake Itamba data does not provide significant evidence of any population structure with eigenanalysis (PCA analysis). It is also obvious, even without formal statistical tests, that blueish and yellowish fish are not distinct populations. Strictly speaking, these conclusions are limited to the data we have. Perhaps sampling additional individuals could reveal some population structure after all. However, arguably we could be justified in proceeding to further analysis by treating these Lake Itamba cichlids as one homogeneous population.