Milan Malinsky and Emiliano Trucchi, 10th June 2022
Background
Understanding of shared ancestry in genetic datasets is almost always key to their interpretation. A powerful approach that utilises haplotype linkage information and focusses on the most recent coalescence (common ancestry) among the sampled individuals was introduced by Lawson et al., 2012 and is implemented in the software fineSTRUCTURE. It offers especially high resolution in inference of recent shared ancestry, as evidenced for example in the study of genetic structure of the British population (Leslie et al., 2015). Further advantages when compared with older methods such as STRUCTURE (Pritchard et al. 2000) and ADMIXTURE (Alexander et al. 2009) include the ability to deal with a very large number of populations and explore the relationships between them.
However, the fineSTRUCTURE pipeline was designed for large scale human SNP datasets, where chromosomal location of the markers are known, haplotypes are typically assumed to be correctly phased, and missing data also needs to have been imputed. Therefore, these methods have so far been generally inaccessible for investigations beyond model organisms (this is changing, with a number of participants having genome-wide re-sequencing data!).
At the 2016 edition of this workshop, a number of the organisers decided to bring this method to non-model organism genomics, specifically to users of RAD-seq data: the fineRADstructure software is the result.
In the first activity you are going to run fineRADstructure, get familiar with the software, and learn about basic interpretation of the results. Then, the second exercise provides additional details about data preparation, especially useful for people who are practically interested in using the software on their own data in the future.
Activity 1 – Alpine Plants
In the first activity, we use a single-digestion RAD dataset including 120 individuals from 12 populations of the alpine plant species complex Heliosperma pusillum (see Trucchi et al. 2017 for details). The dataset comprises 1,097 loci which have been assembled through the Stacks pipeline without a reference genome. While being small, easy and quick-to-run, the results of the analysis will illustrate well many of the benefits of the fineRADstructure approach.
How to run this exercise
To run the commands, two options: i) connect to your AMI instance using Guacamole (ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:8080/guacamole/) and open a terminal or ii) open a terminal on your laptop and connect via ssh [email protected]
to your instance.
To plot and view the pdf with the plots, open a Rstudio connection in your browser (ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:8787) and navigate to the right folder in ~/workshop_materials/a11_fineRADstructure
Running fineRADstructure
The data for this exercise are in: ~/workshop_materials/a11_fineRADstructure. You should ssh
into your instance and cd
into this directory. You can have a look at the input data file: e.g. less -S AlpinePlants.data
. The columns correspond to individuals and each row defines the alleles at a particular RAD locus.
First we calculate the co-ancestry matrix. For all pairs of individuals, the co-ancestry matrix counts the number of RAD loci where one has the most similar allele to another. Thus it summarises the nearest-neighbour haplotype relationships between individuals across the RAD loci. The command to run is:
RADpainter paint AlpinePlants.data
From here on, the steps are identical as if we were working on whole-genome data. Having obtained the co-ancestry matrix (AlpinePlants_chunks.out), we feed it to the fineSTRUCTURE Markov chain Monte Carlo (MCMC) clustering algorithm.
finestructure -x 100000 -y 100000 -z 1000 AlpinePlants_chunks.out AlpinePlants_chunks.mcmc.xml
Details about the MCMC algorithm and parameters
The population partitions are written into AlpinePlants_chunks.mcmc.xml
The parameters we use here are the standard basic parameters that need to be specified for any MCMC analysis:
-x 100000 the number of burn in iterations
-y 100000 the number of sample iterations (run after the burn-in)
-z 1000 thins the output file. It specifies that the population configuration should be written to the output file every 1000th iteration (instead of every iteration).
Finally, to be able to better explore the relationships between inferred populations, we run a simple tree-building algorithm:
finestructure -m T -x 10000 AlpinePlants_chunks.out AlpinePlants_chunks.mcmc.xml AlpinePlants_chunks.mcmcTree.xml
Visualising fineRADstructure results
There is a fineStructure GUI that can be used for visualising the results. However, for people who know basic R programming (which all of you should aim to be), by far the easiest and most flexible way to get publication-quality figures is by adapting the R scripts we provide.
In R-studio open the file ~/workshop_materials/a11_fineRADstructure/fineRADstructurePlot.R
Reminder on how to connect to R-studio
e.g.: http://ecX-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787
R-studio should then open directly in your browser
Then go to File->Open file (top left), navigate into the ~/workshop_materials/a11_fineRADstructure/ folder and select the
fineRADstructurePlot.R file.
In the script edit the following lines:
- Edit lines 30 to 32 to provide the names of the files you generated by running the analysis:
chunkfile<-"AlpinePlants_chunks.out" ## RADpainter output file
mcmcfile<-"AlpinePlants_chunks.mcmc.xml" ## finestructure mcmc file
treefile<-"AlpinePlants_chunks.mcmcTree.xml" ## finestructure tree file - Edit also line 36 to provide the prefix (between quotes) for the pdf files generated by this script
analysisName <- "AlpinePlants"; maxIndv <- 1000; maxPop<-1000
Then execute all the R code. It generates two PDF files:
1) Clustered co-ancestry matrix with co-ancestry values for all pairs of individuals.
Like: AlpinePlants-SimpleCoancestry.pdf
2) Clustered co-ancestry matrix with co-ancestry values averaged across the inferred populations.
Like: AlpinePlants-PopAveragedCoancestry.pdf
Interpreting the results
Look at the PDF results. The first letter in individual ids (V or P) stand for the species (P for Heliosperma pusillum and V for H. veselskyi) and the second letter corresponds to the six sampled localities (A to F). You can see that individuals within the twelve predefined populations (2 species x 6 localities = 12 a priori populations) clearly share more co-ancestry with each other than between populations. Hierarchical structure among and between localities is clearly inferred – populations at localities B and C cluster by species, whereas populations at localities A, D, E, and F cluster by locality. You can also see that some sub-structure is inferred within a number of the a priori population.
Above the co-ancestry matrix is the population tree you constructed earlier. Note that the tree is not a phylogeny. It reflects the data in the co-ancestry matrix, thus focusing on recent coalescence. The tree building approach is simple and intended to be illustrative of the relationships between the populations, rather than to be interpreted as representing the true population history. The values next to the tree branches are posterior population assignment probabilities: they reflect of the proportion of MCMC samples in which the individuals in question formed a population.
Another fineRADstructure analysis of Emperor penguin data: look at the plot and try to answer the questions below
- Briefly, how does the population structure of the penguins compare with that of the alpine plants?
- What can explain the high coancestry values for particular pairs or triplets of individuals (such as the 3 HAL) individuals. How many examples of elevated coancestry can you see in the dataset? Do all of them impact the populations inferred by fineSTRUCTURE?
- Do the populations that are observed correlate at all with geography? If so, how?
- what scenarios can explain the observed population structure? Describe “histories” for the 5 groups that are marked by lines.
Activity 2 – Dataset preparation
This activity focuses on preparing a dataset for use with fineRADstructure. RAD-seq processing pipelines commonly use the Stacks toolkit (Catchen et al. 2013). Therefore, we wrote a data conversion and filtering script (Stacks2fineRAD.py) for processing the output from the core Stacks population program (a file usually called “batch_1.haplotypes.tsv”).
Recently, the authors of Stacks added the: populations --radpainter
option to make suitable input files. This is probably the easiest option if you have your data in Stacks. Also the RADpainter hapsFromVCF command may work if you only have a VCF file for your RAD data; feel free to get in touch if it doesn’t work for your specific VCF.
However, the Stacks2fineRAD.py script from Emiliano has some additional features that are illustrated below.
An example of Stacks populations output is in the file Tortoises_Stacks_output.tsv in the same folder you used for Activity 1. We are going to convert it to the fineRADstructure format using the Stacks2fineRAD.py script.
Now you can execute the script we introduced a few lines above, using the following command:
python Stacks2fineRAD.py -i Tortoises_Stacks_output.tsv -n 5 -m 20
While converting, the script also applied various filters to your data. First, the script automatically discards invariant loci because these are not informative about population structure. We also removed data that could be in some way problematic. One problem we worry about is that of misassembled paralogs (when similar sequences from two or more genomic loci are lumped together by mistake and mistakenly analysed as if they came from a single locus). Therefore, we filter out loci that have more alleles than the ploidy of the organism (in this case we are analysing data from diploid Tortoises, therefore we filtered loci with more than two alleles, i.e. ‘triallelic’ etc.). Further, unusually too many SNPs at a locus could also hint at misassembled paralogs; using the -n
option, you can specify the maximum number of SNPs allowed at a locus. Finally, we discarded individuals who had unusually high levels of missing data (using option -m 20
we set the cutoff to 20%).
The script also makes useful plots that can help inform you about your dataset and especially about any required filtering.
The plots show:
- The distribution of the number of alleles across loci
- The distribution of the number of SNPs across loci
- Missing data levels across samples
- a PCA of the samples using missingness (presence/absence data)
In the latter analysis, missing data are converted to “0” whereas any genotype is converted to “1”. This presence/absence matrix is then used in a PCA analysis. Such PCA should help you understand if data missingness may be affecting the fineRADstructure results. The labels in the PCA plot are difficult to read, because they are generated automatically, and we have not yet found a way how to do this better. You could edit this plot with a vector graphic software in order to better explore the sample ordination. We have prepared an edited version of the PCA plot for you so you don’t have to do that. The first part of the sample label indicate the library it was sequenced in (i.e. l1, l2, l3, l4, l5).
Here is an edited version of the PCA plot
As hinted at in the figure labels, the colours correspond to the library batch:
– purple = l1+l5
– green = l2
– orange = l3
– dark red = l4 (only one individual passed the filtering)
- Can you recognize any batch effect due to the sequencing library?
- Is there any other clear pattern driven by missingness (look at the edited plot to answer this)? What could be the cause? If you work with RAD-seq data, have you ever heard about phylogenetic allele drop-out? (if not we recommend you look for example at this article)
- Is the amount of missing data homogeneous among libraries?
- What is the distribution of the numbers of alleles and SNPs across loci?
Running fineRADstructure on the converted datafile
You can use the final filtered file Tortoises_Stacks_output.tsv.fineRADpainter.lociFilt.samples20missFilt.txt directly in fineRADstructure. To shorten the file names, we suggest you rename it to Tortoise_20miss.data. Then please run it through the fineRADstructure pipelines as we did in Activity 1 – Alpine Plants.
Looking at the results, can you answer the following questions:
- Do you think the coancestry matrix is biased by the missing data issue?
- Which signal that was visible in the missingness PCA is also present in the coancestry matrix?
- Can you identify any hybrid individuals (hint: there are three)?
EXTRA
If you are still curious this will allow you to investigate the effect of different missingness cutoffs:
Run Stacks2fineRAD.py allowing for 70% missing data per sample. Check the new plots (note that all files will be overwritten!). Then rename Tortoises_Stacks_output.tsv.fineRADpainter.lociFilt.samples70missFilt.txt to Tortoise_70miss.data and run fineRADstructure again. Finally compare the two coancestry matrices.