Daniel Falush, 23rd January 2018
This example is based on a recent Science paper, and the figures here are reproduced from the supplement. Please do not read (or reread) the paper until after you have finished the exercise.
There are 46 H. pylori genomes in the sample, one of which is an ancient genome from “Otzi”, the Tirolean iceman. The aim of the practical is to establish how his H. pylori is related to modern isolates based on the SNPs in the genomes and in particular those from Europeans (hpEurope) and Indians (hpAsia2).
Helicobacter pylori is a bacterium with a 1.67 Megabase haploid genome, and the sample of 46 genomes is variable at 172,419 SNPs. Here are the results from three commonly used methods applied to these SNPs.
Neighbor joining tree
(1) What do the results of the three methods tell us about relationships between hpEurope and HpAsia2 strains? What evolutionary scenarios might have accounted for this? Please note that for this and other questions, the hypotheses you suggest do NOT be consistent with history or geography, only with the genetic data.
(2) Suggest TWO evolutionary/historical hypotheses for the relationships between the hpAsia2 and hpEAsia populations.
(3) Please describe at least TWO evolutionary/historical hypotheses for the relationship of Otzi to the other samples that are consistent with the results of each of the three methods. Try to make the hypotheses as distinct as possible, given the data. Then describe TWO evolutionary/historical hypotheses that are consistent with the results of all three methods.
Now we’ll begin working with fineSTRUCTURE version 2. On the cloud, you can run this program from the command line by typing “
fs“. There is also a launcher for the fineSTRUCTURE GUI on the desktop. The data files for this activity can be found in ~/workshop_materials/02_population_structure/fineSTRUCTURE
fineSTRUCTURE performs three steps.
(i) It estimates parameters of a “chromosome painting” Hidden Markov Model.
(ii) It paints each of the chromosomes as a mosaic of each of the others.
(iii) It uses a summary of the painting, the “coancestry matrix”, in order to divide the sample into populations.
fineSTRUCTURE takes three inputs.
46hpyloriHaplotypes.phase is the genetic data from the 46 strains in PHASE format. Note that H. pylori is haploid.
hpyloriUnifRecMap.rcomb is a file giving the genetic distance between the SNPs in the genome. Since there is no genetic map for H. pylori, this is simply the physical distance between SNPs in the reference genome.
46hpyloristrains.id is a file giving the names of each strain. It can also be used to set a flag as to whether the strain should be used in the analysis.
The command we will use to run version 2 of fineSTRUCTURE (fs-2.0.7) is:
fs hpyloriproject1.cp -n -phasefiles 46hpyloriHaplotypes.phase -recombfiles hpyloriUnifRecMap.rcomb -idfile 46hpyloristrains.id -ploidy 1 -go
hyploriproject1.cp is the file that records the details of the fineSTRUCTURE runs, with intermediate files being written into the directory
“-n” indicates that a previous settings file (if there is one) should be overwritten.
“-ploidy 1” indicates the data is haploid. The default is for the data to be diploid.
“-go” indicates to run through the entire fineSTRUCTURE pipeline.
gives an overview of fineSTRUCTURE help and the most basic parameters.
fs -help parameters
gives the parameters and their default values, which we use in this run.
An alternative is to assume all the SNPs are unlinked. This is performed by running the data in the same way, but with the recombination map file omitted. In this case, the data used by the algorithm is very similar to that used by STRUCTURE, PCA and neighbour joining.
fs hpyloriproject.cp -n -phasefiles 46hpyloriHaplotypes.phase -idfile 46hpyloristrains.id -ploidy 1 -go
Visualising fineRADstructure results
There is a fineStructure GUI that can be used for visualising the results. However, for people who know basic R programming (which all of you should aim to be), by far the easiest and most flexible way to get publication-quality figures is by adapting the R scripts we provide (it is the same script that will be used in the following fineRADstructure exercise).
In R-studio open the file ~/workshop_materials/02_population_structure/fineRADstructure/fineRADstructurePlot.R
Reminder on how to connect to R-studio
R-studio should then open directly in your browser
Then go to File->Open file (top left), navigate into the ~/workshop_materials/02_population_structure/fineRADstructure/ folder and select the
In the script, edit line 28 and line 34 to set the correct working directory to read the files and save the plots /home/wpsg/workshop_materials/02_population_structure/fineSTRUCTURE/
Edit also lines 30 to 32 to provide the names of the files you generated by running the analysis (pay attention to the different file names!):
chunkfile<-"YOUR_INPUT_FILE_chunks.out" ## painter output file
mcmcfile<-"YOUR_INPUT_FILE_chunks.mcmc.xml" ## finestructure mcmc file
treefile<-"YOUR_INPUT_FILE_chunks.mcmcTree.xml" ## finestructure tree file
Finally edit line 36 to provide the prefix (between quotes) for the pdf files generated by this script
analysisName <- "YOUR_NAME_PREFIX"; maxIndv <- 1000; maxPop<-1000
Then execute all the R code. It generates two PDF files:
1) Clustered co-ancestry matrix with co-ancestry values for all pairs of individuals.
2) Clustered co-ancestry matrix with co-ancestry values averaged across the inferred populations.
In Rstudio, you can use the bottom-right panel to open the pdf files you just created: using the “Files” tab, navigate to the fineSTRUCTURE folder and then open the pdf files (you need to allow your browser to open pop-up tabs!).
(4) Which population consists of individuals that are not particularly related to each other? How do results like this arise?
(5) Intransitive relationships between populations are a possible signature of admixture; e.g., if pop1 and pop2 have high coancestry with each other and pop1 and pop3 have high coancestry with each other, but pop2 and pop3 do not. One scenario that can explain this pattern is that pop1 and pop2 were sister taxa before pop2 and pop3 admixed with each other. Find three such sets of population triplets in the data.
(6) fineSTRUCTURE infers a large number of populations under both models; why do you think this is?
(7) Now onto the relationship between Otzi and the other samples. What do the coancestry values of Otzi with hpEurope and hpAsia2 isolates tell us about Otzi’s relationships with these samples?
(8) How can your answers to question 8 be reconciled with results from the three other methods above? Describe a single evolutionary hypothesis that explains all the data. Are there any alternatives?
(9) Based on this hypothesis, how might the subtle differences between the linked and unlinked models in the results for Otzi be explained?