fineSTRUCTURE activity - Evolution and Genomics

Daniel Falush and Milan Malinsky, 27th January 2020

Introduction

This example is based on a Science paper from 2018, and the figures here are reproduced from the supplement. Please do not read (or reread) the paper until after you have finished the exercise.

There are 46 H. pylori genomes in the sample, one of which is an ancient genome from “Otzi”, the Tirolean iceman. The aim of the practical is to establish how his H. pylori is related to modern isolates based on the SNPs in the genomes.

Helicobacter pylori is a bacterium with a 1.67 Megabase haploid genome, and the sample of 46 genomes is variable at 172,419 SNPs.

Here are the results from three commonly used methods applied to these SNPs:

Neighbor joining tree

STRUCTURE

PCA
PCA

Questions 1-3:
(1) What do the results of the three methods tell us about relationships between hpEurope and HpAsia2 strains? What evolutionary scenarios might have accounted for this? Please note that for this and other questions, it may be interesting to suggest hypotheses that are based purely on the genetic data, and then, as a second step, consider whether they are consistent with anything we know about history or geography.

(2) Suggest TWO evolutionary/historical hypotheses for the relationships between the hpAsia2 and hpEAsia populations.

(3) Please describe at least TWO evolutionary/historical hypotheses for the relationship of Otzi to the other samples that are consistent with the results of each of the three methods. Try to make the hypotheses as distinct as possible, given the data. Then describe at least one evolutionary/historical hypothesis that is consistent with the results of all three methods.

Running fineSTRUCTURE

How to do this activity

Connect to your Amazon instance via the terminal using SSH.
Switch to Rstudio in the browser (http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787) for the visualization step below.

Running fineSTRUCTURE
Now we’ll begin working with fineSTRUCTURE version 4. On your instance, you can run this program from the command line by typing fs. The data files for this activity can be found in ~/workshop_materials/27_population_structure/fineSTRUCTURE. Please navigate to this directory (cd).

fineSTRUCTURE performs three steps:

It estimates parameters of a “chromosome painting” Hidden Markov Model.
It paints each of the chromosomes as a mosaic of each of the others.
It uses a summary of the painting, the “coancestry matrix”, and then performs MCMC based clustering of this matrix in order to divide the sample into populations.

fineSTRUCTURE takes three inputs:

46hpyloriHaplotypes.phase is the genetic data from the 46 strains in PHASE format. Note that H. pylori is haploid.
hpyloriUnifRecMap.rcomb is a file giving the genetic distance between the SNPs in the genome. Since there is no genetic map for H. pylori, this is simply the physical distance between SNPs in the reference genome.
46hpyloristrains.id is a file giving the names of each strain. It can also be used to set a flag as to whether the strain should be used in the analysis.

With the following command you run version 4 of fineSTRUCTURE (fs-4.0.1):
fs hpyloriproject1.cp -n -phasefiles 46hpyloriHaplotypes.phase -recombfiles hpyloriUnifRecMap.rcomb -idfile 46hpyloristrains.id -ploidy 1 -go

hyploriproject1.cp is the file that records the details of the fineSTRUCTURE runs, with intermediate files being written into the directory hpyloriproject1.
“-n” indicates that a previous settings file (if there is one) should be overwritten.
“-ploidy 1” indicates the data is haploid. The default is for the data to be diploid.
“-go” indicates to run through the entire fineSTRUCTURE pipeline.

fs -help
gives an overview of fineSTRUCTURE help and the most basic parameters.

fs -help parameters
gives the parameters and their default values, which we use in this run.

An alternative is to assume all the SNPs are unlinked. This is performed by running the data in the same way, but with the recombination map file omitted. In this case, the data used by the algorithm is very similar to that used by STRUCTURE, PCA and neighbour joining.

fs hpyloriproject.cp -n -phasefiles 46hpyloriHaplotypes.phase -idfile 46hpyloristrains.id -ploidy 1 -go

Visualising fineSTRUCTURE results

There is a fineStructure GUI that can be used for visualising the results. However, for people who know basic R programming, by far the easiest and most flexible way to get publication-quality figures is by adapting the R scripts we provide (it is the same script that will be used in the following fineRADstructure exercise, because the output formats are the same for both softwares).

In R-studio open the file ~/workshop_materials/27_population_structure/fineRADstructure/fineRADstructurePlot.R

Reminder on how to connect to R-studio

In your favourite browser window (Firefox, Safari, etc.) enter the address of your Amazon instance, followed by a colon and the port number 8787
e.g.: http://ecX-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8787
R-studio should then open directly in your browser
Then go to File->Open file (top left), navigate into the ~/workshop_materials/27_population_structure/fineRADstructure/ folder and select the
fineRADstructurePlot.R file.

Before you run the script, edit the following lines:

In the script, edit line 28 and line 34 to set the correct working directory to read the files and save the plots /home/popgen/workshop_materials/27_population_structure/fineSTRUCTURE/
Edit also lines 30 to 32 to provide the names of the files you generated by running the analysis (pay attention to the different file names!):
chunkfile<-"YOUR_INPUT_FILE_chunks.out" ## painter output file, in this case should be hpyloriproject1_linked_hap.chunkcounts.out
mcmcfile<-"YOUR_INPUT_FILE_chunks.mcmc.xml" ## finestructure mcmc file (hpyloriproject1_linked_hap_mcmc.xml)
treefile<-"YOUR_INPUT_FILE_chunks.mcmcTree.xml" ## finestructure tree file (hpyloriproject1_linked_hap_tree.xml)
Finally edit line 36 to provide the prefix (between quotes) for the pdf files generated by this script
analysisName <- "YOUR_NAME_PREFIX"; maxIndv <- 1000; maxPop<-1000

Then execute all the R code. It generates two PDF files:

Clustered co-ancestry matrix with co-ancestry values for all pairs of individuals.
Like: hPyloriLinked-SimpleCoancestry
Clustered co-ancestry matrix with co-ancestry values averaged across the inferred populations.
Like: hPyloriLinked-PopAveragedCoancestry

In Rstudio, you can use the bottom-right panel to open the pdf files you just created: using the “Files” tab, navigate to the fineSTRUCTURE folder and then open the pdf files (you need to allow your browser to open pop-up tabs!).

Then you can start thinking about the following questions.
Note: To answer question 9, you need to plot also the results from running the unlinked model. We leave it up to you to figure out how to modify the relevant lines of the R function. One hint: the maxIndv <- 1000; maxPop<-1000 values will need to be increased.

Questions 4-9:

(4) Which population consists of strains that are not particularly related to each other? How do results like this arise?

(5) Intransitive relationships between populations are a possible signature of admixture; e.g., if pop1 and pop2 have high coancestry with each other and pop1 and pop3 have high coancestry with each other, but pop2 and pop3 do not. One scenario that can explain this pattern is that pop1 and pop2 were sister taxa before pop2 and pop3 admixed with each other. Find three such sets of population triplets in the data.

(6) fineSTRUCTURE infers a large number of populations under both models; why do you think this is?

(7) Now onto the relationship between Otzi and the other samples. What do the coancestry values of Otzi with hpEurope and hpAsia2 isolates tell us about Otzi’s relationships with these samples?

(8) How can your answers to question 7 be reconciled with results from the three other methods above? Describe a single evolutionary hypothesis that explains all the data. Are there any alternatives?

(9) Based on this hypothesis, how might the subtle differences between the linked and unlinked models in the results for Otzi be explained?