Activity by Richard Durbin, 27th January 2020
How to run this activity
- This activity can be run through the terminal; thus, connect to the Amazon instance via SSH.
- Navigate to the
- You will find the so called “MasterVarBeta” files for six individuals for chromosome 1 in the human_cg_data subdirectory.
This activity is based closely, with permission, on the tutorial by Stephan Schiffels, the author of the MSMC and MSMC2 software, which can be found at
Prerequisites - all this is already done for you
Download a suitable binary for MSMC2 (Version 2.0.0) from here:
Alternatively, download the source and install yourself.
Download the msmc-tools repository, e.g. via git clone https://github.com/stschiff/msmc-tools.git if you have git installed, or simply as a zip file here.
This tutorial, assumes that the following executables are in the user’s path:
The first three are part of the msmc-tools repository, and you should copy those scripts into a location within the path, or create symbolic links to it. The msmc2 executable should be a renamed binary (Version 2.0.0), or a compiled executable, also somewhere in the path. Note that combineCrossCoal.py also requires the scipy package to be installed (for python3).
Finally, for plotting we will use [R](https://www.r-project.org).
You will get input data and intermediate files for this tutorial at the end of the workshop.
For this lesson, we will use two trios from the publicly available “69 genomes” data set published by Complete Genomics. It is openly available and relatively efficient to process for this tutorial. We point to instructions for Illumina data below.
We are going to work in the
~/workshop_materials/27_demography_PSMC_MSMC. You will find the so called “MasterVarBeta” files for six individuals for chromosome 1 in the
human_cg_data subdirectory. Some information on the six samples: The first three form a father-mother-child trio from the West-African Yoruba, people living in Nigeria. Here, NA19240 is the offspring, and NA19238 and NA19239 are the two parents. The second three samples form a father-mother-child trio from Utah (USA), with unspecified European ancestry. Here, NA12878 is the offspring, and NA12891 and NA12892 are the parents. These are very widely used open use samples originally collected as part of the HapMap project, and subsequently by the 1000 Genomes Project.
.bz2 is a compressed file format similar to .gz. You can examine these files like this:
bzip2 -dc masterVarBeta-NA12878-200-37-ASM.tsv.chr1.bz2 | less -S
Generating consensus sequences and mask files for each sample
In order to run MSMC2, you need to types of files:
VCFfiles that contain your variation data.
.bedfiles that contain the genomic intervals in which variation could have been detected. These files should exclude sites in the genome with for example low mapping quality, or were SNPs were filtered.
The data we are using does not come as VCF files (but “MasterVarBeta” files, which are not standard), so there are some steps involved in producing .vcf and .bed files . These steps are given in the box below, but in general you will likely start from a .vcf file.
Converting MasterVarBeta to .vcf + .bed - all this is already done for you
We will use the MasterVarBeta files for each of the 6 samples, and use the
cgCaller.py script (from the msmc-tools repository) to generate a VCF and a mask file for each individual from the `masterVar` file. To do this, we have provided a little shell script
runCGcaller.sh that loops over all individuals:
The script looks like this:
#!/usr/bin/env bash CGDATADIR=human_cg_data OUTDIR=human_vcf_bed CHR=chr1 for IND in NA19238 NA19239 NA19240 NA12878 NA12891 NA12892; do CGFILE=$(ls $CGDATADIR/masterVarBeta-$IND-*.tsv.chr1.bz2) OUT_MASK=$OUTDIR/$IND.$CHR.mask.bed.gz OUT_VCF=$OUTDIR/$IND.$CHR.vcf.gz cgCaller.py $CHR $IND $OUT_MASK $CGFILE | gzip -c > $OUT_VCF done
Here, we restrict analysis to chromosome 1 (which is called `chr1` in the Complete Genomics data sets). Normally, you would also loop over chromosomes 1-22 in this script. The line `CGFILE=$(ls …)` uses bash command substitution to look for the masterVar file and store the result in the variable `$CGFILE`.
Run this script by entering
./runCGcaller.sh. You should see log messages indicating the currently processed position in the chromosome. Chromosome 1 has about 250 million sites, so you can estimate the waiting time.
When finished (should take around 10 minutes for all 6 samples), you should now have one *.mask.bed.gz and one *.vcf.gz file for each individual in the ~/workshop_materials/27_demography_PSMC_MSMC/human_vcf_bed directory.
You will probably have Illumina data (not Complete Genomics data); in this case you can make the corresponding files from a BAM file generated for example by the bwa-mem aligner, using the bamCaller.py from the msmc-tools repository. This uses samtools and bcftools to call variants and calculate callable regions, so you need to have those installed to run it. Alternatively you can use GATK to call every site (variant or not; there is a –includeNonVariantSites in the GenotypeGVCF tool – except for the new GATK4 version) and then use the vcfAllSiteParser.py script from msmc-tools to generate the required VCF and mask.bed files.
Some explanation on the generated files: The VCF file in each sample contains all sites at which at least one of the two chromosomes differs from the reference genome. Here is a sample:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19238 chr1 38232 . A G . PASS . GT 1/1 chr1 41981 . A G . PASS . GT 1/1 chr1 47108 . G C . PASS . GT 1/0 chr1 47292 . T G . PASS . GT 1/0 chr1 49272 . G A . PASS . GT 1/1 chr1 51673 . T C . PASS . GT 1/0 chr1 52058 . G C . PASS . GT 1/0
This alone would not be enough information. MSMC is a Hidden Markov Model which uses the density of heterozygous sites (`1/0` genotypes) to estimate the time to the most recent common ancestor. However, for a density you need not only a numerator but also a denominator, which in this case is the number of non-heterozygous sites, so typically homozygous reference alleles. Those are not part of this VCF file, for efficiency reasons. That’s why we have a Mask-file for each sample, that gives information in which regions in the genome could be called. Regions with not enough coverage or too low quality will be excluded. The first lines of such a mask look like this:
chr1 11093 11101 chr1 11137 11154 chr1 11203 11235 chr1 11276 11288 chr1 11319 11371 chr1 11378 11387 chr1 11437 11453 chr1 11481 11504 chr1 11511 11527 chr1 11568 11637
which gives a very detailed view on which regions could be called (2nd and 3rd column specify the beginning and the end).
There is one more mask that we need, which is the mappability mask. This mask defines regions in the reference genome in which we trust the mapping to be of high quality because the reference sequence is unique in that area (i.e. it masks repeats). The mappability mask for chromosome 1 for the human reference GRCh37 is included in your folder: ~/workshop_materials/27_demography_PSMC_MSMC/hs37d5_chr1.mask.bed. For all other human chromosomes, this README includes a link to download them (but we won’t need them in this tutorial). If you don’t have such mask for your (non-human) genome, you can use Heng Li’s SNPable toolkit as a starting point.
To process these two trios, we will use the two offspring samples only to phase the four parental chromosomes. You can do this with the trio option, as in the file
makePhasedInput.sh which we provide, copied below. For generating the input files for MSMC, it uses a program called
generate_multihetsep.py, which merges VCF and mask files together, and also performs simple trio-phasing.
#!/usr/bin/env bash INDIR=/home/popgen/workshop_materials/27_demography_PSMC_MSMC/human_vcf_bed OUTDIR=/home/popgen/workshop_materials/27_demography_PSMC_MSMC MAPDIR=/home/popgen/workshop_materials/27_demography_PSMC_MSMC generate_multihetsep.py --chr 1 \ --mask $INDIR/NA12878.chr1.mask.bed.gz --mask $INDIR/NA12891.chr1.mask.bed.gz --mask $INDIR/NA12892.chr1.mask.bed.gz \ --mask $INDIR/NA19240.chr1.mask.bed.gz --mask $INDIR/NA19238.chr1.mask.bed.gz --mask $INDIR/NA19239.chr1.mask.bed.gz \ --mask $MAPDIR/hs37d5_chr1.mask.bed --trio 0,1,2 --trio 3,4,5 \ $INDIR/NA12878.chr1.vcf.gz $INDIR/NA12891.chr1.vcf.gz $INDIR/NA12892.chr1.vcf.gz \ $INDIR/NA19240.chr1.vcf.gz $INDIR/NA19238.chr1.vcf.gz $INDIR/NA19239.chr1.vcf.gz \ > $OUTDIR/EUR_AFR.chr1.multihetsep.txt
Here we have first input all 6 calling masks, plus one mappability mask, then the two trio specifications (see
generate_multihetsep.py -h for details), and then the 6 VCF files.
generate_multihetsep.py and other msmc-tools .py scripts are in your
$PATH. You can invoke these scripts from any directory by just typing their name.
If you try to run this file, you are going to get an error like this:
./makePhasedInput.sh ./makePhasedInput.sh: line 7: ./EUR_AFR.chr1.multihetsep.txt: Permission denied
This is the because the output file EUR_AFR.chr1.multihetsep.txt already exists, and the script is telling you it is not allowed to overwrite the pre-existing file. Running the
generate_multihetsep.py script on this dataset takes >20minutes and we would all just have to wait. Therefore we ran it in advance and pre-prepared the output file for you.
Alternative - creating single sample input files
Here we show an example command line that generates and MSMC input file for a single diploid sample
#!/usr/bin/env bash INDIR=/path/to/VCF/and/mask/files OUTDIR=/path/to/output_files MAPDIR=/path/to/mappability/mask generate_multihetsep.py --chr 1 --mask $INDIR/NA12878.mask.bed.gz \ --mask $MAPDIR/hs37d5_chr1.mask.bed $INDIR/NA12878.vcf.gz > $OUTDIR/NA12878.chr1.multihetsep.txt
Here we have added the mask and VCF file of the NA12878 sample, and the mappability mask. I suggest you don't actually run this because we won't need for the tutorial.
The first lines of the resulting "multihetsep" file should look like this:
1 68306 44 TTTCTCCT,TTTCCTTC 1 68316 10 CCCTTCCT,CCCTCTTC 1 87563 13 CCTTTTTT 1 570089 259 TTTTCCCC 1 752566 1058 AAAAAGAA 1 752721 83 GGGGGAGA 1 756781 596 GGGGGGGA 1 756912 113 AGAAAAAA 1 757103 26 CCCCCCCT 1 757734 84 TTTTTCTT
This is the input file for MSMC. The first two columns denote chromosome and position of a segregating site within the samples. The fourth column contains the 8 alleles in the 8 parental haplotypes of the four parents we put in. When there are multiple patterns separated by a comma, it means that phasing information is ambiguous, so there are multiple possible phasings. This can happen if all three members of a trio are heterozygous, which makes it impossible to separate the paternal and maternal allele.
The third column is special and I get a lot of questions about that column, so let me explain it as clearly as possible: The third column contains the number of called sites since the previous segregating site, including the current site. So for example, in the first row above, the first segregating site is at position 68306, but not all 68306 sites up to that site were called homozygous reference, but only 44. This is very important for MSMC, because it would otherwise assume that there was a huge homozygous segment spanning from 1 through 68306. Note that the very definition given above also means that the third column is always greater or equal to 1 (which is actually enforced by MSMC)!
Estimating the effective population size
MSMC's purpose is to estimate coalescence rates between haplotypes through time. This can be interpreted as the inverse effective population size through time. If the coalescence rate is estimated between subpopulations, another interpretation would be how separated the two populations became through time. In this tutorial, we will use both interpretations.
As a first step, we will use MSMC2 to estimate coalescence rates within the African haplotypes alone, and within the European haplotypes alone. Here is a short script for running it (
#!/usr/bin/env bash msmc2 -t 1 -p 1*2+15*1+1*2 -o EUR1.msmc2 -I 0-1 EUR_AFR.chr1.multihetsep.txt msmc2 -t 1 -p 1*2+15*1+1*2 -o EUR2.msmc2 -I 2-3 EUR_AFR.chr1.multihetsep.txt msmc2 -t 1 -p 1*2+15*1+1*2 -o AFR1.msmc2 -I 4-5 EUR_AFR.chr1.multihetsep.txt msmc2 -t 1 -p 1*2+15*1+1*2 -o AFR2.msmc2 -I 6-7 EUR_AFR.chr1.multihetsep.txt
Let's go through the parameters one by one. First, the
-t 1 option tells MSMC2 to use 1 CPUs. By default, it will use as many CPUs as there are on your computer. So if you are running this on a high performance compute cluster, it will use all the CPUs on that machine, which is rarely needed.
-p 1*2+15*1+1*2 option defines the time segment patterning. By default, MSMC uses 32 time segments, grouped as
1*2+25*1+1*2+1*3, which means that the first 2 segments are joined (forcing the coalescence rate to be the same in both segments), then 25 segments each with their own rate, and then again two groups of 2 and 3, respectively. MSMC2 run time and memory usage scales quadratically with the number of time segments. Here, since we are only analysing a single chromosome, you should reduce the number of segments to avoid overfitting. That's why I set 18 segments, with two groups in the front and back. Grouping helps avoiding overfitting, as it reduces the number of free parameters.
-o option denotes an output prefix. The three files generated by msmc will be called like this prefix with endings
-I option denotes the 0-based indices of the haplotypes analysed. In our case we have 8 haplotypes (in the
EUR_AFR.chr1.multihetsep.txt file), the first four being of European ancestry, the latter four of African ancestry. In the first run we estimate coalescence rates within the first two European haplotypes (index 0 to index 1; i.e. the individual NA12891), the second run is within individual NA12892, and the third is for the first parent of African ancestry (index 4 to index 5) and finally the fourth run is for the second parent of African ancestry. The last argument to
msmc2 is the multihetsep file. Normally you would run it on all 22 chromosomes, and in that case you would simply give all those 22 files in a row.
Estimating population separation history
Above we have run MSMC on each population individually. In order to better understand when and how the two ancestral populations separated, we will use MSMC to estimate the coalescence rate across populations. Here is a script for this run (
#!/usr/bin/env bash msmc2 -t 1 -s -I 0-4 -p 1*2+15*1+1*2 -o EUR1_AFR1.msmc2 EUR_AFR.chr1.multihetsep.txt
Again, here we are using 1 CPU (
-t 1), and we are running on the first parental haplotype in each subpopulation. The
-I 0-4 tells MSMC to calculate the coalescence rate between the first European haplotype (index 0) and the first African haplotype (index 4).
-s flag tells MSMC to skip sites with ambiguous phasing. As a rule of thumb: For population size estimates, we have found that unphased sites are not so much of a problem, but for cross-population analysis we typically remove those.
Plotting in R
The result files from MSMC2 look like this:
time_index left_time_boundary right_time_boundary lambda 0 0 2.61132e-06 2.93162 1 2.61132e-06 6.42208e-06 3043.06 2 6.42208e-06 1.19832e-05 3000.32 3 1.19832e-05 2.00987e-05 8353.98 4 2.00987e-05 3.19418e-05 12250.1 5 3.19418e-05 4.92247e-05 8982.41 ...
Here, the first column denotes a simple index of all time segments, the second and third indicate the scaled start and end time for each time interval. The last column contains the scaled coalescence rate estimate in that interval.
Let's first plot the effective population sizes with the following R code:
mu <- 1.25e-8 gen <- 30 afrDat<-read.table("~/workshop_materials/27_demography_PSMC_MSMC/human_results/AFR1.msmc2.final.txt", header=TRUE) eurDat<-read.table("~/workshop_materials/27_demography_PSMC_MSMC/human_results/EUR1.msmc2.final.txt", header=TRUE) plot(afrDat$left_time_boundary/mu*gen, (1/afrDat$lambda)/mu, log="x",ylim=c(0,100000), type="n", xlab="Years ago", ylab="effective population size") lines(afrDat$left_time_boundary/mu*gen, (1/afrDat$lambda)/mu, type="s", col="red") lines(eurDat$left_time_boundary/mu*gen, (1/eurDat$lambda)/mu, type="s", col="blue") legend("topright",legend=c("African", "European"), col=c("red", "blue"), lty=c(1,1))
The code produces this plot:
You can also try to add curves for the other two individuals (the file names start with EUR2 and AFR2).
You can see that both ancestral populations had similar effective population sizes before 200,000 years ago, after which the European ancestors experienced a severe population bottleneck. Of course, this is relatively low resolution because we are only analysing one chromosome, but the basic signal is already visible. Note that here we have scaled times and rates using a generation time of 30 years and a mutation rate of 1.25e-8, which are the same values as used in the initial publication on MSMC.
For the cross-population results, we would like to plot the coalescence rate across populations relative to the values within the populations. However, since we have obtained these three rates indepdenently, we have allowed MSMC2 to choose different time interval boundaries in each case, depending on the observed heterozygosity within and across populations. We therefore first have to use a script
combinedCrossCoal.py available in the msmc-tools repository, to combine the three rates:
combineCrossCoal.py ./EUR1_AFR1.msmc2.final.txt ./EUR1.msmc2.final.txt \ ./AFR1.msmc2.final.txt > ./EUR_AFR.combined.msmc2.final.txt
The resulting file looks like this:
time_index left_time_boundary right_time_boundary lambda_00 lambda_01 lambda_11 0 1.1893075e-06 4.75723e-06 1284.0425703 2.24322 2650.59574175 1 4.75723e-06 1.15451e-05 3247.01877925 2.24322 2940.90417746 2 1.15451e-05 2.12306e-05 7798.2270432 99.0725 2526.98957475 3 2.12306e-05 3.50503e-05 11261.3153077 2271.31 2860.21608183 4 3.50503e-05 5.47692e-05 8074.85679367 4313.17 3075.15793155
Here, instead of just one columns with coalescence rates, as before, we now have three. The first is the rate within population 0, the second across populations, the third within population 1.
OK, so we can now plot the relative cross-coalescence rate as a function of time:
mu <- 1.25e-8 gen <- 30 crossPopDat<-read.table("~/workshop_materials/27_demography_PSMC_MSMC/human_results/EUR_AFR.combined.msmc2.final.txt", header=TRUE) plot(crossPopDat$left_time_boundary/mu*gen, 2 * crossPopDat$lambda_01 / (crossPopDat$lambda_00 + crossPopDat$lambda_11), xlim=c(1000,500000),ylim=c(0,1), type="n", xlab="Years ago", ylab="relative cross-coalescence rate") lines(crossPopDat$left_time_boundary/mu*gen, 2 * crossPopDat$lambda_01 / (crossPopDat$lambda_00 + crossPopDat$lambda_11), type="s")
which produces this plot:
where you can see that the separation of (West-African) and European ancestors began already 200,000 years ago. The two populations then became progressively more separated over time, reaching a mid-point of 0.5 around 80,000 years ago. Since about 45,000 years, the two population seem fully separated on this plot. Note that even in simulations with a sharp separation, MSMC would not produce an infinitely sharp separation curve, but introduces a "smear" around the true separation time, so this plot is compatible also with the assumption that the two populations where already fully separated around 60,000 years ago, even though the relative cross-coalescence rate is not zero at that point yet.
Additional output plots from other analyses
Above we only ran MSMC with a single pair of sequences, so effectively PSMC', because of compute limitations, although we had available four sequences for each populations (two each from the mother and the father of the trio). Here are the results from running with all four sequences, obtained with -I 0,1,2,3 and -I 4,5,6,7 respectively for the single populations analyses, and -I 0,1,4,5 -P 0,0,1,1. Also we plot the cross-coalescence plot directly.
What differences do you see between the two-sequence and four-sequence versions? How much more data are we using with 4-sequences rather than two, for within population and between population?
Here are corresponding plots for two Malawi cichlid species, also obtained for trios, mapped to scaffolds of typical length 1-2Mb.
What is this telling us about species separation between A. calliptera and A. stuartgranti?
These plots have a spike at the left hand side. What might explain that?
Plot and interpret more MSMC results on cichlid species
There are more results for other cichlid species in the folder
~/workshop_materials/27_demography_PSMC_MSMC/cichlid_results_hannes. Try to modify the R code above to plot these results and try to interpret them. Note that the mutation rate is estimated at 3e-9 and generation time at 3 years.