Recombination estimation - Evolution and Genomics

Inferring population recombination rate from population data

Marion Talbi and Milan Malinsky, 08 January 2025

Background and Objectives

Every generation, recombination shuffles regions of the genome with different patterns of genetic diversity together, generating new haplotypes that natural selection can act on. Thus, both the rate and the distribution of recombination have an impact on the strength of selection, and can influence the speed and dynamics of adaptation and speciation. It is therefore a worthy parameter to obtain in population genomic studies.

To infer recombination, different methods are available (see This review from Peñalba and Wolf (2020) for more background).
With the increasing availability of population genomic data, LD-based methods have demonstrated a great ability to infer the fine-scale recombination rate. These methods use the patterns of distribution and association of polymorphism present in population genetic data within the theoretical framework of the coalescent to infer the population recombination rate, ρ. We refer to the population recombination rate because the patterns of LD that we observe are a result of the product of the per-gene-per-generation recombination rate r, and the effective population size Ne.

In this activity, we will use the software pyrho to infer the population recombination rate in two populations of the cichlid fish Astatotilapia calliptera from Lake Masoko.
pyrho is particularly interesting because it can handle unphased data and takes into account changes in demography.

Table of contents

Inference of the demographic history for each population
Inference of fine-scale recombination rates
Visualisation with RStudio

Github:

smc++
pyrho

Material
All material files are available here ~/workshop_materials/21_recombination_estimation.

Workflow

I. Inference of the demographic history

As said earlier, ρ is encapsulating both r and Ne, thus, variation in Ne will impact our estimates of ρ. pyrho is using the software LDpop to built likelihood table taking into account change in demography (Kamm et al., 2016. Two-Locus Likelihoods Under Variable Population Size and Fine-Scale Recombination Rate Estimation).
We thus need first to infer past changes in demography.

In this activity, we will use smc++ (other methods are available, e.g. see MSMC, or fsc2 – activity to come soon).
smc++ used the SCM classical algorithm (from PSMC) with the site frequency spectrum of all other individuals as the observed state. (But read Terhost et al., 2017. Robust and scalable inference of population history from hundreds of unphased whole genomes for more background)
Inferring change in demography with smc++ is a straightforward three steps method, but first, use the following command to access the command on your instance:
source /home/wpsg/smcpp/bin/activate
And this one to move to the good directory:
cd ./smcpp/
Once you did this, you can continue with the next steps.

Convert VCF to smc++ format:

First, we want to prepare the input file using the following commands:

Show me the commands

vcf=../input_files/Astac_LS420024.2.vcf.gz
zcat ${vcf} | sed 's/\t\.:/\t\.\/\.:/g' | bgzip > Astac_LS420024.2.smc.vcf.gz
This sed command is changing how missing data are encoded in our VCF.

bcftools index Astac_LS420024.2.smc.vcf.gz
This command is creating an index for our VCF.

vcf_smc=./Astac_LS420024.2.smc.vcf.gz
This is the input VCF that you will need to run smc++.

Then, you can try to run the following commands: (all the commands you’ll need are in the ~/workshop_materials/21_recombination_estimation/smcpp/script_smcpp_activity.sh file.

vcf_smc=./Astac_LS420024.2.smc.vcf.gz #We save the name of a VCF into a variable
chr=LS420024.2 #Same for the chr number

##Benthic pop
smc++ vcf2smc ${vcf_smc} ${chr}_BEN.smc.gz ${chr} BEN:D29A01,D29A02,D29A03,D29A04,D29A05,D29A08,D29A09,D29B01,D29B02,D29B03,D29B05,D29B06,D29B07,D29B08,D29B10,D29C01,D29C04,D29C05,D29C07,D29C08,D29C09,D29C10,D29D01,D29D02,D29D03,D29D04,D29D06,D29D07,D29D08,D29D09,D29D10,D29E01,D29E02,D29E03,D29E04,D29E05,D29E06,D29E07,D29E08,D29F01,D29F02,D29F03,D29F04,D29F05,D29F06,D29F07,D29E09,D29E10,D29F08,D29F09,D29F10,D29G02,D29G03,D29G04,D29G05,D29G06,D29G07,D29G08,D29G09,D29G10,D29H01,D29H03,D29H04,G03I01,G08A03,G08A05,G08B05,G08B07,G08C03,G08D01

##Littoral pop
smc++ vcf2smc ${vcf_smc} ${chr}_LIT.smc.gz ${chr} LIT:D31B06,D31B07,D31B08,D31B09,D31B10,D31E04,D31E05,D31E06,D31E09,D31E10,D31F01,D31F02,D31F03,D31F04,D31F05,D31F06,D31F07,D31F08,D31F09,D31F10,D31G01,D31G02,D31G03,D31G04,D31G05,D31G06,D31G07,D31G08,D31G10,D31H01,D31H02,D31H03,D31H04,D31H05,D31H06,D31H07,D31H09,D31H10,D31I06,D31I07,D31I08,D31I09,D31J01,D31J02,D31J03,G01B03,G02D08,G03F07,G04C05,G04D04,G04H05,G05D06,G05H07,G05I01,G05I04,G05J08,G08I05,G08J04,G09A05,G09B09,G09C02,G10G06,G10G07,G10G10,G10H03,G10H04,G10H05,G10H06,G11A07

This is creating the smc input file for both populations.

the mask command

The command --mask is optional, but can be used to focus only on non-coding region of the genome for example. In real life – you might want to exclude some regions of the genome from your analysis. (e.g. low quality regions or regions undergoing selection). In the folder, you will find the file Astac_allMasksandCodingregions.bed.gz. This is a bed file including regions with low mappability, missing individuals, regions consider as low quality by GATK, and regions with extrem depths (both too high or too low). This mask also includes coding regions of the genome, because these regions might be under positive selection).

Q EXTRA: You can try to run smc++ with the --mask command, and compare the inferences obtained with and without the mask.

Fit a model of population demography changes to the data:

Show me the command

mu=3.5e-9

smc++ estimate --base LS420024.2_estimate_BEN -o ./ ${mu} LS420024.2_BEN.smc.gz smc++ estimate --base LS420024.2_estimate_LIT -o ./ ${mu} LS420024.2_LIT.smc.gz

This step is a bit long to run, you can try of course, but the output file is already in the folder output_files.

Plot data and output a csv file:

Show me the command

Move to the output_files folder ( cd ./output_files ) and run the following command:

chr=LS420024.2

smc++ plot --csv ${chr}_BENLIT_smcpp.pdf ${chr}_estimate_BEN.final.json ${chr}_estimate_LIT.final.json

Q Try to run smc++ on the chromosome 6 (i.e. LS420024.2) following the different steps for both benthic and littoral populations of Astatotilapia calliptera from Lake Masoko.

Show me the answer!

You can try to run all the above command yourself. If you need help, all the command are in the file named script_smcpp_activity.sh. Step 2 is a bit long, so you might want to directly use the files we provided to run the step 3 (i.e. ./smcpp/output_files/LS420024.2_estimate_${pop}.final.json).

We now should have a csv file and a pdf file with population sizes through time.
The csv file displays the name of the population, the time in generation, and the corresponding effective population size for this time interval.

In the folder, you will find two other output files from smc++:

One run on the chromosome 4 (i.e. LS420022.2): ./smcpp/output_files/LS420022.2_BENLIT_smcpp.csv
One run on all the chromosomes: ./smcpp/output_files/All_chromosome/All_chromosome_BENLIT_smcpp.csv

Q Do you expect to observe differences between these inferences? Why?

Show me the answer!

The genome is a mosaic of trees – all with their own sample history. Inference based on different regions of the genome will thus be a reflect of different local trees, with different patterns of genetic diversity at different frequencies. Selection will impact differently some regions of the genome, modifying locally Ne. Linked selection, modulating by the rate of recombination, will have an impact on regions located at the surrounding of these sites too.

II. Inference of fine-scale recombination rates

For this part, we will need to use pyrho.
To access it on your instance, you will need to activate the following conda environment:
conda activate pyrho
And to move in the good directory:
cd ~/workshop_materials/21_recombination_estimation/pyrho/

Lookup tables construction

The software LDpop is used to built the lookup table – i.e. the table with all two-locus likelihoods to observe a grid of values of ρ given one mutation rate. This software is, compare to older ones (see LDhelmet), implementing changes in demography in the likelihoods estimation.

Here is the code to use. As we run earlier smc++, we can now add the changes of demography to obtain more accurate lookup tables.

pyrho make_table --samplesize [nhaplotypes] --approx --moran_pop_size [Nmoran] --mu [mutation_rate] --outfile [out.name] --popsizes [pop_sizes] --epochtimes [epochtimes]

Q Can you complete the script with the missing arguments?

Show me the answer!

Have a look in the ~/workshop_materials/21_recombination_estimation/pyrho/scripts/script_pyrho_table.sh file, you will find the full command.
You can open this file with this command:
less ~/workshop_materials/21_recombination_estimation/pyrho/scripts/script_pyrho_table.sh

Or, if you know how to exit vi, you can try this command:
vi ~/workshop_materials/21_recombination_estimation/pyrho/scripts/script_pyrho_table.sh

Then, the final commands with all arguments are:
pyrho make_table --samplesize 138 --approx --moran_pop_size 200 --mu 3.5e-9 --outfile ./output_files/69_pyrho_lookuptable_littoral --popsizes 75996,25195,12907,5699,6792,21764,84371,177574,103772 --epochtimes 1943,2484,2880,4134,7441,18379,54395,112126
and
pyrho make_table --samplesize 140 --approx --moran_pop_size 200 --mu 3.5e-9 --outfile ./output_files/70_pyrho_lookuptable_benthic --popsizes 72630,12907,5699,6792,21764,84371,177574,103772 --epochtimes 2006,2880,4134,7441,18379,54395,112126

This step is pretty long to run. You will find the output files in ./pyrho/output_files/70_pyrho_lookuptable_benthic and ./pyrho/output_files/69_pyrho_lookuptable_littoral

Fine-scale recombination rates inference

Use of composite-likelihood and fused-lasso to infer the population recombination rates.
There are two important parameters here:

The block-penalty: Used to determine the smoothness of a map.
The window-size w: Determines how far appart can be the SNPs before being ignored.

You could for example use simulation to find the most suitable values for these parameters, or you could also use the --hyperparam command of pyrho

The command to run pyrho is the following: (you need to change the variables with the appropriate parameter values)
pyrho optimize --vcffile ${vcf} --windowsize ${windowsize} --blockpenalty ${bp} --tablefile ${lookuptable} --ploidy 2 --outfile ${name_output}

Q Can you infer the recombination rate for the chromosome 6 (i.e. LS420024.2) using pyrho ?

Show me the answer!

Have a look in the ~/workshop_materials/21_recombination_estimation/pyrho/scripts/script_pyrho_optimize.sh file, you will find the full command.
You can open this file with this command:
less ~/workshop_materials/21_recombination_estimation/pyrho/scripts/script_pyrho_optimize.sh

Or, if you know how to exit vi, you can try this command:
vi ~/workshop_materials/21_recombination_estimation/pyrho/scripts/script_pyrho_optimize.sh

The final commands with all arguments are:
pyrho optimize --vcffile ../input_files/LS420024.2_benthic.vcf.gz --windowsize 50 --blockpenalty 15 --tablefile ./output_files/70_pyrho_lookuptable_benthic --ploidy 2 --outfile ./output_files/LS420024.2_benthic_bp15_pyrho.optimize
and
pyrho optimize --vcffile ../input_files/LS420024.2_littoral.vcf.gz --windowsize 50 --blockpenalty 15 --tablefile ./output_files/69_pyrho_lookuptable_littoral --ploidy 2 --outfile ./output_files/LS420024.2_benthic_bp15_pyrho.optimize

The output file is a three column file. The first column is the start of an interval, the 2nd is the end of the same interval, and the third column is r – which means that pyrho is already scaling down ρ using the demographic history that we are providing.

Extra question

Q Try to run pyrho with a different block penalty (bp = 30 for example) – you could then compare the maps obtained.

III. Map Visualisation using R Studio

The Rcode for this activity is available here. You could then connect to R-studio, and copy paste this code.

There are different units used to refer to recombination rate. For example, pyrho infers ρ and output r (i.e. the per-generation per-base recombination rate) for each SNPs intervals.
Another common units used to represent recombination rates is the centiMorgan. One centiMorgan is the genetic distance corresponding to 1% chance of observing one event of recombination.

Q Between the position 162442 and 163071 of the chr 6 (LS420024.2), the rate of recombination r per-base-pair is equal to 2.689911e-09. Can you convert this rate in cM per-base-pair?

Show me the answer!

$1cM = 0.01 r$ So here, th recombination rate in cM is equal to 2.689911e-09 / 0.01 =2.689911e-07.

For practical readability reasons, people generally used cM/Mb (i.e. per Mega-base).

Q Convert the recombination maps of the benthic population in cM/Mb.

Show me the answer!

One cM is still equal to 1% chance of observing an event of recombination. Thus $1cM = 0.01 * r$ .
If we want to convert the output of pyrho into cM/Mb, we will also need to change the per-base-pair to per-one-Megabase-pair.
You can do so with the following R command: (you will need to adapt the name of the dataset and the name of the variable)

maps$"cM/Mb"<-(maps$rate_r/0.01)*10^6

From this frequency of recombination event, we can now obtain the cumulative genetic distance in centiMorgans. It will give information about the total length of the genetic map.

Look at this example:

Chromosome	Position(bp)	Rate(cM/bp)	Map(cM)
chr1	2539	0.455158616243	0.00457837
chr1	5806	0.375735428466	?

Q Can you calculate the map length in the ? case?

Show me the answer!

Here, we just need to add the length of the genetic map between the position 2539 and 5806 to the one before. It corresponds to the length of the region multiply by the recombination rate.
You will have: $GeneticMapsSNPrSNPl=(PosSNPright-PosSNPleft) * Recombination_Rate$

To find the missing value in the table, you can use this code:

(5806-2539)*(0.455158616243/10^6) + 0.00457837

Now that we have our output in two different units and that we are able to obtain the recombination map in cM, it would be interesting to have a first look at our recombination landscapes.
To do so, we can use Rstudio.

Q Can you plot the recombination rate r against its genomic position ?

Show me the answer!

We now have our two recombination landscapes. You can see variation along the chromosome, with some regions having more chances to observe an event of recombination (i.e. an higher recombination rate)

Once you managed to do it, you can compare with the maps built with a different value of block penalty.

Q Do you observe any differences?

Show me the answer!

From this plot, we see that the two plots differ with the number and size of ‘spike’. It makes perfect sense, because the block-penalty adds a penalty to big jump along the recombination map.

Bonus part

As you might have noticed, the landscape of recombination is not homogeneous along the genome, and when comparing different population/species, we might want to have an idea on how different the landscapes are. The cumulative curve representation is useful to do so:

Q Get yourself familiar with the graph. Can you tell in which proportion of the genome falls 50% of the total events of recombination both in Homo sapiens and Arabidopsis thaliana?

Show me the answer!

In Homo sapiens (human on the graph), 50% of all events of recombination are falling within less than 5% of the genome, while in Arabidopsis thaliana, it is falling within ~35% of the genome. This is because the recombination landscapes is way more ‘spicky’ in human.

Q Can you output the cumulative curves of the recombination rate for both population based on the chr 6 ?