Visualizing the footprints of mutation spectrum evolution
Activity by Kelley Harris, 25 January 2020
In the lecture this morning, you learned that the mutational process is not completely clock-like, but is better regarded as a mixture of many different processes that damage DNA and cause it to be replicated unfaithfully. These different processes can be disentangled by looking at the relative rates of different types of mutations, and comparing these relative rates across populations or species. For more background, see Harris & Pritchard eLife 2017.
We will be using python scripts to extract mutation spectrum information from a VCF of human variation data, the final 1000 Genomes Phase 3 panel, using the human hg19 reference genome sequence to identify the sequence context in which each variant occurs and using a human-chimpanzee genome alignment for ancestral allele polarization. We will then visualize these differences in two different ways: using a heat map and a principal components analysis plot.
In the final module of the exercise, you will run a Jupyter Notebook that showcases a new, unpublished Harris Lab method for jointly inferring population size changes with mutation spectrum changes.
How to run this exercise
To carry out this exercise, you have two options: i) connect to your Amazon instance using Guacamole (ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:8080/guacamole/) and open a terminal through it or ii) open a terminal on your laptop and connect via
ssh [email protected] to your instance.
To view the pdf with the plots, you can i) open the pdf files through Guacamole desktop navigating to the right folder (see below) or ii) open an RStudio connection in your browser (ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:8787) and navigate to the right folder in
The python scripts that you will need to do the exercise are provided in the folder
~/workshop_materials/25_mutation_spectrum. You will notice that the folder contains a directory called
data, which contains processed data for most of the human genome but is missing chromosome 22. Your first step will be to download the missing chromosome 22 data from the UCSC genome browser and the 1000 Genomes FTP server, so that you get to practice looking through the rich treasure troves of public data available on the internet.
VCF files list the SNPs that occur in a set of whole genome sequences. For each SNP, they list the position in the reference sequence where the SNP occurs, the reference base at that position, and the alternate base that occurs in many of the reads that map to that position. However, the VCF file is missing two pieces of information that are crucial for mutation spectrum analysis. One piece is ancestral state information: is the reference allele or the alternate allele the one that more recently arose by mutation? The other missing piece is the triplet context: which base pairs flank the mutation in the reference sequence? To add these pieces of information to the VCF, we must download the reference genome sequence and an alignment between the reference genome and an outgroup genome, then reformat these files.
Downloading and preprocessing your reference files
Your first task is to use the UCSC Genome Browser to find the hg19 human reference sequence and an alignment between hg19 and the most up-to-date chimpanzee reference sequence. Locate the files you want to download then use
wget to download them and decompress the hg19 human reference sequence file only. Download the reference sequence and the hg19-chimp reference sequence for chromosome 22. Put these sequences into the directories
Show me the answer!
We will also use the UCSC Genome Browser to obtain bed file annotations of regions of the genome that we may want to filter out of our analysis. We want to filter out the regions annotated as “conserved” by the program PhastCons, which appear to be evolving under significant amounts of natural selection, and we also want to filter out regions annotated as repeats by RepeatMasker. First, discuss with your neighbor why it might be a good idea to filter out these regions. Next, use the Genome Browser website to look for the files
nestedRepeats.txt, making sure to get the files that are indexed with respect to the hg19 reference. Download these files into the empty folder
data/bed_files and decompress them with
gzip -d FILE_NAME.
Show me the answer!
gzip -d phastConsElements100way.txt.gz
gzip -d nestedRepeats.txt.gz
preprocess contains two scripts that are needed to reformat these raw reference files. You can move into that folder and run them with the commands:
python2 gen_oneline_format.py #reformats the reference fasta
python2 extract_diffs.py #reformats the alignment to the outgroup
Extracting the mutation spectrum from the VCF
Armed with your preprocessed reference files, you can now extract the mutation spectrum of chromosome 22. I pre-extracted the mutation spectra of the other chromosomes; you can see that these processed spectra are located in the folder
finescale_mut_spectra, and that the chromosome 22 mutation spectrum is missing.
To generate the chromosome 22 mutation spectrum, you will need to google the 1000 Genomes Project Phase 3 FTP server, download the VCF file for chromosome 22, and add it to the empty folder
data/vcfs. (Hint: look for the 1000 Genomes FTP, look through the directory structure for a folder called “release,” then look inside the release folder for the subdirectory that has been modified most recently. This will contain the VCF you want).
Show me the answer!
Then, in your Amazon instance terminal, use the following command from the
count directory to count the number of SNPs occurring in each triplet sequence context at each allele frequency:
python2 process_chromosomes.py -c [chromosome_number]
(This step can take up to 20 minutes to run. Do you want to take a coffee?)
Note that singletons are excluded from the counts due to concerns that they could be enriched for sequencing errors.
You should find that the directory
finescale_mut_spectra now contains new files with data from chromosome 22. Some tabulate all mutations that occur in chromosome 22, for each population separately, while the others tabulate mutations that occur only in conserved regions or only within repeat elements. When you start plotting mutation spectrum data, the plotting scripts will subtract away the mutations that occur within the regions you want to filter away.
It is easy to modify the
process_chromosomes.py script to output mutation spectra contained within any genomic annotation you can specify with a bed file, say, with high recombination rate regions or within open chromatin. If you have time at the end of the exercise, you can experiment with doing this.
Creating a mutation spectrum heat map
Now that you’ve tabulated mutation spectrum data for each population, it’s time to visualize the data so we can pick out differences between populations. The script you will use is
make_heatmap.py, within the
plot directory. There are a lot of different options built in, which I hope you’ll play around with.
First, make a figure with three side by side mutation spectrum comparisons: Great Britain (GBR) vs Chinese from Beijing (CHB); Finnish (FIN) vs Japanese from Tokyo (JPT); GBR vs FIN; JPT vs CHB.
First, you need to activate a special conda environment where all the python library dependencies are installed by executing this command in your instance terminal:
conda activate mushi
Then, the command for the heatmap is the following:
python make_heatmap.py -p GBR CHB FIN JPT GBR FIN
The output of this script will be named
heatmap.pdf, and you can open it in the Guacamole Desktop or opening a Rstudio connection in your browser and navigating to
Each grid square represents a mutation type (a mutation from some triplet ABC to some other triplet ADC). A red square means that the population listed first has a higher proportion of ABC>ADC mutations, relative to other mutation types, than the population listed second. A blue square means that the population listed first is deplete of ABC>ADC mutations relative to the population listed second. In both cases, a white dot on the square means that the difference between populations is significant (chi2 significance value p<10-5).
Take a moment to compare your heat map with your neighbor’s and discuss your observations about it.
Once you’ve successfully done your first heat map and discussed it, it’s time to start playing with the following optional command line arguments:
-c chrom1 chrom2 … : only use mutation counts from the chromosomes listed. Default is range(1,23).
-f min_frequency max_frequency: Only count mutations whose frequencies range between the listed min and max, inclusive. Default is min_frequency=0, max_frequency=1.
-e Exclude: If this flag is used, exclude the SNPs occurring in PhyloP conserved regions and repetitive regions.
-i Analyze chromosomes individually: If this flag is used, output a separate heat map for each specified chromosome individually.
-p pval : Print dots on all squares representing mutation types that are significantly enriched in one population at a chi-square significance level of pval. Default = 1e-5.
Use these flags to generate the following heat maps:
1. Generate a heat map using only chromosome 22 data and excluding repeats or conserved regions with the following 4 comparisons side by side: Great Britain vs Han Chinese from Beijing; Finland vs Japanese; Great Britain vs Finland; Japanese vs Han Chinese from Beijing. Discuss. Try generating the same heat map without excluding repeats or conserved regions. Do you see much of a difference? Repeat using precomputed summary data for all 22 autosomes.
2. Generate another heat map with the same population comparisons (all 22 autosomes), but restricting to alleles with frequency less than 0.98. Discuss the differences.
3. Generate two more heat maps with the same population comparisons (all 22 autosomes), one restricting to alleles with frequency 0.5-0.9; one with frequencies 0.1-0.5; one with frequency <0.1. Discuss the differences.
4. Generate a new heat map plot with 4 panels that each compare Great Britain vs Han Chinese from Beijing, each panel using data from a different chromosome. The panels should correspond to chromosomes 1, 2, 22, and X. What do you observe?
Principal component analysis
Although the mutation spectrum heat map is good for pinpointing differences between populations in the types of mutations that have been accumulating, it is less suitable for visualizing how the differentiation between populations compares to the heterogeneity within populations. To explore how much the mutation spectrum varies within populations, we will use a type of principal component analysis (PCA).
You have probably seen population-differentiation PCAs before, as pioneered in Novembre, et al. Nature 2008. These PCAs are generated by summarizing each genome as a 10,000(+)-dimensional genotype vector, then projecting these vectors onto the optimal 2-dimensional plane for viewing. Here, we will summarize each genome as a 96-dimensional vector that counts the proportion of derived alleles from that genome falling into the 96 possible frequency classes (You might think that there are 192 mutation categories of the form ABC>ADC, but recall that each mutation type is the same as its reverse complement, e.g. CCA>CTA is the same as TGG> TAG.)
To make mutation spectrum PCA plots, we need to count mutations in a different way than we counted them for the heat maps. Instead of counting the number of mutations that occur in each frequency class, we need to count the number that occur within each genome. As before, we exclude singletons due to data quality concerns. To count the mutation types in each genome for chromosome 22, run the following command in the
python2 count_derived_each_lineage_nosingle.py -c 22
(This command takes about 10 minutes to complete but maybe taking yet another coffee is not a good idea…)
As before, counts for the remaining chromosomes are waiting precomputed for you in the
After generating mutation counts for each genome, we need to switch over to the
plot directory and start plotting PCAs. The command
python2 make_pca.py -g YRI GBR
will make a PCA (file name is
YRI_GBR_mut_PCA_1kg_nosingle_altlegend.pdf) showing how distinct the Yoruban from Ibadan (YRI) genomes are from the Great Britain (GBR) genomes in terms of mutation spectrum, compared to the variation within the two groups. As with heat maps, you can restrict to a subset of chromosomes using the -c flag:
-c 1 2 … : only use mutation counts from the chromosomes listed. Default is range(1,23).
Here are some exercises to get you acquainted with mutation spectrum PCAs:
1. Using the chromosome 22 data you just generated, make a PCA plot comparing the following groups: Yoruba (YRI), African Americans from the Southwest US (ASW), Mexicans from LA (MXL), Han Chinese from Shanghai (CHS), Japanese (JPT), Great Britain (GBR), Finnish (FIN). Make another PCA for the same groups using all 22 autosomes. Why do these look different?
2. Generate continent-specific PCAs: one for the set of all East Asian groups (JPT, CHS, CHB, CDX, KHV) and a second PCA for the set of all Europeans (CEU, GBR, FIN, TSI, IBS). Discuss the similarities and differences between these plots and the intercontinental PCAs you generated.
3. For your intercontinental and intracontinental PCAs, experiment with including different numbers of chromosomes and see how much data you need for structure to emerge.
Using mushi to infer mutation spectrum change over time
For the last chapter of this exercise, you’ll be using some prototype software written by Will Dewitt of the University of Washington. This program jointly infers demographic history and mutation spectrum change from site frequency spectrum data, producing time calibrated histories of mutation rate change in different sequence contexts. A preprint describing the software is forthcoming, and many of you may find it useful for inferring PSMC-style histories from SFS data.
This part of the exercise is arranged in a Jupyter Notebook, a nice and easy way to make your python code organized and easy to use by other people.
Move to this folder
Start the Jupyter Notebook in your instance terminal
jupyter notebook cesky-krumlov-tcc-pulse-timing.ipynb --no-browser --port=8888
You should see a few lines mentioning that the Jupyter Notebook is running.
Now you need to connect to this Jupyter Notebook through your browser. It works as the Rstudio connection but using a different port (ec2-XX-XXX-XX-XXX.compute-1.amazonaws.com:8888). It will ask for the usual pwd.
Once you are in your Jupyter Notebook, move to the directory
~/workshop_materials/25_mutation_spectrum/mushi-1.0-alpha if you are not directed there automatically and click on the file called
cesky-krumlov-tcc-pulse-timing.ipynb that will then open in a new tab.
Go through the notebook cell by cell, executing each command by clicking the cell, and clicking
When you are finished with your exercise in the Jupyter Notebook, go back to terminal where you started and terminate it clicking
ctrl+c and answering “y” to the question that will be prompted.
Then de-activate the conda environment executing this command in your instance terminal: