Visualizing the footprints of mutation spectrum evolution

Activity by Kelley Harris, 31 January 2018


In today’s lecture, you learned that the mutational process is not completely clock-like, but is better regarded as a mixture of many different processes that damage DNA and cause it to be replicated unfaithfully. These different processes can be disentangled by looking at the relative rates of different types of mutations, and comparing these relative rates across populations or species. For more background, see Harris & Pritchard eLife 2017.

We will be using python scripts to extract mutation spectrum information from a VCF of human variation data, the final 1000 Genomes Phase 3 panel, using the human hg19 reference genome sequence to identify the sequence context in which each variant occurs and using a human-chimpanzee genome alignment for ancestral allele polarization. We will then visualize these differences in a two different ways: using a heat map and a principal components analysis plot.

If you have time at the end of the exercise or after the workshop, consider modifying the pipeline to have a look at mutation spectrum variation in a whole genome dataset of your own. Very little is known about the ancestral dynamics of the mutation process in species outside of humans and great apes, and it would be exciting to see how mutagenesis has been varying in species with different demographic histories and reproductive strategies.

Getting started

To carry out this exercise, connect to your AMI instance using the Terminal and the ssh command.

The python scripts that you will need to do the exercise are provided in the folder ~/workshop_materials/11_mutation_spectrum. You will notice that the folder contains a directory called data, which contains processed data for most of the human genome but is missing chromosome 22. Your first step will be to download the missing chromosome 22 data from the UCSC genome browser and the 1000 Genomes FTP server, so that you get to practice looking through the rich treasure troves of public data available on the internet.

VCF files list the SNPs that occur in a set of whole genome sequences. For each SNP, they list the position in the reference sequence where the SNP occurs, the reference base at that position, and the alternate base that occurs in many of the reads that map to that position. However, the VCF file is missing two pieces of information that are crucial for mutation spectrum analysis. One piece is ancestral state information: is the reference allele or the alternate allele the one that more recently arose by mutation? The other missing piece is the triplet context: which base pairs flank the mutation in the reference sequence? To add these pieces of information to the VCF, we must download the reference genome sequence and an alignment between the reference genome and an outgroup genome, then reformat these files.

Downloading and preprocessing your reference files

Your first task is to use the UCSC Genome Browser to find the hg19 human reference sequence and an alignment between hg19 and the most up-to-date chimpanzee reference sequence. Locate the files you want to download then use wget to download them and decompress the hg19 human reference sequence file only. Download the reference sequence and the hg19-chimp reference sequence for chromosome 22. Put these sequences into the directories data/hg19_reference and data/hg19_chimp_align, respectively.

Show me the answer!


We will also use the UCSC Genome Browser to obtain bed file annotations of regions of the genome that we may want to filter out of our analysis. We want to filter out the regions annotated as “conserved” by the program PhastCons, which appear to be evolving under significant amounts of natural selection, and we also want to filter out regions annotated as repeats by RepeatMasker. First, discuss with your neighbor why it might be a good idea to filter out these regions. Next, use the Genome Browser website to look for the files phastConsElements100way.txt and nestedRepeats.txt, making sure to get the files that are indexed with respect to the hg19 reference. Download these files into the empty folder data/bed_files and decompress them gzip -d FILE_NAME.

Show me the answer!

cd bed_files
gzip -d phastConsElements100way.txt.gz
gzip -d nestedRepeats.txt.gz
cd ..

The directory preprocess contains two scripts that are needed to reformat these raw reference files. You can move into that folder and run them with the commands:

python #reformats the reference fasta
python #reformats the alignment to the outgroup

Extracting the mutation spectrum from the VCF

Armed with your preprocessed reference files, you can now extract the mutation spectrum of chromosome 22. I pre-extracted the mutation spectra of the other chromosomes; you can see that these processed spectra are located in the folder finescale_mut_spectra, and that the chromosome 22 mutation spectrum is missing.

To generate the chromosome 22 mutation spectrum, you will need to google the 1000 Genomes Project Phase 3 FTP server, download the VCF file for chromosome 22, and add it to the empty folder data/vcfs. (Hint: look for the 1000 Genomes FTP, look through the directory structure for a folder called “release,” then look inside the release folder for the subdirectory that has been modified most recently. This will contain the VCF you want).

Show me the answer!


Then, in your AMI instance, use the following command from the count directory to count the number of SNPs occurring in each triplet sequence context at each allele frequency:

python -c [chromosome_number]

Note that singletons are excluded from the counts due to concerns that they could be enriched for sequencing errors.

You should find that the directory finescale_mut_spectra now contains new files with data from chromosome 22. Some tabulate all mutations that occur in chromosome 22, for each population separately, while the others tabulate mutations that occur only in conserved regions or only within repeat elements. When you start plotting mutation spectrum data, the plotting scripts will subtract away the mutations that occur within the regions you want to filter away.

It is easy to modify the script to output mutation spectra contained within any genomic annotation you can specify with a bed file, say, with high recombination rate regions or within open chromatin. If you have time at the end of the exercise, you can experiment with doing this.

Creating a mutation spectrum heat map

Now that you’ve tabulated mutation spectrum data for each population, it’s time to visualize the data so we can pick out differences between populations. The script you will use is, within the plot directory. There are a lot of different options built in, which I hope you’ll play around with.

First, make a figure with three side by side mutation spectrum comparisons: Great Britain (GBR) vs Chinese from Beijing (CHB); Finnish (FIN) vs Japanese from Tokyo (JPT); GBR vs FIN; JPT vs CHB. The command for doing this is:


The output of this script will be named “heatmap.pdf”, and you can open it either after using scp to download it to you own machine (e.g. with scp wpsg@[your-amazon-ip] . in a new terminal window), or by accessing it through X2Go or RStudio. Each grid square represents a mutation type (a mutation from some triplet ABC to some other triplet ADC). A red square means that the population listed first has a higher proportion of ABC>ADC mutations, relative to other mutation types, than the population listed second. A blue square means that the population listed first is deplete of ABC>ADC mutations relative to the population listed second. In both cases, a white dot on the square means that the difference between populations is significant (chi2 significance value p<10-5).

Take a moment to compare your heat map with your neighbor’s and discuss your observations about it.

Once you’ve successfully done your first heat map and discussed it, it’s time to start playing with the following optional command line arguments:

-c chrom1 chrom2 … : only use mutation counts from the chromosomes listed. Default is range(1,23).

-f min_frequency max_frequency: Only count mutations whose frequencies range between the listed min and max, inclusive. Default is min_frequency=0, max_frequency=1.

-e Exclude: If this flag is used, exclude the SNPs occurring in PhyloP conserved regions and repetitive regions.

-i Analyze chromosomes individually: If this flag is used, output a separate heat map for each specified chromosome individually.

-p pval : Print dots on all squares representing mutation types that are significantly enriched in one population at a chi-square significance level of pval. Default = 1e-5.

Use these flags to generate the following heat maps:

1. Generate a heat map using only chromosome 22 data with the following 4 comparisons side by side: Great Britain vs Han Chinese from Beijing; Finland vs Japanese; Great Britain vs Finland; Japanese vs Han Chinese from Beijing. Discuss. Try generating the same heat map without excluding repeats or conserved regions. Do you see much of a difference? Repeat using precomputed summary data for all 22 autosomes.

2. Generate another heat map with the same population comparisons (all 22 autosomes), but restricting to alleles with frequency less than 0.98. Discuss the differences.

3. Generate two more heat maps with the same population comparisons (all 22 autosomes), one restricting to alleles with frequency 0.5-0.9; one with frequencies 0.1-0.5; one with frequency <0.1. Discuss the differences.

4. Generate a new heat map plot with 4 panels that each compare Great Britain vs Han Chinese from Beijing, each panel using data from a different chromosome. The panels should correspond to chromosomes 1, 2, 22, and X. What do you observe?

Principal component analysis

Although the mutation spectrum heat map is good for pinpointing differences between populations in the types of mutations that have been accumulating, it is less suitable for visualizing how the differentiation between populations compares to the heterogeneity within populations. To explore how much the mutation spectrum varies within populations, we will use a type of principal component analysis (PCA).

You have probably seen population-differentiation PCAs before, as pioneered in Novembre, et al. Nature 2008. These PCAs are generated by summarizing each genome as a 10,000(+)-dimensional genotype vector, then projecting these vectors onto the optimal 2-dimensional plane for viewing. Here, we will summarize each genome as a 96-dimensional vector that counts the proportion of derived alleles from that genome falling into the 96 possible frequency classes (You might think that there are 192 mutation categories of the form ABC>ADC, but recall that each mutation type is the same as its reverse complement, e.g. CCA>CTA is the same as TGG> TAG.)

To make mutation spectrum PCA plots, we need to count mutations in a different way than we counted them for the heat maps. Instead of counting the number of mutations that occur in each frequency class, we need to count the number that occur within each genome. As before, we exclude singletons due to data quality concerns. To count the mutation types in each genome for chromosome 22, run the following command in the count directory:

python -c 22

As before, counts for the remaining chromosomes are waiting precomputed for you in the finescale_mut_spectra folder.

After generating mutation counts for each genome, we need to switch over to the plot directory and start plotting PCAs. The command

python -g YRI GBR

will make a PCA showing how distinct the Yoruban from Ibadan (YRI) genomes are from the Great Britain (GBR) genomes in terms of mutation spectrum, compared to the variation within the two groups. As with heat maps, you can restrict to a subset of chromosomes using the -c flag:

-c 1 2 … : only use mutation counts from the chromosomes listed. Default is range(1,23).

Here are some exercises to get you acquainted with mutation spectrum PCAs:
1. Using the chromosome 22 data you just generated, make a PCA plot comparing the following groups: Yoruba (YRI), African Americans from the Southwest US (ASW), Mexicans from LA (MXL), Han Chinese from Shanghai (CHS), Japanese (JPT), Great Britain (GBR), Finnish (FIN). Make another PCA for the same groups using all 22 autosomes. Why do these look different?
2. Generate continent-specific PCAs: one for the set of all East Asian groups (JPT, CHS, CHB, CDX, KHV) and a second PCA for the set of all Europeans (CEU, GBR, FIN, TSI, IBS). Discuss the similarities and differences between these plots and the intercontinental PCAs you generated.
3. For your intercontinental and intracontinental PCAs, experiment with including different numbers of chromosomes and see how much data you need for structure to emerge.


If you get tired of exploring patterns of human mutation spectrum variation, you can choose to accept the challenge of mapping mutation spectrum variation in Arabidopsis thaliana! To do so, you should first make a copy of the project folder 11_mutation_spectrum as 11a_mutation_spectrum. You’ll then need to replace the human data files in this folder with Arabidopsis data files, as follows:

The Arabidopsis 1001 Genomes Project is a lot like the human 1000 genomes project in that it sampled individuals from all over the world. In this project, individuals are called “accessions.” In your copied folder, replace the human 1000 genomes vcf with the vcf of chromosome 3 from Angela Hancock’s exercise in the folder ~/workshop_materials/09_adaptation/data/

To process the SNPs in this file, you will need to download the A. thaliana reference chromosome fasta for chromosome 3 from the UCSC browser. Instead of the alignment between human and chimp, you will need the alignment of A. thaliana chromosome 3 to its relative A. lyrata, available here:

Finally, you will need to replace the file 1000genomes_phase3_sample_IDs.txt with the accession label file from Angela Hancock’s exercise. Use the grep command to find all instances of 1000genomes_phase3_sample_IDs.txt in the folder counting scripts and replace it with the name of your new population file.

While you have the file open, edit the header section that currently reads:

group_to_populations = {
'EAS': ['CHB','JPT','CHS','CDX','KHV','CHD'],
'EUR': ['CEU','TSI','GBR','FIN','IBS'],
'AFR': ['YRI','LWK','GWD','MSL','ESN', 'ACB', 'ASW'],
'SAS': ['GIH','PJL','BEB','STU','ITU'],
'AMR': ['CLM','MXL','PUR','PEL'],

This current header assigns the 3-letter codes that stand for human populations to the 3-letter codes that stand for larger groups (for example, the first line says that CHB (Han Chinese from Beijing) and JPT (Japanese from Tokyo) are both EAS (East Asian) populations. Delete this header and replace it with the following one, which classifies Arabidopsis populations into the appropriate continents:

group_to_populations = {
‘east_eurasia': [‘central_europe’,’central_asia’,italy_balkan_caucasus’],
‘sweden’: [‘northern_sweden’,’southern_sweden’],
‘west_eurasia’: [‘western_europe’,’germany’,’iberian_peninsula’],
'relict': [‘relict’],
'admixed': [‘admixed’] }

Use what you learned from the human mutation spectrum exercise to preprocess the Arabidopsis data, count mutations, and investigate mutation spectrum variation across worldwide accessions of Arabidopsis.

BTW, bear in mind that this analysis has not been attempted before, so any cool results could be a nice paper!