badMIXTURE: An R package for how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots

Summary

In A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots, Daniel Falush, Lucy van Dorp and Daniel Lawson discuss how mixture solutions to understand population genetics data may be misinterpreted. This is the R package that performs the analysis in that paper.

Abstract

Genetic clustering algorithms, implemented in popular programs such as STRUCTURE and ADMIXTURE, have been used extensively in the characterisation of individuals and populations based on genetic data. A successful example is reconstruction of the genetic history of African Americans who are a product of recent admixture between highly differentiated populations. Histories can also be reconstructed using the same procedure for groups which do not have admixture in their recent history, where recent genetic drift is strong or that deviate in other ways from the underlying inference model. Unfortunately, such histories can be misleading. We have implemented an approach to assessing the goodness of fit of the model using the ancestry “palettes” estimated by CHROMOPAINTER and apply it to both simulated and real examples. Combining these complementary analyses with additional methods that are designed to test specific hypothesis allows a richer and more robust analysis of recent demographic history based on genetic data.
What you need

badMIXTUR Eexample: checking a mixture decomposition from PLINK data

This example runs through the process of: 0. Getting the data

Converting plink data to mixPainter (chromopainter) format
Running ADMIXTURE
Running mixPainter in a way that allows comparisons to the ADMIXTURE run
Getting the data out of mixPainter and into the badMIXTURE R package.

Step 0: Getting the data

From a linux or mac terminal, copy the example data from our repository:

alias mywget=wget
files=”Recent_admix.bim Marginalisation_admix.bim Remnants_admix.bim Remnants_admix.fam Recent_admix.fam Marginalisation_admix.fam Marginalisation_admix.bed Recent_admix.bed Remnants_admix.bed”
for file in $files; do
mywget https://people.maths.bris.ac.uk/~madjl/finestructure/badmixture/$file
done

Get the scripts to process the data

git clone https://github.com/danjlawson/badMIXTUREexample.git
## Note that this stage requires git. You technically don’t need it; you can download the scripts manually if you would prefer. put them in a folder called “badMIXTUREexample” under your working directory for all the paths to work correctly.

Step 1a: Get all the tools you need
badMIXTURE:

Follow the instructions at https://github.com/danjlawson/badMIXTURE. Specifically:

install.packages(“devtools”)
library(devtools)
install_github(“danjlawson/badMIXTURE”)

External tools:

## These are the tools we need. You can either put them all in your path, or update the variables with their full path
plink=”plink1.9″ # Available from https://www.cog-genomics.org/plink2
plink2chromopainter=”plink2chromopainter.pl” # included in the finestructure download: https://people.maths.bris.ac.uk/~madjl/finestructure/finestructure.html
convertrecfile=”convertrecfile.pl” # included in the finestructure download
makeuniformrecfile=”makeuniformrecfile.pl” # included in the finestructure download
admixture=”admixture” # available from https://www.genetics.ucla.edu/software/admixture/download.html

Step 1b: Creating the commands to do the conversion

./doprocess.sh # This generates calls to “convert.sh” that will process the three datasets.
## AT THIS POINT YOU SHOULD READ AND CHECK convert.sh!! Understanding that is the point of this exercise!
## In particular, if you just run the scripts, they will use 8 cores of your machine for several days. You might not want this!

Steps 2-4: Running everything

The included script “convert.sh” explains how we did everything for the paper. The process is:

Run plink to convert the file to ped/map 1a. (Optionally, prune the data down to fewer SNPs)
Run plink2chromopainter.pl to convert to chromopainter format
(Optionally, create a recombination map with either makeuniformrecfile.pl or convertrecfile.pl)
Run mixPainter

Each of these steps is a single command.

## Additional information

ADMIXTURE takes 147m using 8 cores to process the complete dataset. mixPainter takes 1962m using 8 cores. They scale similarly with the number of SNPs (linearly) and the number of individuals (mixPainter is quadratic).