## Activity on admixture and population structure

Activity by Anders Albrechtsen, 25 January 2018

Overview of exercises by Anders here .

## A. Use of NGSadmix to infer admixture proportions for numerous individuals

In this exercise we will try to use NGSadmix to analyze two different NGS datasets.

### Login to the server and set paths

First open a terminal and login to the server (you should `ssh` to your instance, therefore get your according IP address from the website ).
Next – before running any analyses – you need to set paths to the programs and the data you will use. Do this by pasting the following into your terminal window:

```# NB this must be done every time you open a new terminal
AA=/home/wpsg/workshop_materials/05_angsd/PhDCourse

# Set path to ANGSD program
ANGSD=\$AA/prog/angsd/angsd

# Set path to a bam file list with several bam files
BAMFOLDER=\$AA/sfs/data/smallerbams/
```

### First small example

We will first try to run an NGSadmix analysis of a small dataset consisting of bam files with low depth NGS data from 30 human samples: 10 from Nigeria (YRI), 10 from Japan (JPT) and 10 with European ancestry (CEU).

#### Make input data using ANGSD

The input to NGSadmix is genotype likelihoods (GLs). Therefore the first step of running an NGSadmix analysis if all you have are bams files is to calculate GLs. So let’s start bying doing that. First make a file that contains the paths of all the 30 bam files:

```find \$BAMFOLDER |  grep bam\$ > all.files
```

To see the content of the file you made type:

```cat all.files
```

Now calculate GLs from all the BAM files using ANGSD by running the following command in the terminal:

```\$ANGSD -bam all.files -GL 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 2e-6 -minMapQ 30 -minQ 20 -minInd 25 -minMaf 0.05 -doGlf 2 -out all -P 1
```

NOTE that this will take a bit of time to run (a few minutes, ca. ~2min).
While waiting, let’s try to understand the above command and get some info about the data. Here is an explanation of the options used in the command:

``` -bam all.files : tells ANGSD that the bam files to calculate GL from are listed in the file all.files
-GL 2 : tells ANGSD to use the GATK genotype likelihood model
-doMajorMinor 1 : tells ANGSD to infer the major and minor alleles
-doMaf 1 : tells ANGSD to calculate minor allele frequencies (needed by two of the options below: -SNP_pval and -minMaf)
-SNP_pval 2e-6 : tells ANGSD to use a p-value threshold of 2e-6 for calling SNPs
-minMapQ 30 : tells ANGSD what to require as minimum mapping quality (quality filter)
-minQ 20 : tells ANGSD what to require as minimum base quality (quality filter)
-minInd 25 : tells ANGSD to only output GLs for loci with data for at least 25 individuals for each site (quality filter)
-minMaf 0.05 : tells ANGSD to only output GLS for loci with a minimum minor allele frequency of 0.05 (quality filter)
-doGlf 2 : tells ANGSD to write the final genotype likelihoods into a file in beagle format
-out all : tells ANGSD to call all output files it generate "all" followed by the appropriate suffix e.g. "all.beagle.gz"
-P 1 : tells ANDSG use 1 thread (up to 100% CPU)

```

#### Explore the input data

Now let’s have a look at the GL file that you have created with ANGSD. It is a "beagle format" file called all.beagle.gz – and will be the input file to NGSadmix.
The first line in this file is a header line and after that it contains a line for each locus with GLs. By using the unix command wc we can count the number of lines in the file:

```gunzip -c all.beagle.gz | wc -l
```
• Use this to find out how many loci there are GLs for in the data set?

Next, to get an idea of what the GL file contains try from the command line to print the first 9 columns of the first 7 lines of the file:

```gunzip -c all.beagle.gz | head -n 7 | cut -f1-9 | column -t
```

In general, the first three columns of a beagle file contain marker name and the two alleles, allele1 and allele2, present in the locus (in beagle A=0, C=1, G=2, T=3).
All following columns contain genotype likelihoods (three columns for each individual: first GL for homozygote for allele1,
then GL for heterozygote and then GL for homozygote for allele2). Note that the GL values sum to one per site for each individuals. This is just a normalization of the genotype likelihoods in order to avoid underflow problems in the beagle software it does not mean that they are genotype probabilities.

• Based on this, what is the most likely genotype of Ind0 in the first locus and the second locus?

#### Run an analysis of the data with NGSadmix

Now you know how the input looks. Next, let’s try to perform an NGSadmix analyses of the GLs typing assuming the number of ancestral populations, K, is 3:

```\$NGSadmix -likes all.beagle.gz -K 3 -minMaf 0.05 -seed 1 -o all
```
• While waiting for the analysis to run then make sure you understand the command. If you are in doubt seek help here. Here you can also see what other options you have when you run an NGSadmix analyses.

#### Explore the output

The output from the analysis you just ran is three files:

• all.log (a "log file" that summarizes the analysis run)
• all.fopt.gz (an "fopt file", which has a line for each locus that contains an estimate of the allele frequency in each of the 3 assumed ancestral populations)
• all.qopt (a "qopt file", which has a line for each individual that contains an estimate of the individual’s ancestry proportion from each of the three assumed ancestral populations).

Let’s have a look at them one at a time. First, check the log file by typing

```cat all.log
```
• What is the log likelihood of the estimates achieved by NGSadmix (called "best like" in the log file)?

Next, check the first line of the fopt file by typing:

```zcat all.fopt.gz | head -n1
```
• Based on this: what is the estimated allele frequency of the first SNP in three assumed ancestral populations?

Finally, check the first line of the qopt file and thus the estimated admixture proportions for the first individual by typing:

```head -n1 all.qopt
```
• Based on this: does the individual look admixed?

You can see the ID of the first individual by getting the first line of the file you created with all your original bam files in the beginning:

```head -n1 all.files
```
• Based on that ID, which population does the individual come from?
• What does this suggest about what column to look for the frequencies for that population in the qopt file?
• Based on this and the frequency estimates for the first locus that you looked at earlier, what does NGSadmix estimate the allele frequency to be at the first locus in that population?

#### Plot the admixture proportion estimates

Finally, try to make a simple plot the estimated admixture proportions for all the individuals by opening the statistical program called R (which you could do by typing "R" in the terminal and pressing enter. We recommend to visualize results in RStudio in your local browser (e.g. Chrome, Firefox or Safari) by typing “[HereYourIP]:8787”, and then copy the following code:

```## open R
## set your working directory, e.g. setwd("~/")
# Get ID and pop info for each individual
s<-strsplit(basename(scan("all.files",what="theFuck")),"\\.")
pop<-sapply(s,function(x){paste(x[5],x[1],sep="_")})

# Plot them (ordered by population)
ord = order(pop)
par(mar=c(7,4,1,1))
```

Note that the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to the population they are from.

• Try to explain what the plot shows (what is on the axes, what do the colors mean and so on)
• What does the plot suggest about whether the individuals are admixed?

NB As you could tell from the number of loci included in the analysis, the above analysis is based on data from very few loci (actually we on purpose only analyzed data from a small part of the genome to make sure the analysis ran fast). Results from an analyses of data from the entire genome can be seen here.

### More realistic example

Now you know how to make input data to NGSadmix, how to run NGSadmix and what the output looks like. Let’s try to look at a more realistic size dataset. More specifically let’s try to run NGSadmix on data from the 1000 genomes project from the following populations:

 ASW HapMap African ancestry individuals from SW US CEU European individuals CHB Han Chinese in Beijing JPT Japanese individuals YRI Yoruba individuals from Nigeria MXL Mexican individuals from LA California

.

To save time we have already made the input file for you for this dataset and a file with population info.

```#A file with genotype likelihoods from 100 individuals in beagle format: path:

##A file with labels that indicate which population they are sampled from:
```

#### Take a quick look at the data

First try to get an overview of the dataset by copying the information file and making a summary using the following:

```#copy to folder

## cut first column | sort | count
cut -f 1 -d " " pop.info | sort | uniq -c
```
• Which countries are the samples from and how many samples from each?

Now look at the genotype file input.gz in the data folder (\$AA/admixture/data/). It is in the same format at the file we looked at in the previous example:

• Use the same approach as in the previous example to find out how many loci there are genotype likelihoods from

#### Run an analysis of the data with NGSadmix

Try to start an analysis of the data with NGSadmix with K=3 (-K 3), using 1 cpu (-P 1), using only SNPs with minor allele frequency above 0.05 (-minMaf 0.05), set the seed set to 21 (-seed 21) and set the prefix of the output files to myownoutfilesK3 (-o myownoutfilesK3).

``` \$NGSadmix -likes \$AA/admixture/data/input.gz -K 3 -P 1 -minMaf 0.05 -seed 21 -o myownoutfilesK3

```

Next, plot the estimated admixture proportions by running the following code in R :

```## open R
## set your working directory, e.g. setwd("~/")

## order according to population
ord<-order(pop[,1])
text(tapply(1:nrow(pop),pop[ord,1],mean),-0.05,unique(pop[ord,1]),xpd=T)
abline(v=cumsum(sapply(unique(pop[ord,1]),function(x){sum(pop[ord,1]==x)})),col=1,lwd=1.2)
```

Note that – like in the previous example – the order of the individuals in the plot are not the same as in the qopt file. Instead, to provide a better overview, the individuals have been ordered according to their population labels.

• Try to explain what the plot shows (what is on the axes, what do the colors mean and so on)
• What does the plot suggest in terms of population structure and admixture?

#### Other K values (if you have time)

• Plot the output (if you have trouble plotting it here is the plot I got: K4 plot.
• Based on all the results what can you say about the Mexican samples (MXL)?

## B. Use of fastNGSadmix to infer admixture proportions for 3 samples

Let’s try to use fastNGSadmix on a couple of samples. And let’s see if we can use it to guess the ancestry of 3 samples: sample1, sample2 and sample3 based on a reference panel.

### Setup paths

To this first setup some paths by typing:

```# Set path for the fastNGSadmix program

# Set path for all input files you will use in this exercise

## genotypes of reference panel

```

### Explore the files with the reference panel

As reference panel we will use data from these 7 populations:

 French French individuals Han Chinese individuals Chukchi Siberian individuals Karitiana Native American individuals Papuan individuals from Papua New Guinea, Melanesia Sindhi individuals from India YRI Yoruba individuals from Nigeria

.

The files with GL data from sample1, sample2 and sample3 are in beagle format — so exactly the same format as the input files you used for NGSadmix. So let’s not spend time on looking at those. But before we start analysing the data then have a quick look at the files with the reference panel (nInd.txt and refPanel.txt) so you know how they look (in case you at some point want to create your own reference panel – which there are scripts for that comes with fastNGSadmix). You can do this by running the following commands:

```# Show the full content of the file nInd.txt
# (a file that has info how many samples from each population the panel consists of)
cat \${inputpath}/nInd.txt | column -t

# Show the top 2 lines of the file refPanel.txt
# (a file that has info about allele frequencies for the 7 populations)
head -n2 \${inputpath}/refPanel.txt | column -t
```
• How many samples from each population does the reference population consist of?
• What are the allele frequencies of the first SNP for each of the populations?

### Analyse the samples with fastNGSadmix

Let’s try to run fastNGSadmix on the 3 samples one at a time with the following commands.:

```# Analyse sample1
\$fastNGSadmix -likes \${inputpath}/sample1.beagle -fname \${inputpath}/refPanel.txt -Nname \${inputpath}/nInd.txt -outfiles sample1 -whichPops all

# Analyse sample2
\$fastNGSadmix -likes \${inputpath}/sample2.beagle.gz -fname \${inputpath}/refPanel.txt -Nname \${inputpath}/nInd.txt -outfiles sample2 -whichPops all

# Analyse sample3
\$fastNGSadmix -likes \${inputpath}/sample3.beagle.gz -fname \${inputpath}/refPanel.txt -Nname \${inputpath}/nInd.txt -outfiles sample3 -whichPops all

```

As you can see the way to run it is similar to NGSadmix. The options -likes and -outfiles are the same (-outfiles is the equivalent of -o in NGSadmix). But now we also have the -fname and -Nname, which allows you to specify files with your reference panel. And then there is actually one more parameter that has to be set, namely -whichPops which allows you to specify that you only want to use a subset of the populations in the reference panel, or that you want to analyze all populations. So e.g. you can re-analyse sample1 using using only 6 of the 7 populations in your reference panel (excluding the French):

```# Re-analyse sample1 with a smaller reference panel
\$fastNGSadmix -likes \${inputpath}/sample1.beagle -fname \${inputpath}/refPanel.txt -Nname \${inputpath}/nInd.txt -outfiles sample1V2 -whichPops Han,Yoruba,Sindhi,Papuan,Chukchi,Karitiana
```

### Take a look at the output files

The output is very similar to that of NGSadmix. There is no fopt file, but there is a log file and and qopt file. And like for NGSadmix the qopt file will tell you the estimated admixture proportion for the sample you have analysed. E.g. try to have a look at the qopt file for sample 1:

```# Show content of sample1.qopt
cat sample1.qopt | column -t
```
• does the sample look admixed?

Now try to look in the log files for the four analyses using the command cat, so e.g. for sample1 type:

```# Show content of sample1.log
cat sample1.log
```
• how many loci are the 4 different analyses based on (this is in the log files and is called Overlap)?

### Plot the analysis results

Now open R and plot the results for all 4 analyses in turn:

```# Plot results for first analysis of sample1

# Plot results for 2nd analysis of sample1 (the one where we excluded French from the reference panel)

# Plot results for analysis of sample2

# Plot results for analysis of sample3
```
• Based on the results: what ancestry do you think the three samples have (ignore the second analysis of sample1 for now)?
• Now look at the results of the second analysis of sample1 (for which a different reference panel was used). Why do you think the result depends on the reference panel and what are the consequences?
• Do you trust the results for sample2 and sample3 given the number of loci it is based on?

In order to investigate this we can let fastNGSadmix run with bootstraps, where we randomly sample (with replacement),
the sites the analysis is based on. This tells us something about how susceptible our estimates are to change.

Run fastNGSadmix with 100 bootstraps for sample2 and sample3:

```\$fastNGSadmix -likes \${inputpath}/sample2.beagle.gz -fname \${inputpath}/refPanel.txt -Nname \${inputpath}/nInd.txt -outfiles sample2boot -whichPops all -boot 100

\$fastNGSadmix -likes \${inputpath}/sample3.beagle.gz -fname \${inputpath}/refPanel.txt -Nname \${inputpath}/nInd.txt -outfiles sample3boot -whichPops all -boot 100
```

Notice that now the FIRST row of the .qopt files, are the estimated ancestry based on ALL sites, and that the subsequent rows,
are the ones based on the bootstraps.

Now let’s try to plot the results in R by opening R and typing:

```
# Plot estimates

# Plot confidence intervals
## - first we take the 0.025 and 0.975 sample quantiles for constructing the confidence interval for out estimates

## - then we plot them
segments(b,lower,b,upper)
segments(b-0.2,lower,b+0.2,lower)
segments(b-0.2,upper,b+0.2,upper)

```
• Can you say with confidence what the ancestry of this sample is?

Let us also plot sample2 with 100 bootstraps:

```
#Open R
# set your working directory, e.g. setwd("~/")
# Plot estimates

# Plot confidence intervals
## - first we take the 0.025 and 0.975 sample quantiles for constructing the confidence interval for out estimates