Environmental correlation analysis tutorial

Activity by Angela Hancock, 30 January 2018

In this tutorial, we will use the association mapping software GEMMA to identify SNPs that are correlated with climate in a panel of Eurasian Arabidopsis thaliana accessions derived from the 1001 Genomes Project data.

The data set here is not exactly the same data set that was used for analysis previously. First, the SNPs here were re-called from the raw Illumina reads using the SHORE pipeline. These new calls were published together with additional sequence data and the calls are available here under accession #PRJEB240044 in the European Bioinformatics Information Portal (EBI) SVA. For this exercise, we have already filtered out individuals that appear to be contaminants as well as individuals for which specific latitude and longitude coordinates were not available.

Here, we will only run the analysis on one chromosome and one climate variable for the sake of time and simplicity.

Data needed for the tutorial

The files you will need to begin this tutorial are in ~/workshop_materials/09_adaptation/data/

File with geo-referenced individuals: 1001genomes_rmv_contaminants_longlat.txt
This file must minimally contain Sample IDs and coordinates.

Climate data are in the folder ~/workshop_materials/09_adaptation/data/WORLDCLIM/
For this exercise, we will use raster data from the Bioclim Project.

SNP data: 1001_reduced_chr3.recode.vcf
Chromosome 3 from Arabidopsis thaliana 1001 Genomes Project

Tutorial overview

  1. Load climate data and examine; could do some PCA to examine relationships
  2. Extract climate data for accessions, save as R data object
  3. Prepare genotype calls for analysis and load into GEMMA
  4. Prepare a climate file for running in GEMMA
  5. Create a kinship matrix using GEMMA
  6. Run the correlation analysis using GEMMA
  7. Examine the results

1. Load climate data into R and extract information for our accession locations

We will work with the Bioclim data set, which is derived from the WORLDCLIM data set.
WorldClim version 2 provides high resolution gridded average monthly climate data for minimum, mean, and maximum temperature and for precipitation for 1970-2000. These data were produced by interpolation with a digital elevation model and data from 60,000 weather stations. The BioClim Project provides data that are averaged across these years, and that are designed to represent ecologically relevant general trends and extremes. There are 19 bioclim variables:

BIO1 = Annual Mean Temperature
BIO2 = Mean Diurnal Range (Mean of monthly (max temp – min temp))
BIO3 = Isothermality (BIO2/BIO7) (* 100)
BIO4 = Temperature Seasonality (standard deviation *100)
BIO5 = Max Temperature of Warmest Month
BIO6 = Min Temperature of Coldest Month
BIO7 = Temperature Annual Range (BIO5-BIO6)
BIO8 = Mean Temperature of Wettest Quarter
BIO9 = Mean Temperature of Driest Quarter
BIO10 = Mean Temperature of Warmest Quarter
BIO11 = Mean Temperature of Coldest Quarter
BIO12 = Annual Precipitation
BIO13 = Precipitation of Wettest Month
BIO14 = Precipitation of Driest Month
BIO15 = Precipitation Seasonality (Coefficient of Variation)
BIO16 = Precipitation of Wettest Quarter
BIO17 = Precipitation of Driest Quarter
BIO18 = Precipitation of Warmest Quarter
BIO19 = Precipitation of Coldest Quarter

WORLDCLIM and BioClim data sets are available for several possible spatial resolutions ranging from 30 seconds (~1 km2) to 10 minutes (~340 km2). Each download is a “zip” file containing 12 GeoTiff (.tif) files, one for each month of the year (January is 1; December is 12). Here we use 2.5 m just to keep the file sizes manageable.

Here is the website and reference for this dataset in case you are interested in learning more about how the data were obtained and processed: http://worldclim.org/version2.

Fick, S.E. and R.J. Hijmans, 2017. Worldclim 2: New 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology.

General requirements: R packages dismo and maptools (these R packages have already been installed on the Amazon machine), the Bioclim data and the set of sample locations (long, lat).
—————-

The files we will use are in ~/workshop_materials/09_adaptation/data/WORLDCLIM/

Connect to your instance via Rstudio and set the correct working directory
setwd("~/workshop_materials/09_adaptation/data")

Load the climate files:
files <- list.files("WORLDCLIM", pattern='tif', full.names=TRUE)

Create a RasterStack using these files:
library(dismo)
bioclim2.5 <- stack(files)

You can get some information about the data structure:
dim(bioclim2.5)
bioclim2.5

You should see something like this:

class : RasterStack
dimensions : 4320, 8640, 37324800, 19 (nrow, ncol, ncell, nlayers)
resolution : 0.04166667, 0.04166667 (x, y)
extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
names : wc2.0_bio_2.5m_01, wc2.0_bio_2.5m_02, wc2.0_bio_2.5m_03, wc2.0_bio_2.5m_04, wc2.0_bio_2.5m_05, wc2.0_bio_2.5m_06, wc2.0_bio_2.5m_07, wc2.0_bio_2.5m_08, wc2.0_bio_2.5m_09, wc2.0_bio_2.5m_10, wc2.0_bio_2.5m_11, wc2.0_bio_2.5m_12, wc2.0_bio_2.5m_13, wc2.0_bio_2.5m_14, wc2.0_bio_2.5m_15, ...
min values : -53.7253335, 0.0000000, 0.9529444, 0.0000000, -27.2479992, -71.8680038, 0.0000000, -64.5186653, -48.7379996, -36.0000000, -64.5186653, 0.0000000, 0.0000000, 0.0000000, 0.0000000, ...
max values : 33.96500, 28.18300, 100.00000, 2411.53446, 50.52800, 26.50000, 91.50000, 37.96800, 38.47667, 38.89133, 32.54467, 11246.00000, 2768.00000, 507.00000, 230.69151, ...

The data structure is loaded as a list of rasters for the 19 Bioclim variables. You can examine get a better idea of the global distributions of these variables and how they compare to one another by plotting them on maps. Choose a few to plot.
e.g., a plot of the first 9 variables looks like this:

plot(bioclim2.5[[1:9]])
(note that temperature is in units degrees C ×10).

2. Extract climate data for A. thaliana sample sites

Load A. thaliana location accession
locs1001 <- read.table("1001genomes_rmv_contaminants_longlat.txt", sep=" ", header=T)

Check that this file loaded properly and look at the columns included in this file:
head(locs1001)

We will use the maptools package:
library(maptools)
(you will see an error message saying that "rgeos" is not available, but you can ignore this message)
data(wrld_simpl)

If you want to see how the samples are distributed you can plot them onto a map. This is also a good check to make sure that the coordinates in your file are defined correctly.
plot(wrld_simpl, add=T)
points(locs1001[,4:5], col="red", cex=0.3)

It might be more useful to plot the samples onto a map that shows the distribution of a climate variable.
plot(bioclim2.5, 16) #precip wettest qtr
plot(wrld_simpl, add=T)
points(locs1001[,4:5], col="red", cex=0.3)

Extract climate data for A. thaliana accessions:
climate1001 <- extract(bioclim2.5, locs1001[,4:5])

The new data frame climate1001 contains a value based on the bioclim data for each sample location. There are 1125 Arabidopsis thaliana accessions and 19 bioclim variables, so climate1001 has 1125 rows and 19 columns.
dim(climate1001)
You should see the following:
[1] 1054 19

Take a look at the beginning of the file:
climate1001[1:5,1:5]

Notice that there is no information in this extracted data frame about which accession is which (accession IDs are missing). Add this information:
rownames(climate1001) <- locs1001[,1]

You compared some of the climate maps earlier. Now let’s look at how the variables are correlated based on the matrix of 1001 Genomes sampling sites. First, let’s add more interpretable names to the climate1001 data frame. First, read in some names for the bioclim variables. Note that these are shortened versions of the descriptions provided above, so you can also refer back to the above descriptions for more information.
You can combine the following names into a vector of labels if you would like to have names that are easier to interpret in the following plot:

"1_Ann_T"
"2_Diurnal_Range"
"3_Isothermality"
"4_T_Seasonality"
"5_MaxT_Wrmst_Month"
"6_MinT_Cldst_Month"
"7_T_Ann_Range"
"8_T_Wettest_Qtr"
"9_T_Driest_Qtr"
"10_T_Wrmst_Qtr"
"11_T_Cldst_Qtr"
"12_Ann_Precip"
"13_P_Wettest_Month"
"14_P_Driest_Month"
"15_P_Seasonality"
"16_P_Wettest_Qtr"
"17_P_Driest_Qtr"
"18_P_Wrmst_Qtr"
"19_P_Cldst_Qtr"

bioclim_names <- c("1_Ann_T","2_Diurnal_Range","...")

And add these to the data frame as columns:
colnames(climate1001) <- bioclim_names

Now, create a correlation matrix from the climate1001set. (Note that the cor() function takes a matrix as input, not a data.frame)
cors <- cor(as.matrix(climate1001), method="pearson", use="everything")

Now plot the result as a heatmap.
heatmap(cors, margins=c(10,10), revC=T, col=rev(heat.colors(256)))

The plot should look something like this:

Now, let’s create an R data file with extracted climate information that we will use for gene environment correlation analysis. (We will save it as an R object because it makes loading in the next step a bit cleaner, but we could instead have saved it as a plain text file.)

save(climate1001, file="climate1001.rda")

3. Prepare SNP data for running correlation analysis in GEMMA

Files needed: 1001_reduced_chr3.recode.vcf and climate data file we just created: climate1001.rda

We will use PLINK to prepare our files for running in the GEMMA software. GEMMA is a program designed to make genome-wide mapping using a mixed model efficient. The name is based on the “EMMA” software. The ‘G’ in GEMMA means ‘genome-wide’. In order to speed up the inference, whereas EMMA uses a more traditional approach, including computationally expensive steps to calculate the matrix determinate and to conduct matrix inversion; GEMMA takes first and second derivatives of the matrix and uses recursion steps for optimization. This allows the calculation of an exact solution to obtain betas and likelihoods with much reduced time (linear in number of individuals versus quadratic).We use GEMMA here for genotype-environment analysis, but of course you may want to use a similar approach to conduct genotype-phenotype associations (traditional GWAS).

Open a terminal and log in to your Amazon instance via ssh. Navigate to ~/workshop_materials/09_adaptation/data and create a subdirectory structure for running GEMMA on the plink file:

mkdir plink_chr3

The software PLINK is a multipurpose tool for the analysis of genotype data. It is not covered in the 2018 Workshop on Population and Speciation Genomics; however, if you would like to learn more about the software, you might want to follow the exercise instruction from the 2016 workshop, which you may find here. For file format conversion, we will use the software VCFtools, which you may remember from the VCF Activity of the first workshop day.

Convert the vcf file to PLINK format and save the output in the plink directory:
vcftools --vcf 1001_reduced_chr3.recode.vcf --plink --out plink_chr3/1001_chr3
(if you get an error message saying "Could not open temporary file" you may need to run "ulimit -n 3000" first).

This will create two files: 1001_chr3.ped and 1001_chr3.map. Using PLINK you can convert these into PLINK binary format files, which are required for running GEMMA:

cd plink_chr3/
plink --file 1001_chr3 --make-bed --out 1001_chr3

This creates .bed, .bim, and .fam files. The .bed file is a binary format file that contains genotype information. The .bim file contains chromosome locations of the variants included and the .fam file should contain information about phenotype -- or in this case should be modified to contain the environmental variable to analyse.

Since there is no phenotype information in the vcf file, the version of the .fam file produced by PLINK contains ‘-9’ in the phenotype column.

In order to run the correlation analysis we will need to modify the .fam file from GEMMA to include our climate variable information. We will use R to create a new .fam file by merging the one created by PLINK with the data for bio16 – Precipitation in the wettest quarter in our climate1001 data frame.

4. Prepare climate data for input into GEMMA

Go back to the browser Rstudio

First, let’s take a closer look at the climate variable we will use in this tutorial, Maximum Precipitation in the Wettest Quarter.

Go back to Rstudio and make a histogram of the variable values across the 1001 Genomes individuals:
hist(climate1001[,1], col=”blue”)

Now read the .fam file we just created:
fam <- read.table("plink_chr3/1001_chr3.fam")

Check that the file loaded properly:
head(fam)

Remove the phenotype column from the file:
fam <- fam[,1:5]

Create a climate table with the accession ids and the values for Precipitation in the Wettest Quarter (bio16):
bio16 <- cbind(as.numeric(rownames(climate1001)), climate1001[,16])

Merge this with the information in the .fam file using the accession IDs to create a new .fam file:
newfam <- merge(fam, bio16, by=1)

Save this file as 1001_chr3.fam:
write.table(newfam, "plink_chr3/1001_chr3.fam", row.names=F, col.names=F, quote=F)

5. Create a kinship matrix using GEMMA

Now go one more time to your terminal and execute the following in the folder data/plink_chr3:

gemma -bfile 1001_chr3 -gk 1 -o 1001_chr3
The output file and log file will appear in the automatically created subdirectory ‘output’.

6. Run GEMMA to calculate correlations between the bio16 variable and each genetic variant controlling for relatedness among samples

gemma -bfile 1001_chr3 -k output/1001_chr3.cXX.txt -lmm 4 -o 1001_chr3_bio16_alltests

7. Examine the results

Go back to the Rstudio in your browser so that we can easily see the plots we are going to make

Read the results file ("1001_chr3_bio16_alltests.assoc.txt" into Rstudio):
results <- read.table("plink_chr3/output/1001_chr3_bio16_alltests.assoc.txt", header=T)

How similar are the results from the three different tests?

Calculate pairwise correlations between p-values to answer this.
e.g., cor.test(results$p_wald , results$p_lrt )
- Under the null, the p-values should be uniformly distributed -

Make a histogram of the p-values:
hist(results$p_lrt, col="blue")

Another way to look at this: make a qqplot to compare the –logP distribution to a uniform distribution (-log transformation of a uniform distribution is an exponential distribution).
qqplot(rexp(length(results$p_lrt), rate = log(10)), -log10(results$p_lrt), xlab="Expected quantile", pch=18, cex=0.45)
abline(0,1)

The plot should look something like this:

How well do the p-values fit the expected distribution?

Create a Manhattan plot with the p-values from the likelihood ratio test:
plot(-log10(results$p_lrt), pch=18, cex=0.25, xlab="position", ylab="-logP", col="darkgrey")

The plot should look something like this:

We can add a threshold showing significant loci
abline(h=-log10(0.05/nrow(results)))
How many loci are above the threshold?