GWAS activity - Evolution and Genomics

GWAS analysis using python

Magnus Nordborg ([email protected]) and Daniele Filiault ([email protected]), Gregor Mendel Institute of Molecular Plant Biology

30 January 2020

Introduction

Genome-wide association studies (GWAS) are powerful tools for understanding the phenotype-genotype map. As sequencing of non-model organisms becomes more and more tractable, GWAS will be increasingly applicable to many research questions in population and speciation genomics. Today’s session should help make GWAS more accessible for you.

Background and Aim

The goal of this session is to learn the basic concepts behind simple GWAS analysis and understand practical considerations for conducting GWAS. Hopefully this session will give you the background needed to begin using simple GWAS and to explore the topic more on your own.

Learning goals

Determine whether a phenotype is appropriate for GWAS
Understand the types of data required to do GWAS
Run simple GWAS in python using the limix package
Understand Manhattan and QQ plots and what they can tell us
Recognize (and avoid) common GWAS pitfalls

Requirements

python
R
limix

Data set

Genotypes and K matrix

The SNP data and K matrix we are using today are available from the Arabidopsis thaliana 1001 genomes project website https://1001genomes.org/ and are described in the paper 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana. In this exercise, we are subsetting the SNP data to 10% of the original dataset for practical reasons while still using the original K matrix.

Phenotypes

We are working with three different phenotypes in this exercise:

Flowering time at 16 degrees

a. This is the number of days between planting and the opening of the first flower for plants grown at 16 degrees in a growth chamber.
b. A GWAS for this phenotype was published alongside the SNP dataset in 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana.
c. We will use a subset of this dataset (200 accessions) in our first GWAS. Later, you can try to do GWAS using the entire dataset (1100 accessions).

Leaf chromium concentration

a. This is the leaf chromium concentration of plants grown in the greenhouse.
b. This dataset (and its GWAS) were published in: Genome-Wide Association Studies Identify Heavy Metal ATPase3 as the Primary Determinant of Natural Variation in Leaf Cadmium in Arabidopsis thaliana.

Rosette color

a. This is unpublished data describing the “purpleness” of plants grown in the field in Sweden (stressed plants are often more purple than their non-stressed counterparts).

How to do this activity

This tutorial consists of three jupyter notebooks that you will run on your Amazon instance and access through a web browser.

We will walk through the GWAS analysis of the first phenotype together, and then you will be free to work on the other phenotypes independently (instructions for independent work are shown below in this page).

Instructions to launch jupyter notebooks using the Amazon instance

Start a jupyter server
You need to initialize a Jupyter server on the Amazon instance using the Terminal. There are two ways to do this:
- Using the web browser and Apache Guacamole:
- Using SSH in a local Terminal window:
In the Terminal, go to the GWAS activity directory: cd ~/workshop_materials/30_GWAS
Start a new Screen session: screen and press Enter
Activate the conda environment called ‘limix’: conda activate limix
Initialize the jupyter notebook server: jupyter notebook
The Terminal window will look frozen but this is normal, the jupyter server will be running in the background.
In a web browser, enter to the address bar: ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:8888 (replace XXX-XXX-XXX-XXX for your instance IP address and press Enter). You will be asked to provide a password. Enter the password used before. Then, the content of the activity directory will be displayed on screen:

Note there are three jupyter notebooks (files with extension .ipynb): 1_phenotype_exploration.ipynb, 2_GWAS.ipynb, and 3_GWAS_interpretation.ipynb.

To launch a notebook, just click on it. For example, let’s launch the first notebook called 1_phenotype_exploration.ipynb. It will be displayed on screen like this:

Instructions for Independent Work

In this section, you should work through the three notebooks with two different phenotypes:
- a. Phenotype 1 – Chromium concentration
- b. Phenotype 2 – The entire dataset for flowering time at 16 degrees.
Make sure you change the names of input and output files in section 1b of all three notebooks. To do this, just replace “subset_flowering_time_16” with either “chromium_concentration” or “flowering_time_16”. You shouldn’t change any other part of the file names.
Run the three notebooks step-by-step. Focus on what each step of code is doing and why (rather than trying to understand each line individually).
Do you understand:
- a. What an appropriate phenotype for GWAS looks like?
- b. The input matrices/variables that limix is using?
- c. How a linear mixed model tests for association between genotypes and phenotypes?
- d. How to read and interpret a Manhattan plot (including Bonferroni cutoff)?
- e. What a QQplot looks like if p-values are inflated?
- f. Why we use a minor allele frequency cutoff?
What are the differences in GWAS results among the three phenotypes? Which traits are simple and which are complex? Which have more p-value inflation? Which one do you think is more interesting and why?
If you are working more quickly than the others and run out of phenotypes, why not try one of the following optional challenge exercises?

a. Run GWAS with a phenotyping dataset whose accessions cover a small geographic area (./data/rosette_color.csv). This is a measure of the color of plants growing in the field, which is often a sign of stress. What’s different about GWAS here?
b. Try to run a GWAS with different minor allele frequency cutoffs. You will have to change input files and variables accordingly!
c. If you are interested in hdf5 files and how to use them in python, how about trying to understand the code in notebook 2 line by line?