GWAS analysis using python
30 January 2020
Genome-wide association studies (GWAS) are powerful tools for understanding the phenotype-genotype map. As sequencing of non-model organisms becomes more and more tractable, GWAS will be increasingly applicable to many research questions in population and speciation genomics. Today’s session should help make GWAS more accessible for you.
Background and Aim
The goal of this session is to learn the basic concepts behind simple GWAS analysis and understand practical considerations for conducting GWAS. Hopefully this session will give you the background needed to begin using simple GWAS and to explore the topic more on your own.
- Determine whether a phenotype is appropriate for GWAS
- Understand the types of data required to do GWAS
- Run simple GWAS in python using the limix package
- Understand Manhattan and QQ plots and what they can tell us
- Recognize (and avoid) common GWAS pitfalls
Genotypes and K matrix
The SNP data and K matrix we are using today are available from the Arabidopsis thaliana 1001 genomes project website https://1001genomes.org/ and are described in the paper 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana. In this exercise, we are subsetting the SNP data to 10% of the original dataset for practical reasons while still using the original K matrix.
We are working with three different phenotypes in this exercise:
- Flowering time at 16 degrees
- a. This is the number of days between planting and the opening of the first flower for plants grown at 16 degrees in a growth chamber.
- b. A GWAS for this phenotype was published alongside the SNP dataset in 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana.
- c. We will use a subset of this dataset (200 accessions) in our first GWAS. Later, you can try to do GWAS using the entire dataset (1100 accessions).
- a. This is the leaf chromium concentration of plants grown in the greenhouse.
- b. This dataset (and its GWAS) were published in: Genome-Wide Association Studies Identify Heavy Metal ATPase3 as the Primary Determinant of Natural Variation in Leaf Cadmium in Arabidopsis thaliana.
- a. This is unpublished data describing the “purpleness” of plants grown in the field in Sweden (stressed plants are often more purple than their non-stressed counterparts).
How to do this activity
This tutorial consists of three jupyter notebooks that you will run on your Amazon instance and access through a web browser.
We will walk through the GWAS analysis of the first phenotype together, and then you will be free to work on the other phenotypes independently (instructions for independent work are shown below in this page).
Instructions to launch jupyter notebooks using the Amazon instance
- Start a jupyter server
You need to initialize a Jupyter server on the Amazon instance using the Terminal. There are two ways to do this:
- Using the web browser and Apache Guacamole:
- Type to the address bar of a web browser:
http://ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:8080/guacamole(remember to replace XXX-XXX-XXX-XXX for your instance IP address). An Apache Guacamole log-in window will appear on screen
- Use popgen as ‘username’ and the same password as always
- Click on the Terminal and use the same password as before
- Using SSH in a local Terminal window:
- Open the Terminal in your machine and type:
ssh [email protected](again, replace XXX-XXX-XXX-XXX for your instance IP address and press Enter)
- In the Terminal, go to the GWAS activity directory:
- Start a new Screen session:
screenand press Enter
- Activate the conda environment called ‘limix’:
conda activate limix
- Initialize the jupyter notebook server:
The Terminal window will look frozen but this is normal, the jupyter server will be running in the background.
- In a web browser, enter to the address bar:
ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com:8888(replace XXX-XXX-XXX-XXX for your instance IP address and press Enter). You will be asked to provide a password. Enter the password used before. Then, the content of the activity directory will be displayed on screen:
- To launch a notebook, just click on it. For example, let’s launch the first notebook called 1_phenotype_exploration.ipynb. It will be displayed on screen like this:
Note there are three jupyter notebooks (files with extension .ipynb): 1_phenotype_exploration.ipynb, 2_GWAS.ipynb, and 3_GWAS_interpretation.ipynb.
Instructions for Independent Work
- In this section, you should work through the three notebooks with two different phenotypes:
- a. Phenotype 1 – Chromium concentration
- b. Phenotype 2 – The entire dataset for flowering time at 16 degrees.
This is the leaf chromium concentration of plants grown in the greenhouse. Phenotype file =
./data/cadmium_concentration.csv. A GWAS for this dataset is published: https://doi.org/10.1371/journal.pgen.1002923
QWhat does the Manhattan plot look like for a simple trait?
This is the entire flowering time dataset for 1100 accessions. Phenotype file =
QHow much longer does this GWAS take to run? Are the Manhattan and QQ plots different using the full dataset?
- Make sure you change the names of input and output files in section 1b of all three notebooks. To do this, just replace “subset_flowering_time_16” with either “chromium_concentration” or “flowering_time_16”. You shouldn’t change any other part of the file names.
- Run the three notebooks step-by-step. Focus on what each step of code is doing and why (rather than trying to understand each line individually).
- Do you understand:
- a. What an appropriate phenotype for GWAS looks like?
- b. The input matrices/variables that limix is using?
- c. How a linear mixed model tests for association between genotypes and phenotypes?
- d. How to read and interpret a Manhattan plot (including Bonferroni cutoff)?
- e. What a QQplot looks like if p-values are inflated?
- f. Why we use a minor allele frequency cutoff?
- What are the differences in GWAS results among the three phenotypes? Which traits are simple and which are complex? Which have more p-value inflation? Which one do you think is more interesting and why?
- If you are working more quickly than the others and run out of phenotypes, why not try one of the following optional challenge exercises?
- a. Run GWAS with a phenotyping dataset whose accessions cover a small geographic area (
./data/rosette_color.csv). This is a measure of the color of plants growing in the field, which is often a sign of stress. What’s different about GWAS here?
- b. Try to run a GWAS with different minor allele frequency cutoffs. You will have to change input files and variables accordingly!
- c. If you are interested in hdf5 files and how to use them in python, how about trying to understand the code in notebook 2 line by line?