R RADseq exercise

Background and Aim
Getting Started
Exercise 1 – Raw data manipulation
Exercise 2 – Consensus genotyping
Exercise 3 – Exploring selective sweeps based on allele frequencies
Exercise 4 – Exploring selective sweeps based on read coverage

Background and Aim

Restriction site-associated DNA sequencing (RADseq) is recognized as one of the most important implementations in ecological genomics enabling low-cost massive genotyping also in non-model species. Although it has been implemented in many different ways, the basic idea is the non-random fragmentation of the whole genome by digestion with restriction enzymes and sequencing of restriction sites flanking DNA fragments on a high-throughput sequencing platform. Several software packages (e.g., Stacks) have been developed for analyzing RADseq data. Here you will learn how to manipulate and analyze RADseq data, from filtering of the raw sequencing reads to simple population genomic analyses, in the R environment using standard R functions and functions from the Bioconductor packages.

If you are not familiar with RADseq library preparation, here is a brief overview.
Further information about RADseq is summarized nicely in this recent review.

Getting Started

During this tutorial we will primarily use the R statistical programming language, applying knowledge from the R-session on wednesday.
The software you are going to use is R in one of its forms (either R or Rstudio).
We recommend using your cloud instance (write ‘http://yourDNS:8787’in your favorite browser as safari/mozilla/chrome and login)for this exercise, as all the packages and files you need for this activity are already installed on the Amazon server. On the cloud instance, you will find the data files in /home/ubuntu/wpsg_2016/activities/RADseq_R

[If you have R installed already locally, you could also run the exercise on your own computer, as the data sets are truncated. The data files can be found here: Daniel_exercise.zip.]

We will use the R package ‘Shortread’ released by the Bioconductor project. The package ‘Shortread’ is used for FASTQ input and manipulation, e.g. filtering, trimming, and manipulating reads for a variety of applications. The original publication is: Morgan M, Anders S, Lawrence M, Aboyoun P, Pagès H and Gentleman R (2009). ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics, 25, pp. 2607-2608. Further information can be found in the ShortRead manual. We will also use the package Rsamtools to handle reference genome-aligned RADseq data.

As a quick resource facilitating this tutorial based on the R language, an R cheat sheet is provided here. It represents a compilation of R commands – both basic and specific to Bioconductor packages – useful for data manipulation and analysis in R.

Exercise 1 – Raw data manipulation

In exercise 1 we will learn how to manipulate FASTQ files, including searching for specific sequence motifs, quality filtering, and trimming of reads. Finally we will output a new filtered FASTQ file.

PDF_exercise_1
PDF_solution_1
File: illumina.SE100.fastq
These are stickleback RAD sequences from the Misty system, Canada, unpublished; 100k read subset from a single SE100 Illumina lane.
______________________________________________________________________________________________

Exercise 2 – Consensus genotyping

In this exercise we will learn how to upload and manipulate RADseq data aligned to a reference genome, inspect coverage and haplotype distribution, and call a consensus genotype for each locus.

PDF_exercise_2
PDF_solution_2
File: CGATA.bam
Illumina data, from Roesti et al. 2012 Mol. Ecol., limited to ChrIII only.
______________________________________________________________________________________________

Exercise 3 – Exploring selective sweeps based on allele frequencies

In this exercise we will focus on the analysis of a SNP matrix derived from RADseq data, estimates of allele frequencies across the genome and comparisons between populations.

PDF_exercise_3
PDF_solution_3
File: SNP.mat.txt
This is a subset of the Roesti et al. 2015 Nat. Commun. lake-stream stickleback SNP data set, reduced to ChrI, and with the SNPs quality-filtered. If you are interested in the background of the origin of the data, please have a look.

______________________________________________________________________________________________

Exercise 4 – Exploring selective sweeps based on read coverage

In this final exercise we will check coverage depth of RADseq data along the genome and compare it between populations.

PDF_exercise_4
PDF_solution_4
File: coverage.chrI.txt
This RAD locus coverage file is derived from alignments (4 lake-stream stickleback individuals from Roesti et al. 2015 Nat. Commun., ChrI only, physical resolution reduced).

And now please fill in our daily questionnaire.

Table of contents