Table of contents
- Background and Aim
- Getting Started
- Exercise 1 – Raw data manipulation
- Exercise 2 – Consensus genotyping
- Exercise 3 – Exploring selective sweeps based on allele frequencies
- Exercise 4 – Exploring selective sweeps based on read coverage
Restriction site-associated DNA sequencing (RADseq) is recognized as one of the most important implementations in ecological genomics enabling low-cost massive genotyping also in non-model species. Although it has been implemented in many different ways, the basic idea is the non-random fragmentation of the whole genome by digestion with restriction enzymes and sequencing of restriction sites flanking DNA fragments on a high-throughput sequencing platform. Several software packages (e.g., Stacks) have been developed for analyzing RADseq data. Here you will learn how to manipulate and analyze RADseq data, from filtering of the raw sequencing reads to simple population genomic analyses, in the R environment using standard R functions and functions from the Bioconductor packages.
During this tutorial we will primarily use the R statistical programming language, applying knowledge from the R-session on wednesday.
The software you are going to use is R in one of its forms (either R or Rstudio).
We recommend using your cloud instance (write ‘http://yourDNS:8787’in your favorite browser as safari/mozilla/chrome and login)for this exercise, as all the packages and files you need for this activity are already installed on the Amazon server. On the cloud instance, you will find the data files in /home/ubuntu/wpsg_2016/activities/RADseq_R
We will use the R package ‘Shortread’ released by the Bioconductor project. The package ‘Shortread’ is used for FASTQ input and manipulation, e.g. filtering, trimming, and manipulating reads for a variety of applications. The original publication is: Morgan M, Anders S, Lawrence M, Aboyoun P, Pagès H and Gentleman R (2009). ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics, 25, pp. 2607-2608. Further information can be found in the ShortRead manual. We will also use the package Rsamtools to handle reference genome-aligned RADseq data.
As a quick resource facilitating this tutorial based on the R language, an R cheat sheet is provided here. It represents a compilation of R commands – both basic and specific to Bioconductor packages – useful for data manipulation and analysis in R.
In exercise 1 we will learn how to manipulate FASTQ files, including searching for specific sequence motifs, quality filtering, and trimming of reads. Finally we will output a new filtered FASTQ file.
These are stickleback RAD sequences from the Misty system, Canada, unpublished; 100k read subset from a single SE100 Illumina lane.
In this exercise we will learn how to upload and manipulate RADseq data aligned to a reference genome, inspect coverage and haplotype distribution, and call a consensus genotype for each locus.
Illumina data, from Roesti et al. 2012 Mol. Ecol., limited to ChrIII only.
In this exercise we will focus on the analysis of a SNP matrix derived from RADseq data, estimates of allele frequencies across the genome and comparisons between populations.
This is a subset of the Roesti et al. 2015 Nat. Commun. lake-stream stickleback SNP data set, reduced to ChrI, and with the SNPs quality-filtered. If you are interested in the background of the origin of the data, please have a look.
In this final exercise we will check coverage depth of RADseq data along the genome and compare it between populations.
This RAD locus coverage file is derived from alignments (4 lake-stream stickleback individuals from Roesti et al. 2015 Nat. Commun., ChrI only, physical resolution reduced).
And now please fill in our daily questionnaire.