1 Our question: Diet plasticity

Comma butterflies (Polygonia c-album) in Scandinavia can eat plants from several different families (Fig. 1). This kind of diet plasticity (generalization) is rare among herbivorous insects, butterflies in particular. The plants differ in both secondary chemistry and nutrition, suggesting that Comma larvae may use different molecular and physiological mechanisms to safely and efficiently metabolize tissue from these plants. We sequenced mRNA from the midguts of 4th instar larvae feeding on one of three different host plant genera: Ribes (gooseberry), Salix (willow) and Urtica (nettle) (Fig. 2). Using these RNA-seq data, we will investigate differences in the larval gut transcriptome that contribute to the ability to eat such divergent host plants.

Fig. 1. Adult and larva of the Comma butterfly (© R. Tuinstra; © M. Flåten; https://commons.wikimedia.org)

Fig. 1. Adult and larva of the Comma butterfly (© R. Tuinstra; © M. Flåten; https://commons.wikimedia.org)

Fig. 2. _P. c-album_ host plants: _Ribes uva-crispa_, _Salix caprea_, and _Urtica dioica_. (https://botany.cz/)

Fig. 2. P. c-album host plants: Ribes uva-crispa, Salix caprea, and Urtica dioica. (https://botany.cz/)

2 Background

Today’s activity walks through a reference-based differential gene expression (DGE) analysis. This means we map our reads to an existing reference genome rather than creating a de novo transcriptome from our RNA-seq data. The three main steps of reference-based DGE analysis are 1) alignment, 2) quantification and 3) analysis (Fig. 2).

This tutorial details the steps involved in:

  • Trimming RNAseq reads
  • Aligning RNAseq reads to a reference assembly
  • Quantifying gene expression levels
  • Identifying differentially expressed (DE) genes
  • Visualizing expression differences
  • Evaluating functional enrichment of DE gene sets

The workshop has covered trimming and alignment a lot already, so we will start by exploring the genome and annotation used to align and count the data and then go straight into differential expression.

Once you finish the DGE and want an extra challenge, you can continue with Trimming (part 6), Mapping (part 8) and Counting (Part 9) to see how they work.

Fig. 2. The reference-based differential expression analysis pipeline The main input data are your RNAseq reads, the reference genome, and a gene annotation for that reference genome. Potential tools are listed beside each step. This activity uses Trimmomatic, hisat2, htseq and DESeq2.

Fig. 2. The reference-based differential expression analysis pipeline The main input data are your RNAseq reads, the reference genome, and a gene annotation for that reference genome. Potential tools are listed beside each step. This activity uses Trimmomatic, hisat2, htseq and DESeq2.

2.1 Reference assembly and annotation

Start by navigating to our working directory.

cd /home/genomics/workshop_materials/
cd differential_expression

Let’s set this path in our environment.

export RNA_HOME=$(pwd)

You should map your RNA reads to an unmasked or softmasked genome assembly. Check the quality of your reference assembly using completeness (e.g., BUSCO, etc.) and contiguity (e.g., N50, etc.) statistics. If you are using a published assembly, these statistics should be available in the associated publication. If not, you can easily run them yourself.

Assembly evaluation tools:

You can also use BUSCO to check the quality of your annotation. Many other tools exist to assess and manipulate annotations, such as AGAT and gffread. However, be careful because these may have built in methods to add or cut features that conform with their gtf/gff standards. Always check the number of proteins, exons and transcripts before and after manipulating a gtf or gff.

Several metrics may indicate issues with your annotation, including the number of single exon genes or the number of overlapping genes. In metazoans, neither of these should be very high (although more and more evidence is being found for overlapping genomic features, both coding and non-coding).

Annotation evaluation tools:

Our references can be found in the directory refs. Before we get going, set variables for our genome and annotation:

cd $RNA_HOME/refs
reference=Pca_genome.fa
gtf_file=Pca_annotation.gtf

Use bash commands like grep, wc, cut, cat, less, and more to explore the assembly, annotation, and associated files.

  1. How big is the genome?
# use wc -c to count the bytes in the fasta  
wc -c $reference 
# 325522048
  1. How many scaffolds are there?
# use grep -c to count the number of contig headers
grep -c '>' $reference 
# 14319 scaffolds
  1. What percent of single copy orthologs (SCOs) are duplicated?
# look at the BUSCO summary
cat Pca_genome_BUSCOsummary.txt 
  1. What proportion of the scaffolds do not have any annotated genes?
cut -f 1 $gtf_file | uniq | wc -l
#5007/14319 = 0.3496753
  1. Look at column 9 in the annotation file. What are gene and transcript attributes called?
head $gtf_file 
# transcript_id and gene_id
  1. How many genes and how many transcripts are in the gff?
cat $gtf_file | cut -f 3 | grep -c gene
cat $gtf_file | cut -f 3 | grep -c transcript

2.2 Our RNA-seq data

Our data can be found in the directory data_raw. Move into the data_raw directory and explore the reads.

  1. How many reads are there for each sample?
#e.g.
zcat Pca_Ribes_rep1_sub_1.fq.gz | grep -c @D00483 

#100000
#hint: these are subsampled files to help the mapping and counting go quickly 

3 Gene Expression Analysis

Move back to $RNA_HOME. Use ls to explore the directory $RNA_HOME/diff_expr. It contains a file of gene counts (and one of exon counts, ignore this). We will save the output of the analysis in this directory.

From here on, everything will be done in RStudio.

Start by checking (getwd()) and setting (setwd()) your working directory.

#check your wd
getwd()

# if needed, set your wd. the activity is designed to be run in your main directory (aka $RNA_HOME from above)
setwd("/home/genomics/workshop_materials/differential_expression/")

3.1 Install/load packages

  • DESeq2: differential gene expression analysis
  • topGO: gene set enrichment anlaysis
  • magrittr, dplyr, tidyr, readr, tibble: data wrangling
    • (or load the full tidyverse, if you prefer)
  • ggplot2, RColorBrewer, Rgraphviz: visualizations
# load libraries
myLib <- as.vector(c("DESeq2","topGO", "magrittr", "dplyr","tidyr", "readr", "tibble", "ggplot2", "RColorBrewer"))

lapply(myLib,library,character.only=TRUE)

3.2 The data

3.2.1 Load the metadata

Remember that we are comparing expression among larvae feeding on different host plants (in the column condition). We also evenly sampled larvae from three different families. First, load the metadata for the samples:

# readr::read_tsv() to load the metadata
Pca_metadata <- read_tsv("metadata/Pca_metadata.tsv", 
                          col_types = c("ccfff"))

# here we specify the column types using a condensed string, one letter for each column. 
# 'c' stands for character, 'f' for factor

Inspect the metadata using str() and head(). Notice that we have added more samples than we were working with in the trimming, mapping, and counting parts of the activity.

  1. How many samples are there now?
unique(Pca_metadata$sampleName) %>% length()

We have three individuals in each family-condition combination.

table(Pca_metadata[3:4])
##          family
## condition 1 2 3
##    Ribes  3 3 3
##    Salix  3 3 3
##    Urtica 3 3 3

This is a little slim for a differential expression analysis. Many DGE tools, including DESeq2 recommend 7 or more samples in each group. Therefore, we are going to stick with looking at differences between host plants (condition) in this activity.

3.2.2 Load the count data

You created a count matrix from the HTSeq output (Pca_count.matrix.final.txt), but you were only using a subset of the fastq data for trimming, mapping and counting. The full count matrix can be found in the diff_expr directory: diff_expr/Pca_gene_counts.tsv.

Load the count data into a matrix and double check that the column names of the count matrix match the sample names in our metadata.

# load gene count data, convert the column gene_id to row names, convert to a matrix
Pca_counts <- read_tsv("diff_expr/Pca_gene_counts.tsv") %>% 
  column_to_rownames("gene_id") %>% 
  as.matrix()

# compare matrix column names with metadata sample names
colnames(Pca_counts) %in% Pca_metadata$sampleName

All of the values should be TRUE.

3.2.3 Create the DESeq object

Now we create the data object necessary for DESeq2 to analyze the data. Some R packages (esp. Bioconductor packages) define and use custom classes within R for storing data (input data, intermediate data and also results). These custom data structures are similar to lists in that they can contain multiple different data types/structures within them. Unlike lists, they have pre-specified data slots to hold specific types or classes of data. The data stored in these pre-specified slots can be accessed using functions designed and implemented specifically for the package.

Let’s start by creating the DESeqDataSet object and then we can learn a bit more about what is stored inside it. To create the object we will need the count matrix (Pca_counts) and the metadata table (Pca_metadata) as input.

We will also need to specify a design formula. The design formula specifies how the column(s) in the metadata table should be used in the analysis. For our data, we only have one column we are interested in: condition. The design ~ condition tells DESeq2 that for each gene we want to evaluate gene expression change with respect to the three different levels of this variable.

Create the DESeqDataSet (aka dds) object

Pca_dds <- DESeqDataSetFromMatrix(countData = Pca_counts, 
                              colData = Pca_metadata, 
                              design = ~ condition)
# Take a look:
Pca_dds

If you check out some of the other DESeq2 import functions with help("DESeqDataSet-class"), you can see that DESeq2 has a function to import HTSeq count files directly (DESeqDataSetFromHTSeqCount) which can streamline your pipeline a bit.

3.3 Normalization1

There can be systematic biases in RNA sequencing data, such as differences in total sequencing depth between samples. We need to adjust the count data for such systematic biases in a process called normalization. Unfortunately, the term normalization is often used for this aspect of data analysis, even though it can be misleading: it has nothing to do with the normal distribution; nor does it involve data transformation. Rather, we aim to identify the nature and magnitude of systematic biases, and take them into account in our analysis of the data.

3.3.1 Filtering

DGE analysis is designed to deal with count data that is highly variable between samples. However, it is sometimes a good idea to remove genes that are very rarely expressed before you start your analysis. This can:

  • improve model fit
  • reduce the number of multiple comparisons (see below).

It is not a required step and tools like DESeq2 can handle the data without filtering first, but it is something to consider. Here, we will set a filter based on an fixed threshold: keep only those genes with at least 5 reads in at least 3 samples.

Filter the dds object.

# determine whether genes meet our threshold
keep <- rowSums(counts(Pca_dds) >= 5 ) >= 3

# filter the dds object to keep genes that meet the threshold
Pca_dds_filt <- Pca_dds[keep,]
  1. How many genes did not meet the filtering criteria?
# check the number of genes that meet our threshold
table(keep)

3.3.2 Normalize the counts within the dds object

The main factors considered during normalization are sequencing depth and RNA composition. DESeq2 uses the Median of Ratios normalization (MRN) method. Please check the presentation slides on normalization for a detailed explanation of how MRN is calculated. Briefly,

  • A pseudo-reference sample is created for each gene by taking the rowwise geometric mean.
  • The ratio is calculated of each sample in relation to this pseudo-reference.
  • A normalization factor (aka size factor) is calculated from the median value of all ratios for each column.
  • Normalized count values are calculated using the normalization factor.
# Estimate the size factors
Pca_dds_filt <- estimateSizeFactors(Pca_dds_filt) 
sizeFactors(Pca_dds_filt) # Check the size factors 
  1. Based on the size factor calculated for Pca_Ribes_rep6, were read counts in this sample systematically lower or higher than expected?

3.4 Unsupervised clustering

Now that we have the normalized counts, we can perform quality control and exploratory analysis using unsupervised clustering. These analyses include Principal Component Analysis (PCA) and hierarchical clustering.

PCA is a technique used to emphasize variation and bring out strong patterns in the data. We expect to see biological replicates belonging to the same experimental group to cluster together.

Hierarchical clustering also groups samples by expression similarity. A hierarchical tree indicates which samples are more similar to each other based on the normalized gene expression values. A heatmap is used to show the correlation of gene expression for all pairwise combinations of samples in the data. Clustering of samples in the hierarchical clustering analysis should be similar to clustering in the PCA. When using these unsupervised clustering methods, transformation of the normalized counts improves the distances/clustering for visualization. DESeq2 recommends a variance-stabilizing transformation (vst()) of the normalized counts as it moderates the variance across the mean, improving the clustering.

3.4.1 Principal component analysis

Transform the data and create a PCA using the DESeq2 function plotPCA().

# Transform the data using the variance-stabilizing transformation
Pca_vst <- vst(Pca_dds_filt, blind=TRUE)

# PCA plot
plotPCA(Pca_vst, intgroup=c("condition"))

# plotPCA depends on ggplot2, so we can spice it up a bit:
plotPCA(Pca_vst, intgroup=c("condition")) +
  theme_bw() +
  scale_color_brewer(palette = "Dark2", name = "Host plant") +
  theme(legend.text = element_text(face = "italic"))

# save your plot
ggsave("diff_expr/Pca_vst_PC1_PC2.png", height = 6, width = 6)

Hint: If you don’t like using the DESeq2::plotPCA() function and want to use a one that might be a little more customizable (e.g., stats::prcomp() or FactoMineR::PCA()), you can extract the normalized, transformed values using assay(Pca_vst).

3.4.2 Heatmap of similarity between samples

Here, we are going to visualize which samples cluster together when using all of the filtered, normalized and transformed expression data. Start by creating a distance matrix among samples.

# Create the distance matrix
Pca_vst_dists <- dist(t(assay(Pca_vst)))
Pca_vst_mat <- as.matrix(Pca_vst_dists)

Now, run the clustering analysis on the samples.

# run clustering analysis
Pca_vst_hc <- hclust(Pca_vst_dists)

# create a color palette
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) 

# Plot the heatmap

heatmap(Pca_vst_mat, 
        Rowv=as.dendrogram(Pca_vst_hc), 
        symm=TRUE,
        col = rev(hmcol),
        margin=c(13, 13), 
        cexRow = 0.75,
        cexCol = 0.75) 

# save the heatmap
pdf(file = "diff_expr/Pca_vst_sampleHeatmap.pdf", height = 6, width = 6)
heatmap(Pca_vst_mat, 
        Rowv=as.dendrogram(Pca_vst_hc), 
        symm=TRUE,
        col = rev(hmcol),
        margin=c(13, 13), 
        cexRow = 0.75,
        cexCol = 0.75) 
dev.off()
  1. Do the samples cluster as you would expect from the experimental design?

3.5 Differential gene expression

Now that we have checked our samples for outliers and to see whether their is preliminary evidence of clustering by condition, we will conduct differential expression analysis. To run the actual analysis, we use the function DESeq(). It performs the normalization of the counts using the size factors and fits the linear model.

# Run the analysis on the filtered read counts
Pca_dds_filt <- DESeq(Pca_dds_filt)

By re-assigning the results of the function back to the same variable name (Pca_dds_filt), we can fill in the slots of our DESeqDataSet object with the new values calculated by DESeq2.

3.5.1 Identifying DE genes

With DESeq2, the Wald test is commonly used for hypothesis testing when comparing two groups. The Wald test compares expression between two groups. To indicate to DESEq2 the two groups we want to compare, we provide contrasts, or the specific pair of groups we want to test.

# check what contrasts are even possible
resultsNames(Pca_dds_filt)

Ribes is currently set as the first level in our variable condition, so DESeq2 automatically compares the remaining two levels back to Ribes.

Extract the results for Urtica vs. Ribes.

# specify the contrast by naming the factor you want to test 
# and the levels of the factor you want to compare.
contrast_U_R <- c("condition", "Urtica", "Ribes")

# extract the results for your specified contrast
Pca_res_table_U_R <- results(Pca_dds_filt, contrast=contrast_U_R)

# this can also be coded directly by using the names of the contrasts 
Pca_res_table_U_R <- results(Pca_dds_filt, name = "condition_Urtica_vs_Ribes")

A Wald test statistic is computed per-gene, and evaluates the probability that a test statistic at least as extreme as the observed value would occur at at random. We reject the null hypothesis (i.e. there is no difference in expression between groups) and conclude a gene is differentially expressed when the calculated p-value is below our significance threshold (usually 0.05).

# sort results based on the padj
Pca_res_table_U_R <- Pca_res_table_U_R[order(Pca_res_table_U_R$padj),]

# You will now have the smallest padj values at the top of your table
head(Pca_res_table_U_R)

# save your results 
write_tsv(as.data.frame(Pca_res_table_U_R), "diff_expr/Pca_res_table_U_R.tsv")

The column headers are: - baseMean: mean of normalized counts for all samples - log2FoldChange: log2 fold change or LFC - lfcSE: standard error of log2 fold change - stat: the Wald test statistic - pval: Wald test p-value - padj: Benjamini-Hochberg (BH) adjusted p-values

Note that we have p-values (pval) and adjusted p-values (padj) in the output. When we use the p-values and a significance threshold of 0.05, we are likely making a wrong conclusion (for example, believing a gene is significantly differentially expressed) for 1 out of every 20 genes. Now, what if we are testing 100 genes? We are probably making ~5 wrong conclusions. 10,000 genes? ~500 wrong conclusions. Adjusting the p-values to account for testing multiple genes helps us to have more confidence in our conclusions.

We can use additional thresholds, like log2 fold change (LFC), to identify DE genes. A common threshold is an absolute LFC of 1, which means expression of that gene is twice as much in one of the conditions compared to the other.

3.5.2 Visualizing DE genes

3.5.2.1 Individual genes

To make sure we understand the results, it is sometimes good to start with a single gene. Look at the gene that is most differentially expressed (i.e., lowest padj) in the sorted results table.

head(Pca_res_table_U_R, 1)

Is it up or down-regulated in larvae feeding on Ribes compared to Urtica? It can be hard to tell from the results alone. We can use DESeq2::plotcounts() to visualize differences in expression for larvae on different host plants.

plotCounts(Pca_dds_filt, gene="Polcal_g14648", intgroup="condition")

From this plot, we can see that Polcal_g14648, which had an LFC of -7.88699 in the results table, is expressed at lower levels on Urtica than on Ribes. This tells us that genes with negative LFC values and significant adjusted p-values (based on our chosen cutoff) are likely to be downregulated on Urtica and upregulated on Ribes.

Thus, we can think of the contrast c("condition", "Urtica", "Ribes") as asking the question `How does Urtica (condition 1) differ relative to Ribes (condition 2), which should help when trying to specify your own contrasts.

3.5.2.2 MA plot

The MA plot shows the mean of the normalized counts versus the LFC for all genes. Genes that are significantly differentially expressed are colored to be easily identified. The default significance threshold is 0.05, but today let’s use p < 0.01 to be a little more conservative. Remember, low LFC values mean the genes are less expressed in Urtica!

To make an MA plot directly from the DESeq object, we use the function plotMA(). Use ??DESeq2::plotMA to figure out how create this plot.

# MA plot needs the dds object
plotMA(object = Pca_dds_filt, 
       ylim=c(-2,2), 
       alpha = 0.01)

Note that the triangles show genes that fall outside of our y axis limits. Try changing the ylim values to see how the number of triangles changes.

Once you’re happy with it, save it using the pdf() device you learned in the R workshop.

pdf(file = "diff_expr/Pca_MAplot_U_R.pdf", height = 6, width = 6)
plotMA(object = Pca_dds_filt, 
       ylim=c(-5,5), # I chose a y lim of 5
       alpha = 0.01)
dev.off()

Extra challenge: build the plot in ggplot instead.

# ggplot option
ggplot(as.data.frame(Pca_res_table_U_R), aes(x = log(baseMean), y = log2FoldChange)) +
  geom_point(aes(color = padj < 0.01), size = 0.5) +
  scale_color_manual(values = c("grey", "blue"))+
  theme_bw()
# ggsave option
ggsave("diff_expr/Pca_MAplot_U_R.pdf", height = 6, width = 6)

This plot allows us to evaluate the magnitude of fold changes and how they are distributed relative to mean expression. Generally, we would expect to see significant genes across the full range of expression levels. Do you observe this in your plot? In this example data set, you have lots of genes with very small count numbers (between 0 to 10). Do you see large log2fold changes (>1) for genes with low counts? Is there a way to shrink the log2 fold change for genes with low counts? You can have a look on DESeq2 documentation and read about log2 fold change shrinkage.

3.5.2.3 Volcano plot

The volcano plot builds on the MA plot by visualizing the relationship between LFC (on the x axis) and padj (-log10-transformed (-log10(padj)) on the y). Using ggplot() and geom_point(), create a volcano plot. Quick reminder: the results table is a DESeqResults object, so you may have convert it to a dataframe (as.data.frame()) before using it in ggplot.

ggplot(as.data.frame(Pca_res_table_U_R), aes(x = log2FoldChange, y = -log10(padj))) +
 geom_point()

We can make this plot much more informative to readers.

  • Use geom_hline() to add a dashed horizontal line at our significance threshold (p = 0.01, but don’t forget the adjusted p-values are -log10-transformed!)
    • hint: google “linetype R”
  • Use geom_vline() to add dashed vertical lines at our LFC threshold (-1 and 1)
  • Color the points by whether they are significantly DE or not
    • you can do this by creating a new variable in the results table or by changing how you plot the points within ggplot
# option 1
Pca_res_df_U_R <- Pca_res_table_U_R %>% 
  as.data.frame() %>% 
  mutate(sigDE = if_else(padj < 0.01 & abs(log2FoldChange) > 1 , "sig", "nsig")) 

ggplot(as.data.frame(Pca_res_df_U_R), aes(x = log2FoldChange, y = -log10(padj))) +
 geom_point(aes(color = sigDE)) +
  scale_color_manual(values = c("grey", "blue"))+
  geom_hline(yintercept = -log10(0.01), linetype = "dashed")+
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  theme_bw()

## option 2
ggplot() +
  geom_point(
    data = as.data.frame(Pca_res_table_U_R) %>% 
      filter(padj > 0.01 | abs(log2FoldChange) < 1), 
    aes(x = log2FoldChange, y = -log10(padj)),
    color = "grey") +
  geom_point(
    data = as.data.frame(Pca_res_table_U_R) %>%
      filter(padj < 0.01 & abs(log2FoldChange) > 1), 
    aes(x = log2FoldChange, y = -log10(padj)), 
    color = "blue") +
  geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  theme_bw()

# save the volcano plot
ggsave("diff_expr/Pca_volcanoPlot_U_R.pdf", height = 6, width = 6)

Volcano plots also give us an opportunity to see whether our results look “normal”. If we see big lines or streaks of genes all in a row, this might mean we have a lot of duplicated loci in our assembly or that our model doesn’t fit our data very well.

Note: When within-group variation is high, LFC values can be a little misleading. Just look at all the nonsignificant genes with very high or low LFC in our MA and volcano plots. Check out the DESeq2::lfcShrink() function for ways to remove the background noise introduced by these highly variable genes.

3.5.2.4 Heat map of differentially expressed genes.

Another useful visualization tool is a heatmap of gene expression. We used heatmaps above to show correlations between samples. We can also use heatmaps to show expression of significantly DE genes using heatmap(). Heatmaps contain a lot of information, so it is usually a good idea to use a subset of genes that interest you.

Make a heatmap of the 20 most differentially expressed (lowest p-value) genes.

# Identify the top 20 differentially expressed genes from the sorted results table. 
DE_genes <- rownames(head(Pca_res_table_U_R,20))

# Extract the expression levels of the most expressed genes from the vst-transformed dds object
Pca_dds_DE_U_R <- assay(Pca_vst)[rownames(Pca_vst) %in% DE_genes,]

# create the heatmap using baseR heatmap
heatmap(Pca_dds_DE_U_R,
        margin = c(10,6))

# save plot
pdf(file = "diff_expr/Pca_geneHeatmap_U_R.pdf", height = 6, width = 6)
heatmap(Pca_dds_DE_U_R,
        margin = c(10,6))
dev.off()

4 Gene set enrichment analysis

Finally, we want to know the functions of our DE genes. We can do this using gene set enrichment analysis. We have already generated a functional annotation for our annotated genes using EggNOG-mapper (http://eggnog-mapper.embl.de/), which assigned gene ontology (GO) terms to our genes. GO terms provide a unified language for biological processes, molecular functions and cellular components (Gene Ontology Consortium‘ et al. (2000)). Gene ontology is a widely used toolin genomics and transcriptomics but should be interpreted with caution (more on this later.)

4.1 The functional annotation

Define the path to the functional annotation. We do this instead of loading the dataframe directly so we can use the path in a topGO function later on.

GO_path <- "refs/Pca_GO_annotation.tsv"

Take a look at the format of the functional annotation.

#read the go annotation tsv
GO_annotation <- read_tsv(GO_path, col_names = c("gene_id", "GOterms")) 

# take a look at the first 10 rows
head(GO_annotation)

For nonmodel organisms, functional annotation of predicted genes may be incomplete. We can still run GSEA, but should be aware that only a fraction of our genes (including DE genes) may be represented in our analysis.

dim(GO_annotation)

grepl(GO_annotation$GOterms, pattern ="GO") %>% sum()
  1. For this P. c-album annotation, what proportion of the genes have been annotated with GO terms?

4.2 GSEA with topGO

The package we will use to run the GSEA is topGO, which tests GO term enrichment while accounting for GO topology (or, how the GO terms are related to each other).

GSEA tests whether certain functons are enriched in a set of genes by evaluating whether the number of genes associated with a certain GO term is more than expected by chance given the number of times the GO term appears in the whole set (the gene “universe”).

Create the “gene universe”.

# here I will be only analyzing GO terms with at least 5 members,
# as this yield more stable results.
node_size= 5

# We will focus on Biological process terms (BP)
GO_category="BP"

# use topGO to read the functional annotation
geneID2GO <- readMappings(GO_path)

# define the gene universe
geneUniverse <- names(geneID2GO)

We will perform two GSEAs, one each for genes that are upregulated or downregulated on Urtica relative to Ribes using a significance cutoff of 0.01 and an LFC threshold of 1 or -1.

Create lists of genes that are upregulated (up_U_R) and downregulated (down_U_R).

up_U_R <- rownames(Pca_res_table_U_R %>% 
                           data.frame() %>%
                           filter(padj < 0.01 & log2FoldChange > 1))

down_U_R <- rownames(Pca_res_table_U_R %>%
                             data.frame() %>% 
                             filter(padj < 0.01 & log2FoldChange < -1))

# save your gene lists!
  1. How many genes are in our candidate lists?

Generally, GSEA is best applied to an intermediately-sized gene set. Too small, it is unlikely that any GO term will occur often enough to be enriched. Instead, you can easily look at the genes themeselves and make conclusions about their functions. Too large relative to the annotated gene universe and terms will occur in the list just as often as expected… because you’ve sampled all the genes!

Run the GSEA for the downregulated gene set.

genesOfInterest.bv <- down_U_R

geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
names(geneList.bv) <- geneUniverse

# Create a new topGO GSEA objected
myGOdata.bv <- new("topGOdata",
                   description="Candidate genes", 
                   ontology=GO_category, 
                   allGenes=geneList.bv, 
                   annot = annFUN.gene2GO, 
                   gene2GO = geneID2GO, 
                   nodeSize = node_size)                                       

Now, we have to run the statistical analysis for each term. Here, we will run a classical Fisher’s exact test and a fisher’s exact test corrected using topGO’s parent-child algorithm. This algorithm when assessing a GO term, it evaluates enrichment of a term in relation to its parent terms and so reduces false positives due to the inheritance problem.

Run the statistical tests in topGO.

# classic Fisher
resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")

# Fisher with parent-child algorithm
resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")

# Create a GSEA results table
# GenTable() is  annoying and requires a number of terms we want reported (default is 10)
## so we leverage our topGOresult object
top_nodes <-  sum(resultParentchild@score < 0.05)

allRes_BP_down_U_R <- GenTable(myGOdata.bv,
                   classicFisher = resultClassic,
                   parentchildFisher = resultParentchild,
                   orderBy = "parentchildFisher",
                  topNodes = top_nodes)
  1. What are the most enriched biological processes in genes that are downregulated in Urtica relative to Ribes?

  2. P-values for each GO term are listed under the relevant test (classicFisher, parentchildFisher). Are the classic results different from the parent-child results?

head(allRes_BP_down_U_R, 20)

4.2.1 Visualizing GSEA

One common way of showing enriched terms is using a dot plot.

# subset the results table 
allRes_BP_down_U_R_top15 <- allRes_BP_down_U_R %>% 
  mutate(GO.term = paste(GO.ID, Term, sep = ": "),
         parentchildFisher = as.numeric(parentchildFisher)) %>% 
  arrange(parentchildFisher) %>% 
  head(15)

# plot the subsetted results
# points are scaled by the number of signficant genes associated with each term, which gives us an idea of how confident we can be in the enrichment score. 

ggplot(allRes_BP_down_U_R_top15, aes(x = GO.term, y = -log10(parentchildFisher) ))+
  geom_point(aes(size = Significant)) + 
  coord_flip() +
  theme_bw()

# save your GO term plot

Finally, we can visualize the parent-child relationships of the most significantly enriched terms.

# Rgraphviz needs to be installed and loaded to build this plot
library(Rgraphviz)
showSigOfNodes(myGOdata.bv, score(resultParentchild), firstSigNodes = 15, useInfo = 'all')

Rectangles indicate the 5 most significant terms. Rectangle color represents the relative significance (yellow = low, red = high). Within each node shape is listed the identifier and a trimmed GO name, and the raw p-value, the number of significant genes and the total number of genes annotated to the respective GO term.

4.2.2 Functional enrichment of upregulated genes

Modifying the code above, perform a GSEA on the genes that are up-regulated in _Urtica_relative to Ribes.

genesOfInterest.bv <- up_U_R

geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
names(geneList.bv) <- geneUniverse

# Create a new topGO GSEA objected
myGOdata.bv <- new("topGOdata",
                   description="Candidate genes", 
                   ontology=GO_category, 
                   allGenes=geneList.bv, 
                   annot = annFUN.gene2GO, 
                   gene2GO = geneID2GO, 
                   nodeSize = node_size)                                       

# classic Fisher
resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")

# Fisher with parent-child algorithm
resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")

# Create a GSEA results table
# GenTable() is  annoying and requires a number of terms we want reported (default is 10)
## so we leverage our topGOresult object
top_nodes <-  sum(resultParentchild@score < 0.05)

allRes_BP_up_U_R <- GenTable(myGOdata.bv,
                   classicFisher = resultClassic,
                   parentchildFisher = resultParentchild,
                   orderBy = "parentchildFisher",
                  topNodes = top_nodes)

4.2.3 Molecular functions

Fewer genes tend to be annotated with molecular functions than biological processes, so we have less power to detect signifcant enrichment when using these terms. Nevertheless, they can be interesting when trying to understand the roles of up and down regulated genes.

Repeat the GSEA for the up and down-regulated genes, but this time test the molecular functions (GO_category="MF"). Can you run it as a for-loop?

# Starting with BP
GO_category="MF"

DEgenes_list <- list(down_U_R, up_U_R)
names(DEgenes_list) <- c("down_U_R", "up_U_R")

for (i in 1:2){
  # define candidate list
  genesOfInterest.bv <- DEgenes_list[[i]]
  
  # placing this within brackets causes all the lines to run in a single block
  # it can help you keep track of code that you usually run together
  
  geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
  names(geneList.bv) <- geneUniverse
  
  # Create a new topGO GSEA objected
  myGOdata.bv <- new("topGOdata",
                     description="Candidate genes", 
                     ontology=GO_category, 
                     allGenes=geneList.bv, 
                     annot = annFUN.gene2GO, 
                     gene2GO = geneID2GO, 
                     nodeSize = node_size)                                       
  
  # classic Fisher
  resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")
  
  # Fisher with parent-child algorithm
  resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")
 top_nodes <-  sum(resultParentchild@score < 0.05)
  # Create a GSEA results table
  temp <- GenTable(myGOdata.bv,
                   classicFisher = resultClassic,
                   parentchildFisher = resultParentchild,
                   orderBy = "parentchildFisher",
                   topNodes = top_nodes)
  assign(paste0("allRes_MF_", names(DEgenes_list)[[i]]), temp)
}

head(allRes_MF_down_U_R)
head(allRes_MF_up_U_R)

5 Running a second contrast

5.1 Urtica vs. Salix

We know what genes are differentially expressed between Urtica and Ribes but still don’t know how the third host plant, Salix fits into the picture.

Repeat the differential expression analysis. This time, compare expression between larvae reared on Urtica and Salix. Remember from resultsNames(Pca_dds_filt) that this was not one of the comparisons tested in our current dds object. To deal with this, we need to relevel the host plants within the condition variable and refit the DESeq model.

For a hint on releveling condition:

# relevel condition
Pca_dds_filt$condition <- relevel(Pca_dds_filt$condition, "Salix") 

# refit the model
Pca_dds_filt <- DESeq(Pca_dds_filt)

# Check your work!
# Is the right contrast there? 
resultsNames(Pca_dds_filt)

Try to do it on your own first, and compare to my code in the drop down if you need some help along the way:

Extract the results for Urtica vs. Salix.

# specify the contrast by naming the factor you want to test 
# and the levels of the factor you want to compare.
contrast_U_S <- c("condition", "Urtica", "Salix")

# extract the results for your specified contrast
Pca_res_table_U_S <- results(Pca_dds_filt, contrast=contrast_U_S)

# sort results based on the padj
Pca_res_table_U_S <- Pca_res_table_U_S[order(Pca_res_table_U_S$padj),]

# save your results 
write_tsv(as.data.frame(Pca_res_table_U_S), "diff_expr/Pca_res_table_U_S.tsv")

Visualize differential expression. Can you try adding labels to the 20 most significantly DE genes in your volcano plot?

# MA plot needs the dds object
pdf(file = "diff_expr/Pca_MAplot_U_S.pdf", height = 6, width = 6)
plotMA(object = Pca_dds_filt, 
       ylim=c(-2,2), 
       alpha = 0.01)
dev.off()

# code to add labels shown in comments:

# install.packages("ggrepel")
# library(ggrepel)
# top15_labs <- head(Pca_res_table_U_S,20) %>%
#   as.data.frame() %>% 
#   rownames_to_column("gene_id")

ggplot() +
  # geom_text_repel(data = top15_labs, 
  #                 aes(x = log2FoldChange, y = -log10(padj), label = gene_id), size = 2)+
  geom_point(
    data = as.data.frame(Pca_res_table_U_S) %>% filter(padj > 0.01 | abs(log2FoldChange) < 1), 
    aes(x = log2FoldChange, y = -log10(padj)),
    color = "grey") +
  geom_point(
    data = as.data.frame(Pca_res_table_U_S) %>% filter(padj < 0.01 & abs(log2FoldChange) > 1), 
    aes(x = log2FoldChange, y = -log10(padj))) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  theme_bw()

# save the volcano plot
ggsave("diff_expr/Pca_volcanoPlot_U_S.pdf", height = 6, width = 6)

Make a heatmap of the 20 most differentially expressed (lowest p-value) genes.

# Identify the top 20 differentially expressed genes from the sorted results table. 
DE_genes <- rownames(head(Pca_res_table_U_S,20))

# Create a heatmap for the most expressed genes
Pca_dds_DE_U_S <- assay(Pca_vst)[rownames(Pca_vst) %in% DE_genes,]

heatmap(Pca_dds_DE_U_S,
        margin = c(10,6))

# save plot
pdf(file = "diff_expr/Pca_geneHeatmap_U_S.pdf", height = 6, width = 6)
heatmap(Pca_dds_DE_U_R,
        margin = c(10,6))
dev.off()

Run a functional annotation on the up and downregulated genes. You do not need to load the annotation again. Go straight to the candidate gene list steps (But don’t forget to indicate the correct GO category).

# Starting with BP

up_U_S <- rownames(Pca_res_table_U_S %>% 
                           data.frame() %>%
                           filter(padj < 0.01 & log2FoldChange > 1))

down_U_S <- rownames(Pca_res_table_U_S %>%
                             data.frame() %>% 
                             filter(padj < 0.01 & log2FoldChange < -1))

DEgenes_list <- list(down_U_S, up_U_S)
names(DEgenes_list) <- c("down_U_S", "up_U_S")

for (k in c("BP", "MF")) {
  
  GO_category=k
  
  for (i in 1:2){
    # define candidate list
    genesOfInterest.bv <- DEgenes_list[[i]]
    
    # placing this within brackets causes all the lines to run in a single block
    # it can help you keep track of code that you usually run together
    
    geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
    names(geneList.bv) <- geneUniverse
    
    # Create a new topGO GSEA objected
    myGOdata.bv <- new("topGOdata",
                       description="Candidate genes", 
                       ontology=GO_category, 
                       allGenes=geneList.bv, 
                       annot = annFUN.gene2GO, 
                       gene2GO = geneID2GO, 
                       nodeSize = node_size)                                       
    
    # classic Fisher
    resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")
    
    # Fisher with parent-child algorithm
    resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")
    top_nodes <-  sum(resultParentchild@score < 0.05)
    # Create a GSEA results table
    temp <- GenTable(myGOdata.bv,
                     classicFisher = resultClassic,
                     parentchildFisher = resultParentchild,
                     orderBy = "parentchildFisher",
                     topNodes = top_nodes)
    assign(paste0("allRes_", GO_category, "_", names(DEgenes_list)[[i]]), temp)
  }
}

head(allRes_BP_down_U_S)
head(allRes_BP_up_U_S)
head(allRes_MF_down_U_S)
head(allRes_MF_up_U_S)

5.2 Compare DE genes

Are the same genes differentially expressed between Urtica and each of the other two host plants? Compare the results from the Urtica v. Ribes and Urtica v. Salix analyses visually. You can use venn diagrams (e.g., gplots), euler diagrams (eulerr), upsetR plots (e.g., upsetR, ComplexHeatmap), alluvial plots (ggalluvial), scatterplots (e.g., ggplot2).

What about the functional enrichment? Can you use the same tools to visualize functional overlap (i.e., GO terms) of the DE genes in the two contrasts?

You may need to install some packages. I didn’t write any hints for this part! Use google, package manuals and R help to produce the code.

THIS IS THE END OF THE DGE ANALYSIS. If you have time and want to give it a try, return to the command line to do perform the trimming, mapping and counting steps that produce a count matrix.

6 Trimming

Many older pipelines begin by trimming reads to remove adapters and poorly sequenced bases, but this is not required. The commonly used aligners (e.g., STAR, hisat2) “soft trim” the adapters and this should result in reliable read counts, at least at the whole-gene or whole-transcript level. Some studies even suggest trimming can be problematic for subsequent read quantification (Williams et al. (2016), Liao and Shi (2020))

6.1 When should you trim?

If you are planning to use the reads for variant calling, genome annotation or transcriptome assembly, you can trim adapters with tools like BBduk, Trimmomatic, flexbar, TrimGalore (wrapper for CutAdapt, you need to know your adapter sequences). In these cases, you should also quality trim your reads. The UC Davis genome center recommends, “combine gentle quality trimming with a threshold quality score of Q15 with a read length filter retaining only reads longer than 35 bp in length” using the tools BBduk or Trimmomatic.

You might also consider trimming your reads if you are getting a low mapping rate.

Trimming tools:

6.2 Pre-trimming quality check

Use FastQC to check the quality of your reads before trimming (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

If installed, you can also run MultiQC (https://multiqc.info/) to generate a summary report for all of your reads. This is for convenience, and is not necessary for evaluating your reads.

cd $RNA_HOME/data_raw
fastqc *.fq.gz
multiqc .

6.3 Trimming with Trimmomatic

Trimmomatic can be used to remove adapters and to lightly trim and discard low quality fastq data. It supports multi-threading and the use of compressed input files (gzip or bzip2). For more details, consult the manual: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf

Trimmomatic is run using a JAR file (Java ARchive) containing java code. JAR files are executed using the command java -jar, and we have to indicate where the software is installed.

cd $RNA_HOME

# Set threads depending on available cores
THREADS=2

# Set the path to the trimmomatic installation (necessary even when installed with Conda)
TRIMMOMATIC_ROOT=/home/genomics/software/Trimmomatic-0.39

Make the out directory (data_trimmed) for your trimmed reads.

mkdir data_trimmed

Run Trimmomatic for the first sample: Pca_Ribes_rep1_sub

#in files
in1=$RNA_HOME/data_raw/Pca_Ribes_rep1_sub_1.fq.gz
in2=$RNA_HOME/data_raw/Pca_Ribes_rep1_sub_2.fq.gz

#output, paired
paired1=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_1.paired.trimmed.fq.gz
paired2=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_2.paired.trimmed.fq.gz

#output, unpaired
unpaired1=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_1.unpaired.trimmed.fq.gz
unpaired2=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_2.unpaired.trimmed.fq.gz

# run Trimmomatic
java -jar $TRIMMOMATIC_ROOT/trimmomatic-0.39.jar PE -threads $THREADS -phred33 $in1 $in2 \
  $paired1 $unpaired1 $paired2 $unpaired2 \
  ILLUMINACLIP:$TRIMMOMATIC_ROOT/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \
  SLIDINGWINDOW:4:15 MINLEN:36 2> data_trimmed/${file}_trimmed.log

Explanation of options:

  • PE stands for paired end reads
  • Sample1_01.fastq.gz and Sample1_02.fastq.gz are the input files
  • Sample1_01_paired.trimmed.fastq.gz and Sample1_01_unpaired.trimmed.fastq.gz are output files of the paired and unpaired trimmed reads.
  • ILLUMINACLIP is the adapter sequence
  • LEADING is the threshold for low quality or N bases at the leading end of the read
  • Trailing is the threshold for low quality or N bases at the trailing end of the read
  • MINLEN is the minimum length threshold. After trimming, shorter sequences will be discarded.
  • SLIDINGWINDOW here means that a window size of 5 is used and if the average quality of this window is below 15, the sequences will be clipped from where the windows started.

You can see that trimmomatic produces different files for paired and unpaired reads. For DGE analysis of paired-end samples, it is recommended to map only the paired reads.

Now run Trimmomatic on the remaining read pairs. Using the outline below, write a loop to trim all of the reads.

for reads in data_raw/*_1.fq.gz; 
  do
    file=$(basename $reads _1.fq.gz);
.
.
.
    echo ${file} trimmed; 
  done
for reads in data_raw/*_1.fq.gz; 
  do
    file=$(basename $reads _1.fq.gz);
    
    in1=$RNA_HOME/data_raw/${file}_1.fq.gz;
    in2=$RNA_HOME/data_raw/${file}_2.fq.gz;

    #output, paired
    paired1=$RNA_HOME/data_trimmed/${file}_1.paired.trimmed.fq.gz;
    paired2=$RNA_HOME/data_trimmed/${file}_2.paired.trimmed.fq.gz;

    #output, unpaired
    unpaired1=$RNA_HOME/data_trimmed/${file}_1.unpaired.trimmed.fq.gz;
    unpaired2=$RNA_HOME/data_trimmed/${file}_2.unpaired.trimmed.fq.gz;

    java -jar $TRIMMOMATIC_ROOT/trimmomatic-0.39.jar PE -threads $THREADS -phred33 $in1 $in2 \
      $paired1 $unpaired1 $paired2 $unpaired2 \
      ILLUMINACLIP:$TRIMMOMATIC_ROOT/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \
      SLIDINGWINDOW:4:15 MINLEN:36 2> data_trimmed/${file}_trimmed.log;
    echo ${file} trimmed; 
done

To see that you didn’t do a mistake, look at the file sizes (which should all differ) with ls.

ls -l data_trimmed/*.paired.trimmed.fq.gz

6.4 Post-trimming quality check

Use FastQC to check the quality of your trimmed reads. Again, run MultiQC for a full summary:

cd $RNA_HOME/data_trimmed
fastqc *.paired.trimmed.fq.gz
multiqc .
  1. Have most of the adapter sequences been removed by trimming? If you’re not sure, compare your report to the pre-trimming report (data_raw/multiqc.html).

7 Strandedness (do not run, dependencies not installed)

Several alignment and count tools (e.g., hisat2, htseq) require that we specify strandedness of our sequence data. If the data are coming to you from the sequencing facility then you’ll know (or can ask if you are unsure). However, if you are using published or archived data, the library prep information may be vague or missing. (A brief background on why strandedness is important can be found here: https://bitesizebio.com/42077/rna-strandedness-a-road-travelled-in-both-directions/)

Tools: - RSeQC (https://rseqc.sourceforge.net/) - how_are_we_stranded_here (https://github.com/signalbash/how_are_we_stranded_here; requires Kallisto)

In this case, we know our RNA-seq libraries were prepared using an Illumina TruSeq Stranded mRNA kit and is reverse (RF/fr-firststrand) stranded.

Here is an example of how you would run how_are_we_stranded_here (again, do not run as software is not installed.)

# move into the refs folder
cd $RNA_HOME/refs/

# pull CDS from gtf using gffread
gffread $gtf_file \
  -g $reference \
  -x ${gtf_file%.gff}_cds.fa

# move back into the main directory
cd $RNA_HOME

# run how_are_we_stranded_here
check_strandedness --gtf refs/$gtf_file \
  --transcripts  refs/${gtf_file%.gff}_cds.fa \
  --reads_1 data_trimmed/Pca_Ribes_rep1_sub_1.paired.trimmed.fq.gz \
  --reads_2 data_trimmed/Pca_Ribes_rep1_sub_2.paired.trimmed.fq.gz

#checking strandedness
#This is PairEnd Data
#Fraction of reads failed to determine: 0.0595
#Fraction of reads explained by "1++,1--,2+-,2-+": 0.0073 (0.8% of explainable reads)
#Fraction of reads explained by "1+-,1-+,2++,2--": 0.9332 (99.2% of explainable reads)
#Over 90% of reads explained by "1+-,1-+,2++,2--"
#Data is likely RF/fr-firststrand

Here is an excellent resource for how to use these results with various tools: https://rnabio.org/module-09-appendix/0009/12/01/StrandSettings/

8 Alignment

There is no consensus about the best combination of trimming + alignment + counting + normalization tools to use. Usually, a paper introducing a new tool finds that tool does a great, often better, job in comparison to existing tools. Hurray! But is it the best tool to use?

Independent analyses of different pipelines and methods find:

  • Combinations of tools and methods matter (Corchete et al. (2020), Liu et al. (2022), Baruzzo et al. (2017))
  • Alignment and adapter trimming (optional) are less important to the results than counting and and differential expression methods (Corchete et al. (2020)) (but see the previous point).

It is probably also good to consider which methods you or your lab group are most familiar with. Will you be able to determine the best parameters to choose? Will you be able to assess whether the output of your pipeline is reasonable or trustworthy?

Reference-guided aligners:

We will use Hisat2 to map our reads. Hisat2 is a splice-aware aligner that produces reliably mapped reads for both differential gene expression and splicing analyses. There are two main steps to alignment with Hisat2:

8.1 Genome indexing

Create a list of exons and splice sites from the annotation:

# move into the refs folder
cd $RNA_HOME/refs/

# generate splice site list
extract_splice_sites.py $gtf_file > ${gtf_file%.gtf}_splicesites.tsv

# generate exon list
hisat2_extract_exons.py $gtf_file > ${gtf_file%.gtf}_exons.tsv

Now that you have a list of splice sites and exons from the reference annotation, it is time to build an index for the genome using hisat2-build. This takes 20-30 minutes using 2 cores, so for for now just take a look at the code:

# Set threads 
THREADS=2

# build index (do not run!)
hisat2-build -p $THREADS --ss ${gtf_file%.gtf}_splicesites.tsv \
  --exon ${gtf_file%.gtf}_exons.tsv $reference ${reference%.fa} 2> \
  ${reference%.fa}_${reference%.gtf}_hisat-build.log

Unzip the pre-built index and move the index files into the refs directory. Feel free to take a look at the log file if you’re curious about what hisat2-build did.

# unzip
unzip Pca_genome_prebuilt.zip

# move index files ending in *ht2
mv Pca_genome_prebuilt/*ht2 .

Use ls to inspect the genome index before moving on to mapping.

ls -l *ht2
  1. How many index files are there?

8.2 Mapping with Hisat2

Map the paired reads for the first replicate (Pca_Ribes_rep1_sub).


cd $RNA_HOME
# make mapping directory
mkdir mapping

#in, paired
file=Pca_Ribes_rep1_sub
in1=$RNA_HOME/data_trimmed/${file}_1.paired.trimmed.fq.gz
in2=$RNA_HOME/data_trimmed/${file}_2.paired.trimmed.fq.gz

# align reads with hisat2
hisat2 -p $THREADS -x refs/${reference%.fa} -1 $in1 -2 $in2 \
  -S mapping/${sample}_Aligned_out.sam --summary-file mapping/${sample}_summaryfile.txt \
  --dta --rna-strandness RF

Explanation of options:

  • -p Number of threads
  • -x Base name of index built earlier including splice sites and exons (minus the trailing .X.ht2)
  • -1 paired reads file 1
  • -2 paired reads file 2
  • -S Output file with aligned reads in SAM format
  • –summary-file Summary information about how mapping went
  • --rna-strandness specify strandedness, remember ours are reverse-stranded
  • --dta output mapped reads that are compatible with transcript assemblers (e.g., StringTie)

Look at the output files:

ls mapping/
cat mapping/Pca_Ribes_rep1_sub_summaryfile.txt
  1. Was the mapping successful? What percentage of reads mapped successfully (overall alignment rate)?

Write a loop to map the remaining samples.

for reads in data_trimmed/*_1.paired.trimmed.fq.gz; 
  do
    file=$(basename $reads _1.paired.trimmed.fq.gz);
    
    in1=$RNA_HOME/data_trimmed/${file}_1.paired.trimmed.fq.gz
    in2=$RNA_HOME/data_trimmed/${file}_2.paired.trimmed.fq.gz

  hisat2 -p $THREADS -x refs/${reference%.fa} -1 $in1 -2 $in2 \
  -S mapping/${file}.sam --summary-file mapping/${file}_summaryfile.txt \
  --dta --rna-strandness RF

  echo ${file} mapped; 
  done

8.3 Sorting and indexing mapped reads

If you’re familiar with the SAM-format, you know this is the human-readable file describing the reads that mapped (or did not map) to the genome. To save space and prepare the file for downstream analysis, we should use samtools to convert the mapped reads to BAM format (view), sort them by read position (sort) and create an index (index) for each file.

cd $RNA_HOME/mapping

# convert and pipe directly to sort
samtools view -@ $THREADS -bS Pca_Ribes_rep1_sub.sam | \
  samtools sort -@ $THREADS - -o Pca_Ribes_rep1_sub_sorted.bam 

# index
samtools index -@ $THREADS Pca_Ribes_rep1_sub_sorted.bam 

Convert, sort and index the remaining files. Once they are done, remove the SAM files.

for reads in *.sam; 
  do
    file=$(basename $reads .sam);
    
    samtools view -@ $THREADS -bS ${file}.sam | \
      samtools sort -@ $THREADS - -o ${file}_sorted.bam; 
    
    samtools index -@ $THREADS ${file}_sorted.bam;
    
    echo ${file} bam sorted, indexed; 
  done

9 Quantification

As with mapping, there are many options for counting mapped reads. Corchete et al. (2020) found that pipelines including HTseq were consistently among the highest ranked in their analyses. There are some other options. featureCounts is very flexible, however with this flexibility comes many opportunities to make mistakes with large effects on your count matrices. StringTie can identify, assign reads to, and quantify transcripts.

9.1 Count reads mapping to whole genes

To see how may reads mapped to each gene, we will use HTSeq. By default, HTSeq counts one alignment per read. If there are two or more alignments for a read HTSeq takes the one with the highest score, unless you specify otherwise. This alignment is called the primary alignment.

Look at the options for HTSeq with htseq-count -h, or find more details in the manual (https://htseq.readthedocs.io/en/release_0.11.1/count.html).

Create an output directory (counting) for your read count files.

mkdir -p $RNA_HOME/counting

We need to specify which genomic feature htseq-count should use to evaluate aligned reads. Features are identified in the 3rd column of the Pca_annotation.gtf (use head -10 annotation.gtf to check it out). In this case, we plan to perform a differential whole gene expression analysis. We are interested in reads that map to exon features. We use the option -t to specify the feature type in the HTSeq command (exon is the default). However, we plan to count all of the reads mapping to each gene. Genes are identified by gene_id in the last column of the gtf (take a look at the last column of the annotation to see the naming format). We use the option -i to specify this identifier in the command below.

Count the first sample.*

# move into the main directory
cd $RNA_HOME

# count mapped reads at the gene-level for the first sample
file=Pca_Ribes_rep1_sub

htseq-count mapping/${file}_sorted.bam refs/$gtf_file \
 -f bam -r pos -a 10 -s reverse -t exon -i gene_id -m union --nonunique none > \
 counting/${file}_geneCounts.txt

Explanation of options:

  • -f sam or bam format
  • -r whether the alignment is sorted by position (pos) or read name (name)
  • -a minimum quality filter, skips reads below this value (here we specify 10, even though it is the default)
  • -s specifies whether the reads are forward stranded (yes, default), reverse stranded (reverse), or not stranded (no)
  • -t sub feature type (3rd column in GFF file) to be used
  • -i GFF attribute to be used as feature ID
  • -m mode to handle reads overlapping more than one feature. Choices: union, intersection-strict, intersection-nonempty; default: union). Here we continue with the default. Check out the manual for an explanation of the mapping characteristics included and excluded by these options.
  • --nonunique removes reads that map to multiple feature IDs (limits ‘fusion genes’). This should not be used if you are interested in counting exons for differential exon usage analysis. (Do you know why not??)

Count the reads for the remaining samples.

for reads in mapping/*_sorted.bam; 
  do
    file=$(basename $reads _sorted.bam);

    htseq-count mapping/${file}_sorted.bam refs/$gtf_file \
      -f bam -r pos -a 10 -s reverse -t exon -i gene_id -m union --nonunique none > \
      counting/${file}_geneCounts.txt;
      
    echo $file counted;
  done

9.2 Create a count matrix

To create a count matrix from these individual count files, we usepaste. The matrix should look something like this:

##     gene_id Pca_Ribes_rep1 Pca_Ribes_rep2 Pca_Ribes_rep3 ...
## 1 Polcal_g1              5             25             19 ...
## 2 Polcal_g2              0              6              3 ...
  1. Before joining the files, we need to make sure IDs are all in the same order. Why?

Confirm the ids are in the same order.

#create a list of ids from each count file
cd $RNA_HOME/counting

ls *geneCounts.txt | while read file; 
do cat $file | cut -f 1 > $file.id; 
done

# compare the md5sum values of the id files
md5sum *.id

# remove the id files you just made
rm *id

Now create the matrix file.

paste *geneCounts.txt | cut -f 1,2,4,6,8,10,12,14,16,18 > Pca_count.matrix.txt 

To get a final matrix with column headers, use echo and cat to add the header to the count matrix.

echo -e "gene_id\tPca_Ribes_rep1\tPca_Ribes_rep2\tPca_Ribes_rep3\tPca_Salix_rep1\tPca_Salix_rep2\tPca_Salix_rep3\tPca_Udi_rep1\tPca_Udi_rep2\tPca_Udi_rep3" > header.txt

cat header.txt Pca_count.matrix.txt > Pca_count.matrix.final.txt 

# check the matrix
head Pca_count.matrix.final.txt

This is a weirdly convoluted and fiddly process, and luckily, the program we will use for DGE analysis has built-in methods for converting HTSeq counts to input files. But now you know one way to do it yourself!

10 Other great resources:

References

Baruzzo, G, KE Hayer, EJ Kim B Di Camillo, GA FitzGerald, and GR Grant. 2017. Simulation-Based Comprehensive Benchmarking of RNA-Seq Aligners. Nature Methods. Vol. 14.
Corchete, Luis A., Elizabeta A. Rojas, Diego Alonso-Lopez, Javier De Las Rivas, Norma C. Gutierrez, and Francisco J Burguillo. 2020. Systematic Comparison and Assessment of RNA-Seq Procedures for Gene Expression Quantitative Analysis. Scientific Reports. Vol. 10. https://doi.org/10.1038/s41598-020-76881-x.
Gene Ontology Consortium‘, ‘The, Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, et al. 2000. Gene Ontology: Tool for the Unification of Biology. Nature Genetics. Vol. 25.
Liao, Yang, and Wei Shi. 2020. Read Trimming Is Not Required for Mapping and Quantification of RNA-Seq Reads at the Gene Level. NAR Genomics and Bioinformatics. Vol. 2. https://doi.org/10.1093/nargab/lqaa068.
Liu, Xu, Jialu Zhao, Tian Zhao, Wei Ding, Yuying Han, and Haihong Ye. 2022. A Comparison of Transcriptome Analysis Methods with Reference Genome. BMC Genomics. Vol. 23. https://doi.org/10.1186/s12864-022-08465-0.
Williams, Claire R., Alyssa Baccarella, Jay Z. Parrish, and Charles C. Kim. 2016. Trimming of Sequence Reads Alters RNA-Seq Gene Expression Estimates. BMC Bioinformatics. Vol. 17. https://doi.org/10.1186/s12859-016-0956-2.

  1. Much of this material has been adapted through the years from a Harvard University course, https://github.com/hbctraining/DGE_workshop/tree/master/lessons↩︎