Comma butterflies (Polygonia c-album) in Scandinavia can eat plants from several different families (Fig. 1). This kind of diet plasticity (generalization) is rare among herbivorous insects, butterflies in particular. The plants differ in both secondary chemistry and nutrition, suggesting that Comma larvae may use different molecular and physiological mechanisms to safely and efficiently metabolize tissue from these plants. We sequenced mRNA from the midguts of 4th instar larvae feeding on one of three different host plant genera: Ribes (gooseberry), Salix (willow) and Urtica (nettle) (Fig. 2). Using these RNA-seq data, we will investigate differences in the larval gut transcriptome that contribute to the ability to eat such divergent host plants.
Today’s activity walks through a reference-based differential gene expression (DGE) analysis. This means we map our reads to an existing reference genome rather than creating a de novo transcriptome from our RNA-seq data. The three main steps of reference-based DGE analysis are 1) alignment, 2) quantification and 3) analysis (Fig. 2).
This tutorial details the steps involved in:
The workshop has covered trimming and alignment a lot already, so we will start by exploring the genome and annotation used to align and count the data and then go straight into differential expression.
Once you finish the DGE and want an extra challenge, you can continue with Trimming (part 6), Mapping (part 8) and Counting (Part 9) to see how they work.
Start by navigating to our working directory.
cd /home/genomics/workshop_materials/
cd differential_expression
Let’s set this path in our environment.
export RNA_HOME=$(pwd)
You should map your RNA reads to an unmasked or softmasked genome assembly. Check the quality of your reference assembly using completeness (e.g., BUSCO, etc.) and contiguity (e.g., N50, etc.) statistics. If you are using a published assembly, these statistics should be available in the associated publication. If not, you can easily run them yourself.
Assembly evaluation tools:
You can also use BUSCO to check the quality of your annotation. Many other tools exist to assess and manipulate annotations, such as AGAT and gffread. However, be careful because these may have built in methods to add or cut features that conform with their gtf/gff standards. Always check the number of proteins, exons and transcripts before and after manipulating a gtf or gff.
Several metrics may indicate issues with your annotation, including the number of single exon genes or the number of overlapping genes. In metazoans, neither of these should be very high (although more and more evidence is being found for overlapping genomic features, both coding and non-coding).
Annotation evaluation tools:
agat_sq_stat_basic.pl
and
agat_sp_statistics.pl
Our references can be found in the directory refs
.
Before we get going, set variables for our genome and
annotation:
cd $RNA_HOME/refs
reference=Pca_genome.fa
gtf_file=Pca_annotation.gtf
Use bash commands like grep
, wc
,
cut
, cat
, less
, and
more
to explore the assembly, annotation, and associated
files.
# use wc -c to count the bytes in the fasta
wc -c $reference
# 325522048
# use grep -c to count the number of contig headers
grep -c '>' $reference
# 14319 scaffolds
# look at the BUSCO summary
cat Pca_genome_BUSCOsummary.txt
cut -f 1 $gtf_file | uniq | wc -l
#5007/14319 = 0.3496753
head $gtf_file
# transcript_id and gene_id
cat $gtf_file | cut -f 3 | grep -c gene
cat $gtf_file | cut -f 3 | grep -c transcript
Our data can be found in the directory data_raw
.
Move into the data_raw
directory and explore the
reads.
#e.g.
zcat Pca_Ribes_rep1_sub_1.fq.gz | grep -c @D00483
#100000
#hint: these are subsampled files to help the mapping and counting go quickly
Move back to $RNA_HOME
. Use ls to explore the directory
$RNA_HOME/diff_expr
. It contains a file of gene counts (and
one of exon counts, ignore this). We will save the output of the
analysis in this directory.
From here on, everything will be done in RStudio.
Start by checking (getwd()
) and setting
(setwd()
) your working directory.
#check your wd
getwd()
# if needed, set your wd. the activity is designed to be run in your main directory (aka $RNA_HOME from above)
setwd("/home/genomics/workshop_materials/differential_expression/")
DESeq2
: differential gene expression analysistopGO
: gene set enrichment anlaysismagrittr
, dplyr
, tidyr
,
readr
, tibble
: data wrangling
tidyverse
, if you prefer)ggplot2
, RColorBrewer
,
Rgraphviz
: visualizations# load libraries
myLib <- as.vector(c("DESeq2","topGO", "magrittr", "dplyr","tidyr", "readr", "tibble", "ggplot2", "RColorBrewer"))
lapply(myLib,library,character.only=TRUE)
Remember that we are comparing expression among larvae feeding on
different host plants (in the column condition
). We also
evenly sampled larvae from three different families. First, load
the metadata for the samples:
# readr::read_tsv() to load the metadata
Pca_metadata <- read_tsv("metadata/Pca_metadata.tsv",
col_types = c("ccfff"))
# here we specify the column types using a condensed string, one letter for each column.
# 'c' stands for character, 'f' for factor
Inspect the metadata using str()
and
head()
. Notice that we have added more samples than we were
working with in the trimming, mapping, and counting parts of the
activity.
unique(Pca_metadata$sampleName) %>% length()
We have three individuals in each family-condition combination.
table(Pca_metadata[3:4])
## family
## condition 1 2 3
## Ribes 3 3 3
## Salix 3 3 3
## Urtica 3 3 3
This is a little slim for a differential expression analysis. Many
DGE tools, including DESeq2
recommend 7 or more samples in
each group. Therefore, we are going to stick with looking at differences
between host plants (condition
) in this activity.
You created a count matrix from the HTSeq output
(Pca_count.matrix.final.txt
), but you were only using a
subset of the fastq data for trimming, mapping and counting. The full
count matrix can be found in the diff_expr
directory:
diff_expr/Pca_gene_counts.tsv
.
Load the count data into a matrix and double check that the column names of the count matrix match the sample names in our metadata.
# load gene count data, convert the column gene_id to row names, convert to a matrix
Pca_counts <- read_tsv("diff_expr/Pca_gene_counts.tsv") %>%
column_to_rownames("gene_id") %>%
as.matrix()
# compare matrix column names with metadata sample names
colnames(Pca_counts) %in% Pca_metadata$sampleName
All of the values should be TRUE
.
Now we create the data object necessary for DESeq2
to
analyze the data. Some R packages (esp. Bioconductor packages) define
and use custom classes within R for storing data (input data,
intermediate data and also results). These custom data structures are
similar to lists in that they can contain multiple different data
types/structures within them. Unlike lists, they have pre-specified data
slots to hold specific types or classes of data. The data stored in
these pre-specified slots can be accessed using functions designed and
implemented specifically for the package.
Let’s start by creating the DESeqDataSet
object and then
we can learn a bit more about what is stored inside it. To create the
object we will need the count matrix (Pca_counts
) and the
metadata table (Pca_metadata
) as input.
We will also need to specify a design formula. The
design formula specifies how the column(s) in the metadata table should
be used in the analysis. For our data, we only have one column we are
interested in: condition.
The design
~ condition
tells DESeq2
that for each gene we
want to evaluate gene expression change with respect to the three
different levels of this variable.
Create the DESeqDataSet
(aka dds
)
object
Pca_dds <- DESeqDataSetFromMatrix(countData = Pca_counts,
colData = Pca_metadata,
design = ~ condition)
# Take a look:
Pca_dds
If you check out some of the other DESeq2 import functions with
help("DESeqDataSet-class")
, you can see that
DESeq2
has a function to import HTSeq
count
files directly (DESeqDataSetFromHTSeqCount
) which can
streamline your pipeline a bit.
There can be systematic biases in RNA sequencing data, such as differences in total sequencing depth between samples. We need to adjust the count data for such systematic biases in a process called normalization. Unfortunately, the term normalization is often used for this aspect of data analysis, even though it can be misleading: it has nothing to do with the normal distribution; nor does it involve data transformation. Rather, we aim to identify the nature and magnitude of systematic biases, and take them into account in our analysis of the data.
DGE analysis is designed to deal with count data that is highly variable between samples. However, it is sometimes a good idea to remove genes that are very rarely expressed before you start your analysis. This can:
It is not a required step and tools like DESeq2 can handle the data without filtering first, but it is something to consider. Here, we will set a filter based on an fixed threshold: keep only those genes with at least 5 reads in at least 3 samples.
Filter the dds object.
# determine whether genes meet our threshold
keep <- rowSums(counts(Pca_dds) >= 5 ) >= 3
# filter the dds object to keep genes that meet the threshold
Pca_dds_filt <- Pca_dds[keep,]
# check the number of genes that meet our threshold
table(keep)
The main factors considered during normalization are sequencing depth
and RNA composition. DESeq2
uses the Median of Ratios
normalization (MRN) method. Please check the presentation slides on
normalization for a detailed explanation of how MRN is calculated.
Briefly,
# Estimate the size factors
Pca_dds_filt <- estimateSizeFactors(Pca_dds_filt)
sizeFactors(Pca_dds_filt) # Check the size factors
Now that we have the normalized counts, we can perform quality control and exploratory analysis using unsupervised clustering. These analyses include Principal Component Analysis (PCA) and hierarchical clustering.
PCA is a technique used to emphasize variation and bring out strong patterns in the data. We expect to see biological replicates belonging to the same experimental group to cluster together.
Hierarchical clustering also groups samples by expression similarity.
A hierarchical tree indicates which samples are more similar to each
other based on the normalized gene expression values. A heatmap is used
to show the correlation of gene expression for all pairwise combinations
of samples in the data. Clustering of samples in the hierarchical
clustering analysis should be similar to clustering in the PCA. When
using these unsupervised clustering methods, transformation of the
normalized counts improves the distances/clustering for visualization.
DESeq2
recommends a variance-stabilizing transformation
(vst()
) of the normalized counts as it moderates the
variance across the mean, improving the clustering.
Transform the data and create a PCA using the DESeq2 function
plotPCA()
.
# Transform the data using the variance-stabilizing transformation
Pca_vst <- vst(Pca_dds_filt, blind=TRUE)
# PCA plot
plotPCA(Pca_vst, intgroup=c("condition"))
# plotPCA depends on ggplot2, so we can spice it up a bit:
plotPCA(Pca_vst, intgroup=c("condition")) +
theme_bw() +
scale_color_brewer(palette = "Dark2", name = "Host plant") +
theme(legend.text = element_text(face = "italic"))
# save your plot
ggsave("diff_expr/Pca_vst_PC1_PC2.png", height = 6, width = 6)
Hint: If you don’t like using the DESeq2::plotPCA()
function and want to use a one that might be a little more customizable
(e.g., stats::prcomp()
or FactoMineR::PCA()
),
you can extract the normalized, transformed values using
assay(Pca_vst)
.
Here, we are going to visualize which samples cluster together when using all of the filtered, normalized and transformed expression data. Start by creating a distance matrix among samples.
# Create the distance matrix
Pca_vst_dists <- dist(t(assay(Pca_vst)))
Pca_vst_mat <- as.matrix(Pca_vst_dists)
Now, run the clustering analysis on the samples.
# run clustering analysis
Pca_vst_hc <- hclust(Pca_vst_dists)
# create a color palette
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
# Plot the heatmap
heatmap(Pca_vst_mat,
Rowv=as.dendrogram(Pca_vst_hc),
symm=TRUE,
col = rev(hmcol),
margin=c(13, 13),
cexRow = 0.75,
cexCol = 0.75)
# save the heatmap
pdf(file = "diff_expr/Pca_vst_sampleHeatmap.pdf", height = 6, width = 6)
heatmap(Pca_vst_mat,
Rowv=as.dendrogram(Pca_vst_hc),
symm=TRUE,
col = rev(hmcol),
margin=c(13, 13),
cexRow = 0.75,
cexCol = 0.75)
dev.off()
Now that we have checked our samples for outliers and to see whether
their is preliminary evidence of clustering by condition
,
we will conduct differential expression analysis. To
run the actual analysis, we use the function DESeq()
. It
performs the normalization of the counts using the size factors and fits
the linear model.
# Run the analysis on the filtered read counts
Pca_dds_filt <- DESeq(Pca_dds_filt)
By re-assigning the results of the function back to the same variable
name (Pca_dds_filt
), we can fill in the slots of our
DESeqDataSet
object with the new values calculated by
DESeq2
.
With DESeq2
, the Wald test is commonly used for
hypothesis testing when comparing two groups. The Wald test compares
expression between two groups. To indicate to DESEq2
the
two groups we want to compare, we provide contrasts, or the specific
pair of groups we want to test.
# check what contrasts are even possible
resultsNames(Pca_dds_filt)
Ribes is currently set as the first level in our variable
condition
, so DESeq2
automatically compares
the remaining two levels back to Ribes.
Extract the results for Urtica vs. Ribes.
# specify the contrast by naming the factor you want to test
# and the levels of the factor you want to compare.
contrast_U_R <- c("condition", "Urtica", "Ribes")
# extract the results for your specified contrast
Pca_res_table_U_R <- results(Pca_dds_filt, contrast=contrast_U_R)
# this can also be coded directly by using the names of the contrasts
Pca_res_table_U_R <- results(Pca_dds_filt, name = "condition_Urtica_vs_Ribes")
A Wald test statistic is computed per-gene, and evaluates the probability that a test statistic at least as extreme as the observed value would occur at at random. We reject the null hypothesis (i.e. there is no difference in expression between groups) and conclude a gene is differentially expressed when the calculated p-value is below our significance threshold (usually 0.05).
# sort results based on the padj
Pca_res_table_U_R <- Pca_res_table_U_R[order(Pca_res_table_U_R$padj),]
# You will now have the smallest padj values at the top of your table
head(Pca_res_table_U_R)
# save your results
write_tsv(as.data.frame(Pca_res_table_U_R), "diff_expr/Pca_res_table_U_R.tsv")
The column headers are: - baseMean
: mean of normalized
counts for all samples - log2FoldChange
: log2 fold change
or LFC - lfcSE
: standard error of log2 fold change -
stat
: the Wald test statistic - pval
: Wald
test p-value - padj
: Benjamini-Hochberg (BH) adjusted
p-values
Note that we have p-values (pval
) and adjusted p-values
(padj
) in the output. When we use the p-values and a
significance threshold of 0.05, we are likely making a wrong conclusion
(for example, believing a gene is significantly differentially
expressed) for 1 out of every 20 genes. Now, what if we are testing 100
genes? We are probably making ~5 wrong conclusions. 10,000 genes? ~500
wrong conclusions. Adjusting the p-values to account for testing
multiple genes helps us to have more confidence in our conclusions.
We can use additional thresholds, like log2 fold change (LFC), to identify DE genes. A common threshold is an absolute LFC of 1, which means expression of that gene is twice as much in one of the conditions compared to the other.
To make sure we understand the results, it is sometimes good to start with a single gene. Look at the gene that is most differentially expressed (i.e., lowest padj) in the sorted results table.
head(Pca_res_table_U_R, 1)
Is it up or down-regulated in larvae feeding on Ribes
compared to Urtica? It can be hard to tell from the results
alone. We can use DESeq2::plotcounts()
to visualize
differences in expression for larvae on different host plants.
plotCounts(Pca_dds_filt, gene="Polcal_g14648", intgroup="condition")
From this plot, we can see that Polcal_g14648, which had an LFC of -7.88699 in the results table, is expressed at lower levels on Urtica than on Ribes. This tells us that genes with negative LFC values and significant adjusted p-values (based on our chosen cutoff) are likely to be downregulated on Urtica and upregulated on Ribes.
Thus, we can think of the contrast
c("condition", "Urtica", "Ribes")
as asking the question
`How does Urtica (condition 1) differ relative to
Ribes (condition 2), which should help when trying to specify
your own contrasts.
The MA plot shows the mean of the normalized counts versus the LFC for all genes. Genes that are significantly differentially expressed are colored to be easily identified. The default significance threshold is 0.05, but today let’s use p < 0.01 to be a little more conservative. Remember, low LFC values mean the genes are less expressed in Urtica!
To make an MA plot directly from the DESeq
object, we
use the function plotMA()
. Use
??DESeq2::plotMA
to figure out how create this
plot.
# MA plot needs the dds object
plotMA(object = Pca_dds_filt,
ylim=c(-2,2),
alpha = 0.01)
Note that the triangles show genes that fall outside of our y axis
limits. Try changing the ylim
values to see how the number
of triangles changes.
Once you’re happy with it, save it using the
pdf()
device you learned in the R workshop.
pdf(file = "diff_expr/Pca_MAplot_U_R.pdf", height = 6, width = 6)
plotMA(object = Pca_dds_filt,
ylim=c(-5,5), # I chose a y lim of 5
alpha = 0.01)
dev.off()
Extra challenge: build the plot in ggplot instead.
# ggplot option
ggplot(as.data.frame(Pca_res_table_U_R), aes(x = log(baseMean), y = log2FoldChange)) +
geom_point(aes(color = padj < 0.01), size = 0.5) +
scale_color_manual(values = c("grey", "blue"))+
theme_bw()
# ggsave option
ggsave("diff_expr/Pca_MAplot_U_R.pdf", height = 6, width = 6)
This plot allows us to evaluate the magnitude of fold changes and how
they are distributed relative to mean expression. Generally, we would
expect to see significant genes across the full range of expression
levels. Do you observe this in your plot? In this example data set, you
have lots of genes with very small count numbers (between 0 to 10). Do
you see large log2fold changes (>1) for genes with low counts? Is
there a way to shrink the log2 fold change for genes with low counts?
You can have a look on DESeq2
documentation and read about
log2 fold change shrinkage.
The volcano plot builds on the MA plot by visualizing the
relationship between LFC (on the x axis) and padj (-log10-transformed
(-log10(padj)
) on the y). Using
ggplot()
and geom_point()
, create a volcano
plot. Quick reminder: the results table is a
DESeqResults
object, so you may have convert it to a
dataframe (as.data.frame()
) before using it in
ggplot.
ggplot(as.data.frame(Pca_res_table_U_R), aes(x = log2FoldChange, y = -log10(padj))) +
geom_point()
We can make this plot much more informative to readers.
geom_hline()
to add a dashed horizontal line at our
significance threshold (p = 0.01, but don’t forget the adjusted p-values
are -log10-transformed!)
geom_vline()
to add dashed vertical lines at our
LFC threshold (-1 and 1)# option 1
Pca_res_df_U_R <- Pca_res_table_U_R %>%
as.data.frame() %>%
mutate(sigDE = if_else(padj < 0.01 & abs(log2FoldChange) > 1 , "sig", "nsig"))
ggplot(as.data.frame(Pca_res_df_U_R), aes(x = log2FoldChange, y = -log10(padj))) +
geom_point(aes(color = sigDE)) +
scale_color_manual(values = c("grey", "blue"))+
geom_hline(yintercept = -log10(0.01), linetype = "dashed")+
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
theme_bw()
## option 2
ggplot() +
geom_point(
data = as.data.frame(Pca_res_table_U_R) %>%
filter(padj > 0.01 | abs(log2FoldChange) < 1),
aes(x = log2FoldChange, y = -log10(padj)),
color = "grey") +
geom_point(
data = as.data.frame(Pca_res_table_U_R) %>%
filter(padj < 0.01 & abs(log2FoldChange) > 1),
aes(x = log2FoldChange, y = -log10(padj)),
color = "blue") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
theme_bw()
# save the volcano plot
ggsave("diff_expr/Pca_volcanoPlot_U_R.pdf", height = 6, width = 6)
Volcano plots also give us an opportunity to see whether our results look “normal”. If we see big lines or streaks of genes all in a row, this might mean we have a lot of duplicated loci in our assembly or that our model doesn’t fit our data very well.
Note: When within-group variation is high, LFC values can be a
little misleading. Just look at all the nonsignificant genes with very
high or low LFC in our MA and volcano plots. Check out the
DESeq2::lfcShrink()
function for ways to remove the
background noise introduced by these highly variable genes.
Another useful visualization tool is a heatmap of gene expression. We
used heatmaps above to show correlations between samples. We can also
use heatmaps to show expression of significantly DE genes using
heatmap()
. Heatmaps contain a lot of information, so it is
usually a good idea to use a subset of genes that interest you.
Make a heatmap of the 20 most differentially expressed (lowest p-value) genes.
# Identify the top 20 differentially expressed genes from the sorted results table.
DE_genes <- rownames(head(Pca_res_table_U_R,20))
# Extract the expression levels of the most expressed genes from the vst-transformed dds object
Pca_dds_DE_U_R <- assay(Pca_vst)[rownames(Pca_vst) %in% DE_genes,]
# create the heatmap using baseR heatmap
heatmap(Pca_dds_DE_U_R,
margin = c(10,6))
# save plot
pdf(file = "diff_expr/Pca_geneHeatmap_U_R.pdf", height = 6, width = 6)
heatmap(Pca_dds_DE_U_R,
margin = c(10,6))
dev.off()
Finally, we want to know the functions of our DE genes. We can do this using gene set enrichment analysis. We have already generated a functional annotation for our annotated genes using EggNOG-mapper (http://eggnog-mapper.embl.de/), which assigned gene ontology (GO) terms to our genes. GO terms provide a unified language for biological processes, molecular functions and cellular components (Gene Ontology Consortium‘ et al. (2000)). Gene ontology is a widely used toolin genomics and transcriptomics but should be interpreted with caution (more on this later.)
Define the path to the functional annotation. We do this instead of loading the dataframe directly so we can use the path in a topGO function later on.
GO_path <- "refs/Pca_GO_annotation.tsv"
Take a look at the format of the functional annotation.
#read the go annotation tsv
GO_annotation <- read_tsv(GO_path, col_names = c("gene_id", "GOterms"))
# take a look at the first 10 rows
head(GO_annotation)
For nonmodel organisms, functional annotation of predicted genes may be incomplete. We can still run GSEA, but should be aware that only a fraction of our genes (including DE genes) may be represented in our analysis.
dim(GO_annotation)
grepl(GO_annotation$GOterms, pattern ="GO") %>% sum()
topGO
The package we will use to run the GSEA is topGO
, which
tests GO term enrichment while accounting for GO topology (or, how the
GO terms are related to each other).
GSEA tests whether certain functons are enriched in a set of genes by evaluating whether the number of genes associated with a certain GO term is more than expected by chance given the number of times the GO term appears in the whole set (the gene “universe”).
Create the “gene universe”.
# here I will be only analyzing GO terms with at least 5 members,
# as this yield more stable results.
node_size= 5
# We will focus on Biological process terms (BP)
GO_category="BP"
# use topGO to read the functional annotation
geneID2GO <- readMappings(GO_path)
# define the gene universe
geneUniverse <- names(geneID2GO)
We will perform two GSEAs, one each for genes that are upregulated or downregulated on Urtica relative to Ribes using a significance cutoff of 0.01 and an LFC threshold of 1 or -1.
Create lists of genes that are upregulated
(up_U_R
) and downregulated
(down_U_R
).
up_U_R <- rownames(Pca_res_table_U_R %>%
data.frame() %>%
filter(padj < 0.01 & log2FoldChange > 1))
down_U_R <- rownames(Pca_res_table_U_R %>%
data.frame() %>%
filter(padj < 0.01 & log2FoldChange < -1))
# save your gene lists!
Generally, GSEA is best applied to an intermediately-sized gene set. Too small, it is unlikely that any GO term will occur often enough to be enriched. Instead, you can easily look at the genes themeselves and make conclusions about their functions. Too large relative to the annotated gene universe and terms will occur in the list just as often as expected… because you’ve sampled all the genes!
Run the GSEA for the downregulated gene set.
genesOfInterest.bv <- down_U_R
geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
names(geneList.bv) <- geneUniverse
# Create a new topGO GSEA objected
myGOdata.bv <- new("topGOdata",
description="Candidate genes",
ontology=GO_category,
allGenes=geneList.bv,
annot = annFUN.gene2GO,
gene2GO = geneID2GO,
nodeSize = node_size)
Now, we have to run the statistical analysis for each term. Here, we
will run a classical Fisher’s exact test and a fisher’s exact test
corrected using topGO
’s parent-child algorithm. This
algorithm when assessing a GO term, it evaluates enrichment of a term in
relation to its parent terms and so reduces false positives due to the
inheritance problem.
Run the statistical tests in topGO
.
# classic Fisher
resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")
# Fisher with parent-child algorithm
resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")
# Create a GSEA results table
# GenTable() is annoying and requires a number of terms we want reported (default is 10)
## so we leverage our topGOresult object
top_nodes <- sum(resultParentchild@score < 0.05)
allRes_BP_down_U_R <- GenTable(myGOdata.bv,
classicFisher = resultClassic,
parentchildFisher = resultParentchild,
orderBy = "parentchildFisher",
topNodes = top_nodes)
What are the most enriched biological processes in genes that are downregulated in Urtica relative to Ribes?
P-values for each GO term are listed under the relevant test
(classicFisher
, parentchildFisher
). Are the
classic results different from the parent-child results?
head(allRes_BP_down_U_R, 20)
One common way of showing enriched terms is using a dot plot.
# subset the results table
allRes_BP_down_U_R_top15 <- allRes_BP_down_U_R %>%
mutate(GO.term = paste(GO.ID, Term, sep = ": "),
parentchildFisher = as.numeric(parentchildFisher)) %>%
arrange(parentchildFisher) %>%
head(15)
# plot the subsetted results
# points are scaled by the number of signficant genes associated with each term, which gives us an idea of how confident we can be in the enrichment score.
ggplot(allRes_BP_down_U_R_top15, aes(x = GO.term, y = -log10(parentchildFisher) ))+
geom_point(aes(size = Significant)) +
coord_flip() +
theme_bw()
# save your GO term plot
Finally, we can visualize the parent-child relationships of the most significantly enriched terms.
# Rgraphviz needs to be installed and loaded to build this plot
library(Rgraphviz)
showSigOfNodes(myGOdata.bv, score(resultParentchild), firstSigNodes = 15, useInfo = 'all')
Rectangles indicate the 5 most significant terms. Rectangle color represents the relative significance (yellow = low, red = high). Within each node shape is listed the identifier and a trimmed GO name, and the raw p-value, the number of significant genes and the total number of genes annotated to the respective GO term.
Modifying the code above, perform a GSEA on the genes that are up-regulated in _Urtica_relative to Ribes.
genesOfInterest.bv <- up_U_R
geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
names(geneList.bv) <- geneUniverse
# Create a new topGO GSEA objected
myGOdata.bv <- new("topGOdata",
description="Candidate genes",
ontology=GO_category,
allGenes=geneList.bv,
annot = annFUN.gene2GO,
gene2GO = geneID2GO,
nodeSize = node_size)
# classic Fisher
resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")
# Fisher with parent-child algorithm
resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")
# Create a GSEA results table
# GenTable() is annoying and requires a number of terms we want reported (default is 10)
## so we leverage our topGOresult object
top_nodes <- sum(resultParentchild@score < 0.05)
allRes_BP_up_U_R <- GenTable(myGOdata.bv,
classicFisher = resultClassic,
parentchildFisher = resultParentchild,
orderBy = "parentchildFisher",
topNodes = top_nodes)
Fewer genes tend to be annotated with molecular functions than biological processes, so we have less power to detect signifcant enrichment when using these terms. Nevertheless, they can be interesting when trying to understand the roles of up and down regulated genes.
Repeat the GSEA for the up and down-regulated genes, but this time
test the molecular functions (GO_category="MF"
). Can you
run it as a for-loop?
# Starting with BP
GO_category="MF"
DEgenes_list <- list(down_U_R, up_U_R)
names(DEgenes_list) <- c("down_U_R", "up_U_R")
for (i in 1:2){
# define candidate list
genesOfInterest.bv <- DEgenes_list[[i]]
# placing this within brackets causes all the lines to run in a single block
# it can help you keep track of code that you usually run together
geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
names(geneList.bv) <- geneUniverse
# Create a new topGO GSEA objected
myGOdata.bv <- new("topGOdata",
description="Candidate genes",
ontology=GO_category,
allGenes=geneList.bv,
annot = annFUN.gene2GO,
gene2GO = geneID2GO,
nodeSize = node_size)
# classic Fisher
resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")
# Fisher with parent-child algorithm
resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")
top_nodes <- sum(resultParentchild@score < 0.05)
# Create a GSEA results table
temp <- GenTable(myGOdata.bv,
classicFisher = resultClassic,
parentchildFisher = resultParentchild,
orderBy = "parentchildFisher",
topNodes = top_nodes)
assign(paste0("allRes_MF_", names(DEgenes_list)[[i]]), temp)
}
head(allRes_MF_down_U_R)
head(allRes_MF_up_U_R)
We know what genes are differentially expressed between Urtica and Ribes but still don’t know how the third host plant, Salix fits into the picture.
Repeat the differential expression analysis. This time, compare
expression between larvae reared on Urtica and Salix.
Remember from resultsNames(Pca_dds_filt)
that this was not
one of the comparisons tested in our current dds object. To deal with
this, we need to relevel the host plants within the condition
variable and refit the DESeq model.
For a hint on releveling condition:
# relevel condition
Pca_dds_filt$condition <- relevel(Pca_dds_filt$condition, "Salix")
# refit the model
Pca_dds_filt <- DESeq(Pca_dds_filt)
# Check your work!
# Is the right contrast there?
resultsNames(Pca_dds_filt)
Try to do it on your own first, and compare to my code in the drop down if you need some help along the way:
Extract the results for Urtica vs. Salix.
# specify the contrast by naming the factor you want to test
# and the levels of the factor you want to compare.
contrast_U_S <- c("condition", "Urtica", "Salix")
# extract the results for your specified contrast
Pca_res_table_U_S <- results(Pca_dds_filt, contrast=contrast_U_S)
# sort results based on the padj
Pca_res_table_U_S <- Pca_res_table_U_S[order(Pca_res_table_U_S$padj),]
# save your results
write_tsv(as.data.frame(Pca_res_table_U_S), "diff_expr/Pca_res_table_U_S.tsv")
Visualize differential expression. Can you try adding labels to the 20 most significantly DE genes in your volcano plot?
# MA plot needs the dds object
pdf(file = "diff_expr/Pca_MAplot_U_S.pdf", height = 6, width = 6)
plotMA(object = Pca_dds_filt,
ylim=c(-2,2),
alpha = 0.01)
dev.off()
# code to add labels shown in comments:
# install.packages("ggrepel")
# library(ggrepel)
# top15_labs <- head(Pca_res_table_U_S,20) %>%
# as.data.frame() %>%
# rownames_to_column("gene_id")
ggplot() +
# geom_text_repel(data = top15_labs,
# aes(x = log2FoldChange, y = -log10(padj), label = gene_id), size = 2)+
geom_point(
data = as.data.frame(Pca_res_table_U_S) %>% filter(padj > 0.01 | abs(log2FoldChange) < 1),
aes(x = log2FoldChange, y = -log10(padj)),
color = "grey") +
geom_point(
data = as.data.frame(Pca_res_table_U_S) %>% filter(padj < 0.01 & abs(log2FoldChange) > 1),
aes(x = log2FoldChange, y = -log10(padj))) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
theme_bw()
# save the volcano plot
ggsave("diff_expr/Pca_volcanoPlot_U_S.pdf", height = 6, width = 6)
Make a heatmap of the 20 most differentially expressed (lowest p-value) genes.
# Identify the top 20 differentially expressed genes from the sorted results table.
DE_genes <- rownames(head(Pca_res_table_U_S,20))
# Create a heatmap for the most expressed genes
Pca_dds_DE_U_S <- assay(Pca_vst)[rownames(Pca_vst) %in% DE_genes,]
heatmap(Pca_dds_DE_U_S,
margin = c(10,6))
# save plot
pdf(file = "diff_expr/Pca_geneHeatmap_U_S.pdf", height = 6, width = 6)
heatmap(Pca_dds_DE_U_R,
margin = c(10,6))
dev.off()
Run a functional annotation on the up and downregulated genes. You do not need to load the annotation again. Go straight to the candidate gene list steps (But don’t forget to indicate the correct GO category).
# Starting with BP
up_U_S <- rownames(Pca_res_table_U_S %>%
data.frame() %>%
filter(padj < 0.01 & log2FoldChange > 1))
down_U_S <- rownames(Pca_res_table_U_S %>%
data.frame() %>%
filter(padj < 0.01 & log2FoldChange < -1))
DEgenes_list <- list(down_U_S, up_U_S)
names(DEgenes_list) <- c("down_U_S", "up_U_S")
for (k in c("BP", "MF")) {
GO_category=k
for (i in 1:2){
# define candidate list
genesOfInterest.bv <- DEgenes_list[[i]]
# placing this within brackets causes all the lines to run in a single block
# it can help you keep track of code that you usually run together
geneList.bv <- factor(as.integer(geneUniverse %in% genesOfInterest.bv))
names(geneList.bv) <- geneUniverse
# Create a new topGO GSEA objected
myGOdata.bv <- new("topGOdata",
description="Candidate genes",
ontology=GO_category,
allGenes=geneList.bv,
annot = annFUN.gene2GO,
gene2GO = geneID2GO,
nodeSize = node_size)
# classic Fisher
resultClassic <- runTest(myGOdata.bv, algorithm="classic", statistic="fisher")
# Fisher with parent-child algorithm
resultParentchild <- runTest(myGOdata.bv, algorithm="parentchild", statistic="fisher")
top_nodes <- sum(resultParentchild@score < 0.05)
# Create a GSEA results table
temp <- GenTable(myGOdata.bv,
classicFisher = resultClassic,
parentchildFisher = resultParentchild,
orderBy = "parentchildFisher",
topNodes = top_nodes)
assign(paste0("allRes_", GO_category, "_", names(DEgenes_list)[[i]]), temp)
}
}
head(allRes_BP_down_U_S)
head(allRes_BP_up_U_S)
head(allRes_MF_down_U_S)
head(allRes_MF_up_U_S)
Are the same genes differentially expressed between Urtica
and each of the other two host plants? Compare the results from the
Urtica v. Ribes and Urtica v. Salix
analyses visually. You can use venn diagrams (e.g.,
gplots
), euler diagrams (eulerr
), upsetR plots
(e.g., upsetR
, ComplexHeatmap
), alluvial plots
(ggalluvial
), scatterplots (e.g.,
ggplot2
).
What about the functional enrichment? Can you use the same tools to visualize functional overlap (i.e., GO terms) of the DE genes in the two contrasts?
You may need to install some packages. I didn’t write any hints for this part! Use google, package manuals and R help to produce the code.
THIS IS THE END OF THE DGE ANALYSIS. If you have time and want to give it a try, return to the command line to do perform the trimming, mapping and counting steps that produce a count matrix.
Many older pipelines begin by trimming reads to remove adapters and poorly sequenced bases, but this is not required. The commonly used aligners (e.g., STAR, hisat2) “soft trim” the adapters and this should result in reliable read counts, at least at the whole-gene or whole-transcript level. Some studies even suggest trimming can be problematic for subsequent read quantification (Williams et al. (2016), Liao and Shi (2020))
If you are planning to use the reads for variant calling, genome
annotation or transcriptome assembly, you can trim adapters with tools
like BBduk
, Trimmomatic
, flexbar
,
TrimGalore
(wrapper for CutAdapt
, you need to
know your adapter sequences). In these cases, you should also quality
trim your reads. The UC Davis genome center recommends, “combine gentle
quality trimming with a threshold quality score of Q15 with a read
length filter retaining only reads longer than 35 bp in length” using
the tools BBduk
or Trimmomatic.
You might also consider trimming your reads if you are getting a low mapping rate.
Trimming tools:
Use FastQC
to check the quality of your
reads before trimming (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
If installed, you can also run MultiQC
(https://multiqc.info/) to
generate a summary report for all of your reads. This is for
convenience, and is not necessary for evaluating your reads.
cd $RNA_HOME/data_raw
fastqc *.fq.gz
multiqc .
Trimmomatic
can be used to remove adapters and to
lightly trim and discard low quality fastq data. It supports
multi-threading and the use of compressed input files (gzip or bzip2).
For more details, consult the manual: http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
Trimmomatic
is run using a JAR file (Java ARchive)
containing java code. JAR files are executed using the command
java -jar
, and we have to indicate where the software is
installed.
cd $RNA_HOME
# Set threads depending on available cores
THREADS=2
# Set the path to the trimmomatic installation (necessary even when installed with Conda)
TRIMMOMATIC_ROOT=/home/genomics/software/Trimmomatic-0.39
Make the out directory (data_trimmed
) for your trimmed
reads.
mkdir data_trimmed
Run Trimmomatic
for the first sample:
Pca_Ribes_rep1_sub
#in files
in1=$RNA_HOME/data_raw/Pca_Ribes_rep1_sub_1.fq.gz
in2=$RNA_HOME/data_raw/Pca_Ribes_rep1_sub_2.fq.gz
#output, paired
paired1=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_1.paired.trimmed.fq.gz
paired2=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_2.paired.trimmed.fq.gz
#output, unpaired
unpaired1=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_1.unpaired.trimmed.fq.gz
unpaired2=$RNA_HOME/data_trimmed/Pca_Ribes_rep1_2.unpaired.trimmed.fq.gz
# run Trimmomatic
java -jar $TRIMMOMATIC_ROOT/trimmomatic-0.39.jar PE -threads $THREADS -phred33 $in1 $in2 \
$paired1 $unpaired1 $paired2 $unpaired2 \
ILLUMINACLIP:$TRIMMOMATIC_ROOT/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \
SLIDINGWINDOW:4:15 MINLEN:36 2> data_trimmed/${file}_trimmed.log
Explanation of options:
PE
stands for paired end readsSample1_01.fastq.gz
and
Sample1_02.fastq.gz
are the input filesSample1_01_paired.trimmed.fastq.gz
and
Sample1_01_unpaired.trimmed.fastq.gz
are output files of
the paired and unpaired trimmed reads.ILLUMINACLIP
is the adapter sequenceLEADING
is the threshold for low quality or N bases at
the leading end of the readTrailing
is the threshold for low quality or N bases at
the trailing end of the readMINLEN
is the minimum length threshold. After trimming,
shorter sequences will be discarded.SLIDINGWINDOW
here means that a window size of 5 is
used and if the average quality of this window is below 15, the
sequences will be clipped from where the windows started.You can see that trimmomatic produces different files for paired and unpaired reads. For DGE analysis of paired-end samples, it is recommended to map only the paired reads.
Now run Trimmomatic on the remaining read pairs. Using the outline below, write a loop to trim all of the reads.
for reads in data_raw/*_1.fq.gz;
do
file=$(basename $reads _1.fq.gz);
.
.
.
echo ${file} trimmed;
done
for reads in data_raw/*_1.fq.gz;
do
file=$(basename $reads _1.fq.gz);
in1=$RNA_HOME/data_raw/${file}_1.fq.gz;
in2=$RNA_HOME/data_raw/${file}_2.fq.gz;
#output, paired
paired1=$RNA_HOME/data_trimmed/${file}_1.paired.trimmed.fq.gz;
paired2=$RNA_HOME/data_trimmed/${file}_2.paired.trimmed.fq.gz;
#output, unpaired
unpaired1=$RNA_HOME/data_trimmed/${file}_1.unpaired.trimmed.fq.gz;
unpaired2=$RNA_HOME/data_trimmed/${file}_2.unpaired.trimmed.fq.gz;
java -jar $TRIMMOMATIC_ROOT/trimmomatic-0.39.jar PE -threads $THREADS -phred33 $in1 $in2 \
$paired1 $unpaired1 $paired2 $unpaired2 \
ILLUMINACLIP:$TRIMMOMATIC_ROOT/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 \
SLIDINGWINDOW:4:15 MINLEN:36 2> data_trimmed/${file}_trimmed.log;
echo ${file} trimmed;
done
To see that you didn’t do a mistake, look at the file sizes (which
should all differ) with ls
.
ls -l data_trimmed/*.paired.trimmed.fq.gz
Use FastQC
to check the quality of your trimmed
reads. Again, run MultiQC
for a full summary:
cd $RNA_HOME/data_trimmed
fastqc *.paired.trimmed.fq.gz
multiqc .
Several alignment and count tools (e.g., hisat2, htseq) require that we specify strandedness of our sequence data. If the data are coming to you from the sequencing facility then you’ll know (or can ask if you are unsure). However, if you are using published or archived data, the library prep information may be vague or missing. (A brief background on why strandedness is important can be found here: https://bitesizebio.com/42077/rna-strandedness-a-road-travelled-in-both-directions/)
Tools: - RSeQC (https://rseqc.sourceforge.net/) - how_are_we_stranded_here (https://github.com/signalbash/how_are_we_stranded_here; requires Kallisto)
In this case, we know our RNA-seq libraries were prepared using an Illumina TruSeq Stranded mRNA kit and is reverse (RF/fr-firststrand) stranded.
Here is an example of how you would run how_are_we_stranded_here (again, do not run as software is not installed.)
# move into the refs folder
cd $RNA_HOME/refs/
# pull CDS from gtf using gffread
gffread $gtf_file \
-g $reference \
-x ${gtf_file%.gff}_cds.fa
# move back into the main directory
cd $RNA_HOME
# run how_are_we_stranded_here
check_strandedness --gtf refs/$gtf_file \
--transcripts refs/${gtf_file%.gff}_cds.fa \
--reads_1 data_trimmed/Pca_Ribes_rep1_sub_1.paired.trimmed.fq.gz \
--reads_2 data_trimmed/Pca_Ribes_rep1_sub_2.paired.trimmed.fq.gz
#checking strandedness
#This is PairEnd Data
#Fraction of reads failed to determine: 0.0595
#Fraction of reads explained by "1++,1--,2+-,2-+": 0.0073 (0.8% of explainable reads)
#Fraction of reads explained by "1+-,1-+,2++,2--": 0.9332 (99.2% of explainable reads)
#Over 90% of reads explained by "1+-,1-+,2++,2--"
#Data is likely RF/fr-firststrand
Here is an excellent resource for how to use these results with various tools: https://rnabio.org/module-09-appendix/0009/12/01/StrandSettings/
There is no consensus about the best combination of trimming + alignment + counting + normalization tools to use. Usually, a paper introducing a new tool finds that tool does a great, often better, job in comparison to existing tools. Hurray! But is it the best tool to use?
Independent analyses of different pipelines and methods find:
It is probably also good to consider which methods you or your lab group are most familiar with. Will you be able to determine the best parameters to choose? Will you be able to assess whether the output of your pipeline is reasonable or trustworthy?
Reference-guided aligners:
STAR
(https://github.com/alexdobin/STAR)Hisat2
(http://daehwankimlab.github.io/hisat2/)StringTie
(https://ccb.jhu.edu/software/stringtie/)We will use Hisat2
to map our reads. Hisat2
is a splice-aware aligner that produces reliably mapped reads for both
differential gene expression and splicing analyses. There are two main
steps to alignment with Hisat2
:
Create a list of exons and splice sites from the annotation:
# move into the refs folder
cd $RNA_HOME/refs/
# generate splice site list
extract_splice_sites.py $gtf_file > ${gtf_file%.gtf}_splicesites.tsv
# generate exon list
hisat2_extract_exons.py $gtf_file > ${gtf_file%.gtf}_exons.tsv
Now that you have a list of splice sites and exons from the reference
annotation, it is time to build an index for the genome using
hisat2-build
. This takes 20-30 minutes using 2 cores, so
for for now just take a look at the code:
# Set threads
THREADS=2
# build index (do not run!)
hisat2-build -p $THREADS --ss ${gtf_file%.gtf}_splicesites.tsv \
--exon ${gtf_file%.gtf}_exons.tsv $reference ${reference%.fa} 2> \
${reference%.fa}_${reference%.gtf}_hisat-build.log
Unzip the pre-built index and move the index files into the
refs
directory. Feel free to take a look at the
log file if you’re curious about what hisat2-build
did.
# unzip
unzip Pca_genome_prebuilt.zip
# move index files ending in *ht2
mv Pca_genome_prebuilt/*ht2 .
Use ls
to inspect the genome index
before moving on to mapping.
ls -l *ht2
Map the paired reads for the first replicate (Pca_Ribes_rep1_sub).
cd $RNA_HOME
# make mapping directory
mkdir mapping
#in, paired
file=Pca_Ribes_rep1_sub
in1=$RNA_HOME/data_trimmed/${file}_1.paired.trimmed.fq.gz
in2=$RNA_HOME/data_trimmed/${file}_2.paired.trimmed.fq.gz
# align reads with hisat2
hisat2 -p $THREADS -x refs/${reference%.fa} -1 $in1 -2 $in2 \
-S mapping/${sample}_Aligned_out.sam --summary-file mapping/${sample}_summaryfile.txt \
--dta --rna-strandness RF
Explanation of options:
-p
Number of threads-x
Base name of index built earlier including splice
sites and exons (minus the trailing .X.ht2
)-1
paired reads file 1-2
paired reads file 2-S
Output file with aligned reads in SAM format–summary-file
Summary information about how mapping
went--rna-strandness
specify strandedness, remember ours
are reverse-stranded--dta
output mapped reads that are compatible with
transcript assemblers (e.g., StringTie)Look at the output files:
ls mapping/
cat mapping/Pca_Ribes_rep1_sub_summaryfile.txt
Write a loop to map the remaining samples.
for reads in data_trimmed/*_1.paired.trimmed.fq.gz;
do
file=$(basename $reads _1.paired.trimmed.fq.gz);
in1=$RNA_HOME/data_trimmed/${file}_1.paired.trimmed.fq.gz
in2=$RNA_HOME/data_trimmed/${file}_2.paired.trimmed.fq.gz
hisat2 -p $THREADS -x refs/${reference%.fa} -1 $in1 -2 $in2 \
-S mapping/${file}.sam --summary-file mapping/${file}_summaryfile.txt \
--dta --rna-strandness RF
echo ${file} mapped;
done
If you’re familiar with the SAM-format, you know this is the
human-readable file describing the reads that mapped (or did not map) to
the genome. To save space and prepare the file for downstream analysis,
we should use samtools
to convert the mapped reads to BAM
format (view
), sort them by read position
(sort
) and create an index (index
) for each
file.
cd $RNA_HOME/mapping
# convert and pipe directly to sort
samtools view -@ $THREADS -bS Pca_Ribes_rep1_sub.sam | \
samtools sort -@ $THREADS - -o Pca_Ribes_rep1_sub_sorted.bam
# index
samtools index -@ $THREADS Pca_Ribes_rep1_sub_sorted.bam
Convert, sort and index the remaining files. Once they are done, remove the SAM files.
for reads in *.sam;
do
file=$(basename $reads .sam);
samtools view -@ $THREADS -bS ${file}.sam | \
samtools sort -@ $THREADS - -o ${file}_sorted.bam;
samtools index -@ $THREADS ${file}_sorted.bam;
echo ${file} bam sorted, indexed;
done
As with mapping, there are many options for counting mapped reads.
Corchete et al. (2020) found that
pipelines including HTseq
were consistently among the
highest ranked in their analyses. There are some other options.
featureCounts
is very flexible, however with this
flexibility comes many opportunities to make mistakes with large effects
on your count matrices. StringTie
can identify, assign
reads to, and quantify transcripts.
To see how may reads mapped to each gene, we will use
HTSeq.
By default, HTSeq
counts one alignment
per read. If there are two or more alignments for a read
HTSeq
takes the one with the highest score, unless you
specify otherwise. This alignment is called the primary alignment.
Look at the options for HTSeq
with
htseq-count -h
, or find more details in the manual (https://htseq.readthedocs.io/en/release_0.11.1/count.html).
Create an output directory (counting
) for your
read count files.
mkdir -p $RNA_HOME/counting
We need to specify which genomic feature htseq-count
should use to evaluate aligned reads. Features are identified in the 3rd
column of the Pca_annotation.gtf
(use
head -10 annotation.gtf
to check it out). In this case, we
plan to perform a differential whole gene expression analysis.
We are interested in reads that map to exon features.
We use the option -t
to specify the feature type in the
HTSeq command (exon is the default). However, we plan to count all of
the reads mapping to each gene. Genes are identified by
gene_id in the last column of the gtf (take a look at
the last column of the annotation to see the naming format). We use the
option -i
to specify this identifier in the command
below.
Count the first sample.*
# move into the main directory
cd $RNA_HOME
# count mapped reads at the gene-level for the first sample
file=Pca_Ribes_rep1_sub
htseq-count mapping/${file}_sorted.bam refs/$gtf_file \
-f bam -r pos -a 10 -s reverse -t exon -i gene_id -m union --nonunique none > \
counting/${file}_geneCounts.txt
Explanation of options:
-f
sam or bam format-r
whether the alignment is sorted by position (pos) or
read name (name)-a
minimum quality filter, skips reads below this value
(here we specify 10, even though it is the default)-s
specifies whether the reads are forward stranded
(yes, default), reverse stranded (reverse), or not stranded (no)-t
sub feature type (3rd column in GFF file) to be
used-i
GFF attribute to be used as feature ID-m
mode to handle reads overlapping more than one
feature. Choices: union, intersection-strict, intersection-nonempty;
default: union). Here we continue with the default. Check out the manual
for an explanation of the mapping characteristics included and excluded
by these options.--nonunique
removes reads that map to multiple feature
IDs (limits ‘fusion genes’). This should not be used if you are
interested in counting exons for differential exon usage analysis. (Do
you know why not??)Count the reads for the remaining samples.
for reads in mapping/*_sorted.bam;
do
file=$(basename $reads _sorted.bam);
htseq-count mapping/${file}_sorted.bam refs/$gtf_file \
-f bam -r pos -a 10 -s reverse -t exon -i gene_id -m union --nonunique none > \
counting/${file}_geneCounts.txt;
echo $file counted;
done
To create a count matrix from these individual count files, we
usepaste
. The matrix should look something like this:
## gene_id Pca_Ribes_rep1 Pca_Ribes_rep2 Pca_Ribes_rep3 ...
## 1 Polcal_g1 5 25 19 ...
## 2 Polcal_g2 0 6 3 ...
Confirm the ids are in the same order.
#create a list of ids from each count file
cd $RNA_HOME/counting
ls *geneCounts.txt | while read file;
do cat $file | cut -f 1 > $file.id;
done
# compare the md5sum values of the id files
md5sum *.id
# remove the id files you just made
rm *id
Now create the matrix file.
paste *geneCounts.txt | cut -f 1,2,4,6,8,10,12,14,16,18 > Pca_count.matrix.txt
To get a final matrix with column headers, use echo
and
cat
to add the header to the count
matrix.
echo -e "gene_id\tPca_Ribes_rep1\tPca_Ribes_rep2\tPca_Ribes_rep3\tPca_Salix_rep1\tPca_Salix_rep2\tPca_Salix_rep3\tPca_Udi_rep1\tPca_Udi_rep2\tPca_Udi_rep3" > header.txt
cat header.txt Pca_count.matrix.txt > Pca_count.matrix.final.txt
# check the matrix
head Pca_count.matrix.final.txt
This is a weirdly convoluted and fiddly process, and luckily, the program we will use for DGE analysis has built-in methods for converting HTSeq counts to input files. But now you know one way to do it yourself!
Much of this material has been adapted through the years from a Harvard University course, https://github.com/hbctraining/DGE_workshop/tree/master/lessons↩︎