Quantification and differential expression analysis
Table of contents
- Quantification and Differential gene expression analysis
- exercise 1: Data inspection, prepare for genomic alignment
- exercise 2: Quantify with the RSEM programs
- exercise 3: Differential gene expression analysis with DESeq
- exercise 4: Genome alignment of RNA-seq reads
Expected learning outcome
To understand the basics of RNA-Seq data, how to use RNA-Seq for different objectives and to familiarize yourself with some standard software packages for such analysis.
Getting started
The main goal of this activity is to go through a standard method to obtain gene expression values and perform differential gene expression analysis from an RNA-Seq experiment.
We will start performing gene quantification using the RSEM program and do differential gene expression analysis of the estimated counts using DESeq2. Ater that we do whole genome alignment and visualize the data using the TopHat2 spliced aligner.
All the data you need for this activity should be in the transcriptomics directory. Here we will use the genome.quantification and the fastq.quantification subdirectories.
Datasets and Software
We will be using the following bioinformatics tools:
- RSEM version v1.2.12 (http://www.biomedcentral.com/1471-2105/12/323)
- Picard 1.2.7 (http://broadinstitute.github.io/picard/)
- Bowtie2-2.1.0 (http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html)
- samtools 0.1.19 (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/)
- RStudio 0.98 (http://www.rstudio.org/)
- IGV 2.3.40 (http://bib.oxfordjournals.org/content/14/2/178.abstract)
- igvtools 2.3.40 (http://bib.oxfordjournals.org/content/14/2/178.abstract)
Data should be available in two directories called fastq.quantification (12 FASTQ files) and genome.quantification (files related to the mouse genome).
fastq.quantification (short read data):
- control_rep1.1.fq
- control_rep1.2.fq
- control_rep2.1.fq
- control_rep2.2.fq
- control_rep3.1.fq
- control_rep3.2.fq
- exper_rep1.1.fq
- exper_rep1.2.fq
- exper_rep2.1.fq
- exper_rep2.2.fq
- exper_rep3.1.fq
- exper_rep3.2.fq
and
genome.quantification:
- ucsc.gtf (reduced annotation file)
- ucsc_into_genesymbol.rsem (isoform ID to official gene symbol translation)
- mm10.fa (FASTA file of the reduced genome)
Introduction
Sample pooling has revolutionized sequencing. It is now possible to sequence tens of samples together. Different objectives require different sequencing depths. Doing differential gene expression analysis requires less sequencing depth than transcript reconstruction so when pooling samples it is critical to keep the objective of the experiment in mind. In this activity we will use subsets of experimentally generated datasets. The dataset used here was generated for differential gene expression analysis while the other dataset used for the optional exercise was generated for transcript annotation. For quantification we will use a set of data generated from the same strain as the genome reference mouse (C57BL/6J). The study concentrated on the effect of liver JNK signaling pathway in fatty acid metabolism (see Vernia et al, Cell Metabolism, 2014), the study generated triplicate libraries for 4 different genotypes (wildtype, Jnk1 KO, Jnk2 KO, and Jnk1, Jnk2 KO) in two different diets.
We selected three replicates from wild type (control) and three from the double knock-out strain under fat diet (exper), the two genotypes and condition having largest expression differences. We further created a reduced subset of the data that includes the most striking result of the paper. The idea is to find genes that are in the same pathway as the gene that were knock out. We will use a reduced genome consisting of the first 9.5 million bases of mouse chromosome 16 and the first 50.5 million bases of chromosome 7 for this workshop since, mapping and quantifying the reads on whole genome takes more time. We wanted to reduce the running time.
Exercise 1: Data inspection, prepare for genomic alignment
RSEM depends on an existing annotation and will only score transcripts that are present in the given annotation file.
The first step is to prepare the transcript set that we will quantify. We selected the UCSC genes which is a very comprehensive, albeit a bit noisy dataset. As with all the data in this activity we will only use the subset of the genes that map to the genome regions we are using.
We will be working in the transcriptomics directory:
cd ~/workshop_data/transcriptomics/genome.quantification
RSEM provides a program to generate indices for transcriptome alignment which is a one time process for each transcriptome. To generate our indices use
rsem-prepare-reference --gtf ucsc.gtf --transcript-to-gene-map ucsc_into_genesymbol.rsem --bowtie2 mm10.fa mm10.rsem
Which should extract transcript sequences and generate bowtie2 index files against these sequence. You can check that the program ran successfully by ensuring that the following files were created:
mm10.rsem.1.bt2 mm10.rsem.2.bt2 mm10.rsem.3.bt2 mm10.rsem.4.bt2 mm10.rsem.chrlist mm10.rsem.grp mm10.rsem.idx.fa mm10.rsem.rev.1.bt2 mm10.rsem.rev.2.bt2 mm10.rsem.seq mm10.rsem.ti mm10.rsem.transcripts.fa
Please note that the GTF file format is not fully standardized so many times you need to modify the gene_id and transcript_id labels. Especially, when you downlaod this file from UCSC, trancript_id and gene_id’s are the same and usually they are transcript_ids. So, if you want to quantify gene and isoforms, you need to replace transcript_id’s in gene_id field with official gene symbol. This has already been taken care of in the ucsc.gtf file for this workshop.
Question: What do you think each file that is created with rsem-prepare-reference is?
Exercise 2: Quantify with the RSEM program
2.1 Calculate expression
We will assume that your current directory is transcriptomics and that you have four subdirectories within:
bin data fastq.quantification genome.quantification
We also have fastq.reconstruction and genome.reconstruction directories but we are not going to use them in this part. You can use them in optional part about transcript reconstruction, if you have time after you finish this part.
We’ll add one more subdirectory to hold the quantification results:
cd ~/workshop_data/transcriptomics mkdir rsem
We are now ready to align and then attempt to perform read assignment and counting for each isoform in the file provided above. You must process each one of the 6 libraries: ( Here “\” end of each line denotes that the command will continue in the next line to increase the readibility of the commands, sometimes we are writing one line command in multiple lines. You can remove all “\” and run the command in one line, if you prefer.)
rsem-calculate-expression --paired-end -p 2 --bowtie2 \ --output-genome-bam fastq.quantification/control_rep1.1.fq \ fastq.quantification/control_rep1.2.fq genome.quantification/mm10.rsem \ rsem/ctrl1.rsem
And similarly for each of the other 5 libraries
[toggle hide=”yes” border=”yes” style=”white” title_open=”Hide Commands” title_closed=”Show Commands”]rsem-calculate-expression --paired-end -p 2 --bowtie2 \ --output-genome-bam fastq.quantification/control_rep2.1.fq \ fastq.quantification/control_rep2.2.fq genome.quantification/mm10.rsem \ rsem/ctrl2.rsem
rsem-calculate-expression --paired-end -p 2 --bowtie2 \ --output-genome-bam fastq.quantification/control_rep3.1.fq \ fastq.quantification/control_rep3.2.fq genome.quantification/mm10.rsem \ rsem/ctrl3.rsem
rsem-calculate-expression --paired-end -p 2 --bowtie2 \ --output-genome-bam fastq.quantification/exper_rep1.1.fq \ fastq.quantification/exper_rep1.2.fq genome.quantification/mm10.rsem \ rsem/expr1.rsem
rsem-calculate-expression --paired-end -p 2 --bowtie2 \ --output-genome-bam fastq.quantification/exper_rep2.1.fq \ fastq.quantification/exper_rep2.2.fq genome.quantification/mm10.rsem \ rsem/expr2.rsem
rsem-calculate-expression --paired-end -p 2 --bowtie2 \ --output-genome-bam fastq.quantification/exper_rep3.1.fq \ fastq.quantification/exper_rep3.2.fq genome.quantification/mm10.rsem \ rsem/expr3.rsem[/toggle]
To ensure that you have run all commands successfully, you should check that your rsem directory contains the following result files:
ctrl1.rsem.genes.results ctrl1.rsem.genome.bam ctrl1.rsem.genome.sorted.bam ctrl1.rsem.genome.sorted.bam.bai ctrl1.rsem.isoforms.results ctrl1.rsem.stat ctrl1.rsem.transcript.bam ctrl1.rsem.transcript.sorted.bam ctrl1.rsem.transcript.sorted.bam.bai ctrl2.rsem.genes.results ctrl2.rsem.genome.bam ctrl2.rsem.genome.sorted.bam ctrl2.rsem.genome.sorted.bam.bai ctrl2.rsem.isoforms.results ctrl2.rsem.stat ctrl2.rsem.transcript.bam ctrl2.rsem.transcript.sorted.bam ctrl2.rsem.transcript.sorted.bam.bai ctrl3.rsem.genes.results ctrl3.rsem.genome.bam ctrl3.rsem.genome.sorted.bam ctrl3.rsem.genome.sorted.bam.bai ctrl3.rsem.isoforms.results ctrl3.rsem.stat ctrl3.rsem.transcript.bam ctrl3.rsem.transcript.sorted.bam ctrl3.rsem.transcript.sorted.bam.bai expr1.rsem.genes.results expr1.rsem.genome.bam expr1.rsem.genome.sorted.bam expr1.rsem.genome.sorted.bam.bai expr1.rsem.isoforms.results expr1.rsem.stat expr1.rsem.transcript.bam expr1.rsem.transcript.sorted.bam expr1.rsem.transcript.sorted.bam.bai expr2.rsem.genes.results expr2.rsem.genome.bam expr2.rsem.genome.sorted.bam expr2.rsem.genome.sorted.bam.bai expr2.rsem.isoforms.results expr2.rsem.stat expr2.rsem.transcript.bam expr2.rsem.transcript.sorted.bam expr2.rsem.transcript.sorted.bam.bai expr3.rsem.genes.results expr3.rsem.genome.bam expr3.rsem.genome.sorted.bam expr3.rsem.genome.sorted.bam.bai expr3.rsem.isoforms.results expr3.rsem.stat expr3.rsem.transcript.bam expr3.rsem.transcript.sorted.bam expr3.rsem.transcript.sorted.bam.bai
2.3 Create consolidated table
In the bin directory we provide a simple script to take all the independent RSEM output and combine it into a single table, which is needed for inspection and ready for differential gene expression analysis.
To find out what the script does you may type the following command in the transcriptomics directory
cd ~/workshop_data/transcriptomics perl bin/rsem.to.table.pl -help
To quit the help page, just press q.
We will generate two tables using two measures for gene expression level. These are expected_count and tpm values. ‘expected count’ is the number of expected RNA-seq
fragments assigned to the transcript given maximum-likelihood transcript abundance estimates. ‘tpm’ is the number of transcripts per million.
We are not going to use tpm values in this workshop but, using rsem.to.table.pl script you can create this table to use it normalized counts in your further analysis.
perl bin/rsem.to.table.pl -out rsem.gene.summary.count.txt -indir rsem -gene_iso genes -quantType expected_count perl bin/rsem.to.table.pl -out rsem.gene.summary.tpm.txt -indir rsem -gene_iso genes -quantType tpm
Question: Can you find genes that look affected by the experiment?
Hint: To be able to look the data, you can use the following command;
column -t rsem/expr1.rsem.genes.results | less -S
2.4 Visualize the raw data: Make a IGV genome with the transcriptome
We need to create an artificial genome composed of the transcripts we used to annotate. Launch your IGV browser from the terminal in x2goclient:
igv.sh
Then, from the top menu, select
genome -> create .genome file …
A pop up box should ask for information on the genome. Give the genome a name such as for example mm10ReducedTranscriptome. The name is important as it will be used later, if you choose to name this genome differently take note and make sure you keep using that name in what follows. We will refer to this reduced transcriptopme as “mm10ReducedTranscriptome” and use the same for description.
Click on browse and navigate to your genome.quantification directory and select the mm10.rsem.transcripts.fa file and save the transcripts genome file in the same genome.quantification directory.
This also offers a good example of how to use IGV when working with an assembly or non-published genome sequence. In this cases, you create a genome file in the same way we created the transcript genome.
Open a new terminal to create tdf files.
IGV is highly versatile and can display a large number of file formats. In this exercise we will try two formats, sorted and indexed bam file and a TDF file. the tdf is a reduced/lighter version of the bam that only contains coverage information.
The TDF file will be generated from the bam file and will be faster to work with in IGV since the converage is precalculated. The bam file is calculates the coverage interactively, but has the advantage of displaying the individual reads and sequencing errors and SNPs.
To compare these two file formats, you can either create a TDF file or open bam file. To create tdf file:
igvtools count -w 5 rsem/ctrl1.rsem.transcript.sorted.bam rsem/ctrl1.rsem.transcript.sorted.bam.tdf \ genome.quantification/mm10.rsem.transcripts.fa
Go back to IGV and choose the mm10ReducedTranscriptome genome in the drop down menu at the top left corner. Open the bam file (rsem/ctrl1.rsem.transcript.sorted.bam) and tdf file you just generated.
Have a closer look at one of the genes called fgf21 by typing its ucsc_id (uc009gwe.1) into the search window on IGV and press go. Explore the different bam files you generated and look at the gene expression of the double knockout mice (expr files) versus the wildtype (ctrl files).
Can you see any difference in the expression patterns?
We will come back to the results of exercise 1 later where we will compare the results from RSEM and from Tophat (exercise 4).
We now have the files needed to identify differentially expressed genes.
Exercise 3: Differential gene expression analysis with DESeq
If you reach to this level, you can either skip to excersice 4 or wait for the rest of the class.
Because, we are going to explain the R code to make the plots and DESeq analysis step by step.
To do differential expression analysis we are going to use R and RStudio. To reach RStudio, first, please go to amazon console and copy your “Public DNS” from the “Instance” list.
Please, open chrome or firefox in your desktop in X2GoClient and login to rstudio using the credentials below.
username: ubuntu password: evomics2015
DESeq2 is an R package available via Bioconductor and is designed to normalise count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression.
The code we are going to use is in the directory below that you can open from RStudio.
~/workshop_data/transcriptomics/bin/deseq_plots.R
To open it from RStudio, please click to the “Files” Tab in the lower right panel. From home, move into the correct directory and then open deseq_plots.R file.
You can also reach the code we are going to use in RStudio from below. You can either follow the class from here or by opening the file from RStudio.
[toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]
#Tip: To run any line in RStudio you can use Ctrl+Enter or Command+Enter.
#To use any package in R, we need to load its library first.To run DESeq2
#you just need to type the following.
library("DESeq2")
#We are going to use other libraries and lets load them too.
library("ggplot2")
library("RColorBrewer")
library("gplots")
#1. Reading the data.
# We already made gene quantifications and merged expected counts into
# a single table. Here we prepared another table using whole reads
# (not only reduced reads) to give you an idea about the complete
# picture of DESeq analysis.
file <- "~/workshop_data/transcriptomics/data/deseq_dataset.tsv"
rsem<-read.table(file)
head(rsem)
#2. Read the data with header. We set the header=TRUE. In this way,
# the first row will be in the header of the data structure.
rsem <- read.table(file,sep="\t", header=TRUE)
#3. Read the data with row.names. We have the gene names in the
# first column in this table. When we set row.names=1 It will remove
# the first column from the data and put them to the row name section.
rsem <- read.table(file,sep="\t", header=TRUE, row.names=1,
stringsAsFactors = TRUE)
#4. Creating the data structure for DESeq Analysis, we need to select
# the columns that we are going to use in the analysis. Here we are
# going to use experiment and control data that will be 6 coulmns.
# Here in this step make sure to include the columns you want to use in
# your analysis.
data <- data.frame(rsem[, c("exper_rep1","exper_rep2","exper_rep3",
"control_rep1","control_rep2","control_rep3")])
#5. DESeq analysis is using Negative Binomial Distribution to model and
# test two conditions. Basically the quantifications per gene has to be
# count data. In this case we used RSEM to calculate the # reads per gene.
# RSEM estimates this value using a likelihood model that is why the
# values are not integer. We need to convert those values to integer.
cols <- c(1:6);
data[,cols] <- apply(data[,cols], 2,
function(x) as.numeric(as.integer(x)))
#6. Just to see the distribution of the reads we can use hist function in
# log10 scale.
hist(log10(data$exper_rep1), breaks=100)
#7. Now we are ready to look whole data with scatter plot. First lets
# calculate the average reads per gene. To find the average we can sum
# experiments per gene and divide it to the number of replicas.
# In this case we will divide the sum to 3. We will calculate the
# average for control libraries too. After that, we are ready to merge
# the average values from experiment and control to create a two column
# data structure using cbind.
avgall<-cbind(rowSums(data[1:3])/3, rowSums(data[4:6])/3)
# We can also change the column names using colnames function.
colnames(avgall)<-c("Treat", "Control")
#7. Make a simple scatter plot.
plot(avgall)
#8. Hmm!!! The values are ranging from 0 to 800k. So, let's use log2.
plot(log2(avgall))
#9. Let's change the x and y labels and plot again.
log2vals <- log2(avgall)
colnames(log2vals)<-c("log2(Treat)", "log2(Control)")
plot(log2vals)
#10. Let's make this pretty using ggplot.
gdat<-data.frame(avgall)
ggplot() +
geom_point(data=gdat, aes_string(x="Treat", y="Control"),
colour="black", alpha=6/10, size=3) +
scale_x_log10() +scale_y_log10()
######################################
## LETS START DESeq ANALYSIS
######################################
#
#11. We need to Define conditions for each library.
# Make sure the order of the columns matches to your table.
conds <- factor( c("Control","Control", "Control",
"Treat", "Treat","Treat") )
# DESeq function requires a special data structure called data frame.
colData <- as.data.frame((colnames(data)))
colData <- cbind(colData, conds)
colnames(colData) <- c("Libs","group")
groups <- factor(colData[,2])
#12. Filter out the genes if the # of total reads in all libs below 10.
# You can decide this value by looking your histogram we created to
# asses what this number should be.
sumd <- rowSums(data)
hist(log10(sumd), breaks=100)
# To select the sum of the rows > 10, we can use subset function
filtd <- subset(data, sumd > 10)
#13. Create DESeq data set using prepared colData table for condition
# definition and filtered data.
dds <- DESeqDataSetFromMatrix(countData=as.matrix(filtd),
colData=colData, design = ~ group)
#14. Run DESeq analysis
dr <- DESeq(dds);
#15. Put the results into variable.
res <- results(dr);
#16. Look for the column descriptions in the results
mcols(res, use.names=TRUE)
#17. Now we are ready to select the ones has lower padj values.
# higher fold changes and visualize them on our scatter plot with different
# color. padj values are corrected p-values which are multiplied by the number
# of comparisons. Here we are going to use 0.01 for padj value and > 1 log2foldchange.
# (1/2 < foldChange < 2)
f1 <- res[!is.na(res$padj) & !is.na(res$log2FoldChange),]
res_selected <- f1[(f1$padj<0.01 & abs(f1$log2FoldChange)>1),]
#18. To Add a legend for all data we are going to add a text "Add" to
# whole values.
Legend <- "All"
gdat1 <- cbind(gdat, Legend)
gdat_selected <- gdat[rownames(res_selected),]
#19. Add a legend for only significant ones that we selected.
Legend <- "Significant"
gdat_selected1 <- cbind(gdat_selected, Legend)
#20. We now need to merge selected and all data to draw them
# together on a plot with different colors.
gdat2 <- rbind(gdat1, gdat_selected1)
#21. Make ggplot
ggplot() +
geom_point(data=gdat2, aes_string(x="Treat", y="Control",
colour="Legend"), alpha=6/10, size=3) +
scale_colour_manual(values=c("All"="darkgrey","Significant"="red"))+
scale_x_log10() +scale_y_log10()
#22. There is another way to visualize the same data using MAPlots.
# MA plots are basically a plot with log ratios(M) in the y axis vs.
# mean(A) scale in x axis.
gdatMA <- gdat2
#We can just add pseudocount value to prevent calculating infinite values
# if the a gene quantification is 0. This transformation will not change
# anything in data distribution.
gdatMA$Treat <- gdatMA$Treat+0.0001
gdatMA$Control <- gdatMA$Control+0.0001
g <- gdatMA
colnames(g) <- c("A", "M", "Legend")
# Here A <- ( log2(x) + log2(y) )/2 (Average in log scale)
g$A <- log2(gdatMA$Treat*gdatMA$Control/2)
# Here M <- log2(x) - log2(y) (Minus in log scale)
g$M <- log2(gdatMA$Treat/gdatMA$Control)
# You can use any normalization method that is more suitable to your
# data in this part. Here We are going to use scale function.
# However, it might not be the good way for all type of data.
g$A <- scale(g$A, center=TRUE, scale=TRUE)
g$M <- scale(g$M, center=TRUE, scale=TRUE)
ggplot() +
geom_point(data=g, aes_string(x="A", y="M", colour="Legend"),
alpha=6/10, size=3) + ylab("Log2 Fold Change (M)") +
xlab("Log2 mean normalized counts (A)") +
scale_colour_manual(values=c("All"="black","Significant"="red"))+
geom_abline(slope=0, linetype=2)
#23. If you want to save any ggplot as pdf use the command below
ggsave("~/workshop_data/transcriptomics/MAplot.pdf")
#24. For MA Plot there is another builtin function that you can use.
plotMA(dr,ylim=c(-2,2),main="DESeq2");
#25. The third way of visualizing the data is making a Volcano Plot.
# Here on the x axis you have log2foldChange values and y axis you
# have your -log10 padj values. To see how significant genes are
# distributed. Highlight genes that have an absolute fold change > 2
# and a padj < 0.01
res$threshold = as.factor(abs(res$log2FoldChange) > 1 & res$padj < 0.01)
res$log10padj = -log10(res$padj)
dat<-data.frame(cbind(res$log2FoldChange, res$log10padj, res$threshold))
#Define your column names
colnames(dat)<-c("log2FoldChange", "log10padj", "threshold")
##Construct the plot object
ggplot(data=dat, aes_string(x="log2FoldChange", y="log10padj",
colour="threshold")) + geom_point(alpha=0.4, size=1.75) +
theme(legend.position = "none") +
xlim(c(-2.5, 2.5)) + ylim(c(0, 15)) +
xlab("log2 fold change") + ylab("-log10 p-value")
#26. Adding more information to your dataset and save as an excel file
# You can select the significant genes from the dataset and add their
# log2foldChange and padj values to the dataset
calc_cols<-res[, c("log2FoldChange", "padj")]
f1<-cbind(data[rownames(calc_cols), ], calc_cols)
f1<-f1[!is.na(res$padj), ]
res_selected<-f1[f1$padj<0.01 & abs(f1$log2FoldChange)>1, ]
write.csv(res_selected, "~/workshop_data/transcriptomics/selected_genes.csv")
# You can write all of your data to a csv file.
write.csv(data, "~/workshop_data/transcriptomics/all_data.csv")
#27. The forth way of visualizing the data that is widely used in this
# type of analysis is clustering and Heatmaps.
# Here we usually add a pseudocount value 0.1 in this case.
ld <- log2(filtd[rownames(res_selected),]+0.1)
#Scaling the value using their mean centers can be good it the data is
# uniformely distributed in all the samples.
cldt <- scale(t(ld), center=TRUE, scale=TRUE);
cld <- t(cldt)
# We can define different distance methods to calculate the distance
# between samples. Here we focus on euclidean distance and correlation
#a. Euclidean distance
distance<-dist(cldt, method = "euclidean")
# To plot only the cluster you can use the command below
plot(hclust(distance, method = "complete"),
main="Euclidean", xlab="")
#The heatmap
heatmap.2(cld, Rowv=TRUE,dendrogram="column",
Colv=TRUE, col=redblue(256),labRow=NA,
density.info="none",trace="none", cexCol=0.8);
#b. Correlation between libraries
#Here we calculate the correlation between samples.
dissimilarity <- 1 - cor(cld)
# We define it as distance. This will create a square matrix that
# will include the dissimilarites between samples.
distance <- as.dist(dissimilarity)
# To plot only the cluster you can use the command below
plot(hclust(distance, method = "complete"),
main="1-cor", xlab="")
#The heatmap
heatmap.2(cld, Rowv=TRUE,dendrogram="column",
Colv=TRUE, col=redblue(256),labRow=NA,
density.info="none",trace="none", cexCol=0.8)
[/toggle]
The main goals of this section are to:
Use basic plots to look at the results
Perform differential gene expression analysis
View and manipulate the raw data including hierarchical clustering and heatmaps
Exercise 4: Genome alignment of RNA-seq reads
RSEM depends on an existing annotation and will only scores transcripts that are present in the given annotation file.
If we want to do de-novo RNASeq quantification. We cannot use RSEM. We need to map the reads to whole genome. In this way, the reads mapped to undefined regions in annotation file can be quantified.
We will compare the alignments produced by RSEM (exercise 2) and tophat (this exercise) and the difference will become clear.
The fastq.quantification folder contains a relative small set of illumina sequencing reads. We will examine this set by first directly mapping to the reduced mouse genome using tophat.
Make sure you are in the transcriptomics directory for this activity. genome.quantification, genome.reconstruction, fastq. quantification and fastq.reconstruction should be subdirectories. Check this before you proceed.
To avoid cluttering the workspace we will direct the output of each exercise to its own directory. In this case for example:
cd ~/workshop_data/transcriptomics mkdir tophat
First build the index files using bowtie2 (for details of indexing, see exercise 1).
bowtie2-build genome.quantification/mm10.fa genome.quantification/mm10
The following file should be generated in genome.quantification directory:
mm10.1.bt2 mm10.2.bt2 mm10.3.bt2 mm10.4.bt2 mm10.rev.1.bt2 mm10.rev.2.bt2
Then align each of the libraries to the genome. The fastq.quantification subdirectory contains six different libraries, three for a control experiment from wild type mouse liver and from mouse that are deficient in two different proteins. Each genotype was sequenced in triplicates using paired-end 50 base paired reads.
To first explore the data visually in IGV, we'll use the TopHat2 aligner to map these reads to our reduced genome:
tophat2 --library-type fr-firststrand --segment-length 20 \ -G genome.quantification/ucsc.gtf -o tophat/th.quant.ctrl1 \ genome.quantification/mm10 fastq.quantification/control_rep1.1.fq \ fastq.quantification/control_rep1.2.fq
Here the parameters we used in tophat explained below;
--library-type: The default is unstranded (fr-unstranded). If either fr-firststrand or fr-secondstrand is specified.
--segment-length: Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.
-G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
-o: output directory
Please visit tophat manual to learn the rest of the parameters using the following link; http://ccb.jhu.edu/software/tophat/manual.shtml
And using this command as a template, align the other libraries
[toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]tophat2 --library-type fr-firststrand --segment-length 20 \ -G genome.quantification/ucsc.gtf -o tophat/th.quant.ctrl2 \ genome.quantification/mm10 fastq.quantification/control_rep2.1.fq \ fastq.quantification/control_rep2.2.fq tophat2 --library-type fr-firststrand --segment-length 20 \ -G genome.quantification/ucsc.gtf -o tophat/th.quant.ctrl3 \ genome.quantification/mm10 fastq.quantification/control_rep3.1.fq \ fastq.quantification/control_rep3.2.fq tophat2 --library-type fr-firststrand --segment-length 20 \ -G genome.quantification/ucsc.gtf -o tophat/th.quant.expr1 \ genome.quantification/mm10 fastq.quantification/exper_rep1.1.fq \ fastq.quantification/exper_rep1.2.fq tophat2 --library-type fr-firststrand --segment-length 20 \ -G genome.quantification/ucsc.gtf -o tophat/th.quant.expr2 \ genome.quantification/mm10 fastq.quantification/exper_rep2.1.fq \ fastq.quantification/exper_rep2.2.fq tophat2 --library-type fr-firststrand --segment-length 20 \ -G genome.quantification/ucsc.gtf -o tophat/th.quant.expr3 \ genome.quantification/mm10 fastq.quantification/exper_rep3.1.fq \ fastq.quantification/exper_rep3.2.fq[/toggle]
Question: What percent of reads were mapped for each library? [toggle hide="yes" border="yes" style="white" title_open="Hide Hint" title_closed="Show Hint"] Check the tophat reports of tophat for each of the six libraries, in particular the align_summary.txt file
[/toggle]
Tophat always creates reports its alignment in a file named "accepted_hits.bam". To make things clear we'll move this files onto a clean directory. Move the files by, for example, doing
cd ~/workshop_data/transcriptomics mv tophat/th.quant.ctrl1/accepted_hits.bam tophat/th.quant.ctrl1.bam mv tophat/th.quant.ctrl2/accepted_hits.bam tophat/th.quant.ctrl2.bam mv tophat/th.quant.ctrl3/accepted_hits.bam tophat/th.quant.ctrl3.bam mv tophat/th.quant.expr1/accepted_hits.bam tophat/th.quant.expr1.bam mv tophat/th.quant.expr2/accepted_hits.bam tophat/th.quant.expr2.bam mv tophat/th.quant.expr3/accepted_hits.bam tophat/th.quant.expr3.bam
To visualize the alignments we generate indexes (for rapid data access) and compressed read density plots:
picard-tools BuildBamIndex I=tophat/th.quant.ctrl1.bam O=tophat/th.quant.ctrl1.bam.bai
and with all the other libraries:
picard-tools BuildBamIndex I=tophat/th.quant.ctrl2.bam O=tophat/th.quant.ctrl2.bam.bai picard-tools BuildBamIndex I=tophat/th.quant.ctrl3.bam O=tophat/th.quant.ctrl3.bam.bai picard-tools BuildBamIndex I=tophat/th.quant.expr1.bam O=tophat/th.quant.expr1.bam.bai picard-tools BuildBamIndex I=tophat/th.quant.expr2.bam O=tophat/th.quant.expr2.bam.bai picard-tools BuildBamIndex I=tophat/th.quant.expr3.bam O=tophat/th.quant.expr3.bam.bai[/toggle]
Finally create read density files to be able to look at all libraries together
igvtools count -w 5 tophat/th.quant.ctrl1.bam tophat/th.quant.ctrl1.bam.tdf genome.quantification/mm10.fa
NOTE: You need to refer to the genome file (the last argument above) using the same name you used above. Similarly create density files for all other libraries.
[toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]igvtools count -w 5 tophat/th.quant.ctrl2.bam tophat/th.quant.ctrl2.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.ctrl3.bam tophat/th.quant.ctrl3.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.expr1.bam tophat/th.quant.expr1.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.expr2.bam tophat/th.quant.expr2.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.expr3.bam tophat/th.quant.expr3.bam.tdf genome.quantification/mm10.fa[/toggle]
Ideally, being mouse data we would use the mouse (mm10) genome provided by IGV. Data would be concentrated to the regions of chromosomes 7 and 16 we aligned to. However, the mm10 genome version may not be available from the install directory.
In this case it is better to make a new "genome" instead of downloading the full mm10 genome. This excercise is a little bit different than the one in excercise 2. In excercise 2 we visualized only transcripts. In that case, the introns were spliced out and we could visualize only exonic regions.
However, since tophat aligns whole genome we see intronic and intergenic alignments too.
To create the reduced genome for IGV:
Launch your IGV browser from the terminal in x2goclient. (Tip: Run igv.sh).
In the top menu select
genome --> create .genome file
You may enter any name and description of your liking, but you need to use this name in the next set of instructions. We will name it mm10.reduced
For the FASTA file use the genome.quantification/mm10.fa file
and for the Gene file use the genome.quantification/ucsc.gtf file
Please load some of the files you prepared. (ie: tophat/th.quant.expr1.bam.bam and tophat/ctrl1.quant.expr1.bam.bam or tophat/th.quant.expr1.bam.tdf and tophat/ctrl1.quant.expr1.bam.tdf)
Some of the genes are good examples of differentially expressed genes. For example the whole region around the key Fgf21 gene is upregulated in experiment vs controls, while the gene Crebbp is downregulated in experiments vs controls. To point your browser to either gene just type or copy the name of the gene in the location box at the top.
You can also compare the results with rsem genomic alignments. Make sure you are loading genome alignments not transcript in rsem results, since both alignment results exist in the rsem directory.
If you want to compare tdf files, you can use the command below.
igvtools count -w 5 rsem/ctrl1.rsem.genome.sorted.bam rsem/ctrl1.rsem.genome.sorted.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 rsem/expr1.rsem.genome.sorted.bam rsem/expr1.rsem.genome.sorted.bam.tdf genome.quantification/mm10.fa
Question: What is different in tophat and rsem alignments?