Table of contents
- Part I: Quantification and Differential gene expression analysis
- exercise 1: Data inspection, prepare for genomic alignment
- exercise 2: Genomic alignment of RNA-Seq reads
- exercise 3: Data visualization
- exercise 4: Quantification with RSEM
- exercise 5: Visualization of transcript level data
- exercise 6: Differential gene expression with DESeq
- Part II: Transcript reconstruction
- exercise 1: Prepare for Alignment
- exercise 2: Genomics alignment
- exercise 3: preparing and viewing alignments
- exercise 4: Transcript reconstruction using scripture
- exercise 5 (Optional): Transcript reconstruction using cufflinks
- exercise 6 (Optional): Analysis of the reconstruction with Cufflinks
expected learning outcome
To understand the basics of RNA-Seq data, how to use RNA-Seq for different objectives and to familiarize yourself with some standard software packages for such analysis.
getting started
Sample pooling has revolutionized sequencing. It is now possible to sequence 10s of samples together. Different objectives require different sequencing depths. Doing differential gene expression analysis requires less sequencing depth than transcript reconstruction so when pooling samples it is critical to keep the objective of the experiment in mind. In this activity we will use subsets of experimentally generated datasets. One dataset was generated for differential gene expression analysis while the other towards transcript annotation. For quantification we will use a set of data generated from the same strain as the genome reference mouse (C57BL/6J). The study concentrated on the effect of liver JNK signaling pathway in fatty acid methabolism (see Vernia et al, Cell Metabolism, 2014), the study generated triplicate libraries for 4 different genotypes (wildtype, Jnk1 KO, Jnk2 KO, and Jnk1, Jnk2 KO) in two different diets.
We selected three replicates from control (wild type) and three from the double knock-out strain under fat diet, the two genotypes and condition having largest expression differences. We further created a reduced subset of the data that includes the most striking result of the paper. The idea is to find genes that are in the same pathway as the gene that were knock out. We will use a reduced genome consisting of the first 9.5 million bases of mouse chromosome 16 and the first 50.5 million bases of chromosome 7.
For transcript reconstruction we will analyze a partial dataset from a pathogen stimulated Dendritic Cells (DC) time course. Specifically, we will look at reads aligning to chr1:16170000-66500000 from three libraries: an RNA-Seq library from unstimulated cells, an RNA-Seq library from RNA extracted 4 hours post stimulation, and a H3K4me3 Chip-Seq library from unstimulated DCs. We will work against a âgenomeâ consisting only of the first 66.55 megabases of mouse chromosome 1 (NCBI build 37).
Part I: Quantification and differential gene expression analysis
The main goal of this activity is to go through a standard method to obtain gene expression values and perform differential gene expression analysis from an RNA-Seq experiment.
We will start by alignment and visualizing the data using the TopHat2 spliced aligner. We will then perform gene quantification using the RSEM program and finally differential gene expression analysis of the estimated counts using DESeq.
All the data you need for this activity should be in the transcriptomics directory. Here we will use the genome.quantification and the fastq.quantification subdirectories.
Before you start
Ensure that you can access the following software:
- RSEM version v1.2.17
- Picard 1.119
- Bowtie2-2.1.0
- samtools 0.1.19
- this version of scripture (http://garberlab.umassmed.edu/software/scripture/downloads/scripture_beta_82012.jar)
- Java 6 or later
- RStudio
- IGV 2.3 or later
Data should be available in your
fastq.quantification (short read data):
- control_rep1.1.fq
- control_rep1.2.fq
- control_rep2.1.fq
- control_rep2.2.fq
- control_rep3.1.fq
- control_rep3.2.fq
- exper_rep1.1.fq
- exper_rep1.2.fq
- exper_rep2.1.fq
- exper_rep2.2.fq
- exper_rep3.1.fq
- exper_rep3.2.fq
and
genome.quantification (reduced genome and transcriptome data):
- ucsc.gtf
- gene.to.ucsc.transcriptid
- mm10.fa
We will download the following two scripts in the bin directory:
cd transcriptomics/bin
wget “http://garberlab.umassmed.edu/data/transcriptomics/bin/rsem.to.table.perl”
sudo mv rsem.to.table.perl rsem.to.table.pl
wget “http://garberlab.umassmed.edu/data/transcriptomics/bin/rsem.quant.R”
In what follows we assume that bowtie2, tophat2, samtools and RSEM part of your PATH and hence you can execute them without having to invoke them using their full path to the installation directory.
If you can not invoke, say samtools without getting a command not found error you need to configure your PATH variable
If you are using a bash shell you may set your PATH using:
export PATH=$PATH:<path-to-samtools>:<path-to-bowtie2>:<path-to-tophat2>:<path-to-rsem>
Where each <path-to- > needs to be replaced by the correct installation directory.
RSEM depends on an existing annotation and will only scores transcripts that are present in the given annotation file. We will compare the alignments produced by RSEM and tophat and this will become clear.
The first step is to prepare the transcript set that we will quantify. We selected the UCSC genes which is a very comprehensive, albeit a bit noisy dataset. As with all the data in this activity we will only use the subset of the genes that map to the genome regions we are using.
We will be working in the transcriptomics directory:
cd transcriptomics/genome.quantification
RSEM provides a program to generate indices to for transcriptome alignment which is a one time process for each transcriptome. To generate our indices use
rsem-prepare-reference \ --gtf ucsc.gtf --transcript-to-gene-map \ gene.to.ucsc.transcriptid --bowtie2 mm10.fa mm10.rsem
Which should extract transcript sequences and generate bowtie2 index files against these sequence. You can check that the program ran successfully by ensuring that the following files were created:
mm10.rsem.1.bt2 mm10.rsem.2.bt2 mm10.rsem.3.bt2 mm10.rsem.4.bt2 mm10.rsem.chrlist mm10.rsem.grp mm10.rsem.idx.fa mm10.rsem.n2g.idx.fa mm10.rsem.rev.1.bt2 mm10.rsem.rev.2.bt2 mm10.rsem.seq mm10.rsem.ti mm10.rsem.transcripts.fa
Question: Why do we need the –transcript-to-gene-map? What is the information in that table? Look at the GTF file, what is missing?
Question: What files were created? What do you think each one is?
We will also do a genome-wide alignment. The indices above will work for alignment of reads against transcript sequence, which was extracted by the rsem-prepare-reference and written to the mm10.rsem.transcripts.fa file above. For genome wide alignment we will now generate indices for the mm10.fa genome sequence:
bowtie2-build mm10.fa mm10
Which should generate the following indices:
mm10.1.bt2 mm10.2.bt2 mm10.3.bt2 mm10.4.bt2 mm10.rev.1.bt2 mm10.rev.2.bt2
You should check that the command successfully completed by ensuring that these files are in your directory.
Exercise 2: Quantify with the RSEM program
2.1 Calculate expression
We will assume that your current directory is transcriptomics and that you have three subdirectories within:
bin fastq.quantification genome.quantification
We’ll add one more subdirectory to hold the quantification results:
mkdir rsem We need form a soft link to samtools as well: sudo mkdir /usr/local/bin/sam sudo ln -s /usr/bin/samtools /usr/local/bin/sam/samtools
We are now is ready to align and then attempt to perform read assignment and counting for each isoform in the file provided above. You must process each one of the 6 libraries:
rsem-calculate-expression --paired-end --strand-specific -p 2 --bowtie2\ --output-genome-bam fastq.quantification/control_rep1.1.fq \ fastq.quantification/control_rep1.2.fq genome.quantification/mm10.rsem rsem/ctrl1.rsem
And similarly for each of the other 5 libraries
[toggle hide=”yes” border=”yes” style=”white” title_open=”Hide Commands” title_closed=”Show Commands”]rsem-calculate-expression --paired-end --strand-specific -p 2 --bowtie2 --output-genome-bam \ fastq.quantification/control_rep2.1.fq fastq.quantification/control_rep2.2.fq \ genome.quantification/mm10.rsem rsem/ctrl2.rsem
rsem-calculate-expression --paired-end --strand-specific -p 2 --bowtie2 --output-genome-bam \ fastq.quantification/control_rep3.1.fq fastq.quantification/control_rep3.2.fq \ genome.quantification/mm10.rsem rsem/ctrl3.rsem
rsem-calculate-expression --paired-end --strand-specific -p 2 --bowtie2 --output-genome-bam \ fastq.quantification/exper_rep1.1.fq fastq.quantification/exper_rep1.2.fq \ genome.quantification/mm10.rsem rsem/expr1.rsem
rsem-calculate-expression --paired-end --strand-specific -p 2 --bowtie2 --output-genome-bam \ fastq.quantification/exper_rep2.1.fq fastq.quantification/exper_rep2.2.fq \ genome.quantification/mm10.rsem rsem/expr2.rsem
rsem-calculate-expression --paired-end --strand-specific -p 2 --bowtie2 --output-genome-bam \ fastq.quantification/exper_rep3.1.fq fastq.quantification/exper_rep3.2.fq \ genome.quantification/mm10.rsem rsem/expr3.rsem[/toggle]
To ensure that you have run all commands successfully, you should check that your rsem directory contains the following result files:
ctrl1.rsem.genes.results ctrl1.rsem.genome.bam ctrl1.rsem.genome.sorted.bam ctrl1.rsem.genome.sorted.bam.bai ctrl1.rsem.isoforms.results ctrl1.rsem.stat ctrl1.rsem.transcript.bam ctrl1.rsem.transcript.sorted.bam ctrl1.rsem.transcript.sorted.bam.bai ctrl2.rsem.genes.results ctrl2.rsem.genome.bam ctrl2.rsem.genome.sorted.bam ctrl2.rsem.genome.sorted.bam.bai ctrl2.rsem.isoforms.results ctrl2.rsem.stat ctrl2.rsem.transcript.bam ctrl2.rsem.transcript.sorted.bam ctrl2.rsem.transcript.sorted.bam.bai ctrl3.rsem.genes.results ctrl3.rsem.genome.bam ctrl3.rsem.genome.sorted.bam ctrl3.rsem.genome.sorted.bam.bai ctrl3.rsem.isoforms.results ctrl3.rsem.stat ctrl3.rsem.transcript.bam ctrl3.rsem.transcript.sorted.bam ctrl3.rsem.transcript.sorted.bam.bai expr1.rsem.genes.results expr1.rsem.genome.bam expr1.rsem.genome.sorted.bam expr1.rsem.genome.sorted.bam.bai expr1.rsem.isoforms.results expr1.rsem.stat expr1.rsem.transcript.bam expr1.rsem.transcript.sorted.bam expr1.rsem.transcript.sorted.bam.bai expr2.rsem.genes.results expr2.rsem.genome.bam expr2.rsem.genome.sorted.bam expr2.rsem.genome.sorted.bam.bai expr2.rsem.isoforms.results expr2.rsem.stat expr2.rsem.transcript.bam expr2.rsem.transcript.sorted.bam expr2.rsem.transcript.sorted.bam.bai expr3.rsem.genes.results expr3.rsem.genome.bam expr3.rsem.genome.sorted.bam expr3.rsem.genome.sorted.bam.bai expr3.rsem.isoforms.results expr3.rsem.stat expr3.rsem.transcript.bam expr3.rsem.transcript.sorted.bam expr3.rsem.transcript.sorted.bam.bai
2.3 Create consolidated table
In the bin directory we provide a simple script to take all the independent RSEM output and combine it into a single table, which is needed for inspection and ready for differential gene expression analysis.
To find out what the script does you may type the following command in the transcriptomics directory
perl bin/rsem.to.table.pl -help
We will generate two tables using two measures for gene expression level:
perl bin/rsem.to.table.pl -out rsem.gene.summary.count.txt -indir rsem -gene_iso genes -quantType expected_count perl bin/rsem.to.table.pl -out rsem.gene.summary.tpm.txt -indir rsem -gene_iso genes -quantType tpm
You should now take the time to inspect these tables, find genes that look affected by the experiment.
2.4 Visualize the raw data: Make a IGV genome with the transcriptome
We will need to create bam index files for each of the two alignments generated by RSEM. Before we do this, we need to create an artificial genome composed of the transcripts we used to annotate. Launch your IGV browser, from the top menu, select
genome -> create .genome file …
A pop up box should ask for information on the genome. Give the genome a name such as for example mm10Reduced. The name is important as it will be use later, if you choose to name this genome differently take note and make sure you keep using that name in what follows. We will refer to this reduced genome as “mm10Reduced” and use the same for description.
Click on browse and navigate to your genome.quantification directory and select the mm10.rsem.transcripts.fa file and save the transcripts.genome file in the same genome.quantification directory.
This also offers a good example of how to use IGV when working with an assembly or non-published genome sequence. In this cases, you create a genome file in the same way we created the transcript genome.
As we have done before we need to create TDF files for both the genome and transcript alignment e.g.:
igvtools count -w 5 rsem/ctrl1.rsem.transcript.sorted.bam rsem/ctrl1.rsem.transcript.sorted.bam.tdf \ genome.quantification/mm10Reduced.genome
and
igvtools count -w 5 rsem/ctrl1.rsem.genome.sorted.bam rsem/ctrl1.rsem.genome.sorted.bam.tdf \ genome.quantification/mm10Reduced.genome
Now we can compare the later with the tophat alignments and the former to the table we built.
Exercise 3: Differential gene expression analysis with DESeq
DESeq is an R package available via Bioconductor and is designed to normalise count data from high-throughput sequencing assays such as RNA-Seq and test for differential expression.
To run R you just need to type the following in the command line:
R
Then close R by typing:
q()
The R script rsem.quant.R in your bin folder, also available from the web, will be at the center of this activity. Open rsem.quant.R in a text editor such us emacs or vi. The comments within the script explain what each section does.
The tables we generated can be viewed in R with the “View” command, but this requires an X server properly installed in your machine. If you cannot use R’s view, you can also use any spreadsheet application such as Excel or OpenOffice.
[toggle hide=”yes” border=”yes” style=”white” title_open=”Hide Commands” title_closed=”Show Commands”]library("DESeq2")
library("ggplot2")
library("RColorBrewer")
library("gplots")
#1. Read the data
file<-" /media/workshop_data/transcriptomics/deseq_data.tsv"
rsem<-read.table(file);
head(rsem)
#2. Read the data with header
rsem<- read.table(file,sep="\t", header=TRUE)
#3. Read the data with row.names
rsem<- read.table(file,sep="\t", header=TRUE, row.names=1,
stringsAsFactors = TRUE)
#4. Create data structure for DESeq Analysis
data <- data.frame(rsem[, c("exper_rep1","exper_rep2","exper_rep3",
"control_rep1","control_rep2","control_rep3")])
#5. Make sure all the cells in the table are integer
cols = c(1:6);
data[,cols] = apply(data[,cols], 2,
function(x) as.numeric(as.integer(x)))
#Make a histogram
hist(log(data$exper_rep1), breaks=100)
#6. Make a scatter plot with the averages of the replicas
avgall<-cbind(rowSums(data[1:3])/3, rowSums(data[4:6])/3)
colnames(avgall)<-c("Treat", "Control")
#7. Make a simple scatter plot
plot(avgall)
#8. Hmm!!! The values are ranging from 0 to 800k. So, let's use log2.
plot(log2(avgall))
#9. Let's change the x and y titles
log2vals <- log2(avgall)
colnames(log2vals)<-c("log2(Treat)", "log2(Control)")
plot(log2vals)
#10. Let's make this pretty using ggplot
library("ggplot2")
gdat<-data.frame(avgall)
ggplot() +
geom_point(data=gdat, aes_string(x="Treat", y="Control"),
colour="black", alpha=6/10, size=3) +
scale_x_log10() +scale_y_log10()
######################################
## LETS START DESeq ANALYSIS
######################################
#
#11. Define conditions for each library
conds <- factor( c("Control","Control", "Control",
"Treat", "Treat","Treat") )
colData = as.data.frame((colnames(data)))
colData<-cbind(colData, conds)
colnames(colData) = c("Control","group")
groups = factor(colData[,2])
#12. Filter out the genes if the # of total reads in all libs below 10.
sumd = apply(X=data,MARGIN=1,FUN=sum)
filtd = subset(data, sumd > 10)
#13. Create DESeq data set using prepared table.
dds = DESeqDataSetFromMatrix(countData=as.matrix(filtd),
colData=colData, design = ~ group)
#14. Run DESeq analysis
dds <- DESeq(dds);
#15. Put the results into variable.
res <- results(dds);
#16. Look for the column descriptions in the results
mcols(res, use.names=TRUE)
#17. select only significant ones
f1<-res[!is.na(res$padj) & !is.na(res$log2FoldChange), ]
res_selected<-f1[(f1$padj<0.01 & abs(f1$log2FoldChange)>1),]
#18. Add a legend for all data
Legend<-"All"
gdat1<-cbind(gdat, Legend)
gdat_selected<-gdat[rownames(res_selected),]
#19. Add a legend for only significant ones
Legend<-"Significant"
gdat_selected1<-cbind(gdat_selected, Legend)
#20. Merge selected and all data
gdat2<-rbind(gdat1, gdat_selected1)
#21. Make ggplot
ggplot() +
geom_point(data=gdat2, aes_string(x="Treat", y="Control", colour="Legend"),
alpha=6/10, size=3) +
scale_colour_manual(values=c("All"="darkgrey","Significant"="red"))+
scale_x_log10() +scale_y_log10()
#22. Make MA plot
gdatMA<-gdat2
gdatMA$Treat=gdatMA$Treat+1
gdatMA$Control=gdatMA$Control+1
g<-gdatMA
colnames(g)<-c("M", "A", "Legend")
g$M<-log2(gdatMA$Treat*gdatMA$Control/2)
g$A<-log2(gdatMA$Treat/gdatMA$Control)
g$M <- scale(g$M, center=TRUE, scale=TRUE);
g$A <- scale(g$A, center=TRUE, scale=TRUE);
ggplot() +
geom_point(data=g, aes_string(x="M", y="A", colour="Legend"),
alpha=6/10, size=3) + ylab("Log2 Fold Change (M)") +
xlab("Log2 mean normalized counts (A)") +
scale_colour_manual(values=c("All"="black","Significant"="red"))+
geom_abline(slope=0, linetype=2)
#23. If you want to save as pdf use the command below
ggsave("~/myplot.pdf")
#24. For MA Plot with a function
plotMA(dds,ylim=c(-2,2),main="DESeq2");
#25. For Volcano Plot
##Highlight genes that have an absolute fold change > 2 and a padj < 0.01
res$threshold = as.factor(abs(res$log2FoldChange) > 1 & res$padj < 0.01)
res$log10padj = -log10(res$padj)
dat<-data.frame(cbind(res$log2FoldChange, res$log10padj, res$threshold))
colnames(dat)<-c("log2FoldChange", "log10padj", "threshold")
##Construct the plot object
ggplot(data=dat, aes_string(x="log2FoldChange", y="log10padj",
colour="threshold")) + geom_point(alpha=0.4, size=1.75) +
theme(legend.position = "none") +
xlim(c(-2.5, 2.5)) + ylim(c(0, 15)) +
xlab("log2 fold change") + ylab("-log10 p-value")
#26. Adding more information to a dataset
## You can select the significant genes from the dataset and add their log2foldChange and padj values to the dataset
calc_cols<-[, c("log2FoldChange", "padj")]
f1<-cbind(data[rownames(calc_cols), ], calc_cols)
f1<-f1[!is.na(res$padj), ]
res_selected<-f1[f1$padj<0.01 & abs(f1$log2FoldChange)>1, ]
#27. For heatmap
ld <- log(filtd[rownames(res_selected),]+0.1,base=2)
cldt <- scale(t(ld), center=TRUE, scale=TRUE);
#a. Euclidean distance
distance<-dist(cldt, method = "euclidean")
#b. Correlation between libraries
cld <- t(cldt)
dissimilarity <- 1 - cor(cld)
distance <- as.dist(dissimilarity)
plot(hclust(distance, method = "complete"),
main="1-cor", xlab="")
heatmap.2(cld, Rowv=TRUE,dendrogram="column",
Colv=TRUE, col=redblue(256),labRow=NA,
density.info="none",trace="none");
[/toggle]
The main goals of this section are to:
- View and manipulate the raw data including a heatmaps
- Perform differential gene expression analysis
- Use basic plots to look at the results
Exercise 4: Genome alignment of RNA-seq reads
RSEM depends on an existing annotation and will only scores transcripts that are present in the given annotation file. We will compare the alignments produced by RSEM and tophat and this will become clear. The first step is to prepare the transcript set that we will quantify. We selected the UCSC genes which is a very comprehensive, albeit a bit noisy dataset. As with all the data in this activity we will only use the subset of the genes that map to the genome regions we are using.
The fastq.quantification folder contains a relative small set of illumina sequencing reads. We will examine this set by first directly mapping to the reduced mouse genome.
Make sure you are in the transcriptomics directory for this activity. genome.quantification, genome.reconstruction, fastq. quantification and fastq.reconstruction should be subdirectories. Check this before you proceed.
To avoid cluttering the workspace we will direct the output of each exercise to its own directory. In this case for example:
mkdir tophat
Then align each of the libraries to the genome. The fastq.quantification subdirectory contains six different libraries, three for a control experiment from wild type mouse liver and from mouse that are deficient in two different proteins. Each genotype was sequenced in triplicates using paired-end 50 base paired reads.
To first explore the data visually in IGV, we'll use the TopHat2 aligner to map these reads to our reduced genome:
tophat2 --library-type fr-firststrand --segment-length 20 -G genome.quantification/ucsc.gtf \ -o tophat/th.quant.ctrl1 genome.quantification/mm10 fastq.quantification/control_rep1.1.fq \ fastq.quantification/control_rep1.2.fq
And using this command as a template, align the other three different libraries
[toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]tophat2 --library-type fr-firststrand --segment-length 20 -G genome.quantification/ucsc.gtf \ -o tophat/th.quant.ctrl2 genome.quantification/mm10 fastq.quantification/control_rep2.1.fq \ fastq.quantification/control_rep2.2.fq
tophat2 --library-type fr-firststrand --segment-length 20 -G genome.quantification/ucsc.gtf \ -o tophat/th.quant.ctrl3 genome.quantification/mm10 fastq.quantification/control_rep3.1.fq \ fastq.quantification/control_rep3.2.fq
tophat2 --library-type fr-firststrand --segment-length 20 -G genome.quantification/ucsc.gtf \ -o tophat/th.quant.expr1 genome.quantification/mm10 fastq.quantification/exper_rep1.1.fq \ fastq.quantification/exper_rep1.2.fq
tophat2 --library-type fr-firststrand --segment-length 20 -G genome.quantification/ucsc.gtf \ -o tophat/th.quant.expr2 genome.quantification/mm10 fastq.quantification/exper_rep2.1.fq \ fastq.quantification/exper_rep2.2.fq
tophat2 --library-type fr-firststrand --segment-length 20 -G genome.quantification/ucsc.gtf \ -o tophat/th.quant.expr3 genome.quantification/mm10 fastq.quantification/exper_rep3.1.fq \ fastq.quantification/exper_rep3.2.fq[/toggle]
Question: What percent of reads were mapped for each library?
[toggle hide="yes" border="yes" style="white" title_open="Hide Hint" title_closed="Show Hint"] Check the tophat reports of tophat for each of the six libraries, in particular the align_summary.txt file[/toggle]
Tophat always creates reports its alignment in a file named "accepted_hits.bam". To make things clear we'll move this files onto a clean directory. Move the files by, for example, doing
mv tophat/th.quant.ctrl1/accepted_hits.bam tophat/th.quant.ctrl1.bam mv tophat/th.quant.ctrl2/accepted_hits.bam tophat/th.quant.ctrl2.bam mv tophat/th.quant.ctrl3/accepted_hits.bam tophat/th.quant.ctrl3.bam mv tophat/th.quant.expr1/accepted_hits.bam tophat/th.quant.expr1.bam mv tophat/th.quant.expr2/accepted_hits.bam tophat/th.quant.expr2.bam mv tophat/th.quant.expr3/accepted_hits.bam tophat/th.quant.expr3.bam
To visualize the alignments we generate indexes (for rapid data access) and compressed read density plots:
java -Xmx5g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat/th.quant.ctrl1.bam
and with all the other libraries:
[toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]java -Xmx5g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat/th.quant.ctrl2.bam java -Xmx5g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat/th.quant.ctrl3.bam java -Xmx5g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat/th.quant.expr1.bam java -Xmx5g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat/th.quant.expr2.bam java -Xmx5g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat/th.quant.expr3.bam[/toggle]
Ideally, being mouse data we would use the mouse (mm10) genome provided by IGV. Data would be concentrated to the regions of chromosomes 7 and 16 we aligned to. However, the mm10 genome version may not be available from the install directory. In this case it is better to make a new "genome" instead of downloading the full mm10 genome. For this:
Launch the IGV browser (Tip: Run igv.sh).
In the top menu select
genome --> create .genome file
You may enter any name and description of your liking, but you need to use this name in the next set of instructions. We will name it mm10.reduced
For the FASTA file use the genome.quantification/mm10.fa file
and for the Gene file use the genome.quantification/ucsc.corrected.gtf file
Finally create read density files to be able to look at all libraries together
igvtools count -w 5 tophat/th.quant.ctrl1.bam tophat/th.quant.ctrl1.bam.tdf genome.quantification/mm10.reduced.genome
NOTE: You need to refer to the genome file (the last argument above) using the same name you used above. Similarly create density files for all other libraries.
[toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]igvtools count -w 5 tophat/th.quant.ctrl2.bam tophat/th.quant.ctrl2.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.ctrl3.bam tophat/th.quant.ctrl3.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.expr1.bam tophat/th.quant.expr1.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.expr2.bam tophat/th.quant.expr2.bam.tdf genome.quantification/mm10.fa igvtools count -w 5 tophat/th.quant.expr3.bam tophat/th.quant.expr3.bam.tdf genome.quantification/mm10.fa[/toggle]
A few genes are good examples of differentially expressed genes. For example the whole region around the key Fgf21 gene is upregulated in experiment vs controls, while the gene Crebbp is downregulated in experiments vs controls. To point your browser to either gene just type or copy the name of the gene in the location box at the top.
We will revisit these genes below when we do differential gene expression.
Part II: Transcript reconstruction
Your genome.reconstruction folder should contain the partial chromosome 1 (chr1.fa) and the fastq.reconstruction the sequencing reads the two libraries we will use. NOTE: This is an older dataset that was originally analyzed against an earier version (mm9) of the mouse genome.
Exercise 1: Prepare the genome for alignment
As we did previously,it is necessary to create the BW transform of our genome before mapping reads. Use the bowtie-build2 program in the Bowtie distribution.
Change directory to the genome.reconstruction directory and invoke the BW transform. First change your working directory to the genome.reconstruction directory and build the BW index:
bowtie2-build chr1.fa chr1
Our alignment database will be called chr1. If all is successful your data directory should now contain 10 new files:
chr1.1.bt2 chr1.2.bt2 chr1.3.bt2 chr1.4.bt2 chr1.rev.1.bt2 chr1.rev.2.bt2
Exercise 2: Align reads to the genome
Tophat calls all alignment files ”accepted hits.bam”. It is good practice to create a directory for each alignment run. In your working directory, create two directories:
To align the reads in the data directory, invoke TopHat (Note that this will take 5-1:
tophat2 -o tophat.t0/ -r 250 genome.reconstruction/chr1 \ fastq.reconstruction/DC.LPS.t0.merged.sorted.chr1.p1.fq fastq.reconstruction/DC.LPS.t0.merged.sorted.chr1.p2.fq
then rename the alignment e.g.:
mv tophat.t0/accepted_hits.bam tophat.t0/t0.bam
And similarly for the 4h reads. Each of these commands take about 15 minutes. It is best to launch them in two different shells simultaneously.
[toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]tophat2 -o tophat.t4/ -r 250 genome.reconstruction/chr1 \ fastq.reconstruction/DC.LPS.t4.merged.sorted.chr1.p1.fq fastq.reconstruction/DC.LPS.t4.merged.sorted.chr1.p2.fq
mv tophat.t4/accepted_hits.bam tophat.t4/t4.bam[/toggle]
Exercise 3: Preparing and viewing alignments
Most applications require alignments to be sorted. Scripture and the IGV browser further require the alignments to be indexed.
java -Xmx2g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat.t0/t0.bam O=tophat.t0/t0.bam.bai igvtools count -w 5 --strands first tophat.t0/t0.bam tophat.t0/t0.bam.tdf genome.reconstruction/chr1.fa
Similarly, index the 4hr alignments: [toggle hide="yes" border="yes" style="white" title_open="Hide Commands" title_closed="Show Commands"]
java -Xmx2g -jar /usr/share/java/picard/BuildBamIndex.jar I=tophat.t4/t4.bam O=tophat.t4/t4.bam.bai
igvtools count -w 5 --strands first tophat.t4/t4.bam tophat.t4/t4.bam.tdf genome.reconstruction/chr1.fa[/toggle]
Launch the browser by typing igv.sh in the terminal window, then use the Genomes -> Load Genome From File dialogue to open the mm9 genome. To zoom in quickly you may type stat1 in the address textbox. Other interesting genes in the region: Pgap1 with a much longer UTR than that annotated, and Smap1 is a good showcase for the power of strand specific libraries.
Exercise 4: transcript reconstruction using Scripture
In this exercise you will use Scripture. Download it from here :
wget http://garberlab.umassmed.edu/data/transcriptomics/scripture_beta_82012.jar; mv scripture_beta_82012.jar bin/
Scripture is designed to reconstruct all transcripts expressed at significant levels in the sample sequenced. Scripture works with tasks; each process corresponds to tasks specified by the -task parameter:
java -Xmx2g -jar <path_to_scripture>/scripture_beta_82012.jar -task <your task> <task parameters>
If you do not specify the task, scripture executes transcript reconstruction, its default transcript reconstruction task. For a list of tasks, invoke scripture without parameters: java -Xmx2000m -jar <path_to_scripture>/scripture_beta_82012.jar.
To reconstruct transcripts de-novo, locate the scripture_alpha2.jar file and invoke (in one line):
java -Xmx2000m -jar <path to scripture>/scripture_beta_82012.jar -out tophat.t0/t0.scripture.bed \ -alignment tophat.t0/t0.bam -chrSequence genome.reconstruction/chr1.fa -trim 0.25
And similarly for the 4 hour sample. You can directly view the resulting scripture.t0.bed file in the IGV browser.
Exercise 5: Transcript reconstruction using cufflinks (optional)
To run Cufflinks for transcript reconstruction on the same libraries, invoke the command that takes into account that our library is strand specific. First, create an output directory, and then run Cufflinks:
- mkdir cufflinks.t0
- cufflinks --library-type fr-firststrand -o cufflinks.t0 -m 290 t0.sorted.bam
The resulting transcripts cufflinks.t0/transcripts.gtf can be viewed and compared with those from Scripture in the IGV browser. An interesting difference is the Cnm4 region, where a duplication makes for an interesting (if likely artificial) transcript. To look at this region, load cufflinks.t0/transcripts.gtf and scripture.t0.gtf into the IGV browser and navigate to the Cnm4 gene. Scripture will try to report most of the isoforms that are supported by the data; many will be spurious. One way to filter out spurious transcripts is by finding those that account for a small fraction of the expressed mRNAs. Cufflinks' read assignment strategy can be used to find isoforms that are either artifacts or expressed at uninteresting levels.
Exercise 6 (Optional): Analysis of the reconstruction with Cufflinks
6.1 Creating a single reconstruction set using cuffmerge The cuffmerge utility is designed to create a single transcript data set from multiple reconstructions. In this step we use this utility to merge our time point reconstructions into a single set: The input to cuffmerge is a file listing all the different reconstructions to merge. To create it, use your favorite text editor (e.g. vi) and type in the two scripture reconstruction .gtf files we previously obtained. The file should contain the following two lines. Save it with the name cuffmerge.input.txt: scripture.t0.gtf scripture.t4.gtf Then run cuffmerge: cuffmerge cuffmerge.input.txt Cuffmerge will output its merged reconstruction to a merged_asm/merged.gtf file, which we will use throughout this section. Note: Load the merged_asm/merged.gtf to your IGV session to visualize results pertaining to this transcript set. Make sure you also have loaded the alignment data (bam and tdf files) in your session so you can visualize the results we obtain in the next steps.