table of contents
- expected learning outcome
- getting started
- exercise 1: alignment of RNA-seq reads
- exercise 2: transcript reconstruction using Scripture
- exercise 3: analysis of the reconstruction with Cufflinks
- exercise 4: working with existing annotations
- exercise 5: ChIP-Seq analysis using Scripture
expected learning outcome
The learning objective of this activity is to understand the basics of RNA-Seq data and to gain proficiency in the use of Scripture and other programs.
We will analyze a partial dataset from a pathogen stimulated Dendritic Cells (DC) time course. Specifically, we will look at reads aligning to chr1:16170000-66500000 from three libraries: an RNA-Seq library from unstimulated cells, an RNA-Seq library from RNA extracted 4 hours post stimulation, and a H3K4me3 Chip-Seq library from unstimulated DCs.
We will work against a ”genome” consisting only of the first 66.55 megabases of mouse chromosome 1 (NCBI build 37).
The data for this activity can be found in ~/transcriptomics.
FOR THESE EXERCISES, YOU WILL NEED TO MAKE THE FOLLOWING SUBSTITUTIONS!
For <path to picard-tools>, substitute:
For <path to scripture>, substitute:
For <path to IGVTools>, substitute:
exercise 1: alignment of RNA-seq reads
Use TopHat to align RNA-Seq reads. The Bowtie aligner relies on a very efficient compression (Burrows-Wheeler transform). Your data folder should contain 5 read files (fastq) and the partial chromosome 1 (chr1.fa).
1.1 Creating the BW-transformed genome (~ 5 min.)
TopHat depends on Bowtie to map reads and read fragments, and the Bowtie aligner relies on a very efficient compression (Burrows-Wheeler transform) of the genome to perform its searches. It is necessary to create the BW transform of our genome before mapping reads.
Use the bowtie-build2 program in the Bowtie distribution.
Change directory to the data folder and invoke the BW transform:
bowtie2-build chr1.fa chr1
Our alignment database will be called chr1. If all is successful your data directory should now contain 10 new files:
1.2 Alignment using TopHat
Tophat calls all alignment files ”accepted hits.bam”. It is good practice to create a directory for each alignment run. In your working directory, create two directories:
To align the reads in the data directory, invoke TopHat:
tophat -o tophat.t0/ -r 250 chr1 DC.LPS.t0.merged.sorted.chr1.p1.fq DC.LPS.t0.merged.sorted.chr1.p2.fq
And similarly for the 4h reads.
Each of these commands take about 15 minutes. It is best to launch them in two different shells simultaneously.
1.3 Preparing and viewing alignments (~ 5 min)
Most applications require alignments to be sorted. Scripture and the IGV browser further require the alignments to be indexed. There is currently an incompatibility between samtools sort and Picard (which is used by Scripture); therefore you need to sort the BAM file using Picard and not samtools:
- java -Xmx2000m -jar <path to picard-tools>/SortSam.jar I=tophat.t0/accepted_hits.bam O=t0.sorted.bam SO=coordinate
- java -Xmx2000m -jar <path to picard-tools>/BuildBamIndex.jar I=t0.sorted.bam O=t0.sorted.bam.bai
Similarly, sort and index the 4hr alignments:
- java -Xmx2000m -jar <path to picard-tools>/SortSam.jar I=tophat.t4/accepted_hits.bam O=t4.sorted.bam SO=coordinate
- java -Xmx2000m -jar <path to picard-tools>/BuildBamIndex.jar I=t4.sorted.bam O=t4.sorted.bam.bai
1.4 Viewing the alignments
Once sorted and indexed, the alignments can be viewed in the Integrative Genomics Viewer (IGV).
Viewing alignments can be slow and is only useful to understand the specific structure of each gene. To be able to just look at read densities, IGV provides a counting function to estimate the number of reads aligning to any window in the genome.
To produce these files for our alignments, use igvtools:
- <path to IGVTools>/igvtools count -w 5 t0.sorted.bam t0.sorted.bam.tdf <path to IGVTools>/genomes/mm9.genome
- <path to IGVTools>/igvtools count -w 5 t4.sorted.bam t4.sorted.bam.tdf <path to IGVTools>/genomes/mm9.genome
Launch the browser by typing igv.sh in the terminal window, then use the Genomes -> Load Genome From File.. dialogue to open the mm9 genome. To zoom in quickly you may type stat1 in the address textbox.
Other interesting genes in the region: Pgap1 with a much longer UTR than that annotated, and Obfc2a with long UTR minor isoform.
exercise 2: transcript reconstruction using Scripture
Scripture is designed to reconstruct all transcripts expressed at significant levels in the sample sequenced.
2.1 Looking at the paired-end information (optional)
The two libraries sequenced were prepared using the dUTP method, which ensures the cDNA from all transcripts is always on reverse orientation to the original RNA. However, since the alignment contains both pairs, this is not obvious when looking at the alignments.
Scripture creates a new alignment that contains the original inserts in the orientation of one of the pairs. To generate this alignment invoke:
java -Xmx2000m -jar <path_to_scripture>/scripture_alpha2.jar -task pairedfile -pair1 t0.sorted.bam -usePair2Orientation -out t0.sorted.pairs.bam
And similarly for the 4 hour library.
These alignments need to be sorted and indexed like the original alignments using Picard.
- java -Xmx2000m -jar <path to picard-tools>/SortSam.jar I=t0.sorted.pairs.bam O=t0.sorted.pairs.sorted.bam SO=coordinate
- java -Xmx2000m -jar <path to picard-tools>/BuildBamIndex.jar I=t0.sorted.pairs.sorted.bam O=t0.sorted.pairs.sorted.bam.bai
And similarly for the 4 hour time point. The Bzw1 gene is a good example of a transcript having a pair of spliced exons without spliced alignments, but where paired-end data clearly indicates their connection.
2.2 Run Scripture
Scripture works with tasks; each process corresponds to tasks specified by the -task parameter:
java -Xmx2000m -jar <path_to_scripture>/scripture_alpha2.jar -task <your task> <task parameters>
If you do not specify the task, scripture executes transcript reconstruction, its default transcript reconstruction task.
For a list of tasks, invoke scripture without parameters: java -Xmx2000m -jar <path_to_scripture>/scripture_alpha2.jar
Locate the scripture_alpha2.jar file and invoke (in one line):
java -Xmx2000m -jar <path to scripture>/scripture_alpha2.jar -alignment t0.sorted.bam -chr chr1 -chrSequence chr1.fa -out scripture.t0.bed
And similarly for the 4 hour sample.
You can directly view the resulting scripture.t0.bed file in the IGV browser. However, in order to use scripture output as input to Cufflinks, we need to convert the output into a GTF file readable by Cufflinks:
java -Xmx1000m -jar <path to scripture>/scripture_alpha2.jar -task toGFF -cufflinks -in scripture.t0.bed -source SCRIPTURE -out scripture.t0.gtf -prefix t0
And similarly with the 4 hour sample. Stat1 is an interesting gene to explore.
2.3 Running Cufflinks (optional)
To run Cufflinks for transcript reconstruction on the same libraries, invoke the command that takes into account that our library is strand specific. First, create an output directory, and then run Cufflinks:
- mkdir cufflinks.t0
- cufflinks –library-type fr-firststrand -o cufflinks.t0 -m 290 t0.sorted.bam
The resulting transcripts cufflinks.t0/transcripts.gtf can be viewed and compared with those from Scripture in the IGV browser. An interesting difference is the Cnm4 region, where a duplication makes for an interesting (if likely artificial) transcript.
To look at this region, load cufflinks.t0/transcripts.gtf and scripture.t0.gtf into the IGV browser and navigate to the Cnm4 gene.
Scripture will try to report most of the isoforms that are supported by the data; many will be spurious. One way to filter out spurious transcripts is by finding those that account for a small fraction of the expressed mRNAs. Cufflinks’ read assignment strategy can be used to find isoforms that are either artifacts or expressed at uninteresting levels.
exercise 3: analysis of the reconstruction with Cufflinks
3.1 Creating a single reconstruction set using cuffmerge
The cuffmerge utility is designed to create a single transcript data set from multiple reconstructions. In this step we use this utility to merge our time point reconstructions into a single set:
The input to cuffmerge is a file listing all the different reconstructions to merge. To create it, use your favorite text editor (e.g. vi) and type in the two scripture reconstruction .gtf files we previously obtained. The file should contain the following two lines. Save it with the name cuffmerge.input.txt:
Then run cuffmerge:
Cuffmerge will output its merged reconstruction to a merged_asm/merged.gtf file, which we will use throughout this section.
Note: Load the merged_asm/merged.gtf to your IGV session to visualize results pertaining to this transcript set. Make sure you also have loaded the alignment data (bam and tdf files) in your session so you can visualize the results we obtain in the next steps.
3.2 Expression analysis with Cuffdiff
The single transcript file we created in the previous section can be readily used by cuffdiff for differential expression analysis at both isoform and gene levels.
To run cuffdiff type
cuffdiff -o cuffdiff.merged –library-type fr-firststrand merged_asm/merged.gtf -L t0,t4 t0.sorted.bam t4.sorted.bam
The –library-type fr-firststrand tells cuffdiff that the library construction method used for this library is strand specific and that the first read comes from the 3′ end of the RNA fragment as is the case with the dUTP library construction method used to generate the libraries used in this exercise.
The -L t0,t4 option tells cuffdiff to use the labels t0 for the first sample, and t4 for the second sample.
The -o cuffdiff.merged option forces output into the cuffdiff.merged subdirectory of your current directory.
Cuffdiff has several output files, described elsewhere in more detail. Here we will concentrate on cuffdiff.merged/gene_exp.diff and cuffdiff.merged/isoform_exp.diff files.
These files are tab-separated tables consisting of 13 columns and can be viewed in Excel or OpenOffice.
To quickly analyze these files we will use standard Unix commands.
To view the gene differential expression file use
less -S cuffdiff.merged/gene_exp.diff
less -S cuffdiff.merged/isoform_exp.diff
in two different terminal windows. While there is a single test_id row and gene_id for each locus in the gene_exp.diff file, the isoform_exp.diff file may contain several test_id rows for each gene_id and, in some cases several gene_ids for a single locus.
Next look for the most significant differentially expressed genes according to cuffdiff. For this we can use standard unix tools to sort cuffdiff output according to the significance of the test statistic for differential expression:
grep -v test_id cuffdiff.merged/gene_exp.diff | sort -n –key=12 | less -S
The locus chr1:40448452-40522254 is near the top of the list. Look at this locus in the IGV browser by pasting the locus into IGV’s location window. The locus corresponds to the Il1rl1 gene, an Il33 putative receptor, which has been shown to a regulator of the inflammatory response and associated with several auto-immune diseases. Interestingly this gene is down-regulated in DCs during the inflammatory response.
We can similarly examine the isoform expression for this gene:
grep ‘chr1:40448452-40522254’ cuffdiff.merged/isoform_exp.diff | less -S
Take the time to explore the different isoform usage for the 5 putative isoforms and the intronic segment.
Examine this locus in detail as it is an example of many interesting features.
exercise 4: working with existing annotations
Scripture implements a simple scoring scheme to compute the expression of a given set of transcripts. It does so by finding the set of constituent exons for any given gene and computing RPKM and other expression values.
First download the RefSeq annotations for the region of interest. Go to the UCSC genome browser, select the mouse mm9 genome, and download the RefSeq track of the Gene and Gene prediction track group in BED format (be sure to select full gene when prompted). In the position textbox, input chr1:16170000-66500000 and save the file as refseq.bed.
Then use Scripture to score genes:
- java -Xmx2000m -jar <path to scripture>/scripture_alpha2.jar -task score -in refseq.bed -alignment t0.sorted.pairs.sorted.bam -out refseq.t0.scored -useConstituentExons
- java -Xmx2000m -jar <path to scripture>/scripture_alpha2.jar -task score -in refseq.bed -alignment t4.sorted.pairs.sorted.bam -out refseq.t4.scored -useConstituentExons
The output consists of the orginal BED file with an extra set of 8 columns. These columns are:
- Family wise error rate (FWER) at the observed score
- Read enrichment (average reads per base over transcript / genome average)
- Total reads aligned transcript
- Average reads per base over transcript
- Genome average reads per base
- Transcript length
- Poisson nominal p-value for the observed count given the genome-wide average
exercise 5: ChIP-Seq analysis using Scripture
In this exercise we use Scripture to find regions of enrichment (peaks) in a H3K4me3 ChIP-Seq library of unstimulated DCs.
We first align reads to our toy chromosome 1, but first we need to unpack our reads:
To map the reads we use bowtie:
bowtie –best -S chr1 DC.K4me3.t0.fq DC.K4me3.t0.sam
Take the time to examine the output alignment to review the SAM format.
We then sort and index the alignment. Since the alignment is in SAM format, we can either convert it to BAM using the samtools view utility or use Picard (or Unix sort) to directly sort the alignment:
java -Xmx2000m -jar <path to picard-tools>/SortSam.jar I=DC.K4me3.t0.sam O=DC.K4me3.t0.sorted.bam SO=coordinate
and then index:
java -Xmx2000m -jar <path to picard-tools>/BuildBamIndex.jar I=DC.K4me3.t0.sorted.bam O=DC.K4me3.t0.sorted.bam.bai
and produce a count file to view in IGV:
<path to IGVTools>/igvtools count -w 5 DC.K4me3.t0.sorted.bam DC.K4me3.t0.sorted.bam.tdf <path to IGVTools>/genomes/mm9.genome
Find regions of enrichment using Scripture:
java -Xmx2000m -jar <path to scripture>/scripture_alpha2.jar -task chip -alignment DC.K4me3.t0.sorted.bam -chr chr1 -out DC.K4me3.t0.segments.bed -windows 500,100 -trim