Optional Part: Transcript reconstruction
Your genome.reconstruction folder should contain the partial chromosome 1 (chr1.fa) and the fastq.reconstruction the sequencing reads the two libraries we will use. NOTE: This is an older dataset that was originally analyzed against an earlier version (mm9) of the mouse genome.
Exercise 1: Prepare the genome for alignment
As we did previously,it is necessary to create the BW transform of our genome before mapping reads. Use the bowtie-build2 program in the Bowtie distribution.
Change directory to the genome.reconstruction directory and invoke the BW transform. First change your working directory to the genome.reconstruction directory and build the BW index:
bowtie2-build chr1.fa chr1
Our alignment database will be called chr1. If all is successful your data directory should now contain six new files:
chr1.1.bt2 chr1.2.bt2 chr1.3.bt2 chr1.4.bt2 chr1.rev.1.bt2 chr1.rev.2.bt2
Exercise 2: Align reads to the genome
Tophat calls all alignment files ”accepted hits.bam”. It is good practice to create a directory for each alignment run. In your working directory (transcriptomics), create two directories:
To align the reads in the data directory, invoke TopHat. Since this will take a while you can run Tophat2 for the 4h reads (t4) in a separate terminal (it takes ~10 mins):
tophat2 -o tophat.t0/ -r 250 genome.reconstruction/chr1 \ fastq.reconstruction/DC.LPS.t0.merged.sorted.chr1.p1.fq fastq.reconstruction/DC.LPS.t0.merged.sorted.chr1.p2.fq
then rename the alignment e.g.:
mv tophat.t0/accepted_hits.bam tophat.t0/t0.bam
And similarly for the 4h reads.[toggle hide=”yes” border=”yes” style=”white” title_open=”Hide Commands” title_closed=”Show Commands”]
tophat2 -o tophat.t4/ -r 250 genome.reconstruction/chr1 \ fastq.reconstruction/DC.LPS.t4.merged.sorted.chr1.p1.fq fastq.reconstruction/DC.LPS.t4.merged.sorted.chr1.p2.fq
mv tophat.t4/accepted_hits.bam tophat.t4/t4.bam[/toggle]
Exercise 3: Preparing and viewing alignments
Most applications require alignments to be sorted. Scripture and the IGV browser further require the alignments to be indexed.
picard-tools BuildBamIndex I=tophat.t0/t0.bam O=tophat.t0/t0.bam.bai igvtools count -w 5 --strands first tophat.t0/t0.bam tophat.t0/t0.bam.tdf genome.reconstruction/chr1.fa
Similarly, index the 4hr alignments: [toggle hide=”yes” border=”yes” style=”white” title_open=”Hide Commands” title_closed=”Show Commands”]
picard-tools BuildBamIndex I=tophat.t4/t4.bam O=tophat.t4/t4.bam.bai
igvtools count -w 5 --strands first tophat.t4/t4.bam tophat.t4/t4.bam.tdf genome.reconstruction/chr1.fa[/toggle]
Launch the browser by typing igv.sh in the terminal window, then use the Genome drop down menu then More and finally open the Mouse mm9 genome. To look interesting genes please use bam files. To zoom in quickly you may type stat1 in the address textbox. Other interesting genes in the region: Pgap1 with a much longer UTR than that annotated, and Smap1 is a good showcase for the power of strand specific libraries.
Exercise 4: transcript reconstruction using Scripture
In this exercise you will use Scripture.
Scripture is designed to reconstruct all transcripts expressed at significant levels in the sample sequenced. Scripture works with tasks; each process corresponds to tasks specified by the -task parameter:
scripture_beta_82012.jar -task <your task> <task parameters>
If you do not specify the task, scripture executes transcript reconstruction, its default transcript reconstruction task. For a list of tasks, invoke scripture without parameters: java -Xmx2000m -jar <path_to_scripture>/scripture_beta_82012.jar.
To reconstruct transcripts de-novo, locate the scripture_alpha2.jar file and invoke (in one line):
scripture_beta_82012.jar -out tophat.t0/t0.scripture.bed \ -alignment tophat.t0/t0.bam -chrSequence genome.reconstruction/chr1.fa -trim 0.25
And similarly for the 4 hour sample. You can directly view the resulting scripture.t0.bed file in the IGV browser.To generate a gtf file from scripture run:
cd tophat.t0 scripture_beta_82012.jar -task toGFF -cufflinks -in t0.scripture.bed -source SCRIPTURE -out scripture.t0.gtf -prefix t0
Modify the above scripture command for the t4 data as well (run in the tophat.t0 directory). You should now have a gtf file for each of the conditions.
Exercise 5: Transcript reconstruction using cufflinks
To run Cufflinks for transcript reconstruction on the same libraries, invoke the command that takes into account that our library is strand specific. First, create an output directory, and then run Cufflinks:
mkdir cufflinks.t0 cufflinks tophat.t0/t0.bam --library-type fr-firststrand -o cufflinks.t0/ -m 290
The resulting transcripts cufflinks.t0/transcripts.gtf can be viewed and compared with those from Scripture in the IGV browser generatied in exercise 4. An interesting difference is the Cnm4 region, where a duplication makes for an interesting (if likely artificial) transcript. To look at this region, load cufflinks.t0/transcripts.gtf and scripture.t0.gtf into the IGV browser and navigate to the Cnm4 gene. Scripture will try to report most of the isoforms that are supported by the data; many will be spurious. One way to filter out spurious transcripts is by finding those that account for a small fraction of the expressed mRNAs. Cufflinks’ read assignment strategy can be used to find isoforms that are either artifacts or expressed at uninteresting levels.
Exercise 6: Analysis of the reconstruction with Cufflinks
Creating a single reconstruction set using cuffmerge
The cuffmerge utility is designed to create a single transcript data set from multiple reconstructions. In this step we use this utility to merge our time point reconstructions into a single set:
To do this please first go to ~/workshop_data/transcriptomics directory.
The input to cuffmerge is a file listing all the different reconstructions to merge. To create it, use your favorite text editor (e.g. vi or emacs) and type in the two scripture reconstruction .gtf files we previously obtained.
The file should contain the following two lines. Save it with the name cuffmerge.input.txt:
Then run cuffmerge:
Cuffmerge will output its merged reconstruction to a merged_asm/merged.gtf file, which we will use throughout this section.
Note: Load the merged_asm/merged.gtf to your IGV session to visualize results pertaining to this transcript set. Make sure you also have loaded the alignment data (bam and tdf files) in your session so you can visualize the results we obtain in the next steps.