Transcript reconstruction - Evolution and Genomics

Optional Part: Transcript reconstruction

Your genome.reconstruction folder should contain the partial chromosome 1 (chr1.fa) and the fastq.reconstruction the sequencing reads the two libraries we will use. NOTE: This is an older dataset that was originally analyzed against an earlier version (mm9) of the mouse genome.

Exercise 1: Prepare the genome for alignment

As we did previously,it is necessary to create the BW transform of our genome before mapping reads. Use the bowtie-build2 program in the Bowtie distribution.

Change directory to the genome.reconstruction directory and invoke the BW transform. First change your working directory to the genome.reconstruction directory and build the BW index:

bowtie2-build chr1.fa chr1

Our alignment database will be called chr1. If all is successful your data directory should now contain six new files:

chr1.1.bt2
chr1.2.bt2
chr1.3.bt2
chr1.4.bt2
chr1.rev.1.bt2
chr1.rev.2.bt2

Exercise 2: Align reads to the genome

Tophat calls all alignment files ”accepted hits.bam”. It is good practice to create a directory for each alignment run. In your working directory (transcriptomics), create two directories:

cd ..

mkdir tophat.t0

mkdir tophat.t4

To align the reads in the data directory, invoke TopHat. Since this will take a while you can run Tophat2 for the 4h reads (t4) in a separate terminal (it takes ~10 mins):

tophat2 -o tophat.t0/ -r 250 genome.reconstruction/chr1 \
fastq.reconstruction/DC.LPS.t0.merged.sorted.chr1.p1.fq fastq.reconstruction/DC.LPS.t0.merged.sorted.chr1.p2.fq

then rename the alignment e.g.:

mv tophat.t0/accepted_hits.bam tophat.t0/t0.bam

And similarly for the 4h reads.

[toggle hide=”yes” border=”yes” style=”white” title_open=”Hide Commands” title_closed=”Show Commands”]

tophat2 -o tophat.t4/ -r 250 genome.reconstruction/chr1 \
 fastq.reconstruction/DC.LPS.t4.merged.sorted.chr1.p1.fq fastq.reconstruction/DC.LPS.t4.merged.sorted.chr1.p2.fq

mv tophat.t4/accepted_hits.bam tophat.t4/t4.bam

[/toggle]

Exercise 3: Preparing and viewing alignments

Most applications require alignments to be sorted. Scripture and the IGV browser further require the alignments to be indexed.

picard-tools BuildBamIndex I=tophat.t0/t0.bam O=tophat.t0/t0.bam.bai

igvtools count -w 5 --strands first tophat.t0/t0.bam tophat.t0/t0.bam.tdf genome.reconstruction/chr1.fa

Similarly, index the 4hr alignments: [toggle hide=”yes” border=”yes” style=”white” title_open=”Hide Commands” title_closed=”Show Commands”]

picard-tools BuildBamIndex I=tophat.t4/t4.bam O=tophat.t4/t4.bam.bai

igvtools count -w 5 --strands first tophat.t4/t4.bam tophat.t4/t4.bam.tdf genome.reconstruction/chr1.fa

[/toggle]

Launch the browser by typing igv.sh in the terminal window, then use the Genome drop down menu then More and finally open the Mouse mm9 genome. To look interesting genes please use bam files. To zoom in quickly you may type stat1 in the address textbox. Other interesting genes in the region: Pgap1 with a much longer UTR than that annotated, and Smap1 is a good showcase for the power of strand specific libraries.

Exercise 4: transcript reconstruction using Scripture

In this exercise you will use Scripture.

Scripture is designed to reconstruct all transcripts expressed at significant levels in the sample sequenced. Scripture works with tasks; each process corresponds to tasks specified by the -task parameter:

scripture_beta_82012.jar -task <your task> <task parameters>

If you do not specify the task, scripture executes transcript reconstruction, its default transcript reconstruction task. For a list of tasks, invoke scripture without parameters: java -Xmx2000m -jar <path_to_scripture>/scripture_beta_82012.jar.
To reconstruct transcripts de-novo, locate the scripture_alpha2.jar file and invoke (in one line):

scripture_beta_82012.jar -out tophat.t0/t0.scripture.bed \
-alignment tophat.t0/t0.bam  -chrSequence genome.reconstruction/chr1.fa -trim 0.25

And similarly for the 4 hour sample. You can directly view the resulting scripture.t0.bed file in the IGV browser.To generate a gtf file from scripture run:

cd tophat.t0
scripture_beta_82012.jar -task toGFF -cufflinks -in t0.scripture.bed -source SCRIPTURE -out scripture.t0.gtf -prefix t0

Modify the above scripture command for the t4 data as well (run in the tophat.t0 directory). You should now have a gtf file for each of the conditions.

Exercise 5: Transcript reconstruction using cufflinks

To run Cufflinks for transcript reconstruction on the same libraries, invoke the command that takes into account that our library is strand specific. First, create an output directory, and then run Cufflinks:

mkdir cufflinks.t0
cufflinks tophat.t0/t0.bam --library-type fr-firststrand -o cufflinks.t0/ -m 290

The resulting transcripts cufflinks.t0/transcripts.gtf can be viewed and compared with those from Scripture in the IGV browser generatied in exercise 4. An interesting difference is the Cnm4 region, where a duplication makes for an interesting (if likely artificial) transcript. To look at this region, load cufflinks.t0/transcripts.gtf and scripture.t0.gtf into the IGV browser and navigate to the Cnm4 gene. Scripture will try to report most of the isoforms that are supported by the data; many will be spurious. One way to filter out spurious transcripts is by finding those that account for a small fraction of the expressed mRNAs. Cufflinks’ read assignment strategy can be used to find isoforms that are either artifacts or expressed at uninteresting levels.

Exercise 6: Analysis of the reconstruction with Cufflinks

Creating a single reconstruction set using cuffmerge

The cuffmerge utility is designed to create a single transcript data set from multiple reconstructions. In this step we use this utility to merge our time point reconstructions into a single set:
To do this please first go to ~/workshop_data/transcriptomics directory.

cd ~/workshop_data/transcriptomics

The input to cuffmerge is a file listing all the different reconstructions to merge. To create it, use your favorite text editor (e.g. vi or emacs) and type in the two scripture reconstruction .gtf files we previously obtained.
The file should contain the following two lines. Save it with the name cuffmerge.input.txt:

tophat.t0/scripture.t0.gtf
tophat.t4/scripture.t4.gtf

Then run cuffmerge:

cuffmerge cuffmerge.input.txt

Cuffmerge will output its merged reconstruction to a merged_asm/merged.gtf file, which we will use throughout this section.

Note: Load the merged_asm/merged.gtf to your IGV session to visualize results pertaining to this transcript set. Make sure you also have loaded the alignment data (bam and tdf files) in your session so you can visualize the results we obtain in the next steps.