Trinity

expected learning outcomes
getting started
system requirements
exercise 1
exercise 2
exercise 3

expected learning outcomes

By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. This approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

Grabherr, MG et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology. 2011 May 15;29(7):644-52.

getting started

Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. Briefly, the process works like so:

Inchworm assembles the RNA-Seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.

Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.

Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.

The Trinity software package includes each of these tools and can be downloaded here.

system requirements

This exercise will not work on the Mac OS. For those of you working with Macs, this exercise will need to be completed using your Virtual Box installation of Ubuntu.

A basic recommendation is to have 1G of RAM per1M pairs of Illumina reads in order to run the Inchworm and Chrysalis steps. Simpler transcriptomes require less memory than complex transcriptomes. Butterfly requires less memory and can also be spread across multiple processors.

The entire process can require ~1 hour per million pairs of reads in the current implementation. There are various things that can be done to modify performance. Please review the guidelines in the official Trinity documentation for more advice on this topic.

Basic Trinity usage is as follows:

Trinity.pl --seqType (fq for fastq or fa for fast) --left ~/path/to/reads_1.fq --right ~/path/to/reads_2.fq (or --single for single reads) --CPU 4 --bflyHeapSpace 10G --output ~/path/to/output_dir

NOTE: It is recommended to use fully specified paths for sequence files with Trinity.

exercises

exercise 1: accessing data

Get yourself familiar with Trinity by having a look at the manual: http://trinityrnaseq.sourceforge.net/.
Have a look at the example data used in this exercise. The data is obtained from mouse dendritic cells (mouse_left.fasta and mouse_right.fasta and) and a whitefly (whitefly_left.fasta and whitefly_right.fasta), and the files are located in ~/wg_jan2012/activities/Trinity/.

exercise 2: run Trinity

Run Trinity on the mouse and whitefly example data. Here is some information to help you setting up your run.Trinity is run via the script Trinity.pl. Usage is as follows:

##################################################################
# Required:
#  --seqType <string>  :type of reads: (fq or fa)

#  If paired reads:

#      --left  <string>    :left reads
#      --right <string>    :right reads

#  Or, if unpaired reads:

#      --single <string>   :single reads
#      --output <string>   :name of directory for output (will be created if it doesn't already exist)

#                                 default( "trinity_out_dir" )

#  if strand-specific data, set:

#      --SS_lib_type <string>  :if paired: RF or FR,  if single: F or R

#  Butterfly-related options:

#      --run_butterfly                 :executes butterfly commands.  Do not set this if you want to spawn them on a computing grid.

#      --bfly_opts <string>            :parameters to pass through to butterfly (see butterfly documentation).

#      --bflyHeapSpace <string>        :java heap space setting for butterfly (default: 1000M) => yields command java -Xmx1000M -jar Butterfly.jar ... $bfly_opts

#  Inchworm-related options:

#      --no_meryl                      :do not use meryl for computing the k-mer catalog (default: uses meryl, providing improved runtime performance)

#      --min_kmer_cov <int>            :min count for K-mers to be assembled by Inchworm (default: 1)

# Misc:

#  --CPU <int>               :number of CPUs to use, default: 2

#  --min_contig_length <int> :minimum assembled contig length to report (def=200)

#  --paired_fragment_length <int>  :maximum length expected between fragment pairs (aim for 90% percentile)  (def=300)

#  --jaccard_clip     option, set if you have paired reads and you expect high gene density with UTR overlap (use FASTQ input file format for reads).

#  --run_ALLPATHSLG_error_correction :runs the read error correction process built into ALLPATHSLG.

#                                     (requires ALLPATHSLG to be installed, and installation directory indicated

#                                      by the env variable 'ALLPATHSLG_BASEDIR')
#####################################################################################################################################

Trinity performs best with strand-specific data, in which case sense and antisense transcripts can be resolved.

If you have strand-specific data, specify the library type. There are four library types:

Paired reads:

RF: first read (/1) of fragment pair is sequenced as anti-sense (reverse(R)), and second read (/2) is in the sense strand (forward(F)); typical of the dUTP/UDG sequencing method.
FR: first read (/1) of fragment pair is sequenced as sense (forward), and second read (/2) is in the antisense strand (reverse)

Unpaired (single) reads:

F: the single read is in the sense (forward) orientation
R: the single read is in the antisense (reverse) orientation

By setting the —SS_lib_type parameter to one of the above, you are indicating that the reads are strand-specific. By default, reads are treated as not strand-specific.

Typical Trinity usage is:

Trinity.pl --seqType (fq for fastq or fa for fasta) --left ~/path/to/reads_left --right ~/path/to/reads_right (or --single for single reads) --CPU 2 --bflyHeapSpace 10G --output ~/path/to/output_dir

NOTE: It is recommended to use fully specified paths for sequence files with Trinity.

Data sets: mouse (mouse_left.fasta, mouse_right.fasta), whitefly (whitefly_left.fasta, whitefly_right.fasta)

exercise 3: assess contiguity and full-lengthness

Explore the Trinity output file Trinity.fasta located in the trinity_out_dir/output directory (or output directory you specify). Transcripts are grouped as follows:
- components: the set of all sequences that share at least one k-mer (including paralogs)
- contigs: transcripts that share a number of k-mers (the set of isoforms of a gene)
- sequences (isoforms and allelic variation)
Ensure you assembled mouse data by opening the UCSC genome browser and following these steps:
– Select BLAT from the menu at the top of the page and paste in a mouse transcript sequence from Trinity.fasta.
– Select the mouse/mm9 genome and click “submit”.
– Click on the top scoring hit.
Optional: examine the alignments by clicking “details” on the resulting page.
– Your sequences will be displayed in the browser.
– Enable the mouse annotations (ENSEMBL gene build, UCSC genes, human proteins etc.).

Find alternatively spliced isoforms in the mouse data:
– Examine the sequences in Trinity.fasta and “guess” which ones might be alternative isoforms
– Verify alternative splicing in the UCSC genome browser (see above)

In order to align our data against human proteins, translate the nucleotide output in the Trinity.fasta file in all six frames. This can be done using SatsumaProt:
SatsumaProt -t UCSC_genes_readable.prot -q Trinity.fasta > aligns.out

Example Output:
Target: 1-1093 Query: 1-777
--------------------------------------------------------------------------------
PAWLRRLCGQLLSERLMRPNGVQAVVRGILEGAGAGAAGGSDAEATAADWRKCDLIAKILASCPQQSLSPESYYKDICPQ
PAWLRRLCGQLLSERLMRPNGVQAVVRGILEGAGAGAAGGSDAE TAADW+KCDLIAKILASCPQQSLSPE+YY+DICPQ
PAWLRRLCGQLLSERLMRPNGVQAVVRGILEGAGAGAAGGSDAEVTAADWKKCDLIAKILASCPQQSLSPENYYRDICPQ
--------------------------------------------------------------------------------
ILDLFHLQDKLTARQFQRVATTTFITLSRERPELAAKYLLQPMLAPLQRCLSTAEIPESDMVPGAILVTEEELSRCVEDV
+LDLFH QDKLTARQFQRVATTTFITLSRERP LAAKYLLQP+LAPL RCL+TAE+ ESDMVPG ILVTEEELSRC+EDV
VLDLFHFQDKLTARQFQRVATTTFITLSRERPHLAAKYLLQPVLAPLHRCLNTAELSESDMVPGTILVTEEELSRCIEDV
--------------------------------------------------------------------------------
FKVYVVANEPVPVLLDSLLPLLRVFFSLYCFTQQSVSHIRSLCQEILLWILVKLERKKAIASLKGFSGLDKTVPTLHPQC
FKVYVV NEP+ VL+DSLLP+L V F LYCFT+QSVSHIRSLCQEILLWIL KLERKKAIASLKGF+GLDK VP+LH  C
FKVYVVGNEPLTVLMDSLLPVLGVLFLLYCFTKQSVSHIRSLCQEILLWILGKLERKKAIASLKGFAGLDKAVPSLHSLC
--------------------------------------------------------------------------------
QFRAATHGGIVITAKEAISDD-EDEALYQKVSSEQSQVEHLGDLLLHCQQCGLAGDFFIFCLKELSHLLEDREAEFTPKP
QFR AT GGI+IT KEAISD+ EDEALYQKVSSEQ +VEHLGDLL HCQ+CGLAGDFFIFCLKEL+H+  + E E   +P
QFRVATQGGIMITIKEAISDEDEDEALYQKVSSEQGRVEHLGDLLSHCQECGLAGDFFIFCLKELTHVASENETELKTEP
--------------------------------------------------------------------------------
SCYASLLELEHHQTLLIEDQERKLQVLQLLAVLCEKMSEQIFTHVTQVVDFVAATLQRACAGLAHEAESAVGSQTLSMSM
SLLELE HQTLL+E QERKL VLQL+AVLCE+MSEQIFT+VTQVVDFVAATLQRACA LAH+AES V SQTLSMSM
FSSKSLLELEQHQTLLVEGQERKLLVLQLMAVLCERMSEQIFTNVTQVVDFVAATLQRACASLAHQAESTVESQTLSMSM
--------------------------------------------------------------------------------
GLVAVMLGGAVQLKSSDFAVLKQLLPLLERVSNTYPDPVIQELAADLRITISTHGAFSTDAVSTAAQSTLNQKDPGQKIE
GLVAVMLGGAVQLKSSDFAVLKQLLPLLE+VSNTYPDPVIQELA DLRITISTHGAF+T+AVS AAQSTLN+KD   KIE
GLVAVMLGGAVQLKSSDFAVLKQLLPLLEKVSNTYPDPVIQELAVDLRITISTHGAFATEAVSMAAQSTLNRKDLEGKIE
--------------------------------------------------------------------------------
EQRQ-TSPDISTEGAQ--K-----------------PPRTGQGSSGPCTATSQPPGSITTQQFREVLLSACDPEVPTRAA
E +Q TS +  T+ A                      P   QG + P T TSQ  GS+TT+Q +EVLLSA DP++PTRAA
E-QQQTSHERPTDVAHSHLEQQQSHETAPQTGLQSNAPIIPQGVNEPSTTTSQKSGSVTTEQLQEVLLSAYDPQIPTRAA
--------------------------------------------------------------------------------
ALRTLARWVEQREARALEEQKKLLQIFLENLEHEDSFVYLSAIQGIALLSDVYPEEILVDLLAKYDSGKDKHTPETRMKV
ALRTL+ W+EQREA+ALE Q+KLL+IFLENLEHED+FVYLSAIQG+ALLSDVYPE+IL DLLA+YDS KDKHTPETRMKV
ALRTLSHWIEQREAKALEMQEKLLKIFLENLEHEDTFVYLSAIQGVALLSDVYPEKILPDLLAQYDSSKDKHTPETRMKV
--------------------------------------------------------------------------------
GEVLMRVVRALGDMVSKYREPLIHTFLRGVRDPDAAHRASSLANLGELCQCLHFLLGPVVHEVTACLIAVAKTDNDVQVR
GEVLMR+VRALGDMVSKYREPLIHTFLRGVRDPD AHRASSLANLGELCQ L FLLG VVHEVTACLIAVAKTD +VQVR
GEVLMRIVRALGDMVSKYREPLIHTFLRGVRDPDGAHRASSLANLGELCQRLDFLLGSVVHEVTACLIAVAKTDGEVQVR
--------------------------------------------------------------------------------
RAAVHVVVLLLRGLSQKATEVLSDVLRDLYHLLKHVVRLEPDDVAKLHAQLALEELDEIMRNFLFPPQKLEKKIVVLP
RAA+HVVVLLLRGLSQKATEVLS VL+DLYHLLKHVV LEPDDVAKLHAQLALEELD+IM+NFLFPPQKLEKKI+VLP
RAAIHVVVLLLRGLSQKATEVLSAVLKDLYHLLKHVVCLEPDDVAKLHAQLALEELDDIMKNFLFPPQKLEKKIMVLP

Redo above exercises with the whitefly data.

table of contents

expected learning outcomes

getting started

system requirements

exercises