table of contents

expected learning outcomes

Satsuma is a tool that reliably aligns large and complex DNA sequences providing maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes). Satsuma addresses these three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous ‘battleship’-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine. Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license here.

Grabherr, MG et al. (2011) Genome-wide synteny through highly sensitive sequence alignment: SatsumaBioinformatics. 2010 May 1;26(9):1145-51.

getting started

Download the manual from: http://www.broadinstitute.org/science/programs/genome-biology/spines and get familiar with Satsuma.

Data directory: ~/wg_jan2012/activities/Satsuma

system requirements

This exercise will not work on the Mac OS. For those of you working with Macs, this exercise will need to be completed using your Virtual Box installation of Ubuntu.

exercises

exercise 1: set up the Drosophila yakuba vs. Drosophila ananassae project
The files you will need for this exercise are located in ~/wg_jan2012/activities/Satsuma/:

  • dyak.fasta: all of chr2L
  • dana.fasta: scaffold matching to chr2L
  1. Run SatsumaSyntenyHere is some information to help you setting up your run. Usage is as follows:
    ##################################################################
    # -q : query fasta sequence
    # -t : target fasta sequence
    # -o : output directory
    # -l : minimum alignment length (def=0)
    # -t_chunk : target chunk size (def=4096)
    # -q_chunk : query chunk size (def=4096)
    # -t_chunk_seed : target chunk size (seed) (def=8192)
    # -q_chunk_seed : query chunk size (seed) (def=8192)
    # -n : number of CPUs (def=1)
    # -ni : number of initial search blocks (def=-1)
    # -lsf : submit jobs to LSF (def=0)
    # -nosubmit : do not run jobs (def=0)
    # -nowait : do not wait for jobs (def=0)
    # -chain_only : only chain the matches (def=0)
    # -refine_only : only refine the matches (def=0)
    # -min_prob : minimum probability to keep match (def=0.99999)
    # -proteins : align in protein space (def=0)
    # -cutoff : signal cutoff (def=1.8)
    # -cutoff : signal cutoff (seed) (def=3)
    # -m : number of jobs per block (def=8)
    # -resume : resumes w/ the output of a previous run (xcorr*data) (def=)
    # -seed : loads seeds and runs from there (xcorr*data) (def=)
    #-pixel : number of blocks per pixel (def=24)
    # -nofilter : do not pre-filter seeds (slower runtime) (def=0)
    # –dups : allow for duplications in the query sequence (def=0)
    #####################################################################################################################################

     

    For aligning the examples, run:

    ~/wg_jan2012/software/Spines_latest/SatsumaSynteny –t target_genome -q query_genome -n 2 –m 8 –o output_directory

    This will run on 2 CPUs, frequently updating the results (-m).

  2. Monitor output (MicroSyntenyPlot for visualization). Progress can be checked by:

    cd output_directory
    ~/wg_jan2012/software/Spines_latest/MicroSyntenyPlot –s 80000 –i xcorr_aligns.seeds.out –o seeds.ps

    View the postscript file:
    On mac: open seeds.ps
    On linux: evince seeds.ps &to display the initial guess on where to align.
    and by:

    ~/wg_jan2012/software/Spines_latest/MicroSyntenyPlot –s 80000 –i xcorr_aligns.init.out –o init.ps

    View the postscript file:
    On mac: open init.ps
    On linux: evince init.ps &

    to display the synteny-filtered initial guesses.
    and by:

    ~/wg_jan2012/software/Spines_latest/MicroSyntenyPlot –s 80000 –i xcorr_aligns.temp.out –o temp.ps

    View the postscript file:
    On mac: open temp.ps
    On linux: evince temp.ps &

    to display the current state of the search. This file is updated regularly.

  3. Analyze and visualize final results
    File output_directory/satsuma_summary.chained.out contains the final coordinates:

    • Target sequence name (provided by fasta)
    • First target base
    • Last target base
    • Query sequence name (provided by fasta)
    • First query base
    • Last query base
    • Identity orientation

    Here is an example:

    EXAMPLE:
    chrX 5947	6164	chrX 9153 chrX 6270	6452	chrX 9472 9360	0.626728	+ 9654	0.576923	+

    File output_directory/MergeXCorrMatches.chained.out contains the final readable alignments.

    Here is an example:

    Query chr24 [29727636-29727834] vs target scaffold_24 [1206-1404] + length 198 check 198
    Identity (w/ indel count): 52.5253 %
    -------------------------------------------------------------------------------
    
    TCCCCACTTCTAAAGTAAACTGCACATAGGGACTTCTTTCCAAAGAGCACAGTCTGGAAAGGAGGGAAAAACAATTTTAC
           ||  |||      |  | ||     ||   ||||||| |  || ||  | |||||    || || ||||| ||
    ATATATTTTTAAAATATCTATTAAAATCAAACCTATGTTCCAAATATTACGGTACGAAAAGGGAAAAATAAGAATTTCAC
    
    WYMYMWYTTYWAAAKWWMWMTKMAMATMRRRMCTWYKTTCCAAAKAKYACRGTMYGRAAAGGRRRRAAWAASAATTTYAC AMB
    -------------------------------------------------------------------------------
    
    AGTCTATAAACCTGATAAACACTACCTCAGCCAGGTGCTCAAGGGCAACATCAAGACTCGTAAGTCATGTTGATAGTAGA
    ||   | ||  |||  |  ||| |||| | ||| ||| ||||||  ||  |||    || | |||||  ||||||  |
    AGCAAAGAAGTCTGGCAGTCACCACCTTAACCAAGTGATCAAGGTTAATGTCACTGATCATGAGTCACATTGATATAATG
    
    AGYMWAKAARYCTGRYARWCACYACCTYARCCARGTGMTCAAGGKYAAYRTCAMKRMTCRTRAGTCAYRTTGATAKWAKR AMB
    -------------------------------------------------------------------------------
    
    TCCTATTGATATGCTTTGCAAGGACAGAGTAATTGACA
    | |     ||||| | ||    ||  |     |   |
    TACCCCCCATATGATGTGATGAGAAGGGCATTTCACCT
    
    TMCYMYYSATATGMTKTGMWRRGAMRGRSWWWTYRMCW AMB

exercise 2: set up the Candida albicans vs. Candida tropicalis project

  1. Run SatsumaSynteny
  2. Monitor output (MicroSyntenyPlot for visualization)
  3. Analyze and visualize final results