table of contents

expected learning outcome

The objective is to become familiar with the basic Newbler/gsAssembler analysis procedures. The walkthrough is focusing on a transcriptome analysis which will give slightly different output files.

getting started

The following exercises will be using a file generated from the NCBI Trace Archives that have been converted into a SFF file format. The SFF format is a binary format generated from a 454 pyrosequencing run.

You will find the file used in these exercises in the WCG Ubuntu installation. Go to the DataAnalysis directory:
cd ~/Documents/WCG/packages/DataAnalysis_2.5.3/

exercise 1: investigate and extract data from the SFF file

Locate the file called 454transcriptome.sff. Check the file with the following command at the command line:

sffinfo 454transcriptome.sff | less

The sffinfo program will output text data to the screen and with the pipe sign (|) followed by less it is possible to view the text page by page. You can use the arrows to navigate up or down in the text. Press q to quit and return to the command line.

Check the line starting with # of Reads to check how many sequences the SFF file contains. Please note that this is a small set compared to what you would normally expect from an SFF file (hundreds of thousands of reads).

  • How many reads does the SFF file contain?

Run sffinfo again but this time retrieve the sequences in fasta format:

sffinfo -seq 454transcriptome.sff >454transcriptome.fna

This time we are piping the output to a file called 454transcriptome.fna which is a text file in FASTA format. Use ls to see that the file has been generated and less to view the contents of the file.

You can check the number of sequences extracted and compare with the number you retrieved previously. It should be the same number of reads:

cat 454transcriptome.fna | grep -c ">"

cat is outputting the content of the file to the screen and the grep command is matching all the lines that match the “>” sign (i.e. only the headers of each sequence). By using grep with the -c option, the number of matches are given.

Finally, you could run the sffinfo command with the -qual option to retrieve the quality file from the sff file. Have a look at the file using less.

sffinfo -qual 454transcriptome.sff >454transcriptome.qual

exercise 2: running Newbler graphical interface

Great! Now we should have a look at one of the strengths of Newbler, the graphical interface! Run gsAssembler (which commonly is called Newbler).

gsAssembler&

You will see a Quick start page where you can select to generate a new assembly project or open previous projects. Click on the link called New Assembly Project.

You should get a popup window where you can fill in the name of the project, specify the location to save the output data, and specify what type of sequence analysis you would like to perform. The alternatives are cDNA (transcriptome) or genome. I will call the project “testproject”. Select cDNA.

Press OK and you have generated the file structure for the new transcriptome project.

Press the tab called Project and the the + button to the left. This will allow you to select one or more SFF files for analysis. Select the 454transcriptome.sff file and press OK (you do not need to make any changes to the options).

The 454transcriptome.sff should show up in the table.

Press start to start the assembly.

You can follow the progress in the bottom window. The analysis should be finished within a few minutes (depending on your computer).

Once the assembly is completed you can view the output files from the command line or by pressing the Result files tab. Clicking the file names in the left window will open the first 5000 lines for viewing.

Have a look at the files to get an idea of the contents. Especially important is 454NewblerMetrics.txt which gives an overview of the assembly. Towards the end of the file you can see how many isogroups and isotigs were generated. You can think of isogroups as putative genes and the isotigs as alternative splicing of these genes. You will also get information on basic numbers such as average length, coverage, etc.

Click Alignment results to view the alignment of the individual isotigs that was generated in the assembly process.

Select one isotig and select a position in the alignment. Try to find a region where the reads are not agreeing. These are marked yellow. The selected column should should change color so that the base/bases turn white on blue background.

Press the Flowgrams tab on top to view the flowgrams of the selected region. You will see the flowgram of one of the sequences and can change which read is displayed by changing the Read Name in the left window. A small green triangle in the flowgram indicates the base you previously selected. The top flowgram indicates the consensus sequence and the bottom flowgram shows the currently selected read. Hover over the flowgrams and see the information flow and count in the box to the left. Can you see any evidence of homopolymer problems in the sequence?

Close the graphical interface by clicking exit and select yes when asked whether to save the project.

exercise 3: running Newbler in command line

A graphical interface is useful, but for routine analysis it is more convenient to run the analyses from the command line. This is also possible for Newbler. Run the following commands:

runAssembly -o testproject2 -cdna 454transcriptome.sff

cd testproject2

ls

Newbler will generate a directory called testproject2 and put the result files here. You can view the files with less. This will generate the same files as were previously generated by the graphical interface.