De novo Assembly of Illumina Reads with Velvet and Read Alignment with Bowtie

table of contents

expected learning outcome

The objective of this activity is to help you understand how to run Velvet in general, how to accurately estimate the insert size of a paired-end library through the use of Bowtie, the primary parameters of velvet, and the process involved in producing a de novo assembly from Illumina reads.

useful resources

getting started

We will be assembling a bacterial genome known to be 1.63MB in length. The data set includes two Illumina libraries. The first library is a 300bp-insert paired-end lane which has been reduced by half to satisfy the time and memory constraints of this activity. These reads are 54bp in length. The second library is a 3kb mate-pair lane which has not been reduced at all and these reads are 38bp in length.

Even for small datasets, velvet uses a lot of memory since a single graph containing all of the sequence information is needed for an assembly. For this activity you will need a computer with at least 2GB of RAM. If you are using a Mac, it is recommended to simply use it natively instead of running VirtualBox. To check the amount of memory on your Mac simply click the apple button at the top left and select “About This Mac”. For those of you using the Workshop’s Ubuntu Linux distribution, you can determine the amount of memory by going to System->Administration->System Monitor. We will be running velvet multiple times to explore the parameter space. Each separate kmer assembly will use up to 1GB of disk space and may take upwards of 30 minutes to run based on your parameters. If you follow the entire activity with the suggested commands, it will use upwards of 20GB of disk space. You can easily determine your disk space usage with the command:

df -h

The -h flag specifies human readable and should report in MB and GB values. If you want to run multiple assemblies with different parameters, plan accordingly. It may also be wise to coordinate with those working around you to save time and disk space by deciding which assemblies each person should run. If your system does not satisfy these requirements, partner with another participant whose computer does! We will share the results of each assembly in a spreadsheet on the computer at the front of the lab as we go along, so be sure to enter your results (as well as parameters used) so that others in the lab can see the cumulative results to make informed decisions about how to set their parameters from one run to the next.

exercise 1: preparing fastq files for velvet

cd to the ~/wg_jan2012/activities/Velvet_and_Bowtie directory, do an ls, and you will see five files:

  • 300bp_pe_1.fastq.gz
  • 300bp_pe_2.fastq.gz
  • 3kb_mp_1.fastq.gz
  • 3kb_mp_2.fastq.gz
  • NODE_3_length_277033_cov_81.263977.fa

There are four compressed fastq files and a previously assembled contig that we will use in exercise 3. While velvet can handle these compressed files, bowtie cannot, so we will uncompress them now for use with the rest of the activity. Run gunzip on each compressed file:

gunzip 300bp_pe_1.fastq.gz
gunzip 300bp_pe_2.fastq.gz
gunzip 3kb_mp_1.fastq.gz
gunzip 3kb_mp_2.fastq.gz

(Note: When uncompressed, these files will use ~2.3GB of disk space.)

These files represent a paired-end (pe) run, for which the orientation of the reads is => insert <=, and a mate-pair run, for which the orientation of the reads is <= insert =>. Velvet assumes our reads are oriented in a paired-end fashion and the mate-pair reads have to be changed in a reverse complement orientation to meet this assumption. This can be done with one of the fastX modules:

fastx_reverse_complement -i 3kb_mp_1.fastq -o 3kb_mp_1RC.fastq
fastx_reverse_complement -i 3kb_mp_2.fastq -o 3kb_mp_2RC.fastq

Velvet only handles interleaved or “shuffled” fasta and fastq files, where each pair is seen one after the other. For example:

@HWI-EAS210R_0001:6:1:3:663#CGATGT/1
TGTTCTTATTGGACCAGAANGAGGATTTAGTGAAGAAG
+HWI-EAS210R_0001:6:1:3:663#CGATGT/1
dgggggggggfggddbb]_BTTSSSfffdddddd^`cc
@HWI-EAS210R_0001:6:1:3:663#CGATGT/2
CTACTCTCTAACAGAAAGCAGATAGTCAAACCGTGTAA
+HWI-EAS210R_0001:6:1:3:663#CGATGT/2
d`Yd`fff`dffefde^cffee^^^Ic_cceeee^WcS

Velvet includes a Perl script to perform this shuffling from the original two separate unshuffled sequence files:

shuffleSequences_fastq.pl 300bp_pe_1.fastq 300bp_pe_2.fastq 300bp_pe_shuffled.fastq
shuffleSequences_fastq.pl 3kb_mp_1RC.fastq 3kb_mp_2RC.fastq 3kb_mp_shuffled.fastq

exercise 2: running velveth

We will get velvet going and discuss it as it is running. So to get velvet started, run the following command:

velveth auto 31,45,2 -fastq -shortPaired1 300bp_pe_shuffled.fastq -fastq -shortPaired2 3kb_mp_shuffled.fastq

Velveth reads in these sequence files and simply produces a hashtable and two output files (Roadmaps and Sequences) which are necessary for the subsequent program, velvetg. To see the help message for velveth, simply run:

velveth

When we ran velveth, we specified auto as our directory name. We then specified the hash length as 31,45,2 which runs velveth with hash lengths of 31-43 with a step of 2 (note: k-mers have to be odd). This creates seven directories named auto_31 .. auto_43. To save disk space, the Sequences file is symbolically linked by velvet to the first directory (in this case auto_31). We then specified our two input files both with -fastq -shortPaired and the file name.

exercise 3: running velvetg and determining optimal k

The first step is to run velvetg on the different values of k, to determine the optimal value of k. Choose a value of k that you want to run first and start velvetg by running the following command:

velvetg auto_33 -exp_cov auto

This command will automatically estimate the expected kmer coverage (note this is different from the actual read coverage), the insert sizes of the individual paired-end and mate-pair runs and the coverage cut off, a parameter that is used to filter low coverage contigs from your assembly. All these parameters can be manually specified, which we will do later during this activity. Please look up the estimated kmer coverage, insert sizes and their standard deviations in the on-screen output of velvet and write them down. When velvetg finishes it will output the number of nodes, n50, and max and total size of the assembly created. If you look in the auto_* directory, you will also see a few files:

contigs.fa LastGraph PreGraph Sequences Graph2 Log Roadmaps stats.txt

These files are explained in detail in the manual, but the most useful files for post-analysis are the contigs.fa, Log, and stats.txt files. Running the following custom script will output the n50 as well as n90 values for this assembly:

calculateN50.pl auto_*/contigs.fa

You may notice that this n50 value is slightly different than what was reported by velvet. This is due to the fact that velvet reports its n50 (as well as everything else) in kmer space. For example, the relationship between coverage and kmer coverage is defined by the following:

Ck = C ∗ (L−k+1)/L

Where C=coverage, L=read length, and k=kmer length. For other things such as a contig length it is as simple as adding k-1 to the reported length.

exercise 4: running bowtie to estimate insert size distribution

Although the –exp_cov command will give you a rapid way to run velvet with insert sizes that reflect the actual insert sizes of your paired-end/mate-pair libraries, you can also estimate the insert-lengths by mapping all of the paired reads to a large segment of known sequence (such as a sequenced BAC) or to a large contig that you have just produced with an assembly and that you trust. In this case, we will be aligning to a contig of an assembled sequence for this bacteria. This will also allow us to estimate the standard deviation of the insert size of the libraries. You can then use the -ins_length* and -ins_length*_sd velvet parameters more accurately than was first specified. You will first need to create a bowtie index. Run bowtie-build on the provided long contig:

bowtie2-build NODE_3_length_277033_cov_81.263977.fa contig1.build

This will create six new files which constitute the bowtie index necessary for bowtie:

  • contig1.build.1.ebwt contig1.build.1.ebwt
  • contig1.build.1.ebwt contig1.build.2.ebwt
  • contig1.build.1.ebwt contig1.build.3.ebwt
  • contig1.build.1.ebwt contig1.build.4.ebwt
  • contig1.build.2.ebwt contig1.build.rev.1.ebwt
  • contig1.build.3.ebwt contig1.build.rev.2.ebwt

We can now map our paired reads in fastq format (unshuffled) to the contig1.build reference sequence:

bowtie -S -q --solexa1.3-quals -p 1 -I 100 -X 600 --fr contig1.build -1 300bp_pe_1.fastq -2 300bp_pe_2.fastq 300bp_pe_def.sam
bowtie -S -q --solexa1.3-quals -p 1 -I 1000 -X 6000 --rf contig1.build -1 3kb_mp_1.fastq -2 3kb_mp_2.fastq 3kb_mp_def.sam

The bowtie manual is very extensive, and you should read it if you require more information. Here we specified to specify -p 1 to specify the number of processors (increase this for multi-processor/multi-core machines), -I for minimum insert size, -X for maximum insert size, –fr for paired-end orientation, –phred64 for the version of fastq, –no-mixed and –no-discordant to remove unpaired and discordant alignment, -x for the reference sequence, -q for fastq format, -1 for left end reads, -2 for right end reads. Unfortunately, the bowtie -S option to output the SAM format outputs every single read to the SAM file regardless of whether it actually aligns to the reference or not. This creates a rather large file (in this case about 1GB), of which many reads are not mapped at all. To remove these unmapped reads you can use awk

awk '$3!="*"' 300bp_pe_def.sam > 300bp_pe_def.filt.sam
awk '$3!="*"' 3kb_mp_def.sam > 3kb_mp_def.filt.sam

This awk command searches the 3rd column ($3) of the tab-delimited SAM file to see if it is a “*” character (which means there was not a match for that particular read) or the name of a contig or reference sequence that you indexed (hence, an actual match). Make sure your awk command actually worked, and if you want to save some space you may decide to remove the original bowtie SAM file at this point:

rm 300bp_pe_def.sam
rm 3kb_mp_def.sam

exercise 5: viewing mapped pairs with Tablet

To launch tablet in the Ubuntu distribution, first start by typing the following commands in a terminal window:

cd ~/wg_jan2012/local/bin
rm tablet
ln -s ~/wg_jan2012/software/Tablet/tablet .

Then open tablet on the Ubuntu distribution by simply running:

tablet

On the Mac OS X distribution, you should have already installed Tablet in your Applications folder. Double-click this .dmg and go through the installer steps to install Tablet in your Applications folder, and then run Tablet from there. Once Tablet has loaded:

  • Click on “Open Assembly”.
  • Choose one of the filtered SAM files obtained previously as the primary assembly. Pick either 300bp_pe_def.filt.sam or 3kb_mp_def.filt.sam, and try to check with your neighbor so that each of you pick a different one.
  • Select the contig against which the reads were mapped in bowtie as the reference file (NODE_3_length_277033_cov_81.263977.fa) and click “Open”.
  • Once the assembly has loaded (this can take a while), select the only contig on the left menu, and wait until it loads (this also may take a while).
  • Explore the visualization features of the program, some of which are:
    • the navigation tools (zoom, …)
    • the various levels of highlighting the variants
    • the different layout styles
    • the different pack styles
    • in the advanced tab, click on “Coverage”
  • In a paired-end pack mode, mouse over some pairs, and have a look at the displayed information.
  • In the coverage mode, click in the low coverage regions in the overview window (above the main read display area).

exercise 6: re-running velvetg with optimized -ins_length and -ins_length_sd parameters

While it helps to visualize the insert sizes with tablet, we can get a more precise estimate of the insert sizes. The following command is a custom Perl script that parses the filtered SAM output files that we created earlier and determines the length between mapped mate-pairs, using get_insert_sizes_from_sam.pl

get_insert_sizes_from_sam.pl 300bp_pe_def.filt.sam > 300bp_pe_def.sizes
get_insert_sizes_from_sam.pl 3kb_mp_def.filt.sam > 3kb_mp_def.sizes

Take a look at the output file; it is simply a text file with the insert size for each mapped pair. We can then input this .sizes file into R for analysis. Open R on the Ubuntu distribution by simply running:

R

On the Mac OSX distribution, you will need to install R first if you do not already have it. You can find it in ~/wg_jan2012/software/R-2.13.1/R-2.13.1.pkg. Double-click this .pkg and go through the installer steps to install R. Then you can run R from the command line as well. Type the following commands in R to calculate mean and standard deviation, and draw a histogram of the insert size distributions:

sizes <- as.numeric(readLines("300bp_pe_def.sizes")) mean(sizes, na.rm=TRUE) median(sizes, na.rm=TRUE) sd(sizes, na.rm=TRUE) hist(sizes, nclass=50) sizes <- as.numeric(readLines("3kb_mp_def.sizes")) mean(sizes, na.rm=TRUE) median(sizes, na.rm=TRUE) sd(sizes, na.rm=TRUE) hist(sizes, nclass=50)

When you are satisfied, you can quit R by typing q(). Are the values that you obtained from this exercise different from the ones velvet estimated? If they are different, why would they be different? We can now use the -ins_length* and -ins_length*_sd velvet parameters (either the parameter values estimated by velvet or bowtie) to more accurately perform the assembly. With these new insert size estimates, run another velvetg assembly:

velvetg auto_33 -exp_cov 99.0 -ins_length1 354 -ins_length1_sd 43 -ins_length2 4252 -ins_length2_sd 821

exercise 7: additional velvetg parameters

Type velvetg from the command line to get the help:

velvetg

The velvet manual provides additional details for each parameter. The purpose of this exercise is to explore the parameters of interest and decide how they may affect the output of the assembly. You may be able to explore more velvet parameter space by running assemblies with different parameter values than your neighbors. You may try adjusting the cov_cutoff parameter. Take a look at the stats.txt which provides coverage information for each contig starting with the 6th column. Decide on a reasonable value for cov_cutoff and re-run velvetg, perhaps even a few times. The same can be said for the max_coverage parameter.

The min_pair_count variable is another parameter you may want to adjust. The default min_pair_count is 5, but you may to decide after looking at the bowtie mapped reads in tablet that this is too high or too low. Making this too low may result in scaffolds that are improperly connected during the velvet scaffolding step and thus mis-assembled. In fact, 5 is probably low to start with for such a high coverage data set.

Remember to share your results in the excel file at the front of the lab so we can discuss the results at the end of the activity.