Contig Assembly Using Minia - Evolution and Genomics

Introduction

Minia is a software for ultra-low memory DNA sequence assembly. It takes as input a set of short genomics sequences (typically data produced by the Illumina DNA sequencer). Its output is a set of contigs (assembled sequences) forming an approximation of the expected genome. Minia is based on a succinct representation of the de Bruijn graph. The computational resources required to run Minia are significantly lower than that of other assemblers. For example, it is able to assemble a human genome using only 5.7 GB of RAM.

Of note, Minia is only a contig assembler. It performs the first stage of a complete assembly (contig assembly of short reads) in a computationally efficient manner. Additional software is required to assemble these short contigs into longer ones.

Minia is primarily developed for the assembly of Illumina reads typically input as FASTQ files. However, it will accept FASTA input files, so it is technically possible to input reads from other sequencers using this format. However, this is neither recommended or supported.

In addition, Minia does not take the quality values from FASTQ files into consideration during assembly. Instead, during the first step of assembly Minia quantifies the abundance of possible k-mers and only uses k-mers above a threshold for assembly. This should eliminate a majority of erroneous k-mers.

Links:

This tutorial is for version 1 of Minia. Minia 1.x can be downloaded from: http://minia.genouest.org/files/minia-1.6906.tar.gz

Minia has Biostar support site which can be found at: http://biostar.genouest.org/

A short manual for Minia can be found here: http://minia.genouest.org/files/manual.pdf

Installation:

Note: if you used the customized workshop USB flash drive for software installation, you already have this program and do not need to install it. Minia is also installed on our AWS instance.

Minia can be run on any computer using Linux or Mac operating systems. You can download the program from http://minia.genouest.org/. Unzip them to your preferred directory and type make.

It is possible to compile Minia to consider long k-mer lengths. To do so, you need to specify this during the installation process. For example, to explore k-mers up to 100 in length you would use the following command:

make clean && make k=100

Memory Usage:

Rougly, Minia will require 3 GB of RAM per gigabases in the target genome.

Parameters:

A typical Minia 1.x run uses the following command structure:

minia input_file kmer_size min_abundance estimated_genome_size output_prefix

A completed example might look like the following:

minia reads.fastq 31 3 4500000 output_prefix

Where reads.fastq is the location of your input file, 31 is the k-mer length, 3 is the minimum abundance, 4500000 is an inaccurate estimate of the expected genome size (4.5 Mbp) and output_prefix indicates where the output files will be written. More details about each of these parameters is below.

input_file: the input file

FASTA or FASTQ: Minia ignores quality information encoded in a FASTQ file. It will accept paired or mate-pairs, but discards pairing information. Input files of different types can be mixed (i.e. FASTA and FASTQ, zipped and unzipped (see below))

Multiple Files: Minia will accept input from multiple, separate sequence data files. You will need to create a file listing all of the files you would like Minia to use. One file per line.

Compressed Files: Minia can directly read files compressed with gzip. Compressed files should end with .gz.

Output

The output of Minia is a set of contigs in the FASTA format in the file contigs.fa.

Exercise:

For this exercise we will be assembling contigs using sequence data obtained using the Illumina MiSeq platform. This is a high-quality data set produced and provided by Illumina. Details about the generation of the sequenced data can be found on their web site http://www.illumina.com/systems/miseq/scientific_data.ilmn. In short, genomic DNA was prepared from E. coli strain K12 MG1655 and sequenced on the MiSeq instrument using a paired-end 2 x 150 bp protocol. The genome of E. coli K12 is ~4.6 million base pairs.

We will also use a variety of other programs to assess assembly quality and also to compress and decompress large sequence data files. A list of these programs with brief description of what they do is below:

Quip: Aggressive compression of FASTQ, SAM and BAM files. This will help us to reduce the amount of drive space we take up and decrease data transfer times

Minia: software for ultra-low memory DNA sequence assembly

Bowtie2: mapps reads to a reference genome

Tablet: genome assembly viewer

QUAST: Calculates a variety of assembly metrics

SSPACE: connects contigs into larger scaffolds

You may choose to work either on your local Ubuntu VM, or on AWS. If your machine is on the slower/older side, we would recommend trying AWS.

If your choose to run this activity on AWS, you will need to log-in as we have before at: https://genomics2013.signin.aws.amazon.com/console.

Use the instructions from earlier in the week. You will need to:

1) Enter name and password

2) Select EC2

3) Launch new instance

4) Classic wizard

5) Select the My AMI’s tab

6) Select the AMI with WG_JAN2013 in the title

7) Under instance details make sure to select M1 Large

7) Continue through the remaining windows

8) Log-in via ssh using instructions from Konrad’s exercise

Konrad’s instructions for logging into the cloud can be found here.

Step 0.1: Navigate to the Workshop activities directory

You can find the example files in:

cd ~/wg_jan2013/activities/minia/

Step 0.2: Decompress the Illumina MiSeq data files and generate smaller subsets

Data generated on Illumina platforms generally results in very large data files. Working with such large data files can be challenging. Fortunately, compression algorithms have been developed to reduce FASTQ, SAM and BAM file size. This can aid with long term storage and file transfer across networks which can be a major bottleneck for sharing data with collaborators or working with cloud based services.

Quip is implements an aggressive file compression algorithm designed and is designed specifically for FASTQ files. It can typically reduce FASTQ, SAM and BAM files to 15% of their original size. This is particularly useful in the Workshop, but you may also find use of it in your own research so we will briefly introduce it here. Please note, file compression has nothing to do with assembly per se, this is only a method to compress the files we will be using for assembly.

Quip compressed files have the *.qp file extension. Decompress the file names Read_1_25.fastq.qp.

unquip Read_1_25.fastq.qp

Decompression will take a minute or two and you should end up with a ~467 MB file called Read_1_25.fastq.qp.

This file contains contains 1,432,368 reads. This is actually on 25% (hence the _25 in the filename) of the total reads that were sequenced. We are going to use this smaller subset of data to make the exercises doable during the allotted time. The other files contain increasing numbers of reads (e.g. Read_1_50.fastq has 50%). You may not have enough time to explore working with these denser data files, but they are included in case you have more time and wish to explore.

Step 1: de novo contig assembly using Minia

During experimental contig assembly you would want to assess the effect of read abundance, length of k and minimum abundance on the assembly. However, due to time constraints we will continue to work with the 25% read data set and will be evaluating only the effect of k on contig assembly.

We will be evaluating contig construction using QUAST which is an utility which takes a series of contigs as inputs and provides extensive metrics on their distributions.

Recall that the general syntax for Minia is:

minia input_file kmer_size min_abundance estimated_genome_size output_prefix

Run 25% of the reads through Minia with k=63, min_abundance=7 and a genome size of 4500000. Give the output a useful output_prefix so it is easy to keep track of.

Note! If you get an error when launching Minia on Amazon, you will need to type the following at the prompt: export LD_LIBRARY_PATH=

On AWS the run should take 20-30 minutes. This is a good time to read through the QUAST manual which we will be using to evaluate our assembly.

Evaluating contig metrics with QUAST:

Run QUAST on your output file. The syntax is below. You will want to specify a descriptive name for the output directory where QUAST will deposit a large number of output files.

quast.py -o output_directory contig_file

Move into your QUAST output directory and examine the report.txt file. A description of the output can be found on the QUAST web site: http://quast.bioinf.spbau.ru/manual_1.3.html.

Evaluating adjusting k on contig assembly

Rerun the Read_1_25.fastq data using a lower k of 44.

This run will also take up to 30 minutes. Once complete, evaluate the output using quast as we did above.

Optional Step 2: Mapping reads to a reference genome

Minia is an efficient de novo assembler. In reality, if you were sequencing a known strain of E. coli or any other organism with a reference genome, you would probably not rely on de novo assembly. Instead, you would map your reads against the known reference genome.

For this step we will go ahead and map the reads to an E. coli reference genome just to give you an idea of the level of coverage we obtain with a reference. In addition, some introduction to mapping will help with several exercises you will be completing in the next few days.

You will need to complete this exercise on the Ununtu VM and not on AWS. If your computer is slow it will take some time, which is why this is optional. We will be covering this topic in more detail in the coming days so don’t be rushed to get it done.

The E. coli reference genome is located in your activities/minia directory. It is called E coli_MG1655_reference.fasta. We will be using bowtie2 to map our FASTQ reads to this reference genome so the first thing to do is to move into the bowtie directory. Confirm that the reference fasta file is there before moving on.

Before mapping reads to the reference genome you will need to make a set of index files for the reference. This is accomplished using the bowtie2-build command. The syntax is to specify the file to index, in this case the FASTA file, and then a name for all of the output files. We will just call the index files the same thing as the input reference.

$bowtie2-build Ecoli_MG1655_reference.fasta Ecoli_MG1655_reference

This should only take a few seconds and you should end up with 6 new files in your bowtie directory. All of them will end with the *.bt2 extension. These files constitute the entire index for the reference sequence and are the files that Bowtie2 uses during the mapping procedure.

Map your decompressed reads to this index using the bowtie2 command. To use bowtie2 you need to (at a minimum) indicate the index files with the -x flag, and the location of the file containing the reads you want to map. In our case we will specify this using the -U flag to indicate we are looking at an unpaired read set. Following this we will use the -S flag to output a SAM file which we will look at in the next step.

bowtie2 -x Ecoli_MG1655_reference -U ../Read_1_100.fastq -S Read_1_100.sam

This will take several minutes. When the mapping is complete you will see some basic mapping metrics in your terminal and you should have a new Read_1_100.sam output file.

You can view sam output files in the Tablet visualization tool. You can launch Tablet by typing tablet at the command prompt. Select to “Open a New Assembly” and load the sam output file and the original FASTA reference file. After clicking open it will take a minute or two to open. You will need to select the one contig from the menu at the left to display the alignment visualization. Tablet will allow you to navigate and zoom in and out on the alignment.

Export a coverage summary by clicking on the red, circular tablet icon tablet_icon . This will create a text file with a detailed coverage report. An easy way to summarize this coverage file is to use the coveragestats.py script available on the Tablet web site. Do this by running this script on the exported coverage file

$python ~/wg_jan2013/local/bin/coveragestats.py exported_coverage_file

You should get the idea that there is extensive coverage of the genome given this data set.