Read mapping and variant calling

expected learning outcomes
system requirements
getting started
exercise 1
exercise 2
exercise 3
exercise 4
exercise 5

system requirements

For this activity, you will need the following programs installed:

Python (version 2.6 or 2.7)
Stampy
BWA
Samtools (includes bcftools, vcfutils.pl, tabix)
Platypus (temporary site)
VCFtools
IGV
annovar

getting started:

For this activity, you will need the following files:

These files will be provided to the Workshop on Genomics attendees by USB stick. Please do not initiate these downloads from the House of Prelate.

human_g1k_v37.fasta.gz: Human genome build 37 with NCBI identifiers (~800 Mb)
raw_reads.tgz: Raw trio reads: (73Mb)
EB_OTH_213.vcf.gz: Pre-called and merged trio variant calls (331 Mb)

Participants in the Workshop on Genomics should already have these files in ~/wg_jan2012/activities/Read_Mapping_and_Variant_Calling.

Workshop on Genomics participants should install and use a different version of Stampy for this activity: 1.0.13

tar xvfz stampy-1.0.13r1160.tgz
cd stampy-1.0.13
make
cd ..
cp -r stampy-1.0.13 ~/wg_jan2012/local/lib
ln -sf ~/wg_jan2012/local/lib/stampy-1.0.13/stampy.py ~/wg_jan2012/local/bin/stampy.py

NOTE: Participants in the Workshop on Genomics do not need to complete steps 1-3 below. Instead, ask for a data drive that contains the output from steps 1-3.

If you have at least 15G of disk space available, you may copy human_g1k_v37.tar.gz to ~/wg_jan2012/activities/Read_Mapping_and_Variant_Calling and uncompress it like so:

tar xvfz human_g1k_v37.tar.gz

The following steps take a while to complete. The runtimes assume a dual-core machine with 4G of RAM. You should also have several GB of disk space available before beginning this exercise.

Prepare Stampy index (*.stdix) and hash (*.sthash) (~60 minutes):
Prepare BWA index and hash (~15 minutes):
Download databases for Annovar (variable; up to 30 minutes):

annotate_variation.pl -downdb refGene humandb/ -build hg19
annotate_variation.pl -downdb band humandb/ -build hg19
annotate_variation.pl -downdb mce46way humandb/ -build hg19
annotate_variation.pl -downdb 1000g2010nov humandb/ -build hg19
annotate_variation.pl -downdb avsift humandb/ -build hg19
annotate_variation.pl -downdb segdup humandb/ -build hg19

Note 1: Some familiarity with the following languages and tools is assumed:

Basic Unix tools: man less head tail wc grep cut sort uniq gzip tar wget
Scripting languages: awk, one of perl or python (preferred)

If you’re not familiar with these basic tools, google them, or use man or tool –help to understand what they do. If you’re not familiar with awk, python or perl, make a mental note to look into them (the efficiency gained will make it worth your time), and for now just use the examples given below.

Note 2: Most of what you need is written below. If you don’t understand a command, do not run it as that would be missing the point of this tutorial. Ask yourself if you could explain what each step is accomplishing and how. If things are at all unclear, read the documentation (man command, and/or command –help) to find out what a tool does, and what the meaning is of all options.

Note 3: Working with large files is a challenge, as it is often difficult to spot errors that would be trivial to notice with smaller files. It is important to check the sanity of your input and output files at each stage, using basic Unix tools (head, less -S, wc) or viewers (samtools view, samtools tview, igv). Always look at a portion of the input and output files to make sure the tools worked as you expected them to. Count the number of input and output lines, and convince yourself that these numbers are in the right ballpark.

Note 4: The tutorial below makes use of various file formats, designed especially for working with next-gen sequencing data, and which have emerged in the last year or two. The main ones are:

SAM/BAM (for storing mapped reads): SAM1.pdf
VCF/BCF (for storing variant calls): specs.html

exercises

exercise 1: mapping reads

(Note: This exercise has already been completed and the output files are on the Workshop data drive in mapped_reads.tar.gz)

Using Stampy

(Note: This step has already been completed and the output files are on the Workshop data drive in human_g1k_v37.tar.gz. -G generates the *.stdix file, -H generates the *.sthash) Use the commands -G and -H to build the genome index and the hash table, using the (gzipped) genome reference as input.
Map the reads for the 3 individuals. Make sure to provide a read group identifier, and a sample name. The output is in SAM format. What is the the meaning of the data in each of the SAM files’ columns?

Using BWA (optional)

(Note: This step has already been completed and the output files are on the Workshop data drive in human_g1k_v37.tar.gz) Make the genome index files using the “index” command. Use the bwtsw algorithm.
Make .sai files for each set of reads (bwa aln), then combine them to produce a SAM file (bwa sampe).

If you are sure that the Stampy and BWA mapping steps worked properly, you may remove the following files if you are low on disk space: human_g1k_v37.bwt human_g1k_v37.pac human_g1k_v37.sa human_g1k_v37.sthash human_g1k_v37.stidx

Post process the mapped SAM file:

Convert to BAM format (samtools view –Sb)
Sort by coordinate (samtools sort)
Remove duplicates (samtools rmdup)
Index (samtools index)

Again, if you are low on disk space, you may remove the SAM files after you have converted them to BAM files.

exercise 2: snp and indel calling

Using Platypus

Linux users should first run the following command:
cp ~/wg_jan2012/software/PlatypusRelease/Platypus.py ~/wg_jan2012/local/bin

Mac users, download the package: Platypus_0.1.5.tgz
tar xvfz Platypus_0.1.5.tgz

# Build pysam first:
cd PlatypusRelease/pysam python setup.py install --prefix=PATH_TO_INSTALL_PLATYPUS_LIBRARIES

# Then build Platypus
cd PlatypusRelease/ python setup.py install --prefix=PATH_TO_INSTALL_PLATYPUS_LIBRARIES

IF THIS DOESN’T WORK, you can simply download the VCF file that would be generated: ebsnp.vcf_.gz

Unzip and rename the file: gunzip ebsnp.vcf_.gz && mv ebsnp.vcf_ ebsnp.vcf

Use the defaults. Call the 3 individuals together (note: EB_OTH_2 is the child, EB_OTH_1 and EB_OTH_3 are the parents). What do the columns in the VCF file mean? What information is in the header?

Using Samtools

Google “samtools variant calling” for the SourceForge page with an example. The required tools (samtools, bcftools, vcfutils.pl) are all part of the samtools package. Call the 3 individuals together.
Mac users may download bgzip and tabix: bgzip_and_tabix.tar.gz
tar xvfz bgzip_and_tabix.tar.gz
cp bgzip tabix ~/wg_jan2012/local/bin
Compress and index VCF files for further processing. Run bgzip and tabix -p vcf on the VCF files.

exercise 3: comparing the two call sets

Linux users should open ~/wg_jan2012/local/wg.profile in a text editor and add the following line:
export PERL5LIB=~/wg_jan2012/local/lib/:~/wg_jan2012/local/lib/perl5:~/wg_jan2012/local/lib/vcftools_0.1.7:$PERL5LIB
Then, open a new terminal. VCF tools such as vcf-compare should now work.

Use vcf-compare to get a summary of matching and unique calls in the two call sets. Which of the two sets contain more calls? Does the tool report issues? Scan through the call sets by eye. Are there calls you don’t trust? Could filtering have improved the call set?
Use vcf-merge to combine the two sets into one. Use awk ‘$10 == “.”’ to select calls unique to one or the other. Select a couple of SNP and indel calls in the region 61,500,000 – 62,500,000 and have a look at the relevant BAM files. Use both samtools tview and IGV. Can you draw any conclusions about the reliability of these unique SNP and indel calls?

exercise 4: finding de novo variants and annotation

On the original VCF files, use awk ‘$10 ~ “0/0” && $11 ~ 0/0’ to extract potential de novo calls in the child (assuming the child was listed third on the command line). How many real de novo mutations do you expect to see in an entire genome? How many in a 1 Mb region?
Select all SNPs in the region 61,500,000 – 62,500,000, and look at them manually using either samtools tview or IGV. How many candidates are you left with? How many look like real variants?
See http://www.openbioinformatics.org/annovar/annovar_startup.html
Convert the VCF file of putative de novo variants into Annovar format:
convert2annovar.pl in.vcf –format vcf4 –includeinfo > variants.ann
Run Annovar to find non-synonymous variants:
annotate_variation.pl –geneanno variants.ann humandb/ --buildver hg19
How many variants are predicted to cause coding changes? What is the affected gene?

exercise 5: the real analysis

The previous section showed how to find the causal variant, if you already know which 1 Mb portion of the genome it is in. In reality you don’t know this, and more filtering is required to drill down to a handful of likely candidates. Annovar is very helpful for this analysis.

First read the page under ‘necessary programs’ on the Annovar website; the steps outlined below take you through a similar filtering pipeline. Also browse the documentation for gene-based, region-based and filter-based annotations.

Merge the 3 VCF files together using vcf-merge. Use the order 2,1,3 to get the child as the 1st sample. Compress and index using bgzip and tabix -p vcf. (Note: this file is already in the activity folder as EB_OTH_213.vcf.gz)
Filter, using an AWK, Perl or Python script, to get a file with all putative de novo variants in the child (103331).
Convert the file into annovar format (use the –includeinfo option). Make a note of the number of transitions and transversions in the file. What does this suggest about the FP rate? Also note the number of SNPs and indels. What does this suggest of the FP rate among indels? How is the annovar format constructed?
Run annotate_variation.pl -geneanno humandb/ -buildver hg19.
Extract the variants with likely phenotypes: amino acid changing, frameshift inducing, and splice variants.
grep -v -w synonymous 213.ann.exonic_variant_function | grep -v -w nonframeshift | cut -f4- > 213.phenotype.unsorted.anngrep -w splicing 213.ann.variant_function | cut -f3- >> 213.phenotype.unsorted.ann

sort -k1,1 -k2,2g 213.phenotype.unsorted.ann > 213.phenotype.ann
Why is the cut operation necessary? How many variants are you left with?
Extract variants that are evolutionarily conserved (the LOD-score threshold 50 used below is fairly arbitrary).
annotate_variation.pl -regionanno -dbtype mce46way -buildver hg19 213.exon.splicing.ann humandb/ awk ‘match($0,/lod=[0-9]+/){if(substr($0,RSTART+4,RLENGTH-4)+0>50)print}’ < … | cut –f3- > …
How many variants are left now?
Remove variants in segmental duplications (do this in steps, rather than in one line as written below, to ensure that each command does what it is supposed to do):
annotate_variation.pl -regionanno -dbtype segdup -buildver hg19 …. humandb/ | cut -f4- | grep -v -f – …. > ….
How many variants were removed by this filter?
Remove variants that are present in the 1000 Genomes database.
annotate_variation.pl -filter -dbtype 1000g2010nov_all -buildver hg19 …. humandb/
How many variants were removed in this way?
Remove variants that appear benign according to the SIFT program:
annotate_variation.pl -filter -dbtype avsift -buildver hg19 …. humandb/How many variants are you left with now?
Convert the annovar output file back into VCF using standard Unix tools. Download BAM files from here. Start up IGV and have a look at the remaining 8 variants, loading up the 3 BAM files provided, and the VCF file, and identify any calls that appear to be data problems. How many calls remain?
The child (individual 2) presented with an epilepsy phenotype, and a de novo variant was suspected to have caused it. Is there a gene that is a likely candidate for this phenotype? Is this the same one that was identified in exercise 4.5?

table of contents

system requirements

getting started:

exercises