table of contents
- expected learning outcomes
- system requirements
- getting started
- exercise 1
- exercise 2
- exercise 3
- exercise 4
- exercise 5
system requirements
For this activity, you will need the following programs installed:
- Python (version 2.6 or 2.7)
- Stampy
- BWA
- Samtools (includes bcftools, vcfutils.pl, tabix)
- Platypus (temporary site)
- VCFtools
- IGV
- annovar
getting started:
For this activity, you will need the following files:
- human_g1k_v37.fasta.gz: Human genome build 37 with NCBI identifiers (~800 Mb)
- raw_reads.tgz: Raw trio reads: (73Mb)
- EB_OTH_213.vcf.gz: Pre-called and merged trio variant calls (331 Mb)
- tar xvfz stampy-1.0.13r1160.tgz
- cd stampy-1.0.13
- make
- cd ..
- cp -r stampy-1.0.13 ~/wg_jan2012/local/lib
- ln -sf ~/wg_jan2012/local/lib/stampy-1.0.13/stampy.py ~/wg_jan2012/local/bin/stampy.py
If you have at least 15G of disk space available, you may copy human_g1k_v37.tar.gz to ~/wg_jan2012/activities/Read_Mapping_and_Variant_Calling and uncompress it like so:
tar xvfz human_g1k_v37.tar.gz
The following steps take a while to complete. The runtimes assume a dual-core machine with 4G of RAM. You should also have several GB of disk space available before beginning this exercise.
- Prepare Stampy index (*.stdix) and hash (*.sthash) (~60 minutes):
- gunzip human_g1k_v37.fasta.gz
- stampy.py –assembly=ncbi37 –species=human -G human_g1k_v37 human_g1k_v37.fasta
- stampy.py -g human_g1k_v37 -H human_g1k_v37
- Prepare BWA index and hash (~15 minutes):
- bwa index -a bwtsw -p human_g1k_v37 human_g1k_v37.fasta
- Download databases for Annovar (variable; up to 30 minutes):
- annotate_variation.pl -downdb refGene humandb/ -build hg19
- annotate_variation.pl -downdb band humandb/ -build hg19
- annotate_variation.pl -downdb mce46way humandb/ -build hg19
- annotate_variation.pl -downdb 1000g2010nov humandb/ -build hg19
- annotate_variation.pl -downdb avsift humandb/ -build hg19
- annotate_variation.pl -downdb segdup humandb/ -build hg19
- Basic Unix tools: man less head tail wc grep cut sort uniq gzip tar wget
- Scripting languages: awk, one of perl or python (preferred)
If you’re not familiar with these basic tools, google them, or use man or tool –help to understand what they do. If you’re not familiar with awk, python or perl, make a mental note to look into them (the efficiency gained will make it worth your time), and for now just use the examples given below.
- SAM/BAM (for storing mapped reads): SAM1.pdf
- VCF/BCF (for storing variant calls): specs.html
exercises
exercise 1: mapping reads
(Note: This exercise has already been completed and the output files are on the Workshop data drive in mapped_reads.tar.gz)
Using Stampy
- (Note: This step has already been completed and the output files are on the Workshop data drive in human_g1k_v37.tar.gz. -G generates the *.stdix file, -H generates the *.sthash) Use the commands -G and -H to build the genome index and the hash table, using the (gzipped) genome reference as input.
- Map the reads for the 3 individuals. Make sure to provide a read group identifier, and a sample name. The output is in SAM format. What is the the meaning of the data in each of the SAM files’ columns?
Using BWA (optional)
- (Note: This step has already been completed and the output files are on the Workshop data drive in human_g1k_v37.tar.gz) Make the genome index files using the “index” command. Use the bwtsw algorithm.
- Make .sai files for each set of reads (bwa aln), then combine them to produce a SAM file (bwa sampe).
If you are sure that the Stampy and BWA mapping steps worked properly, you may remove the following files if you are low on disk space: human_g1k_v37.bwt human_g1k_v37.pac human_g1k_v37.sa human_g1k_v37.sthash human_g1k_v37.stidx
Post process the mapped SAM file:
- Convert to BAM format (samtools view –Sb)
- Sort by coordinate (samtools sort)
- Remove duplicates (samtools rmdup)
- Index (samtools index)
Again, if you are low on disk space, you may remove the SAM files after you have converted them to BAM files.
exercise 2: snp and indel calling
Using Platypus
Linux users should first run the following command:
cp ~/wg_jan2012/software/PlatypusRelease/Platypus.py ~/wg_jan2012/local/bin
Mac users, download the package: Platypus_0.1.5.tgz
tar xvfz Platypus_0.1.5.tgz
# Build pysam first:
cd PlatypusRelease/pysam
python setup.py install --prefix=PATH_TO_INSTALL_PLATYPUS_LIBRARIES
# Then build Platypus
cd PlatypusRelease/
python setup.py install --prefix=PATH_TO_INSTALL_PLATYPUS_LIBRARIES
IF THIS DOESN’T WORK, you can simply download the VCF file that would be generated: ebsnp.vcf_.gz
Unzip and rename the file: gunzip ebsnp.vcf_.gz && mv ebsnp.vcf_ ebsnp.vcf
- Use the defaults. Call the 3 individuals together (note: EB_OTH_2 is the child, EB_OTH_1 and EB_OTH_3 are the parents). What do the columns in the VCF file mean? What information is in the header?
Using Samtools
- Google “samtools variant calling” for the SourceForge page with an example. The required tools (samtools, bcftools, vcfutils.pl) are all part of the samtools package. Call the 3 individuals together.
- Mac users may download bgzip and tabix: bgzip_and_tabix.tar.gz
tar xvfz bgzip_and_tabix.tar.gz
cp bgzip tabix ~/wg_jan2012/local/bin
- Compress and index VCF files for further processing. Run bgzip and tabix -p vcf on the VCF files.
exercise 3: comparing the two call sets
Linux users should open ~/wg_jan2012/local/wg.profile
in a text editor and add the following line:
export PERL5LIB=~/wg_jan2012/local/lib/:~/wg_jan2012/local/lib/perl5:~/wg_jan2012/local/lib/vcftools_0.1.7:$PERL5LIB
Then, open a new terminal. VCF tools such as vcf-compare
should now work.
- Use vcf-compare to get a summary of matching and unique calls in the two call sets. Which of the two sets contain more calls? Does the tool report issues? Scan through the call sets by eye. Are there calls you don’t trust? Could filtering have improved the call set?
- Use vcf-merge to combine the two sets into one. Use awk ‘$10 == “.”’ to select calls unique to one or the other. Select a couple of SNP and indel calls in the region 61,500,000 – 62,500,000 and have a look at the relevant BAM files. Use both samtools tview and IGV. Can you draw any conclusions about the reliability of these unique SNP and indel calls?
exercise 4: finding de novo variants and annotation
- On the original VCF files, use awk ‘$10 ~ “0/0” && $11 ~ 0/0’ to extract potential de novo calls in the child (assuming the child was listed third on the command line). How many real de novo mutations do you expect to see in an entire genome? How many in a 1 Mb region?
- Select all SNPs in the region 61,500,000 – 62,500,000, and look at them manually using either samtools tview or IGV. How many candidates are you left with? How many look like real variants?
- See http://www.openbioinformatics.org/annovar/annovar_startup.html
- Convert the VCF file of putative de novo variants into Annovar format:
convert2annovar.pl in.vcf –format vcf4 –includeinfo > variants.ann
- Run Annovar to find non-synonymous variants:
annotate_variation.pl –geneanno variants.ann humandb/ --buildver hg19
How many variants are predicted to cause coding changes? What is the affected gene?
exercise 5: the real analysis
The previous section showed how to find the causal variant, if you already know which 1 Mb portion of the genome it is in. In reality you don’t know this, and more filtering is required to drill down to a handful of likely candidates. Annovar is very helpful for this analysis.
First read the page under ‘necessary programs’ on the Annovar website; the steps outlined below take you through a similar filtering pipeline. Also browse the documentation for gene-based, region-based and filter-based annotations.
- Merge the 3 VCF files together using vcf-merge. Use the order 2,1,3 to get the child as the 1st sample. Compress and index using bgzip and tabix -p vcf. (Note: this file is already in the activity folder as EB_OTH_213.vcf.gz)
- Filter, using an AWK, Perl or Python script, to get a file with all putative de novo variants in the child (103331).
- Convert the file into annovar format (use the –includeinfo option). Make a note of the number of transitions and transversions in the file. What does this suggest about the FP rate? Also note the number of SNPs and indels. What does this suggest of the FP rate among indels? How is the annovar format constructed?
- Run
annotate_variation.pl -geneanno humandb/ -buildver hg19
.
Extract the variants with likely phenotypes: amino acid changing, frameshift inducing, and splice variants.
grep -v -w synonymous 213.ann.exonic_variant_function | grep -v -w nonframeshift | cut -f4- > 213.phenotype.unsorted.ann
grep -w splicing 213.ann.variant_function | cut -f3- >> 213.phenotype.unsorted.ann
sort -k1,1 -k2,2g 213.phenotype.unsorted.ann > 213.phenotype.ann
- Why is the cut operation necessary? How many variants are you left with?
- Extract variants that are evolutionarily conserved (the LOD-score threshold 50 used below is fairly arbitrary).
annotate_variation.pl -regionanno -dbtype mce46way -buildver hg19 213.exon.splicing.ann humandb/
awk ‘match($0,/lod=[0-9]+/){if(substr($0,RSTART+4,RLENGTH-4)+0>50)print}’ < … | cut –f3- > …
How many variants are left now? - Remove variants in segmental duplications (do this in steps, rather than in one line as written below, to ensure that each command does what it is supposed to do):
annotate_variation.pl -regionanno -dbtype segdup -buildver hg19 …. humandb/ | cut -f4- | grep -v -f – …. > ….
How many variants were removed by this filter? - Remove variants that are present in the 1000 Genomes database.
annotate_variation.pl -filter -dbtype 1000g2010nov_all -buildver hg19 …. humandb/
How many variants were removed in this way? - Remove variants that appear benign according to the SIFT program:
annotate_variation.pl -filter -dbtype avsift -buildver hg19 …. humandb/
How many variants are you left with now?
- Convert the annovar output file back into VCF using standard Unix tools. Download BAM files from here. Start up IGV and have a look at the remaining 8 variants, loading up the 3 BAM files provided, and the VCF file, and identify any calls that appear to be data problems. How many calls remain?
- The child (individual 2) presented with an epilepsy phenotype, and a de novo variant was suspected to have caused it. Is there a gene that is a likely candidate for this phenotype? Is this the same one that was identified in exercise 4.5?