Evomics2022_SV_lab_2.2

Evomics 2022 – Structural variant tutorial

Tutorial written by Valentina Peona


Presentation slides

Introduction and wrap-up to the activities (both the lab and quiz).


Aim

Structural variants (SVs) are awesome! But their characterisation is hard and very dependent on the technologies implemented and the quality of the reference genome assembly. The call of SVs can be frustrating but it is essential to carry it out as carefully as possible so to not bias downstream analysis (which is the fun part!).

In this tutorial we are going to call SVs with short and long read sequencing data and compare those calls and discuss about what factors can improve the analysis.

The tutorial is relatively easy (map reads, call SVs, and compare VCF files) and uses what we can consider “optimal” data since we simulated both SVs and sequencing reads (PacBio, Nanopore and Illumina) from the human chromosome 22 (GRChg38). The simulated SVS are all homozygous! In this optimal scenario you will get an idea of how technology and tool dependent SV calling is and how manual inspection can be essential to get sensible results.

In more details, we simulated Illumina, PacBio, and Nanopore libraries at 30X coverage and simulated 110 SVs on a single chromosome. The SV simulation was done using SURVIVOR simSV and sequencing simulation was done using SURVIVOR simread for PacBio and Nanopore and wgsim for Illumina.

  • Illumina library (30X): paired-end reads, 150 bp each
  • PacBio library (30X): average read length of 6 kb. The error profile and length distribution were taken from the human sample and library HG002.
  • Nanopore library (30X): average read length of 5.5 kb. The error profile and length distribution were taken from the human sample and library NA12878.

You can find the human reference for chromosome 22 in the folder Data with the name GRChg38_chr22.fasta while you can find the mapped reads (BAM and BAI files) in Intermediate/BWA (Illumina) and Intermediate/Minimap2 (PacBio and Nanopore).

Tools

  • Minimap2 to map long reads (PacBio and Nanopore)
  • BWA to map short reads (Illumina)
  • Samtools to manipulate BAM files
  • Sniffles to call and genotype SVs
  • Manta to call SVs using short reads
  • Bcftools to manipulate VCF files with SV calls
  • BEDTools to manipulate VCF files
  • SURVIVOR to merge different VCF files with the SV calls
  • Samplot and IGV to visualise and inspect SVs

Folder structure

a13_structural_variation/SV_lab/
├─ Code/
├─ Data/
├─ Intermediate/
│  ├─ BWA/
│  ├─ Manta/
│  ├─ Minimap2/
│  ├─ Sniffles/
│  ├─ VCF_comparison/
├─ Results
├─ Tools
├─ Other

Conda environments

The conda-environments and additional programs are pre-installed on the instance, if you would like to recreate in future you click here.

In this tutorial we need to run several tools that need two separate Conda environments. A Yaml file is provided for each environment.

CONDA ENVS have been already installed for you on the Amazon server, please do not run the code below

The code below is here for illustrative purposes

Create main environment called SV_Env and install SURVIVOR.

# be sure that no other conda envs are active. If another env is active, please run:
# conda deactivate
# be sure that no other conda envs are active
conda deactivate

cd ~/workshop_materials/a13_structural_variation/SV_lab/Tools
mamba env create -f SV_Env.yml
git clone https://github.com/fritzsedlazeck/SURVIVOR.git
cd SURVIVOR/Debug
make

The code below is here for illustrative purposes

Create a second environment called Manta.

# be sure that no other conda envs are active. If another env is active, please run:
# conda deactivate
cd ~/workshop_materials/a13_structural_variation/SV_lab/Tools/
mamba env create -f Manta.yml
conda activate Manta

The code below is here for illustrative purposes

Manta installation instructions from: https://github.com/Illumina/manta/blob/master/docs/userGuide/installation.md

# create folder for the installation of Manta
mkdir -p ~/SV_lab/Tools/Manta
cd ~/workshop_materials/a13_structural_variation/SV_lab/Tools/Manta

# download from repository
wget https://github.com/Illumina/manta/releases/download/v1.6.0/manta-1.6.0.release_src.tar.bz2
tar -xjf manta-1.6.0.release_src.tar.bz2

# install
mkdir build && cd build
# Ensure that CC and CXX are updated to target compiler if needed, e.g.:
#     export CC=/path/to/cc
#     export CXX=/path/to/c++
../manta-1.6.0.release_src/configure --jobs=2 --prefix=./
make -j2 install

Now that everything is installed and ready to run, deactivate the Manta env and let’s start the analysis.

Map data

I already mapped the reads for you because it takes too long. This part of the tutorial is illustrative, you do not need to run any mapping command

If you are interested in how the reads are mapped, click here.

The first step of the analysis is the mapping of reads to the reference genome assembly. In our case we will separately map Illumina short reads with BWA, PacBio and Nanopore long reads with Minimap2 to the human reference of chromosome 22.

Short reads with BWA

Illumina (IL) reads

Map short Illumina reads with BWA to obtain a SAM file of aligned reads. Sort the SAM file by coordinate and covert it into BAM and index the resulting BAM file.

Code example, do not run

OUT=SV_IL

# create output directory
mkdir -p ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/BWA/
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/BWA/

# run bwa
REF=~/workshop_materials/a13_structural_variation/SV_lab/Data/GRChg38_chr22.fasta
R1=/path/to/forward_IL.fastq.gz
R2=/path/to/reverse_IL.fastq.gz

bwa index $REF

bwa mem -t 4 -o $OUT.sam -a $REF $R1 $R2

# convert sam into bam and index the bam file
samtools sort $OUT.sam > $OUT.bam
samtools index -b -@ 4 $OUT.bam $OUT.bai

Instead of running the above code, please find the BAM file in Intermediate/BWA/SV_IL.bam as well as its index file Intermediate/BWA/SV_IL.bai. The index file is very essential!

DO RUN

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Manta
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/BWA/SV_IL.bam .
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/BWA/SV_IL.bai .

Long reads with Minimap2

PacBio (PB) reads

Map long PacBio reads with Minimap2 to obtain a SAM file of aligned reads. Sort the SAM file by coordinate and covert it into BAM and index the resulting BAM file.

Minimap2 is able to map different types of reads including short reads. You can and must specify what type of reads you want to map through the parameter -x. In this case we use -x map-pb for PacBio CSR reads. The newest releases of Minimap2 allow to specifically map alfo PacBio hifi reads. The different options are necessary for the program to predict the sequencing error pattern of each of the technologies (manual on GitHub).

In addition, it is very important to add the parameter --MD to the Minimap2 command because the MD tag is used downstream by Sniffles to call the SVs.

Code example, do not run

OUT=SV_PB

# create output directory
mkdir -p ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/PacBio
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/PacBio

# link necessary files
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Data/GRChg38_chr22.fasta .

# index reference before running minimap2
minimap2 -d GRChg38_chr22.mmi GRChg38_chr22.fasta

# run minimap2 on all the fastq.gz
minimap2 -a -Y -x map-pb --MD GRChg38_chr22.mmi $( ls ~/workshop_materials/a13_structural_variation/SV_lab/Data/Reads/PacBio/*.fastq.gz ) > $OUT.sam

# convert sam into bam and index the bam file
samtools sort $OUT.sam > $OUT.bam
samtools index -b -@ 4 $OUT.bam $OUT.bai

Instead of running the above code, please find the BAM file corresponding to the PacBio library in Intermediate/Minimap2/PacBio/SV_PB.bam as well as its index file Intermediate/Minimap2/PacBio/SV_PB.bai. The index file is essential!

DO RUN

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/PacBio
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/PacBio/SV_PB.bam .
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/PacBio/SV_PB.bai .

Nanopore (ONT) reads

Map long Nanopore reads with Minimap2 to obtain a SAM file of aligned reads. Sort the SAM file by coordinate and covert it into BAM and index the resulting BAM file.

This time we set -x map-ont to specify that we are going to map Nanopore reads (manual on GitHub).

Again, the parameter --MD is necessary for the downstream analysis of Sniffles to call the SVs.

Code example, do not run

OUT=SV_ONT

# create output directory
mkdir -p ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/Nanopore
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/Nanopore

# link necessary files
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Data/GRChg38_chr22.fasta .

# index reference before running minimap2
minimap2 -d GRChg38_chr22.mmi GRChg38_chr22.fasta

# run minimap2
minimap2 -a -Y -x map-ont --MD GRChg38_chr22.mmi $( ls ~/workshop_materials/a13_structural_variation/SV_lab/Data/Reads/Nanopore/*.fastq.gz ) > $OUT.sam

# convert sam into bam and index the bam file
samtools sort $OUT.sam > $OUT.bam
samtools index -b -@ 4 $OUT.bam $OUT.bai

Instead of running the above code, please find the BAM file corresponding to the Nanopore library in Intermediate/Minimap2/Nanopore/SV_ONT.bam as well as its index file Intermediate/Minimap2/Nanopore/SV_ONT.bai. The index file is essential!

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/Nanopore
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/Nanopore/SV_ONT.bam .
ln -s ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Minimap2/Nanopore/SV_ONT.bai .

SV calling

Now we are ready to start the most important part of the analysis: SV calling! To do that we use two tools: Manta for the short reads and Sniffles for the long reads.

Illumina (IL) reads

Let’s start with calling SV using the Illumina mapped to the reference genome. We’re going to use Manta.

Manta user guide

In order to run Manta we need:

  • BAM file of aligned Illumina reads to GRCh38_chr22
  • reference genome GRCh38_chr22
  • index (.fai) of the reference genome produced by samtools faidx
  • configuration file (a template of this config file is always given with the installation of Manta. We provide a copy of it in the Tools/Manta folder)

Prepare the input data for Manta

# got to working directory
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Manta

# link BAM file if not already present
ln -s ../BWA/SV_IL.bam .
ln -s ../BWA/SV_IL.bai .

# copy config file, the file is already in the folder, do not run
#cp ~/workshop_materials/a13_structural_variation/SV_lab/Tools/Manta/bin/configManta.py.ini .

For the scope of the tutorial, we use the default parameters in the configManta.py.ini but feel free to experiment with all the other parameters. The path to the reference in configManta.py.ini has already been modified.

Run configManta.py to generate a Manta workflow that will call the SVs.

conda activate Manta
../../Tools/Manta/bin/configManta.py --config ./configManta.py.ini --bam SV_IL.bam

It creates the folder called MantaWorkflow that contains all the scripts and subfolders necessary to run the analysis. Inspect the MantaWorkflow folder, you will find several empty subfolders that will contain the results of the analysis. In particular the folder MantaWorkflow/results/variants is the most important for us.

The name MantaWorkflow is given by default but can be changed with the parameter --runDIR in the configManta.py command.

Enter the MantaWorkflow folder and run the workflow runWorkflow.py. This program will run for ~10 minutes.

cd MantaWorkflow

# run the workflow
./runWorkflow.py

Inspect the main output file MantaWorkflow/results/variants/candidateSV.vcf.gz. Looking at it you may notice the presence of a particular type of SV called BND. BND stands for breakend (of an SV) and indicate the breaking point of an inversion. Breaking points beloning to the same inversion share the same ID (e.g., MantaBND:1895). When you have more than two entries for the same BND ID, it means that there are multiple inversions occurring in the region with respect to the reference.

For some of the next analysis, we’re going to use SURVIVOR but SURVIVOR is not very happy to deal with inversions specified on different lines, therefore we need to consolidate the calling of inversions using the Manta script convertInversion.py which can be found in the Manta installation folder.

To run the script, we need to specify the path to samtools, reference genome, and VCF with SVs.

cd results/variants
gunzip candidateSV.vcf.gz

# consolidate inversions and redirect the new VCF file into the main Intermediate/Manta folder
~/workshop_materials/a13_structural_variation/SV_lab/Tools/Manta/libexec/convertInversion.py /home/wpsg/software/samtools-1.15.1/samtools ../../../../../Data/GRChg38_chr22.fasta candidateSV.vcf > ../../../SV_IL.vcf

Now the VCF file contains the SV INV.

Let’s proceed to call SV with PacBio and Nanopore data.

PacBio (PB) reads

Activate the SV_Env to run Sniffles.

conda deactivate
# to suppress irrelevant errors we use "2> /dev/null"
conda activate SV_Env 2> /dev/null

Before running Sniffles, look at the help page of the tool sniffles --help to get familiar with the parameters.

Which parameters would you add or modify? Feel free to experiment!

Run Sniffles. This will take ~5 minutes.

OUT=SV_PB

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/PacBio

# link BAM file if not already linked
ln -s ../../Minimap2/PacBio/$OUT.bam .
ln -s ../../Minimap2/PacBio/$OUT.bai .

# run Sniffles
sniffles -i $OUT.bam -v $OUT.vcf --minsupport 5 --minsvlen 50 -t 4

Nanopore (ONT) reads

Similarly to what we did with the PacBio reads, we run Sniffles on the Nanopore reads. This will take ~5 minutes.

OUT=SV_ONT

cd ../Nanopore

# link BAM file if not already linked
ln -s ../../Minimap2/Nanopore/$OUT.bam .
ln -s ../../Minimap2/Nanopore/$OUT.bai .

# run Sniffles
sniffles -i $OUT.bam -v $OUT.vcf --minsupport 5 --minsvlen 50 -t 4

Filtering of called SVs

The filtering of the SVs is aimed to remove commonly erroneous variants.

We can filter the vcf files using bcftools similarly to what was done in Weissensteiner et al. 2020. More details in the code deposited on GitHub.

Filters:
– Keep allele frequency (based on supporting reads) > 0.3 (if field is present)
– Keep SV length < 100 kb (to remove erroneous chromosome-scale variants)
– Keep SVs with number of supporting reads < 60 (to remove variants caused by high-copy repeats)

We are going to apply this filtering to all the VCF files produced here.

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Manta
bcftools view -i 'SVLEN<100000 && SVLEN>-100000 && PAIR_COUNT<30' SV_IL.vcf > SV_IL_filtered.vcf

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/PacBio
bcftools view -i 'AF>0.3 && SVLEN<100000 && SVLEN>-100000 && SUPPORT<60' SV_PB.vcf  > SV_PB_filtered.vcf

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/Nanopore
bcftools view -i 'AF>0.3 && SVLEN<100000 && SVLEN>-100000 && SUPPORT<60' SV_ONT.vcf > SV_ONT_filtered.vcf

Another important point to keep in mind is: GAPS! We’re going to remove all the SVs that overlap with sequencing gaps in the reference (“NNNNN” sequences, scaffold extremities) because they can easily lead to false positives.

We procede as follows:
– annotate the gaps in the reference in a BED file with a Python script
– intersect the VCF with the BED files containing the gaps with BEDtools

# annotate gaps
cd ~/workshop_materials/a13_structural_variation/SV_lab/Data
python ~/workshop_materials/a13_structural_variation/SV_lab/Tools/annotateGaps.py GRChg38_chr22.fasta >> GRChg38_chr22.fasta.gaps

BED=~/workshop_materials/a13_structural_variation/SV_lab/Data/GRChg38_chr22.fasta.gaps

# filter VCF from Manta on Illumina
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Manta/
VCF=SV_IL_filtered.vcf
OUTVCF=${VCF%.*}_gaps.vcf
bedtools intersect -v -a $VCF -b $BED | cat <(grep '#' $VCF) - > $OUTVCF

# filter VCF from Sniffles on PacBio
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/PacBio
VCF=SV_PB_filtered.vcf
OUTVCF=${VCF%.*}_gaps.vcf
bedtools intersect -v -a $VCF -b $BED | cat <(grep '#' $VCF) - > $OUTVCF

# filter VCF from Sniffles on Nanopore
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/Nanopore
VCF=SV_ONT_filtered.vcf
OUTVCF=${VCF%.*}_gaps.vcf
bedtools intersect -v -a $VCF -b $BED | cat <(grep '#' $VCF) - > $OUTVCF

Comparison of SVs and VCF files

General comparison

We can parse the VCF with several tools, here I suggest to import the files into R and use Bioconductor packages like StructuralVariantAnnotation (based on IRanges and GRanges) to explore the characteristics of the SVs called.

Get general information about the single VCF files using statVCF.R

# go to the directory with your VCF, e.g., PacBio
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/PacBio

# run the R script
# it will produce a .stat file and a pdf with histograms with the length distribution of the SVs
Rscript --vanilla ~/workshop_materials/a13_structural_variation/SV_lab/Tools/statVCF.R SV_PB_filtered_gaps.vcf

# rename the pdf
mv Rplots.pdf SV_PB_filtered_gaps_hist.pdf

Run this also for the VCFs created for Nanopore and illumina.

Click here if you need help adjusting the code above

This is what you can run for Nanopore:

# go to the directory with your VCF, e.g., Nanopore
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/Nanopore

# run the R script
# it will produce a .stat file and a pdf with histograms with the length distribution of the SVs
Rscript --vanilla ~/workshop_materials/a13_structural_variation/SV_lab/Tools/statVCF.R SV_ONT_filtered_gaps.vcf

# rename the pdf
mv Rplots.pdf SV_ONT_filtered_gaps_hist.pdf

This is what you can run for Illumina:

# go to the directory with your VCF, e.g., Illumina
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Manta

# run the R script
# it will produce a .stat file and a pdf with histograms with the length distribution of the SVs
Rscript --vanilla ~/workshop_materials/a13_structural_variation/SV_lab/Tools/statVCF.R SV_IL_filtered_gaps.vcf

# rename the pdf
mv Rplots.pdf SV_IL_filtered_gaps_hist.pdf

Once you produced the stats and pdf for each VCF you can collect the results into one folder:

# change into VCF_comparison directory
cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/VCF_comparison

# link stat and pdf files
ln -s ../Sniffles/PacBio/SV_PB_filtered_gaps_hist.pdf .
ln -s ../Sniffles/PacBio/SV_PB_filtered_gaps.vcf.stat .
ln -s ../Sniffles/Nanopore/SV_ONT_filtered_gaps_hist.pdf .
ln -s ../Sniffles/Nanopore/SV_ONT_filtered_gaps.vcf.stat .
ln -s ../Manta/SV_IL_filtered_gaps_hist.pdf .
ln -s ../Manta/SV_IL_filtered_gaps.vcf.stat .

Compare the results.

Get overlapping SVs between technologies

We can get information about which SVs overlap between the different VCF files by using SURVIVOR. SURVIVOR is designed to merge VCF files of SVs from different individuals or species and it indicates which ones are present in all VCF files. We can use the same tool to compare the technologies and treat them as they were different individuals/samples. Several parameters can be used to tune the merging.

# link the VFC files
ln -s ../Sniffles/PacBio/SV_PB_filtered_gaps.vcf .
ln -s ../Sniffles/Nanopore/SV_ONT_filtered_gaps.vcf .
ln -s ../Manta/SV_IL_filtered_gaps.vcf .

# list the VCF files to merge
ls SV*.vcf > list_vcf

SURVIVOR merge list_vcf 1000 1 1 0 0 50 merged_filtered.vcf

Here the filtering parameters of SURVIVOR are the following:

Max distance between breakpoints: 1000

Minimum number of supporting caller: 1

Take the type into account (1==yes, else no): 1

Take the strands of SVs into account (1==yes, else no): 0

Estimate distance based on the size of SV (1==yes, else no): 0

Minimum size of SVs to be taken into account: 50

Inspect the output file merged_filtered.vcf.

Plot the comparisons

Code from: https://github.com/fritzsedlazeck/SURVIVOR/wiki#plotting-the-comparison-of-multiple-input-vcf-files-after-merging

Extract the overlap information like this:

perl -ne 'print "$1\n" if /SUPP_VEC=([^,;]+)/' merged_filtered.vcf | sed -e 's/\(.\)/\1 /g' > merged_filtered_overlapp.txt

This will extract the string from the support vector representing 0 or 1 depending on if a sample/input VCF file supports an SV or not. The sed command will add a space between every character, which is needed for R.

Next, I am using the R package called VennDiagram. You need to install/download it for your R instance.

#enter R on the terminal
R

#here is the R code
library(VennDiagram)

t = read.table("merged_filtered_overlapp.txt", header = FALSE)

dst = data.matrix(t(t))

venn.diagram(list(Illumina = which(t[,1]==1), Nanopore = which(t[,2]==1), PacBio = which(t[,3]==1)) , fill = c("gray", "orange" ,"blue"), alpha = c(0.5, 0.5, 0.5), cex = 2, lty = 2, filename = "my_techs_overlapp.tiff");

#exit R 
quit()

You need to adjust this R code if you have more than 3 samples by extending the list, fill and alpha variable.

Evaluate the calling of simulated SVs

The SVs we called actually come from a simulated set of SVs therefore we do know which ones are real and which are not. We can evaluate this by again using a function of SURVIVOR.

To do this we need the VCF files produced for each technology separately and compare them to the BED file with the SVs. You can find the BED file in the Other folder. You can evaluate the filtered and unfiltered VCF files!

Look at the parameters of SURVIVOR eval before running the code. Would you change anything in the proposed code for PacBio below? How the evaluation results change with, for example, changing the distance between the called SV and its simulated instance?

SURVIVOR eval
VCF file
BED file from simulations
Max allowed distance to simulated
Output eval file

Code for PacBio:

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/PacBio

# link BED file with the list of simulated SVs
ln -s ../../../Other/simulated.bed .

# evaluate the unfiltered VCF file
SURVIVOR eval SV_PB.vcf simulated.bed 10 SV_PB_eval_res

# evaluate the filtered VCF file
SURVIVOR eval SV_PB_filtered.vcf simulated.bed 10 SV_PB_filtered_eval_res

# evaluate the filtered VCF file + filtered for gaps
SURVIVOR eval SV_PB_filtered_gaps.vcf simulated.bed 10 SV_PB_filtered_gaps_eval_res

Run this also for the VCFs created for Nanopore and illumina.

Click here if you need help adjusting the code above

This is what you can run for Nanopore:

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/Nanopore

# link BED file with the list of simulated SVs
ln -s ../../../Other/simulated.bed .

# evaluate the unfiltered VCF file
SURVIVOR eval SV_ONT.vcf simulated.bed 10 SV_ONT_eval_res

# evaluate the filtered VCF file
SURVIVOR eval SV_ONT_filtered.vcf simulated.bed 10 SV_ONT_filtered_eval_res

# evaluate the filtered VCF file + filtered for gaps
SURVIVOR eval SV_ONT_filtered_gaps.vcf simulated.bed 10 SV_ONT_filtered_gaps_eval_res

This is what you can run for Illumina:

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Manta

# link BED file with the list of simulated SVs
ln -s ../../Other/simulated.bed .

# evaluate the unfiltered VCF file
SURVIVOR eval SV_IL.vcf simulated.bed 10 SV_IL_eval_res

# evaluate the filtered VCF file
SURVIVOR eval SV_IL_filtered.vcf simulated.bed 10 SV_IL_filtered_eval_res

# evaluate the filtered VCF file + filtered for gaps
SURVIVOR eval SV_IL_filtered_gaps.vcf simulated.bed 10 SV_IL_filtered_gaps_eval_res

The command gives you two types of VCF files and a summary of the efficiency of the SV calling (by default printed on screen).

*_right.vcf: correctly called SVs
*addition.vcf: additionally called SVs

Example of the efficiency of the SV calling:

Overall: 110 8/20/8/0/10 21/20/22/0/1 2/5/2/0/13 0.418182 0.323529

This shows that 110 SVs were simulated. Followed by the true positive (simulated and found) SVs, the false negative (simulated not found) and the false positive (not simulated but found). Each of which is reported by type (DEL/DUP/INV/TRA/INS). The second last number is the sensitivity and the last number is the FDR rate.

More info here and here.

Visual inspection of SVs

Either use Guacamole, by going to “http://ec2-XX-XXX-XXX-XXX.compute-1.amazonaws.com:8080/guacamole/”, again replacing “XX-XXX-XXX-XXX” with your Amazon instance IP address. Go to the desktop environment and open the terminal inside of the ubuntu gui of guacamole and run this code:”./software/IGV_2.13.0/igv.sh”. IGV will open in your Guacamole desktop.

Or install IGV on your local computer from http://software.broadinstitute.org/software/igv/download. In this case download on local the BAM files and the fasta file for chr22 and open them with IGV.

Go to a location of interest that you identified from the VCF file and check if the evidence given by the BAM files align with the SV calling and genotype.

Can you find any region that do not look right?

Another fast(er) way to check visually if the information given by the mapped reads align with the SV calling is to use Samplot. Samplot gets BAM files and (optionally) VCF files to create coverage plots of the regions of interest.

Go to the GitHub repository and look at the manual to run it on your regions of interest.

Basic usage (example). It plots the region of chr22 from base 100,000 to 200,000 and saves the plot in region1.png

cd ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/VCF_comparison
samplot plot \
    -n Illumina PacBio Nanopore \
    -b ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Manta/SV_IL.bam \
      ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/PacBio/SV_PB.bam \
      ~/workshop_materials/a13_structural_variation/SV_lab/Intermediate/Sniffles/Nanopore/SV_ONT.bam \
    -o region1.png \
    -c NC_000022.11 \
    -s 34287816 \
    -e 34288320 \
    -t INV

Useful usage example that takes BAM file(s) + VCF with SVs + BED file with regions of interest

samplot vcf \
    --filter "SVTYPE == 'DEL' & SU >= 8" \
    --filter "SVTYPE == 'INV' & SU >= 5" \
    --vcf example.vcf\
    -d test/\
    -O png\
    --important_regions regions.bed\
    -b example.bam > samplot_commands.sh

References

Belyeu, J. R., Chowdhury, M., Brown, J., Pedersen, B. S., Cormier, M. J., Quinlan, A. R., & Layer, R. M. (2021). Samplot: a platform for structural variant visual validation and automated filtering. Genome Biology, 22(1), 2020.09.23.310110. doi: 10.1186/s13059-021-02380-5

Chen, X., Schulz-Trieglaff, O., Shaw, R., Barnes, B., Schlesinger, F., Källberg, M., … Saunders, C. T. (2016). Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics, 32(8), 1220–1222. doi: 10.1093/bioinformatics/btv710

Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., … Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. doi: 10.1093/gigascience/giab008

Jeffares, D. C., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., … Sedlazeck, F. J. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nature Communications, 8(1), 14061. doi: 10.1038/ncomms14061

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094–3100. doi: 10.1093/bioinformatics/bty191

Quinlan, A. R. (2014). BEDTools: The Swiss-Army tool for genome feature analysis. Current Protocols in Bioinformatics, 2014(1), 11.12.1-11.12.34. doi: 10.1002/0471250953.bi1112s47

Sedlazeck, F. J., Rescheneder, P., Smolka, M., Fang, H., Nattestad, M., von Haeseler, A., & Schatz, M. C. (2018). Accurate detection of complex structural variations using single-molecule sequencing. Nature Methods, 15(6), 461–468. doi: 10.1038/s41592-018-0001-7

Smolka, M., Paulin, L. F., Grochowski, C. M., Mahmoud, M., Behera, S., Gandhi, M., … Sedlazeck, F. J. (2022). Comprehensive Structural Variant Detection: From Mosaic to Population-Level. BioRxiv, 2022.04.04.487055. doi: 10.1101/2022.04.04.487055

Weissensteiner, M. H., Bunikis, I., Catalán, A., Francoijs, K. J., Knief, U., Heim, W., … Wolf, J. B. W. (2020). Discovery and population genomics of structural variation in a songbird genus. Nature Communications, 11(1), 3403. doi: 10.1038/s41467-020-17195-4