- Intrachromosomal duplications
- Interchrom dups
- Interspersed dups
- Tandem dups
- Can be in a direct orientation or can be inverted
Why do we care?
- Duplicated sequences are mutable (hyper), can have unequal crossing so paralogs become substrate for recombination. Duplications can have implications for human disease.
Research in drosphilia has shown us the affects of duplication events on the structure of the eye but also additionally that these effects/phenotype can be reversed.
- Duplications promote structrual variation
- Evolution of gene function; evolution by gene duplication
Normally duplications die a valiant death at the hand of selection and are purged out, however they can still drive unequal crossing over. The creation of ‘new’ genes is rare but it can and has happened. Bioinformatically, duplicated and repeated sequences are often last to be assembled due to the difficultly in placing them. For humans, 60% of our dups are separated by a lot; this is called interspersed duplication in humans, modular in organization, difficult to resolve, large >10kb and recent meaning an identity of >95%. Also of interest is where the duplication events are centered…interchromosomal duplication is based at the centromere and sub-telomeric ends. The 2nd best genome (the mouse) doesn’t have that. Sure 4% of the genome is duplicated but not in the same pattern as the human genome…a pattern of tandem duplications within 1MB of each other!
- DiGeorge/VCFS/22q11 Syndrome which happens 1/2000 live births and is a result of deletion of the genome. It’s not inherited (75-80% sporadic) and they’ve documented 180 phenotypes such as cleft palate, defects of heart, and risk for schizophrenia. These individuals are haplosensitive, because they have 1/2 the material they should have in their genome.
- There is a recurrent deletion of particular pieces of DNA associated with developmental delay in children (chr7; 5 genes gone)
- Autism spectrum disorder (chr15; 3MB region deletion, explains 0.2% autism in the population) it is 1/2 ‘de novo’ (no parents but in germline) 1/2 transmitted, there is no common phenotype, 0.3% of children (chr15; 1.5MB del, portions in brain development genes; gives a predisposition for seizures).
There are several healthy conditions associated with duplications and/or deletions and many are still not fully understood today:
“So simply just detecting CNV is not sufficient, you need to know structure/organization as well!”
- This regions is actually an inversion
- Is a region of recurrent deletion in a site of common inversion in human populations
- It is largely restricted to caucasian populations
- This inversions is associated with increased fecundity and increased recombination rate for the individuals.
- The inverted allele is called H2
- This inversion polymorphism is a risk factor for the microdeletion and if the microdeletion takes place then you get the syndrome.
- The inversion happened 2.3 million yrs ago and the ancestor was thought to have migrated out from African into Arabia and Europe
In summary:
- The human genome is enriched for segmental duplications which predisposes us to recurrent large CNVs during germ-cell reproduction
- 15% of nuerocognitive disease in intellectually disable children is “caused” by CNVs
- Increased complexity is both good and bad. Ancestral duplication has predisposed us to the inversion polymorphism, this inversion polymorph acquires a duplication, the haplotype becomes positively selected leading to an increase risk for this microdeletion.
There are many genome-wise structural variation (SV) discovery approaches:
- Hybridization based
- Single molecule
- Sequencing based
It’s important to know when you’ve found something wth NGS, you need to prove it’s real with something like array comparative genomic hybridization. You are looking for copy number imbalance. This technology is trying to be replaced by SNP microarrays using Illumina.
- With SNP microarrays, you are using it to genotype and it works with anything except repetitive sequences
- Clone based techniques provide sequence resolution of structural variation via building libraries, sequencing paired ends and aligning to a reference. When you use paired reads you can tell if they are concordant or discordant based on whether the ends map with an expected distance between them..too close and it is indicative of an insertion, too far and it’s indicative of a deletion, if they don’t face each other you might have an inversion. Basically these are indirect signatures that lead you to detecting structural variation. So in order to fully understand what is going on you can go back find that clone and resequence it to fully characterize what is going on in the genome.
Experimental approaches, however are incomplete and when you overlap them you rarely find a lot of correspondence.:
- Read pair analysis, where you can detect small deletions, novel insertions, and transposons however size and breakpoint resolution are depending on insert size.
- Read depth analysis, where you can really only detect deletions and duplications and there is relatively poor breakpoint resolution.
- Split Read analysis, you can detect small novel insertions/deletions and mobile element insertions and you have about 1bp breakpoint resolution
- Local de novo assembly, this will detect novel structural variants in unique segments and also has 1bp breakpoitn resolution.
However, there are limits the the ‘holy grail’ of NGS as well. The short nature of NGS doesn’t let you place around repeat regions, limits what you can detect, and sensitivity/specificity is impaired. As with experimental approaches, computational approaches are also incomplete and often times don’t overlap either in their results.
- Challenges include: size spectrums, class bias in terms of deletions vs duplications vs inversions, multi-allelic states, incomplete references, complexity of repetitive DNA, looking at the exome versus the genome and false negatives.
A good program for working out CNV events is mrsFAST.
Part II
There have been tools developed for Exome sequencing data one of which is called CoNIFER (Copy Number variation Inference From Exome Reads). Another software which is good for validation is XHMM. Here’s a Broad video on XHMM!
- Focus on comprehensive assessment of genetic variation—NGS are indirect and do not resolve structure by provide specificity and excellent dynamic range response.
- High quality sequence resolution of complex structural variation to establish alternate references/haplotypes—often show extraordinary differences in genetic diversity
- Technological advances in whole genome sequencing “Third Generation Sequencing”: Long-read sequencing technologies with NGS throughput in order to sequence and assemble genomes de novo
So lets talk about single molecule real-time sequencing…
Both of these programs have improved sequence assembly and accuracy and have allowed for the finishing of microbial genomes.
- As stated in many talks prior…multiple methods need to be considered and used in analysis of data and experimental methods.
- There will be a trade off in specificity and sensitivity in your analyses, know where those trade offs are.
- Complexity is still not fully understood.
- It helps to narrow the size spectrum of your structural variants
- NGS can help lead to more accurate prediction of copy number and has unparalleled specificity in genotyping duplicated genes though you MUST have a quality reference for the best results or inferences to be made.
- Third generation sequencing holds quite a bit of promise (ala SMRT) but the coverage needs to get up’d.
- The selective disadvantage of interspersed duplications is offset by the benefit of evolutionary plasticity and the emergence of new genes with new functions associated with core duplicons.
Marques-Bonet and Eichler, CSHL Quant Biol, 2008
Many of the genes involved aren’t really characterized well in terms of what they actually do but many have been implicated or associated with human brain development and those gene families have been involved in expansions historically. Evan highlights one such genes SRGAP2 which is implicated in neuronal function. It has been duplicated 3 times in humans and is not in any other mammalian lineage. These duplicates appear fixed in humans are all expressed. When SRGAP2 reaches a given concentration is folds over on itself (homodimerization) and allows for nueronal progression.
So once again, in summary (from Evan’s slides):
- Interspersed duplication architecture sensitized our genome to copy-number variation increasing our species predisposition to disease—children with autism and intellectual disability
- Duplication architecture has evolved recently in a punctuated fashion around core duplicons which encode human great-ape specific gene innovations (eg. NPIP, NBPF, LRRC37, etc. —see slides for examples given).
- Cores have propagated in a stepwise fashion “transducing” flanking sequences—human-specific acquisitions flanks are associated with brain developmental genes.
- Core Duplicon Hypothesis: Selective disadvantage of these interspersed duplications offset by newly minted genes and new locations within our species. Eg. SRGAP2C
Evan ended his talk on a great slide depicting what he feels is the relationship with disease and evolution…a ying/yang relationship. His group is doing some incredibly cool stuff and hopefully he’s reignited your appreciation for the complexities of the human genome…