CNVs and Humans are Cool

Dr. Evan Eichler
Howard Hughes Medical Center
University of Washington


Topic: Genome Structural Variation


So when we talk about genome structural variation here we are talking about deletions, duplicatons and inversions within human chromosomes. So Evan’s mission in life is to remind us that the human is still a pretty f**king cool animal to study. Yes…I just came back from wine tasting, you all know what I mean. Human’s still harbor ‘the unknown’ and that is worth investigating further. The Human has not been completely decoded and Evan and his group are committed to figuring out the mystery’s that still lie in the human genome.


For todays lecture we are talking about structural variation resulting from genomics differentiation of 50bp or larger. Genome structure variation includes copy number variation (heretofore referred to as CNV) and balance events such as inversion and translocations–originally defined as >1kbp but now >50bps.


We will also chat about Segmental Duplications (SD) which is a continuous portion of genomics sequence represented more than once in the genome –a historical CNV (>90% and >1kb length) this can arise via:
  • Intrachromosomal duplications
  • Interchrom dups
  • Interspersed dups
  • Tandem dups
  • Can be in a direct orientation or can be inverted

 Why do we care?

  • Duplicated sequences are mutable (hyper), can have unequal crossing so paralogs become substrate for recombination. Duplications can have implications for human disease.

Research in drosphilia has shown us the affects of duplication events on the structure of the eye but also additionally that these effects/phenotype can be reversed.

Two very important points for today:
  1. Duplications promote structrual variation
  2. Evolution of gene function; evolution by gene duplication

Normally duplications die a valiant death at the hand of selection and are purged out, however they can still drive unequal crossing over. The creation of ‘new’ genes is rare but it can and has happened. Bioinformatically, duplicated and repeated sequences are often last to be assembled due to the difficultly in placing them. For humans, 60% of our dups are separated by a lot; this is called interspersed duplication in humans, modular in organization, difficult to resolve, large >10kb and recent meaning an identity of >95%. Also of interest is where the duplication events are centered…interchromosomal duplication is based at the centromere and sub-telomeric ends. The 2nd best genome (the mouse) doesn’t have that. Sure 4% of the genome is duplicated but not in the same pattern as the human genome…a pattern of tandem duplications within 1MB of each other!

Many duplication events result in genomic disorders which are a group of diseases that result from genome rearrangement mediated mostly by non-allelic homologous recombination (see Evan’s slides for citations).


Evan gave several examples starting with:


  • DiGeorge/VCFS/22q11 Syndrome  which happens 1/2000 live births and is a result of deletion of the genome. It’s not inherited (75-80% sporadic) and they’ve documented 180 phenotypes  such as cleft palate, defects of heart, and risk for schizophrenia. These individuals are haplosensitive, because they have 1/2 the material they should have in their genome.
  • There is a recurrent deletion of particular pieces of DNA associated with developmental delay in children (chr7; 5 genes gone)
  • Autism spectrum disorder (chr15; 3MB region deletion, explains 0.2% autism in the population) it is 1/2 ‘de novo’ (no parents but in germline) 1/2 transmitted, there is no common phenotype, 0.3% of children (chr15; 1.5MB del, portions in brain development genes; gives a predisposition for seizures).

There are several healthy conditions associated with duplications and/or deletions and many are still not fully understood today:


“So simply just detecting CNV is not sufficient, you need to know structure/organization as well!”

So lets talk about common and rare structural variation linked to 17q21.31 deletion syndrome:
  • This regions is actually an inversion
  • Is a region of recurrent deletion in a site of common inversion in human populations
  • It is largely restricted to caucasian populations
  • This inversions is associated with increased fecundity and increased recombination rate for the individuals.
  • The inverted allele is called H2
  • This inversion polymorphism is a risk factor for the microdeletion and if the microdeletion takes place then you get the syndrome.
  • The inversion happened 2.3 million yrs ago and the ancestor was thought to have migrated out from African into Arabia and Europe

In summary:

  • The human genome is enriched for segmental duplications which predisposes us to recurrent large CNVs during germ-cell reproduction
  • 15% of nuerocognitive disease in intellectually disable children is “caused” by CNVs
  • Increased complexity is both good and bad. Ancestral duplication has predisposed us to the inversion polymorphism, this inversion polymorph acquires a duplication, the haplotype becomes positively selected leading to an increase risk for this microdeletion.

There are many genome-wise structural variation (SV) discovery approaches:

  • Hybridization based
  • Single molecule
  • Sequencing based

It’s important to know when you’ve found something wth NGS, you need to prove it’s real with something like array comparative genomic hybridization. You are looking for copy number imbalance. This technology is trying to be replaced by SNP microarrays using Illumina.

  • With SNP microarrays, you are using it to genotype and it works with anything except repetitive sequences
  • Clone based techniques provide sequence resolution of structural variation via building libraries, sequencing paired ends and aligning to a reference. When you use paired reads you can tell if they are concordant or discordant based on whether the ends map with an expected distance between them..too close and it is indicative of an insertion, too far and it’s indicative of a deletion, if they don’t face each other you might have an inversion. Basically these are indirect signatures that lead you to detecting structural variation. So in order to fully understand what is going on you can go back find that clone and resequence it to fully characterize what is going on in the genome.

Experimental approaches, however are incomplete and when you overlap them you rarely find a lot of correspondence.:

In terms of NGS applications we have:
  • Read pair analysis, where you can detect small deletions, novel insertions, and transposons however size and breakpoint resolution are depending on insert size.
  • Read depth analysis, where you can really only detect deletions and duplications and there is relatively poor breakpoint resolution.
  • Split Read analysis, you can detect small novel insertions/deletions and mobile element insertions and you have about 1bp breakpoint resolution
  • Local de novo assembly, this will detect novel structural variants in unique segments and also has 1bp breakpoitn resolution.

However, there are limits the the ‘holy grail’ of NGS as well. The short nature of NGS doesn’t let you place around repeat regions, limits what you can detect, and sensitivity/specificity is impaired. As with experimental approaches, computational approaches are also incomplete and often times don’t overlap either in their results.

  • Challenges include: size spectrums, class bias in terms of deletions vs duplications vs inversions, multi-allelic states, incomplete references, complexity of repetitive DNA, looking at the exome versus the genome and false negatives.

A good program for working out CNV events is mrsFAST.

…and with visions of CNVs and developmental disorders dancing in our heads we were released to acquire some coffee…

Part II

There have been tools developed for Exome sequencing data one of which is called CoNIFER (Copy Number variation Inference From Exome Reads). Another software which is good for validation is XHMM. Here’s a Broad video on XHMM!

Going forward we want to focus on:
  • Focus on comprehensive assessment of genetic variation—NGS are indirect and do not resolve structure by provide specificity and excellent dynamic range response.
  • High quality sequence resolution of complex structural variation to establish alternate references/haplotypes—often show extraordinary differences in genetic diversity
  • Technological advances in whole genome sequencing “Third Generation Sequencing”: Long-read sequencing technologies with NGS throughput in order to sequence and assemble genomes de novo

So lets talk about single molecule real-time sequencing…

Single molecule real-time seq (SMRT) involves long reads, no cloning or amplification but has lower thoughput and 15% error rate; though improvements in this error rate are expected in the future. One of the fundamental problems with assembly is the presence of repeats which are longer than the sequencing fragments covers leading to little or no mapping/assembly of that fragment, resulting in a collapsed assembly. Some new tools that have been developed are helping to improve assemblies from sequences derived from PacBio SMRT sequencing.

Both of these programs have improved sequence assembly and accuracy and have allowed for the finishing of microbial genomes.

So in terms of a plan of attack; you’d make and sequence BAC clones for your genome using both Illumina and PacBio sequencing and do tiling. This has amounted to the successful reconstructed assembly of complex regions of the human genome using BACs. Most of the difficulties were in indel areas and one large scale collapse of 20kbp region to 12kpb occurred. But otherwise it was remarkably successful.


In summary:


  • As stated in many talks prior…multiple methods need to be considered and used in analysis of data and experimental methods.
  • There will be a trade off in specificity and sensitivity in your analyses, know where those trade offs are.
  • Complexity is still not fully understood.
  • It helps to narrow the size spectrum of your structural variants
  • NGS can help lead to more accurate prediction of copy number and has unparalleled specificity in genotyping duplicated genes though you MUST have a quality reference for the best results or inferences to be made.
  • Third generation sequencing holds quite a bit of promise (ala SMRT) but the coverage needs to get up’d.
He continued on with a story of human/ape evolution and how there was duplication acceleration in the human great ape ancestor. There is a 3-4 fold excess of de novo segmental duplication in the common ancestor or human, chimp and gorilla AFTER the divergence from the Orangutan and this accumulation was not continuous.


Ongoing is the Great Ape Diversity Genome Project which seeks deep genome sequencing of 79 wild and captive born great apes and 10 human genomes. They have found 167 Mbp affected by SNPs and SNVs, 469 affected by copy number. They found that in looking at patterns of duplication and deletion in the human/ape genomes that there is a ‘centroid’ around events where a ‘centromere’ or ‘core’ carries genes long with it when duplicated leading to more and more gene duplications over time and ‘mosaic’ architecture emerging.
They also found rates of duplications slow down as we get to closer and closer related species, same with deletions; though with deletions the clock is much more constant.
This led them to the hypothesis of Core Expansion where you have an initial core duplicated then over time that core is duplicated again however brings more genes with it everytime:
And this in turn led to the Core Duplicon Hypothesis:
  • The selective disadvantage of interspersed duplications is offset by the benefit of evolutionary plasticity and the emergence of new genes with new functions associated with core duplicons.
    Marques-Bonet and Eichler, CSHL Quant Biol, 2008

Many of the genes involved aren’t really characterized well in terms of what they actually do but many have been implicated or associated with human brain development and those gene families have been involved in expansions historically. Evan highlights one such genes SRGAP2 which is implicated in neuronal function. It has been duplicated 3 times in humans and is not in any other mammalian lineage. These duplicates appear fixed in humans are all expressed. When SRGAP2 reaches a given concentration is folds over on itself (homodimerization) and allows for nueronal progression.


So once again, in summary (from Evan’s slides):

  • Interspersed duplication architecture sensitized our genome to copy-number variation increasing our species predisposition to disease—children with autism and intellectual disability
  • Duplication architecture has evolved recently in a punctuated fashion around core duplicons which encode human great-ape specific gene innovations (eg. NPIP, NBPF, LRRC37, etc. —see slides for examples given).
  • Cores have propagated in a stepwise fashion “transducing” flanking sequences—human-specific acquisitions flanks are associated with brain developmental genes.
  • Core Duplicon Hypothesis: Selective disadvantage of these interspersed duplications offset by newly minted genes and new locations within our species. Eg. SRGAP2C

Evan ended his talk on a great slide depicting what he feels is the relationship with disease and evolution…a ying/yang relationship. His group is doing some incredibly cool stuff and hopefully he’s reignited your appreciation for the complexities of the human genome…

…Dr. Mel