gsi (genealogical sorting index) activity

table of contents

  • expected learning outcomes
  • getting started
  • exercise 1: species differentiation in dolphins
  • exercise 2: species differentiation in dolphins, integrating across loci
  • exercise 3: population structure of parasitic lice

expected learning outcomes

The objective of this activity is to promote understanding of the genealogical sorting index, gsi, and its application to problems of lineage divergence (Cummings, Neel and Shaw 2008). The gsi is a novel and objective way to quantify genealogical structure by assessing the amount of exclusive ancestry of a group on a rooted tree and determining the probability of observing that amount of exclusive ancestry at random in a single lineage independent of any specific topology or distribution of coalescent times. The gsi thereby transcends the categorical view of monophyly and nonmonophyly characteristic of phylogenetic systematics to enable novel insight into the evolutionary process. Likewise, the gsi captures historical information about diverging populations from its quantification of exclusive relationship, independent of a reliance on estimates of coalescent times characteristic of historical population genetics. Although most often described in the context of evolutionary divergence, the gsi statistic is more broadly applicable to quantifying and assessing the significance of clustering of observations in labeled groups on any tree.

getting started

This learning activity can be done by a person working alone, or preferably, two people working together. There are opportunities to divide the work among partners, and subsequently share and compare results from analyses that were independently obtained. The ultimate analyses for exercises in this activity will be done using the web interface at genealogicalsorting.org.

exercise 1: species differentiation in dolphins

This example comes from a genetic study of the demography of speciation in two very closely related (sister taxa) allopatric dolphin species with anti-tropical distribution, Pacific white-sided dolphin (Lagenorhynchus obliquidens, LOB) in the North Pacific, and dusky dolphin (L. obscurus, LOS) in the South Pacific (Fig. 1). Atlantic white-sided dolphin (L. acutus, LAC) is used as the outgroup (Hare et al. 2002). The data are sequences of four loci, nuclear protein coding gene introns, sampled from multiple individuals of each species.

In this activity, we will provide you with trees generated for each of the 4 loci. The trees were generated in GARLI, were rooted with the Atlantic white-sided dolphin individuals and written to Newick format. Then, the outgroups were pruned from each tree using the Retree package in Phylip (you have this in the distribution you received the first day), resulting in the unrooted trees that you can download below. Note that the trees have to be rooted initially for this analysis, even though the outgroup should be removed before determining the probability of the observed gsi values for the groups of interest. Additionally, you should only includeoperational taxonomic units that under the null hypothesis might be consider equivalent.

composite figure with Pacific white-sided dolphin, Lagenorhynchus obliquidens (left), and dusky dolphin, L. obscurus (right)

Figure 1. Pacific white-sided dolphin, Lagenorhynchus obliquidens (left), and dusky dolphin, L. obscurus(right).

  1. Download the processed tree for each of the four nuclear protein coding gene introns: ACTtree2.phy, actinBTMtree2.phy, ButyrophilinCAMKtree2.phy, calcium calmodulin-dependent kinase; andHEXBtree2.phy, lysosomal beta-hexosaminidase.
  2. Retrieve and save the group assignment file, which maps each operational taxonomic unit in the trees to one of the two species, L. obliquidens and L. obscurus (denoted by the specific epithet).
  3. Upload a tree file and the assignment file.
  4. Examine the tree for each locus, which should be presented with group labels. This will require you to reload the assignment file.
    • Based on examination of the tree before further analysis, make note of your impressions regarding the divergence between L. obliquidens and L. obscurus.
    • Make note of your impressions regarding the relative degree of exclusive ancestry for each species.
    • Visually compare trees from the different loci and make note of the relative differentiation betweenL. obliquidens and L. obscurus.
    • You may recall that different programs and program options affect the way polytomies are handled, and as a consequence your previous decisions in the phylogenetic analysis steps may affect your results. Do any of your trees have polytomies, and if so how might these affect the gsi values for the groups involved?
  5. Choose the analysis parameters and launch an analysis.
  6. Retrieve your results when the analysis is completed and examine the gsi values.
    • Are the analytical results consistent with your initial impressions from examining the trees?
    • In those cases where your initial impressions and the gsi values seem in conflict, reexamine the tree and the gsi values, and reevaluate your impressions.

exercise 2: species differentiation in dolphins, integrating across loci

The dolphin data set also provides an opportunity to learn about how to use information from multiple loci to quantify lineage divergence. You may recall that the variance in gene genealogies can be quite high, as demonstrated both by coalescent theory and empirical observations. In much the same way that data from multiple unlinked loci provide a more precise estimates of θ (theta) in population genetics, integrating over multiple independent gene genealogies can provide more precise estimates of lineage divergence. The statistic for an ensemble of trees or gene genealogies is gsiT (Cummings, Neel and Shaw 2008).

  1. Combine the Newick tree files from the individual loci into a single multi-tree file, making note of the order of the trees. Note that there are several ways that this file might be created (e.g., UNIX command line, text editor, some other program, or a combination of these).
  2. Upload the multi-tree file and the assignment file again.
  3. Choose the analysis parameters and launch an analysis.
  4. Retrieve your results when the analysis is completed and examine the gsi values.
    • How does the value of gsiT, the ensemble statistic generated by integrating across gene genealogies, compare to the value of gsi for the individual trees?
    • How does the p-value associated with gsiT compare to those associated with the individual loci?

exercise 3: population structure of parasitic lice

This example comes from a genetic study of a parasitic louse species, Polyplax serrata (Fig. 2), and some of their mice hosts in Europe (Stefka and Hypsa 2007). The mice hosts here are striped field mouse, Apodemus agrarius, yellow-necked mouse, A. flavicollis, and wood mouse, A. sylvaticus (Fig. 3). The data are sequences of the mitochondrial gene for cytochrome c oxidase subunit I (COI) sampled from 94 individuals of P. serrata, which were sampled from the three host Apodemus spp. from several areas of Europe. The data for this learning activity comprise a subset of the data from the original study.

image of Polyplax serrata

Figure 2. Polyplax serrata.

composite figure with striped field mouse, Apodemus agrarius (left), yellow-necked mouse, A. flavicollis (center), and wood mouse, A. sylvaticus (right)

Figure 3. Striped field mouse, Apodemus agrarius (left), yellow-necked mouse, A. flavicollis (center), and wood mouse, A. sylvaticus (right).

  1. The phylogenetic analysis has already been completed, and the liceTree.phy is available.
  2. Retrieve and save the group assignment files, which maps each operational taxonomic unit in the tree. For this problem there are two assignment files: one based on host species (denoted by the specific epithet), and one based on geographical location as broad areas within Europe (central, central-eastern, eastern, and southwestern, western).
  3. Upload the tree file and an assignment file.
  4. Examine the tree, which should be presented with group labels.
  5. Make note of your impressions regarding the relative degree of exclusive ancestry for each group based on the assignment.
  6. Visually compare the tree with the different group assignments and make note of the relative differentiation based on host species and geography.
  7. Choose the analysis parameters and launch an analysis.
  8. Retrieve your results when the analysis is completed and examine the gsi values.
    • Are the analytical results consistent with your initial impressions from examining the trees?
    • In those cases where your initial impressions and the gsi values seem in conflict, reexamine the tree and the gsi values, and reevaluate your impressions.
    • Does exclusive ancestry seem higher when sequences from P. serrata are grouped based on host species, or on geographical region?
    • What host-parasite evolutionary scenarios are more consistent with the results, and which are less consistent with the results?