Note, steps within objectives that have fill-in-the-blank prompts are
indicated as such using blue color font. Please fill in these prompts.
Additionally, if the software you are trying to use isn’t in your path,
it is likely in ~/software
.
Programs used: MAFFT, ClipKIT, IQTREE2, PhyKIT, & OrthoSNAP
Datasets of organisms with complex evolutionary histories, especially complex patterns of duplications and loss, can have few single-copy orthologs. As a result, it may be helpful to use newer strategies to obtain single-copy orthologs that are nested in larger, multi-copy orthogroups (OGs). To do so, we will implement a splitting and pruning procedure in OrthoSNAP. This software breaks or “snaps” branches in multi-copy gene families to identify single-copy OGs nested within; thus, single-copy OGs identified by OrthoSNAP are termed SNAP-OGs.
This figure describes how OrthoSNAP works and what it takes as input.
Here,the objective will be learning how to use OrthoSNAP to identify SNAP-OGs. Other skills and tasks reinforced during this practical include multiple sequence alignment, trimming, and tree inference. The dataset for this is orthologous groups of genes in FASTA format.
This objective is divided in five parts:
Hint:
# the command is 'tar'
Solution(s):
tar -zxvf data.tar.gz
Solution:
cd data
Hint:
# you can list the the content, grep for a pattern eg a fasta header
Solution:
# How many FASTA files are in data?
$ ls or $ ll or $ ls -1 | grep -c ".fa"
# What is the naming scheme used for each FASTA entry?
$ grep ">" OG0010342.fa
shows: Tachyglossus_aculeatus.outgroup|XP_038618634.1 (i.e., species_name.clade|gene_ID)
# How many of the FASTA files are single-copy orthologous groups?
for i in $(ls *fa); do
echo $i ;
grep ">" $i | sed 's/|.*$//g' | sort | uniq -c ;
done
# here is a breakdown of the more complex portion of the command
# grep ">" $i: get FASTA headers of file i
# sed 's/|.*$//g': remove everything after the pipe character to the end of the line, stripping gene names from the header
# sort | uniq -c: sort and count how many times each species is present in a FASTA file
# any file with species present more than two times is NOT a single-copy ortholog; thus, none of them are.
Align and trim the sequences using whatever multiple sequence alignment and trimming algorithm you’d like. In this tutorial, we will be using MAFFT and ClipKIT
Hint:
# to run mafft just type it's name
Solution:
# individual
$ mafft --auto OG0010342.fa > OG0010342.mafft
# loop
$ for i in $(ls OG00*.fa); do mafft --auto $i > $i.mafft ; done
# string interpolation (replacing the *.fa with *.mafft)
$ for i in *.fa; do mafft --auto $i > ${i/.fa/.mafft}; done; done
Solution:
$ for i in $(ls *mafft); do clipkit $i ; done
# OR
$ for i in *.mafft; do clipkit $i; done
Solution:
$ for i in *clipkit; do iqtree2 -s $i -pre $i -m JTT+F+G4 -fast ; done
Now, let’s break branches!
Hint:
# OrthoSNAP requires two arguments -f and -t, which point the software to where the tree and associated FASTA file are located.
Solution:
for i in OG001*clipkit; do orthosnap -t $i.treefile -f $i ; done
This step is a preview of a future practical on concatenation and partitioning.
Among other analyses, SNAP-OGs can be used to infer phylogenomic trees. To put into practice what we learned during the concatenation practical, we will now create a concatenation matrix using the SNAP-OGs.
Hint:
# use sed 's///g'
# use a loop
Solution:
for i in *orthosnap*fa; do mv $i temp ; cat temp | sed 's/|.*$//g' > $i ; done
# mv $i temp: create a temporary file of i
# cat temp | sed 's/|.*$//g' > $i: open the temporary file, strip everything after the pipe, and then rewrite to file name i
Solution:
# Create a concatenated matrix using the create_concat function in PhyKIT.
ls *orthosnap*fa > alignment_list.txt
pk_create_concat -a alignment_list.txt -p concat_orthosnap
# Examine the resulting output files. Using the occupancy file, which SNAP-OG had the highest and lowest taxon occupancy?
The occupancy file has six columns:
col 1: alignment file name
col 2: number of taxa present
col 3: number of taxa missing
col 4: fraction of taxa present
col 5: names of taxa that are missing
col 6: names of taxa that are present
highest: OG0010342.fa.mafft.clipkit.orthosnap.1.fa
lowest: OG0017677.fa.mafft.clipkit.orthosnap.0.fa
# Using the partition file, how long is the concatenated alignment?
There are 2,609 sites. Note, this may slightly vary depending on stochastic choices implemented in MAFFT.
Set up: 1. Download “full_dataset_trees.tar.gz”, which is available
here https://github.com/JLSteenwyk/2024_phylogenomics_workshop/raw/main/gene_trees_challenge/gene_tree_challenge_data/full_dataset_trees.tar.gz.
2. Unzip the directory. 3. Move ‘hypothesis_1.tre’, ‘hypothesis_2.tre’,
and ‘polytomy_test_groups_file.txt’ in full_dataset_trees
to the current directory.
So far, these files represent a subset of a much larger dataset, encompassing Eutherian mammals and outgroup taxa. For a long time, the evolutionary relationships among Eutherian mammals was contentious. Specifically, the evolutionary relationships among Afrotheria, Xenarthra, and other Eutherian mammals was unknown. The two major competing hypotheses were that Afrotheria and Xenarthra are sister lineages and that Afrotheria diverged from all other Eutherian mammals. In newick form, these hypotheses would be represented as follows:
Here, we will test support for each hypothesis.
Solution:
ls full_dataset_trees | grep -v "orthosnap" | wc -l
257
ls full_dataset_trees | grep "orthosnap" | wc -l
1475
# there are more 1,218 (1475 - 257) more SNAP-OGs than SC-OGs
We will now calculate how many genes support each hypothesis. This exact metric is typically referred to as “gene support frequencies” but was later adapted to be called “gene concordance factor.” We will revisit gene concordance factors later in the workshop; specifically, while talking about phylogenomic incongruence.
Let’s examine the phylogenetic trees with each alternative hypothesis. (The point here is just to show you how to quickly visualize a tree in the command-line and to have a deeper understanding of the underlying data.)
Solution:
pk_print_tree hypothesis_1.tre
Note, the higher taxonomic classification is after each genus and
species name. For example, Elephantulus_edwardii.Afrotheria
refers to genus and species Elephantulus_edwardii
from the
Afrotheria
clade. For simplicity, “other Eutheria” is
simplified to “Eutheria” and outgroup taxa are simply labeled as
“outgroup”.
Now, examine the alternative hypothesis:
Solution:
pk_print_tree hypothesis_2.tre
$ cat polytomy_test_groups_file.txt
Eutheria_relationships Chrysochloris_asiatica.Afrotheria;Echinops_telfairi.Afrotheria;Elephantulus_edwardii.Afrotheria;Loxodonta_africana.Afrotheria;Orycteropus_afer_afer.Afrotheria;Trichechus_manatus_latirostris.Afrotheria Choloepus_didactylus.Xenarthra;Dasypus_novemcinctus.Xenarthra Camelus_dromedarius.Eutheria;Canis_lupus_dingo.Eutheria;Capra_hircus.Eutheria;Desmodus_rotundus.Eutheria;Heterocephalus_glaber.Eutheria;Homo_sapiens.Eutheria;Macaca_mulatta.Eutheria;Mus_musculus.Eutheria;Oryctolagus_cuniculus.Eutheria;Phocoena_sinus.Eutheria;Sus_scrofa.Eutheria;Zalophus_californianus.Eutheria Aythya_fuligula.outgroup;Dermochelys_coriacea.outgroup;Falco_rusticolus.outgroup;Lacerta_agilis.outgroup;Malaclemys_terrapin_terrapin.outgroup;Mauremys_reevesii.outgroup;Tachyglossus_aculeatus.outgroup;Taeniopygia_guttata.outgroup
There are five columns in the polytomy_test_groups_file.txt file and one row. Each row specifies different tests to conduct; each column specifies the following: * col 1: A placeholder name for users to label each different polytomy test * col 2: group 0 (in this case, Afrotheria) * col 3: group 1 (in this case, Xenarthra) * col 4: group 2 (in this case, other Eutheria) * col 5: group 3 (taxa that are not the three focal groups of interest in a rootet quartet; in this case, outgroup taxa)
ls full_dataset_trees/ | grep -v "orthosnap" | awk '{print "full_dataset_trees/"$1}' > SC-OG_list.txt
$ pk_ptt -t SC-OG_list.txt -g polytomy_test_groups_file.txt
Gene Support Frequency Results
==============================
chi-squared: 29.4524
p-value: 0.0
total genes: 252
0-1: 121
0-2: 51
1-2: 80
The output of this test is as follows: * chi-squared: chi-squared value from chi-squared polytomy test * p-value: p-value from chi-squared polytomy test * total genes: the total number of genes used for polytomy testing + Note, this is fewer than how many SC-OGs because not all are informative for testing, such as those that have insufficient taxon representation for each clade * 0-1: the number of genes that support a sister relationship between group 0 and group 1 * 0-2: the number of genes that support a sister relationship between group 0 and group 2 * 1-2: the number of genes that support a sister relationship between group 1 and group 2 * As a reminder: + group 0: Afrotheria + group 1: Xenarthra + group 2: other Eutheria
Which hypothesis is best supported among SC-OGs?
$ ls FILES_orthosnap_full/*orthosnap* > SNAP-OG_list.txt
$ pk_ptt -t SNAP-OG_list.txt -g polytomy_test_groups_file.txt
Gene Support Frequency Results
==============================
chi-squared: 104.6823
p-value: 0.0
total genes: 1363
0-1: 582
0-2: 283
1-2: 498
Which hypothesis is best supported among SNAP-OGs? How similar are patterns of support among SNAP-OGs and SC-OGs?
Solution:
SNAP-OGs and SC-OGs are similar in their phylogenetic informativeness, suggesting that both are equally useful in phylogenomic analyses.
One benefit from SNAP-OGs compared to SC-OGs is that there can be substantially more SNAP-OGs than SC-OGs, which can be helpful for hypothesis testing, such as explored here.