Table of contents
- Expected learning outcome
- Exercise 1: Identifying homologs and non-homologs; effects of scoring matrices and algorithms
- Exercise 2: Comparison of protein:protein and translated DNA:protein to DNA:DNA searches – more sensitive DNA searches
- Exercise 3: Confirming statistical estimates with shuffles
- Exercise 4: Significant similarities within sequences – domain duplication
- Conclusion
Expected learning outcome
This activity will guide you through a number of similarity searching exercises using FASTA and BLAST. These exercises use programs on the FASTA WWW Search page and the Molecular Evolution BLAST WWW Search page [pgm].
An alternative (and more compact) version of these exercises is at: fasta.bioch.virginia.edu/mol_evol/.
In Exercise 1, you will search a small database for homologs using FASTA, Smith-Waterman (SSEARCH), or BLAST. The goal is to demonstrate that: (a) In general, all three programs produce about the same results using their default parameters. (b) Similarity statistics are accurate – unrelated sequences have E()-values near 1. (c) Homologous sequences do not always have significant scores, but additional searches (with different queries) can produce significant similarities. In Exercise 2, we show that protein:protein (BLASTP) or translated DNA:protein (BLASTX) searches are far far more sensitive than DNA:DNA searches. Exercise 3 shows how to use shuffling to independently confirm statistical signficance. Excercise 4 shows how to find duplicated domains, and that scoring matrices matter more for getting accurate domain boundaries.
Links below with the [pgm] highlight should take you to the appropriate program, with much of the information filled in (program, query sequence, database). Links with [seq] link to the NCBI page describing the protein or DNA sequence.
Exercise 1: Identifying homologs and non-homologs; effects of scoring matrices and algorithms
Use the FASTA search page [pgm] to compare Drosophila glutathione transferase GSTT1_DROME [seq](gi|121694) to the PIR1 Annotated protein sequence database. Be sure to press , not .
- Inspect the output.
- How long is the query sequence?
- How many sequences are in the PIR1 database?
- What scoring matrix was used?
- What were the gap penalties?
- What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
- Looking at an alignment, where are the boundaries of the alignment (the best local region)?
- What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm that your candidate non-homolog was truly unrelated?
- Note that this drosophila glutathione transferase shares significant similarity with both sequences from bacteria (SSPA_ECO57, stringent starvation protein) and mammals. How might you test whether the stringent starvation protein is homologous to glutathione transferases? (Hint – compare your candidate non-homolog with SwissProt for a more comprehensive test.)
- Compare the expectation (E()) value for the distant relationship between GSTT1_DROME andGSTP1_HUMAN (class-pi). How would you demonstrate that GSTT1_DROME is homologous toGSTP1_HUMAN?
[show answer] - Examine how the expectation value changes with different scoring matrices (BLOSUM62, BlastP62, PAM250) and different gap penalties. (The default scoring matrix for the FASTA programs is BLOSUM50, with gap penalties of -10 to open a gap and -2 for each residue in the gap – e.g. -12 for a one residue gap).
- What happens to the E()-value for the 100% identical sequence with the different matrices and different gap penalties?
- What happens to the E()-value for distant homologs, like GSTA1_RAT with the different matrices and different gap penalties?
- What happens to the E()-value for the highest scoring unrelated sequence with the different matrices?
- Try the search with ssearch [pgm] (Smith-Waterman). Again, look at the E()-values for distant homologs and the highest scoring unrelated sequence.
Do the same search (121694) using the Course BLAST [pgm] WWW page (BLASTP search instead of FASTA).
- Inspect the output.
- How long is the query sequence?
- How many sequences are in the PIR1 database?
- What scoring matrix was used?
- What were the gap penalties?
- What are the numbers after the description of the library sequence? Which one is best for inferring homology?
- Looking at an alignment, where are the boundaries of the alignment (the best local region)?
- What is the highest scoring non-homolog?
- How do the blastp E()-values compare with the FASTA (blosum62) E()-values for the distantly related mammalian and plant sequences?
Exercise 2: Comparison of protein:protein and translated DNA:protein to DNA:DNA searches – more sensitive DNA
In this exercise, we will try to find GSTT1_DROME homologs in the Arabidopsis genome, using (1) protein:protein (BLASTP), (2) DNA:protein (BLASTX), (3) protein:DNA (TBLASTN), and (4) DNA:DNA (BLASTN) searches.
The BLASTP, BLASTX, etc. links below are pre-set to search Arabidopsis sequences.
- BLASTP. Compare the GSTT1_DROME [seq] (gi|121694) protein sequence to Arabidopsis proteins usingBLASTP [pgm].What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1, and ATGSTU4?
- BLASTX. Try the same search using the GSTT1_DROME cDNA DMGST [seq] (gi|8033) againstArabidopsis proteins using BLASTX [pgm].What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1, and ATGSTU4?
- TBLASTN. Use GSTT1_DROME [seq] (gi|121694) against translated Arabidopsis DNA using TBLASTN [pgm].What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1, and ATGSTU4?
- Finally, try the DNA:DNA comparison. Use BLASTN [pgm] to compare dmgst (gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.Are there detectable Arabidopsis homologs?
Exercise 3: Confirming statistical estimates with shuffles
Use the PRSS shuffle [pgm] program to evaluate the statistical significance of a match.
- Compare GSTT1_DROME [seq] (gi|121694) to GSTA4_RAT [seq] (gi|121714) using PRSS [pgm].What is the E()-value? What equivalent database size is used to calculate the E()-value? Why isn’t the database size 1? What should it be?
[show answer] - Use PRSS [pgm] to compare OPSD_HUMAN [seq] (gi|129207) to US27_HCMVA [seq] (gi|137159) Human Herpes GPCR homolog. Compare with uniform or window shuffling. How does the E()-value change? Why?
[show answer] - Use the FASTA search page [pgm] to compare rat synapsin-1 SYN1_RAT [seq] (gi|6686305) to the PIR1 Annotated protein sequence database.
- What is the E()-value against the human IL3 receptor IL3RB_HUMAN [seq]?
- Use PRSS [pgm] to compare SYN1_RAT with IL3RB_HUMAN. Now what is the E()-value? What happens with the window shuffle?
- Look at the two sequences using pseg [pgm]. (Select “show low complexity as domains”.)
Exercise 4: Significant similarities within sequences – domain duplication
Exploring domains with local alignments – Calmodulin
- Use lalign [pgm] to examine local similarities between calmodulin CALM_HUMAN and itself.
- Use plalign [pgm] to plot the same alignment. How many repeats are present in this sequence?
- If a protein is made up of five copies of a domain, how many off-diagonal (non-identical) alignments should be seen?
- What happens to the domain alignment plot when you use a shallower scoring matrix? (Try BP62 and MD20.)
Exploring domains with local alignments – Cortactin (SRC8_HUMAN)
- Use lalign [pgm] to examine local similarities between SRC8_HUMAN and itself. Note the average percent identity for the some of the most significant alignments.
- Use plalign [pgm] to plot the same alignment. How many repeats are shown in this alignment? How long do they appear to be?
- Based on the percent identities you saw in step 1, what would the appropriate scoring matrix be to accurately identify the cortactin domains?
- What happens to the domain alignment plot when you use a shallower scoring matrix? (Try BP62 and MD40.) How do the BP62, PAM120, and MD40 alignments differ? Why are they different? Which matrix appears to best identify all the repeats found in the PFAM diagram?
- You can look at the PFAM annotation of this protein at PFAM: SRC8_HUMAN [pfam] (Cortactin). How many repeats are shown in this diagram? How long are they?
For more complex domain alignments, try mwkw, or mouse RNA polymerase (rpb1_mouse residues 1500- ) against itself. Try the rpb1_mouse alignment using the MD20 scoring matrix as well as BLOSUM50.
Conclusion
Where to get the FASTA package: faculty.virginia.edu/wrpearson/fasta/
The “normal” FASTA WWW site