Table of contents
- expected learning outcome
- exercise 1: parse sequence files, calculate summary statistics
- exercise 2: calculate summary statistics for a genome assembly
- exercise 3: multiple alignment parsing and processing
- exercise 4: BLAST parsing and processing
- exercise 5: processing sequence files from GenBank or Swiss-Prot
- exercise 6: using local and remote sequence databases
- exercise 7: advanced: pattern matching on sequence data
expected learning outcomes
The objective of this activity is to promote learning how to use BioPerl to address parsing sequence files, tree files, and location information.
Download the Presentation slides .
exercise 1: parse sequence files, calculate summary statistics
Write a program that will read in the FASTA format sequence files available from here or here. You will need to uncompress them first with the gunzip command. On the command line, type gunzip FILENAME.gz.
- Read in the data using the Bio::SeqIO module.
- Calculate some simple summary statistics for sequences, including:
- The GC content. How many G or C bases are in the genomic sequence (yeast_chromosomes.fasta)?
- The AA frequencies of the two protein files (yeast_proteins.fasta,yeast_proteins_dubious.fasta).
- Compute the total and average length of the proteins in these files.
- Is there something different about the amino acid composition of the proteins that are either in the official geneset or the dubious set?
- Print out the reverse complement of the chromosome file and save that to a file calledyeast_chromosomes.revcom
exercise 2: calculate summary statistics for a genome assembly
- Given a FASTA file of a genome assembly, first, calculate the length of each contig and store this in an array. Use Bio::SeqIO to read in the data.
- Write a subroutine that takes a list of numbers and calculates their mean value.
- Write a subroutine that takes a list of numbers and calculates their median value. Here is help for how to calculate the median.
- Write a subroutine to calculate the N50 for the list of numbers. What about the N90? There are already some answers to this and some Perl code that does this for you to look at, here.
FYI – a set of programs that can do N50 for you include running
faLen < genome | stats
Which is part of Colin Dewey’s code for Mercator. Also see the Appendix of Colin’s Thesis.
Also see Thomas Girke’s N50 calculating code in R and it will graph the values (so it computes all N, eg N10,N50, N90,…).
exercise 3: multiple alignment parsing and processing
The Bio::AlignIO and Bio::SimpleAlign modules are useful for parsing and manipulating multiple alignments.
- Write a parser to read in a multiple alignment in ClustalW or multi-FASTA format. You can get some examples here or use your own.
- Print the number of sequences in the alignment using the method each_seq.
- Print the length of the alignment.
- (Advanced) Print the number of columns that contain any gaps in the alignment.
- Print a slice of the alignment from position 20 to 30.
exercise 4: BLAST parsing and processing
Parse this BLAST file which is in the full text format. Some files are [here]. You will want to use this tutorial to match the data you can get out of the objects. Write a script to print out the following:
- Print the name of the Query.
- Print out the name of each Hit, indented.
- Print out the number of HSPs for each Hit.
- Print out the percent similarity and percent identity for each HSP.
- Print out the e-value for each HSP.
The SearchIO system is flexible so it will work for BLAST or FASTA output. Try your program on the FASTA result.
exercise 5: processing sequence files from GenBank or Swiss-Prot
Download a genbank file from here.
- Use Bio::SeqIO to parse the file (it is in GenBank format).
- Count the total number of CDS features in the file.
- Print out the start and stop of all the CDS features in the file.
- How many CDS features have a product type which is NOT “hypothetical protein”?
exercise 6: using local and remote sequence databases
- Use the example from the lectures to query for the accession number ABQ71724 or NC_000069. Use
- Using the downloaded yeast proteome, use the module Bio::DB::Fasta to index the FASTA file and search for the gene ID YJR104C. Write out this sequence to a file.Now you can use Bio::DB::Fasta to get sequences out by name.
exercise 7: advanced: pattern matching on sequence data
- Use the yeast_proteins.fasta file which is in FASTA format.
- Use Bio::SeqIO to read the file.
- Write a tool to identify which proteins have a Nuclear Localization Signal (NLS) based on the classical motif K(K/R)X(K/R) or the bipartite motif. This would look like this regular expression: /K[KR].[KR]/
- Write a yeast NES mapping tool based on the NES motif L-X(2,3)-[LIVFM]-X(2,3)-L-X-[LI]. This would look like …
- Restrict the search to the 60 terminal amino acids.
- How many yeast proteins are likely to be nuclear localized? How many are likely to be exported?
Some background and an example of the NLS and NES tool info: