Perl

expected learning outcome
exercise 1: write a program to mimic the unix ‘cat’ program
exercise 2: tab delimited simple processing to find best reciprocal BLAST hits
exercise 3: using hashes to make unique lists
exercise 4: write a program to re-sort a data file by a particular column
exercise 5: calculate gene summary statistics from a GFF file

expected learning outcomes

The objective of this activity is to promote learning basic Perl programming, data, and string manipulation.

Get access to the examples from the slides as single files.

To install perldoc on Ubuntu Linux, issue the following command: sudo apt-get install perl-doc

The solutions to the problems are available. Sometime there are more than one scripts generated for a problem, and you will see these as _2 or _3. etc.

Download all the data, solutions, and examples as tar files. You uncompress these with ‘tar xvf FILE.tar.gz’.

exercise 1: write a program to mimic the unix ‘cat’ program

‘cat’ is a program that essentially prints out the contents of a file; you can provide multiple files and their output will be concatenated together.

Try it:

cat file1.fa file2.fa cat file1.f1 file2.fa > file3.fa

The program should open each of the file(s) one at a time.
Print their contents to the STDOUT stream.

exercise 2: tab delimited simple processing to find best reciprocal BLAST hits

Parse a tab delimited file from BLAST (available here).
Open the file and use the split function to separate out the columns (see example of this in the slides – stop using Excel).
Throw away lines that start with # or where the e-value is not significant.
Keep a hash that tracks the BEST (top) hit for a query.
Now walk back through the hash, find cases where the query->subject pair is the same as the subject->query pair, and print these out using code like this:
my $best_hit = $data{$query}; if( $query eq $data{$best_hit} ) { }

exercise 3: using hashes to make unique lists

Often, we have a set of data from different sources and we may want to make this list into a unique, nonredundant list of identifiers. So your task is to collapse this given list of strings into a unique list. Use a hash to collect these (as keys) and print out the list of keys when you are finished.

Try making a list like this:
my @list = qw(green yellow red blue red green blue purple orange yellow orange red purple);

BONUS: Can you print out both the unique list and the number of times each item occurs?

DOUBLE BONUS: Can you print out the list in the order of the most frequently occurring element?

exercise 4: write a program to re-sort a data file by a particular column

Download the Cufflinks results file of gene expression in tab delimited format from here. There are several.expr files there; each one is for a different strain.
Read the .expr file and print out the number of genes which have a FPKM > 100.
Print out a new version of the file which is sorted by gene expression of q2, so that the most highly expressed are at the top.
The file FGSC_1131-FGSC_8878.genes_exp.diff represents a comparison of two strains’ expression. How many genes are > 2-fold upregulated in strain 1 (q1) vs strain 2 (q2)?

exercise 5: calculate gene summary statistics from a GFF file

Download some GFF files from here. Try the Mycobacterium one first. The GFF format is a tab delimited format for describing features, originally developed for describing gene features but has expanded to many types. You can see the original specification here and the GFF3 format is documented with examples.

Describe the total length (number of base pairs) of ‘gene’ sequences in the file. What about all types of features? How many annotated stop codons are there in the Mycobacterium annotation?

Write a script to process the GFF (tab delimited format).
The 3rd column is the feature type (e.g. gene, CDS).
The 4th and 5th columns are the feature start, stop.
Calculate the length of each feature type (or just try and do ‘gene’) type.

More advanced:

The GFF3 file is a fungus that has introns as determined by the breaks in the CDS features, e.g., a gene with 1 intron will have 2 CDS features. The intron can be inferred as this region is within the mRNA region but not a CDS.
How many introns are there in C. immitis?
How many base pairs are intronic in C. immitis?
Can you create a table that summarizes the number of CDS, mRNA, and Gene features and their mean or median length for this genome based on the GFF3 file?

Coding examples

example1, strings

#!/usr/bin/perl -w
use strict;

my $A = 20;
my $B = 20;
my $C = 10;
# Some logic testing
if( $A< $B ) {
	print "A is less than B\n";
} elsif( $B < $C ) {
	print "B is less than C\n";
} else {
	print "neither of those were true\n";
}
# a one line testing
print "A is $A\n" if ($A > 10);

# multi-lines for printing can be broken up
print "This is a very long message that takes up ",
"lots of space on the screen\n";

my $g =1;

# Loops
while( $g < 8) {
	print "g is $g\n";
	$g++;
}
for ( $g = 1; $g < 8; $g++ ) {
	print "g is $g\n";
}

example 2, arrays

#!/usr/bin/perl -w
use strict;
my @fruit = qw(apple pear peach);
my @tropical = qw(pineapple passion star);

print join(",",@fruit),"\n";

my @all = (@fruit,@tropical);
print join(" ",@all),"\n";
my @sorted = sort @all;
print join(" ",@sorted),"\n";

my @numbers = (20,21,17,0.87,400,5);

print "original list", join(" ",@numbers), "\n";
@numbers = sort { $a <=> $b } @numbers;

print "numerically sorted ", join(" ",@numbers), "\n";
@numbers = sort @numbers;
print "lexically sorted ", join(" ",@numbers), "\n";
@numbers = sort{$b<=>$a}@numbers;
print "reverse sorted ", join(" ",@numbers), "\n";

Table of contents