Introduction

This workshop will be an overview of an RNASeq workflow using widely accepted analysis techniques. We will perform a case study examining gene knockout of Atg5 in macrophages of B6 mice.

This Workflow will be posted to my github repository with the outputs from the commands

WorkFlow Overview

  1. Quality control (Skipping this step)
  2. Map reads to genome
  3. Count reads mapped to annotated genes
  4. Data Assessment/Quality Control
  5. Detect differentially expressed genes
  6. Data Visualization (Pretty Plots)
  7. Pathway analysis

Data originate from a study on norovirus infection (https://www.ncbi.nlm.nih.gov/sra/?term=PRJEB10074)

Case Study

Black 6 Mice +/- Atg5 (5 knockouts and 9 controls)   - Macrophage cells
Sequencing – Ilumina HiSeq
  - Single-end
  - 50 bp reads

READ MAPPING

UPDATE:

Type the following commands to download the correct raw data files.

$ wget https://www.dropbox.com/s/xe5zszwq849ym6h/Data.tar.gz  
$ tar -zxvf Data.tar.gz  

This will overwrite your Data folder. If you look in this Data directory, you will see the correct raw read files. You will also notice a file called mm_ref_GRCm38.p2_chr1.fa . This is the fasta file required for running IGV after mapping. Move this file to MouseGenome_GRCm38

Let’s recall some useful unix commands from earlier in the week. List the files in this directory. Now, Take a look at a few lines of “AACGCATT.fq”. How many reads are there in this file?

Now let’s move up one directory back to our main project directory, Transcriptomics.

STEP 1: READ MAPPING with STAR

OPTIONS (more options available in the STAR manual)

--runThreadN NumberOfThreads
--genomeDir /path/to/genomeDir
--readFilesIn /path/to/read1 [/path/to/read2] 
--readFilesCommand (uncompression command)*
--outFileNamePrefix 
--outFilterMismatchNmax N (recommended 0.06*readLength)
--outSamtype (sorted or unsorted)

STAR Command

$ STAR --runThreadN 2 --outBAMsortingThreadN 2 \
  --genomeDir <GENOME_INDEX_DIRECTORY> \
  --readFilesIn <READS_DIR><file> \
  --outFileNamePrefix <OUTPUT_DIR><prefix>. \
  --outFilterMismatchNmax 3 \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM SortedByCoordinate 

Now run the STAR command with “AACGCATT.fq” as the input file and output it to the STAR_Mapping directory. ####Note: For each exercise, “prefix” will be replaced with “AACGCATT”

Once the command has completed Take a look at the files inside STAR_Mapping generated by STAR to make sure all files have been generated. You should have files with the same extensions as in the image below. alt text

We can use the following command to print the contents of the Log file to your screen to get some general mapping information

$ cat AACGCATT.Log.final.out

You will notice that the number of input reads is listed in this file. Was your answer about the number of reads correct?

PAUSE 🛑

STEP 1.2: VIEW ALIGNMENTS IN IGV

  1. Load up x2g0 and select the terminal icon
  2. enter the command “igv.sh” and a browser should load
  3. Select File tab -> Load from File (~/workshop_materials/Transcriptomics/BAMS/AACGCATT.bam)
  4. Select Genome tab -> Load Genome From File (~/workshop_materials/Transcriptomics/MouseGenome_GRCm38/mm_ref_GRCm38.p2_chr1.fa)
  5. In the Search box type in a chromosome location
    i.) We only have chromosome one in this case ii.) for an example, type in “chr1:9,841,441-9,841,599” iii.) double click to zoom in and let’s look at the read mapping together

return to Transcriptomics.

For convenience, lets copy the bam file generated by STAR into the BAMS directory.

STEP 2: Counting

Before we execute the counting command, we need to convert the bam files into sam format.
Move into the BAMS directory if you aren’t there already.
Run the below command to convert the bam file into sam format

$  samtools view -h <prefix>.bam > <prefix>.sam

Now that we have the alignment file in sam output, use one of your unix commands to take a looks at some of the lines in this file.

PAUSE 🛑

STEP 2: COUNTING With HTSeq

PAUSE 🛑

Considerations for real studies:

When performing an RNASeq study, we have multiple samples and multiple output files. We don’t want to perform these commands one by one, so we can use some unix tools to automate these repetitive tasks. This is beyond the scope of this workshop, but I recommend you take a look at the links below at some point, so you can learn how to write some basic scripts execute a command like STAR iteratively over multiple files

   for-do-loop - iterate over multiple files
   using variables
   grep
   “|” - redirect output of a command to another command

Below is an example of how we use for-do-loops to execute the HTSeq command across multiple files. Try not to let yourself be overwhelmed with the syntax. Slowly, over time, you will become equipped with all the tools to write scripts similary to automate tasks.