This workshop will be an overview of an RNASeq workflow using widely accepted analysis techniques. We will perform a case study examining gene knockout of Atg5 in macrophages of B6 mice.
This Workflow will be posted to my github repository with the outputs from the commands
WorkFlow Overview
Data originate from a study on norovirus infection (https://www.ncbi.nlm.nih.gov/sra/?term=PRJEB10074)
Black 6 Mice +/- Atg5 (5 knockouts and 9 controls) - Macrophage cells
Sequencing – Ilumina HiSeq
- Single-end
- 50 bp reads
STAR
Tophat (Tuxedo suite)
HISAT
For this case study, we will use STAR
start AWS instance in terminal (or putty)
Project Directory - /home/genomics/workshop_materials/Transcriptomics
You will notice many other subdirectories in the Project directory
- Our raw sample fastq files are located in the Data directory
- Mapping results will output to STAR_Mapping
- Our Mouse genome (fasta and gtf) is located at MouseGenome_GRCm38
- Genome index is located at GenomeIndex2
- Counting results will go to HTSeq
- Differential Expression results will go to the DESeq2 directory
- You should make a directory, Figures, to place our pretty plots in later
Now lets Go to the Data directory
Type the following commands to download the correct raw data files.
$ wget https://www.dropbox.com/s/xe5zszwq849ym6h/Data.tar.gz
$ tar -zxvf Data.tar.gz
This will overwrite your Data folder. If you look in this Data directory, you will see the correct raw read files. You will also notice a file called mm_ref_GRCm38.p2_chr1.fa . This is the fasta file required for running IGV after mapping. Move this file to MouseGenome_GRCm38
Let’s recall some useful unix commands from earlier in the week. List the files in this directory. Now, Take a look at a few lines of “AACGCATT.fq”. How many reads are there in this file?
Now let’s move up one directory back to our main project directory, Transcriptomics.
OPTIONS (more options available in the STAR manual)
--runThreadN NumberOfThreads
--genomeDir /path/to/genomeDir
--readFilesIn /path/to/read1 [/path/to/read2]
--readFilesCommand (uncompression command)*
--outFileNamePrefix
--outFilterMismatchNmax N (recommended 0.06*readLength)
--outSamtype (sorted or unsorted)
$ STAR --runThreadN 2 --outBAMsortingThreadN 2 \
--genomeDir <GENOME_INDEX_DIRECTORY> \
--readFilesIn <READS_DIR><file> \
--outFileNamePrefix <OUTPUT_DIR><prefix>. \
--outFilterMismatchNmax 3 \
--outReadsUnmapped Fastx \
--outSAMtype BAM SortedByCoordinate
Now run the STAR command with “AACGCATT.fq” as the input file and output it to the STAR_Mapping directory. ####Note: For each exercise, “prefix” will be replaced with “AACGCATT”
Once the command has completed Take a look at the files inside STAR_Mapping generated by STAR to make sure all files have been generated. You should have files with the same extensions as in the image below.