READ MAPPING

UPDATE:

Type the following commands to download the correct raw data files.

$ wget https://www.dropbox.com/s/xe5zszwq849ym6h/Data.tar.gz  
$ tar -zxvf Data.tar.gz

This will overwrite your Data folder. If you look in this Data directory, you will see the correct raw read files. You will also notice a file called mm_ref_GRCm38.p2_chr1.fa . This is the fasta file required for running IGV after mapping. Move this file to MouseGenome_GRCm38

Let’s recall some useful unix commands from earlier in the week. List the files in this directory. Now, Take a look at a few lines of “AACGCATT.fq”. How many reads are there in this file?

Now let’s move up one directory back to our main project directory, Transcriptomics.

STEP 1: READ MAPPING with STAR

OPTIONS (more options available in the STAR manual)

--runThreadN NumberOfThreads
--genomeDir /path/to/genomeDir
--readFilesIn /path/to/read1 [/path/to/read2] 
--readFilesCommand (uncompression command)*
--outFileNamePrefix 
--outFilterMismatchNmax N (recommended 0.06*readLength)
--outSamtype (sorted or unsorted)

STAR Command

$ STAR --runThreadN 2 --outBAMsortingThreadN 2 \
  --genomeDir <GENOME_INDEX_DIRECTORY> \
  --readFilesIn <READS_DIR><file> \
  --outFileNamePrefix <OUTPUT_DIR><prefix>. \
  --outFilterMismatchNmax 3 \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM SortedByCoordinate

Now run the STAR command with “AACGCATT.fq” as the input file and output it to the STAR_Mapping directory. ####Note: For each exercise, “prefix” will be replaced with “AACGCATT”

Once the command has completed Take a look at the files inside STAR_Mapping generated by STAR to make sure all files have been generated. You should have files with the same extensions as in the image below.

We can use the following command to print the contents of the Log file to your screen to get some general mapping information

$ cat AACGCATT.Log.final.out

You will notice that the number of input reads is listed in this file. Was your answer about the number of reads correct?

PAUSE 🛑

STEP 1.2: VIEW ALIGNMENTS IN IGV

Load up x2g0 and select the terminal icon
enter the command “igv.sh” and a browser should load
Select File tab -> Load from File (~/workshop_materials/Transcriptomics/BAMS/AACGCATT.bam)
Select Genome tab -> Load Genome From File (~/workshop_materials/Transcriptomics/MouseGenome_GRCm38/mm_ref_GRCm38.p2_chr1.fa)
In the Search box type in a chromosome location
i.) We only have chromosome one in this case ii.) for an example, type in “chr1:9,841,441-9,841,599” iii.) double click to zoom in and let’s look at the read mapping together

return to Transcriptomics.

For convenience, lets copy the bam file generated by STAR into the BAMS directory.

STEP 2: Counting

Before we execute the counting command, we need to convert the bam files into sam format.
Move into the BAMS directory if you aren’t there already.
Run the below command to convert the bam file into sam format

$  samtools view -h <prefix>.bam > <prefix>.sam

Now that we have the alignment file in sam output, use one of your unix commands to take a looks at some of the lines in this file.

PAUSE 🛑

STEP 2: COUNTING With HTSeq

A Few Popular Counting Software packages

  HTSeq
  Cufflinks (Tuxedo Suite)
  Salmon (alignment free read quantification)

OPTIONS

-s <yes/no/reverse> (strand specific assay)  
-t <feature type>

Below is the HTSeq command for single sam file. Let’s execute it on our same file and save it to the HTSeq Directory

$ htseq-count -s yes -t exon <prefix>.sam /home/genomics/workshop_materials/Transcriptomics/MouseGenome_GRCm38/gencode.vM6.annotation.gtf > <OutputDirectory/prefix>.txt

PAUSE 🛑

Considerations for real studies:

When performing an RNASeq study, we have multiple samples and multiple output files. We don’t want to perform these commands one by one, so we can use some unix tools to automate these repetitive tasks. This is beyond the scope of this workshop, but I recommend you take a look at the links below at some point, so you can learn how to write some basic scripts execute a command like STAR iteratively over multiple files

   for-do-loop - iterate over multiple files
   using variables
   grep
   “|” - redirect output of a command to another command

Below is an example of how we use for-do-loops to execute the HTSeq command across multiple files. Try not to let yourself be overwhelmed with the syntax. Slowly, over time, you will become equipped with all the tools to write scripts similary to automate tasks.

alt text

In addition, there are many variations of the above script that could accomplish the same output. As you become more familiar with unix and coding in the terminal, you will acquire skills to make your analyses much more efficient.

Up until now, we have been working on a single file and a subset of it’s reads. I have placed the HTSeq output for each file in a github repository. You can go to the Counts directory and type the following command to retrieve the files.

$ git pull

Open a browser and log into RStudio

#load libraries
library("phyloseq"); packageVersion("phyloseq")
library("DESeq2"); packageVersion("DESeq2")

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:phyloseq':
## 
##     distance

## Loading required package: GenomicRanges

## Loading required package: GenomeInfoDb

## Loading required package: SummarizedExperiment

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:phyloseq':
## 
##     sampleNames

## Loading required package: DelayedArray

## Loading required package: matrixStats

## 
## Attaching package: 'matrixStats'

## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians

## Loading required package: BiocParallel

## 
## Attaching package: 'DelayedArray'

## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

## The following objects are masked from 'package:base':
## 
##     aperm, apply

library("ggplot2"); packageVersion("ggplot2")
library("ggrepel"); packageVersion("ggrepel")
library("data.table"); packageVersion("data.table")

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:SummarizedExperiment':
## 
##     shift

## The following object is masked from 'package:GenomicRanges':
## 
##     shift

## The following object is masked from 'package:IRanges':
## 
##     shift

## The following objects are masked from 'package:S4Vectors':
## 
##     first, second

Read HTSeq output into DESeq2 ###The following commands are adopted from the DESeq2 manual

First we need to read in the files

directory <- "~/Documents/Transcriptomics/Transcriptomics2018"
sampleFiles <- list.files(path=directory,pattern=".txt")
sampleFiles

##  [1] "AACGCATT.txt" "AACTTGAC.txt" "AAGTAGAG.txt" "AGTTGCTT.txt"
##  [5] "CACATCCT.txt" "CATGCTTA.txt" "CCAGTTAG.txt" "CGCAATTC.txt"
##  [9] "CGTTACCA.txt" "GTATAACA.txt" "TCCAGCAA.txt" "TGCTCGAC.txt"
## [13] "TTCGCTGA.txt" "TTGAATAG.txt"

Now read in your metadata from the “Conditions.txt” file

sampleCondition<- read.table("~/Documents/Transcriptomics/Conditions.txt",head=TRUE) #file with sample data to be compared
sampleCondition

##    SampleID condition
## 1  AGTTGCTT   Cre_d28
## 2  AACTTGAC   Cre_d28
## 3  AACGCATT   Cre_d28
## 4  CATGCTTA   Cre_d28
## 5  GTATAACA   Cre_d28
## 6  AAGTAGAG   Cre_d28
## 7  TGCTCGAC   Cre_d28
## 8  CACATCCT   Cre_d28
## 9  CGCAATTC   Cre_d28
## 10 CCAGTTAG    FF_d28
## 11 TTCGCTGA    FF_d28
## 12 CGTTACCA    FF_d28
## 13 TCCAGCAA    FF_d28
## 14 TTGAATAG    FF_d28

sampleCondition <- sampleCondition[order(sampleCondition$SampleID),] #We need to order this table by SampleID before merging with the sampleFiles
sampleCondition

##    SampleID condition
## 3  AACGCATT   Cre_d28
## 2  AACTTGAC   Cre_d28
## 6  AAGTAGAG   Cre_d28
## 1  AGTTGCTT   Cre_d28
## 8  CACATCCT   Cre_d28
## 4  CATGCTTA   Cre_d28
## 10 CCAGTTAG    FF_d28
## 9  CGCAATTC   Cre_d28
## 12 CGTTACCA    FF_d28
## 5  GTATAACA   Cre_d28
## 13 TCCAGCAA    FF_d28
## 7  TGCTCGAC   Cre_d28
## 11 TTCGCTGA    FF_d28
## 14 TTGAATAG    FF_d28

combine sample data into a data frame with file names

sampleTable <- data.frame(sampleName = sampleFiles, fileName = sampleFiles, condition = sampleCondition)
sampleTable #verify that the "fileName" column matches the condition.SampleID column

##      sampleName     fileName condition.SampleID condition.condition
## 3  AACGCATT.txt AACGCATT.txt           AACGCATT             Cre_d28
## 2  AACTTGAC.txt AACTTGAC.txt           AACTTGAC             Cre_d28
## 6  AAGTAGAG.txt AAGTAGAG.txt           AAGTAGAG             Cre_d28
## 1  AGTTGCTT.txt AGTTGCTT.txt           AGTTGCTT             Cre_d28
## 8  CACATCCT.txt CACATCCT.txt           CACATCCT             Cre_d28
## 4  CATGCTTA.txt CATGCTTA.txt           CATGCTTA             Cre_d28
## 10 CCAGTTAG.txt CCAGTTAG.txt           CCAGTTAG              FF_d28
## 9  CGCAATTC.txt CGCAATTC.txt           CGCAATTC             Cre_d28
## 12 CGTTACCA.txt CGTTACCA.txt           CGTTACCA              FF_d28
## 5  GTATAACA.txt GTATAACA.txt           GTATAACA             Cre_d28
## 13 TCCAGCAA.txt TCCAGCAA.txt           TCCAGCAA              FF_d28
## 7  TGCTCGAC.txt TGCTCGAC.txt           TGCTCGAC             Cre_d28
## 11 TTCGCTGA.txt TTCGCTGA.txt           TTCGCTGA              FF_d28
## 14 TTGAATAG.txt TTGAATAG.txt           TTGAATAG              FF_d28

Now we have our metadata formatted in a way that can be used by DESeq2, but we need to create a data matrix that can be used by the DESeq function

ddsCounts <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ condition.condition) #creates DESEqDataSet for DESeq function

PAUSE 🛑

Before we run the DESeq function, we need to do a final quality control step to make sure there are no batch effects or outliers

#The data first need to be transformed using the variance stabilizing transformation
vsd <- vst(ddsCounts, blind=FALSE) #variance stabilizing transformation
pcaData <- plotPCA(vsd, intgroup=c("condition.condition"),returnData=TRUE)
percentVar <- round(100 * attr(pcaData,"percentVar"))
head(pcaData)

##                     PC1       PC2   group condition.condition         name
## AACGCATT.txt -15.560314 -5.106014 Cre_d28             Cre_d28 AACGCATT.txt
## AACTTGAC.txt -15.560314 -5.106014 Cre_d28             Cre_d28 AACTTGAC.txt
## AAGTAGAG.txt  -8.813421 -5.173291 Cre_d28             Cre_d28 AAGTAGAG.txt
## AGTTGCTT.txt -17.843509 -5.159003 Cre_d28             Cre_d28 AGTTGCTT.txt
## CACATCCT.txt  -4.543260 -3.578570 Cre_d28             Cre_d28 CACATCCT.txt
## CATGCTTA.txt  24.039225  5.152656 Cre_d28             Cre_d28 CATGCTTA.txt

ggplot(pcaData, aes(PC1, PC2, color=condition.condition))+
  geom_text(mapping = aes(label = name),size=2.5)+
  geom_point(size=2.5) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  xlim(-25,25)+
  coord_fixed()

We can see that the sample “CGCAATTC.txt” performs drastically differently than the rest of the samples, so we need to remove it

vsd2 <- vsd[,vsd$condition.SampleID != "CGCAATTC"] #there is a typo here in the Rmarkdown you pulled from github. Remove the '.txt' extention here

pcaData2 <- plotPCA(vsd2, intgroup=c("condition.condition"),returnData=TRUE)
percentVar <- round(100 * attr(pcaData2,"percentVar"))
head(pcaData2)

##                     PC1         PC2   group condition.condition
## AACGCATT.txt -18.438509  -0.6177934 Cre_d28             Cre_d28
## AACTTGAC.txt -18.438509  -0.6177934 Cre_d28             Cre_d28
## AAGTAGAG.txt -11.179033  -5.6603309 Cre_d28             Cre_d28
## AGTTGCTT.txt -20.724749  -5.3125679 Cre_d28             Cre_d28
## CACATCCT.txt  -6.858376  -4.2964690 Cre_d28             Cre_d28
## CATGCTTA.txt  23.763822 -14.9062063 Cre_d28             Cre_d28
##                      name
## AACGCATT.txt AACGCATT.txt
## AACTTGAC.txt AACTTGAC.txt
## AAGTAGAG.txt AAGTAGAG.txt
## AGTTGCTT.txt AGTTGCTT.txt
## CACATCCT.txt CACATCCT.txt
## CATGCTTA.txt CATGCTTA.txt

ggplot(pcaData2, aes(PC1, PC2, color=condition.condition))+
  #geom_text(mapping = aes(label = name),size=1.5)+
  geom_point(size=3) +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) + 
  coord_fixed()

####You can see, there is now separation between the two conditions, so we won’t do anymore sample removal based on the PCA results ####Now we have to remove the outlier(s) from the dds counts matrix since this is what the DESeq function will be run on

ddsCounts2 <- ddsCounts[,ddsCounts$condition.SampleID != "CGCAATTC"] #there is a typo here in the Rmarkdown you pulled from github. Remove the '.txt' extention here

#Now we can run the differential expression analysis
dds <- DESeq(ddsCounts2) #DESeq2 command for differential expression test

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

## -- replacing outliers and refitting for 21 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)

## estimating dispersions

## fitting model and testing

Extracting DESeq2 Results

Below is a description of the column headers that will appear in the DESeq2 results table: * Basemean - mean of normalized counts for all samples * log2foldchange * lfcse - standard error * stat - wald statistic (log2foldchange / lfcse) * pvalue * padj - pvalue adjusted for false discovery ####Warning: You MUST resist the temptation to use the p-value when evaluating DESeq2 results. The adjusted p-value (padj) accounts for multiple hypothesis testing

#text file with ensembl ID's and corresponding gene names
mmGenes = read.table("~/Documents/Transcriptomics/MouseGeneTable.txt", header=TRUE, row.names = 1)
mmGenes = as.matrix(mmGenes) #I convert the data frame to a matrix to get a downstream function, cbind, to work properly
head(mmGenes)

##                       External_Gene_Name
## ENSMUSG00000000001.4  "GNAI3"           
## ENSMUSG00000000003.13 "PBSN"            
## ENSMUSG00000000028.12 "CDC45"           
## ENSMUSG00000000031.13 "H19"             
## ENSMUSG00000000037.14 "SCML2"           
## ENSMUSG00000000049.9  "APOH"

extract results table from DESeq2

res <- results(dds)
res.dds <- res[order(res$padj),] #order table by adj p value
summary(res.dds) #lets take a look at how many genes are up-regulated/down-regulated

## 
## out of 19026 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1444, 7.6%
## LFC < 0 (down)     : 1524, 8%
## outliers [1]       : 18, 0.095%
## low counts [2]     : 6832, 36%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

  # Quick check of factor levels
mcols(res.dds, use.names = TRUE) #this is important for results interpretation

## DataFrame with 6 rows and 2 columns
##                        type
##                 <character>
## baseMean       intermediate
## log2FoldChange      results
## lfcSE               results
## stat                results
## pvalue              results
## padj                results
##                                                                  description
##                                                                  <character>
## baseMean                           mean of normalized counts for all samples
## log2FoldChange log2 fold change (MLE): condition.condition FF d28 vs Cre d28
## lfcSE                  standard error: condition.condition FF d28 vs Cre d28
## stat                   Wald statistic: condition.condition FF d28 vs Cre d28
## pvalue              Wald test p-value: condition.condition FF d28 vs Cre d28
## padj                                                    BH adjusted p-values

We can add the gene names to the results by merging the mmGenes table with the res.dds table

res.dds = cbind(as(res.dds, "data.frame"), Gene_Name = as(mmGenes[rownames(res.dds), ],"matrix"))
head(res.dds)

##                         baseMean log2FoldChange     lfcSE       stat
## ENSMUSG00000069515.5  1755.61412      -2.490739 0.2074743 -12.005048
## ENSMUSG00000023913.15 4207.92575      -2.075657 0.2226230  -9.323642
## ENSMUSG00000028124.13 1049.89492      -1.589033 0.1713642  -9.272843
## ENSMUSG00000028691.10 9768.76201      -1.903420 0.2101529  -9.057312
## ENSMUSG00000093765.5    32.94265      -2.201636 0.2467593  -8.922201
## ENSMUSG00000022947.7    31.30477      -2.374491 0.2771568  -8.567317
##                             pvalue         padj Gene_Name
## ENSMUSG00000069515.5  3.342679e-33 4.070046e-29      LYZ1
## ENSMUSG00000023913.15 1.124135e-20 6.843736e-17    PLA2G7
## ENSMUSG00000028124.13 1.812483e-20 7.356266e-17      GCLM
## ENSMUSG00000028691.10 1.337044e-19 4.069961e-16     PRDX1
## ENSMUSG00000093765.5  4.571121e-19 1.113159e-15   GM20658
## ENSMUSG00000022947.7  1.059241e-17 2.149553e-14      CBR3

Make a new object to add significance boolean for plotting

resdt.dds = data.table(as(res.dds, "data.frame"),
                 keep.rownames = TRUE)
setnames(resdt.dds, "rn", "Ensembl_ID") 

resdt.dds

##                   Ensembl_ID     baseMean log2FoldChange     lfcSE
##     1:  ENSMUSG00000069515.5 1755.6141168    -2.49073851 0.2074743
##     2: ENSMUSG00000023913.15 4207.9257497    -2.07565720 0.2226230
##     3: ENSMUSG00000028124.13 1049.8949185    -1.58903345 0.1713642
##     4: ENSMUSG00000028691.10 9768.7620105    -1.90342025 0.2101529
##     5:  ENSMUSG00000093765.5   32.9426531    -2.20163571 0.2467593
##    ---                                                            
## 45702:  ENSMUSG00000107387.1    0.0000000             NA        NA
## 45703:  ENSMUSG00000107388.1    0.1148895     0.05009061 3.1853473
## 45704:  ENSMUSG00000107389.1    0.3429328     0.25155769 1.8079178
## 45705:  ENSMUSG00000107390.1    0.1759695    -0.31057669 2.4419312
## 45706:  ENSMUSG00000107392.1    0.0000000             NA        NA
##                stat       pvalue         padj     Gene_Name
##     1: -12.00504850 3.342679e-33 4.070046e-29          LYZ1
##     2:  -9.32364232 1.124135e-20 6.843736e-17        PLA2G7
##     3:  -9.27284345 1.812483e-20 7.356266e-17          GCLM
##     4:  -9.05731230 1.337044e-19 4.069961e-16         PRDX1
##     5:  -8.92220091 4.571121e-19 1.113159e-15       GM20658
##    ---                                                     
## 45702:           NA           NA           NA RP24-343A12.4
## 45703:   0.01572532 9.874535e-01           NA  RP23-408A1.4
## 45704:   0.13914222 8.893378e-01           NA   RP23-8L20.8
## 45705:  -0.12718487 8.987941e-01           NA RP24-351O18.4
## 45706:           NA           NA           NA   RP23-57N5.3

resdt.dds[, Significant := padj < .1]
resdt.dds[!is.na(Significant)]

##                   Ensembl_ID    baseMean log2FoldChange     lfcSE
##     1:  ENSMUSG00000069515.5 1755.614117  -2.490739e+00 0.2074743
##     2: ENSMUSG00000023913.15 4207.925750  -2.075657e+00 0.2226230
##     3: ENSMUSG00000028124.13 1049.894918  -1.589033e+00 0.1713642
##     4: ENSMUSG00000028691.10 9768.762010  -1.903420e+00 0.2101529
##     5:  ENSMUSG00000093765.5   32.942653  -2.201636e+00 0.2467593
##    ---                                                           
## 12172: ENSMUSG00000026927.15   87.101498   1.068005e-04 0.1576998
## 12173:  ENSMUSG00000024033.9    1.506442  -1.001425e-04 0.8582128
## 12174: ENSMUSG00000025290.14  812.246831   9.475938e-05 0.2630923
## 12175: ENSMUSG00000027002.11    4.563147   8.091800e-05 0.5605311
## 12176: ENSMUSG00000060261.13   44.095044   1.385327e-05 0.2523291
##                 stat       pvalue         padj Gene_Name Significant
##     1: -1.200505e+01 3.342679e-33 4.070046e-29      LYZ1        TRUE
##     2: -9.323642e+00 1.124135e-20 6.843736e-17    PLA2G7        TRUE
##     3: -9.272843e+00 1.812483e-20 7.356266e-17      GCLM        TRUE
##     4: -9.057312e+00 1.337044e-19 4.069961e-16     PRDX1        TRUE
##     5: -8.922201e+00 4.571121e-19 1.113159e-15   GM20658        TRUE
##    ---                                                              
## 12172:  6.772395e-04 9.994596e-01 9.997881e-01   SDCCAG3       FALSE
## 12173: -1.166872e-04 9.999069e-01 9.999562e-01     RSPH1       FALSE
## 12174:  3.601754e-04 9.997126e-01 9.999562e-01     RPS24       FALSE
## 12175:  1.443595e-04 9.998848e-01 9.999562e-01    NCKAP1       FALSE
## 12176:  5.490160e-05 9.999562e-01 9.999562e-01     GTF2I       FALSE

resdt.dds

##                   Ensembl_ID     baseMean log2FoldChange     lfcSE
##     1:  ENSMUSG00000069515.5 1755.6141168    -2.49073851 0.2074743
##     2: ENSMUSG00000023913.15 4207.9257497    -2.07565720 0.2226230
##     3: ENSMUSG00000028124.13 1049.8949185    -1.58903345 0.1713642
##     4: ENSMUSG00000028691.10 9768.7620105    -1.90342025 0.2101529
##     5:  ENSMUSG00000093765.5   32.9426531    -2.20163571 0.2467593
##    ---                                                            
## 45702:  ENSMUSG00000107387.1    0.0000000             NA        NA
## 45703:  ENSMUSG00000107388.1    0.1148895     0.05009061 3.1853473
## 45704:  ENSMUSG00000107389.1    0.3429328     0.25155769 1.8079178
## 45705:  ENSMUSG00000107390.1    0.1759695    -0.31057669 2.4419312
## 45706:  ENSMUSG00000107392.1    0.0000000             NA        NA
##                stat       pvalue         padj     Gene_Name Significant
##     1: -12.00504850 3.342679e-33 4.070046e-29          LYZ1        TRUE
##     2:  -9.32364232 1.124135e-20 6.843736e-17        PLA2G7        TRUE
##     3:  -9.27284345 1.812483e-20 7.356266e-17          GCLM        TRUE
##     4:  -9.05731230 1.337044e-19 4.069961e-16         PRDX1        TRUE
##     5:  -8.92220091 4.571121e-19 1.113159e-15       GM20658        TRUE
##    ---                                                                 
## 45702:           NA           NA           NA RP24-343A12.4          NA
## 45703:   0.01572532 9.874535e-01           NA  RP23-408A1.4          NA
## 45704:   0.13914222 8.893378e-01           NA   RP23-8L20.8          NA
## 45705:  -0.12718487 8.987941e-01           NA RP24-351O18.4          NA
## 45706:           NA           NA           NA   RP23-57N5.3          NA

write.table(res.dds,file="~/Documents/Transcriptomics/FF_vs_Cre.DESeq.out.txt",quote=F,sep="\t")

make rnk table for Gene Set Enrichment Analysis

RNK = data.table(Gene_Name = res.dds$Gene_Name, stat = res.dds$stat) #These are 2 columns from our deseq2 output that we need for GSEA
RNK = subset(RNK, stat != "NA")
head(RNK)

##    Gene_Name       stat
## 1:      LYZ1 -12.005048
## 2:    PLA2G7  -9.323642
## 3:      GCLM  -9.272843
## 4:     PRDX1  -9.057312
## 5:   GM20658  -8.922201
## 6:      CBR3  -8.567317

write.table(RNK, "~/Documents/Transcriptomics/FF_vs_Cre.DESeq.rnk",quote=F,sep="\t", row.names = F)

Volcano plots are a good way of visualizing the effects of the experimental conditions on the cells

volcano = ggplot(
    data = resdt.dds[!is.na(Significant)],
    mapping = aes(x = log2FoldChange,
                  y = -log10(padj),
                  color = Significant,
                  label = Ensembl_ID, label1 = Gene_Name)) +
    theme_bw() +
    geom_point() + 
    geom_point(data = resdt.dds[(Significant)], size = 7, alpha = 0.7) + #Larger circles for the significant values
    theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
    geom_hline(yintercept = -log10(.1)) +
    ggtitle("DESeq2 Negative Binomial Test Volcano Plot \nProximal Colon Abx vs no Abx") +
    theme(axis.title = element_text(size=12)) +
    theme(axis.text = element_text(size=12)) +
    theme(legend.text = element_text(size=12)) +
    geom_vline(xintercept = 0, lty = 2)
#volcano
volcano + xlim(-7,7) #xlim isn't necessary, but I add it here to make the plot symmetrical about the y axix

summary(resdt.dds)

##   Ensembl_ID           baseMean        log2FoldChange       lfcSE      
##  Length:45706       Min.   :    0.00   Min.   :-5.806   Min.   :0.076  
##  Class :character   1st Qu.:    0.00   1st Qu.:-0.527   1st Qu.:0.254  
##  Mode  :character   Median :    0.00   Median :-0.140   Median :0.462  
##                     Mean   :   38.99   Mean   :-0.150   Mean   :1.094  
##                     3rd Qu.:    2.23   3rd Qu.: 0.376   3rd Qu.:1.813  
##                     Max.   :32611.22   Max.   : 4.526   Max.   :3.185  
##                                        NA's   :26680    NA's   :26680  
##       stat             pvalue           padj                Gene_Name    
##  Min.   :-12.005   Min.   :0.000   Min.   :0.00    MMU-MIR-669A-4:    9  
##  1st Qu.: -0.758   1st Qu.:0.092   1st Qu.:0.11    FLG           :    8  
##  Median : -0.165   Median :0.437   Median :0.39    ACEA_U3       :    5  
##  Mean   : -0.049   Mean   :0.441   Mean   :0.42    SCARNA15      :    5  
##  3rd Qu.:  0.791   3rd Qu.:0.762   3rd Qu.:0.70    SCARNA4       :    5  
##  Max.   :  6.913   Max.   :1.000   Max.   :1.00    SNORA17       :    5  
##  NA's   :26680     NA's   :26698   NA's   :33530   (Other)       :45669  
##  Significant    
##  Mode :logical  
##  FALSE:9208     
##  TRUE :2968     
##  NA's :33530    
##                 
##                 
##

save.image("./Transcriptomics.RData") #it's to to save the image so all the object that have been initiated can be accessed at any time

After running GSEA, we may want to make some heat maps. To do that we need to merge HTSeq counts files with Gene list file to make a counts table

counts = mmGenes #initialize counts table as gene list with gene names and ensemble IDs
#subset sampleFiles to remove the outlier we removed earlier
sampleFiles

#iterate over each HTSeq file
for(i in 1:length(sampleFiles)){ 
  file <- read.table(paste(directory,sampleFiles[i],sep=""),header=FALSE,sep="\t",row.names = 1)

  colnames(file) = c(sub('\\.txt$', '', sampleFiles[i])) #add the file name minus the ".txt" suffix to the header so we can identify each column
  
  dat = subset(file, !(row.names(file)%in%c("__no_feature","__ambiguous","__too_low_aQual","__not_aligned","__alignment_not_unique"))) #remove the extra fields from the counts table
  
 counts = cbind(as(counts,"matrix"),as(dat,"data.frame")) #each time the loop iterates, it will add a new column for the file's counts
  
}
head(counts)
write.table(counts,"~/Transcriptomics/CountsTable.txt",quote=F,sep="\t")

Because R…We have to open the file “CountsTable.txt” and add the column header “Ensembl_ID” followed by a tab to the file, so the columns match with their headers. You can open the file in RStudio by double clicking the file

Gene Set Enrichment Analysis (GSEA)

Open x2goclient Because R…We have to open the file “CountsTable.txt” and add the column header “Ensembl_ID” followed by a tab to the file, so the columns match with their headers. Click the GSEA tab (a java browser should open)

Enrichment Plots

1. Open GSEA

2. select the “Load data” tab and select the rnk file (blue square by the name)

alt text

3. Tools/GSEAPreranked

3.1 Select Gene Sets

i) gseaftp.broadinstitute.org://pub/gsea/gene_sets/h.all.v6.2.symbols.gmt (Hallmark Gene Symbols)
ii) gseaftp.broadinstitute.org://pub/gsea/gene_sets/c2.cp.v6.2.symbols.gmt (Canonical Pathways)
alt text

3.2 Permutations = 1000

3.3 ranked list - select your rnk file

3.4 Basic Fields - Analysis Name - give your output prefix (include conditions in the filename)

3.5 Basic Fields - Enrichment Statistic : Weighted

3.6 Basic Fields - Max size 2000, min 15 (these are the limits of gene set sizes)

3.7 Basic Fields - Change the path to GSEA

alt text

3.8 RUN

PAUSE 🛑

Make a pretty heatmap based on an enriched GSEA pathway

In x2go, go to https://software.broadinstitute.org/morpheus/

Open /home/genomics/workshop_materials/Transcriptomics/CountsTable.txt

alt text

Click the first count value at the top left and make sure only the counts matrix is blue (also verify that column headers are lined up -> click ok

alt text

Now we need to normalize our data

Click “Tools” -> select Adjust -> Log 2 Transform and Quantile normalize

- Log2 Transform to make the data more interpretable for the heatmap
- Quantile normalize to account for variance in the dataset
alt text

The following is a list of genes which are enriched for the Atg5 knockout condition from “HALLMARK_REACTIVE_OXIGEN_SPECIES_PATHWAY

[ERCC2 GCLC CDKN2D HMOX2 PRDX2 PFKP GPX4 TXNRD1 TXNRD2 MGST1 FES JUNB PRDX4 SOD2 GLRX ATOX1 PRDX6 GSR CAT ABCC1 SRXN1 NQO1 PRDX1 GCLM]

Copy and paste it into the box to the right of the “columns” tab

Select the drop down arrow and select “Exact Match”

alt text

This will select all the genes in our list

Now, Select Tools -> New Heat Map

alt text

We can annotate columns by adding the conditions file as long as the column headers match the a column in the conditions file.

Go to File -> Open -> Conditions.txt and open the file (located in our project directory)

Click the drop down arrow and select “Annotate Columns”

$alt text$

alt text

Under “File Field Name”, select SampleID -> OK

alt text

We can perform hierarchical clustering to see if the gene set clusters by conditions.

Select Tools -> hierarchical clustering -> Euclidean distance and complete linkage

alt text

There are multiple distance formulas that could be applied, but this is the default formula used in the pheatmap package which is recommended R function for making heatmaps by the DESeq2 manual

RNASeq WorkFlow

Barry L Hykes Jr

10/2/2018

Introduction

Case Study

READ MAPPING

Popular read aligners

FILE ORGANIZATION

UPDATE:

STEP 1: READ MAPPING with STAR

STAR Command

PAUSE 🛑

STEP 1.2: VIEW ALIGNMENTS IN IGV

STEP 2: Counting

PAUSE 🛑

STEP 2: COUNTING With HTSeq

A Few Popular Counting Software packages

PAUSE 🛑

Considerations for real studies:

Open a browser and log into RStudio

First we need to read in the files

Now read in your metadata from the “Conditions.txt” file

combine sample data into a data frame with file names

Now we have our metadata formatted in a way that can be used by DESeq2, but we need to create a data matrix that can be used by the DESeq function

PAUSE 🛑

Before we run the DESeq function, we need to do a final quality control step to make sure there are no batch effects or outliers

We can see that the sample “CGCAATTC.txt” performs drastically differently than the rest of the samples, so we need to remove it

Extracting DESeq2 Results

extract results table from DESeq2

Make a new object to add significance boolean for plotting

make rnk table for Gene Set Enrichment Analysis

Gene Set Enrichment Analysis (GSEA)

Enrichment Plots

1. Open GSEA

2. select the “Load data” tab and select the rnk file (blue square by the name)

3. Tools/GSEAPreranked

3.1 Select Gene Sets

3.2 Permutations = 1000

3.3 ranked list - select your rnk file

3.4 Basic Fields - Analysis Name - give your output prefix (include conditions in the filename)

3.5 Basic Fields - Enrichment Statistic : Weighted

3.6 Basic Fields - Max size 2000, min 15 (these are the limits of gene set sizes)

3.7 Basic Fields - Change the path to GSEA

3.8 RUN

PAUSE 🛑

Make a pretty heatmap based on an enriched GSEA pathway

In x2go, go to https://software.broadinstitute.org/morpheus/

Open /home/genomics/workshop_materials/Transcriptomics/CountsTable.txt

Click the first count value at the top left and make sure only the counts matrix is blue (also verify that column headers are lined up -> click ok

Now we need to normalize our data

Click “Tools” -> select Adjust -> Log 2 Transform and Quantile normalize

The following is a list of genes which are enriched for the Atg5 knockout condition from “HALLMARK_REACTIVE_OXIGEN_SPECIES_PATHWAY

Copy and paste it into the box to the right of the “columns” tab

Select the drop down arrow and select “Exact Match”

This will select all the genes in our list

Now, Select Tools -> New Heat Map

We can annotate columns by adding the conditions file as long as the column headers match the a column in the conditions file.

Go to File -> Open -> Conditions.txt and open the file (located in our project directory)

Click the drop down arrow and select “Annotate Columns”

Under “File Field Name”, select SampleID -> OK

We can perform hierarchical clustering to see if the gene set clusters by conditions.

Select Tools -> hierarchical clustering -> Euclidean distance and complete linkage

There are multiple distance formulas that could be applied, but this is the default formula used in the pheatmap package which is recommended R function for making heatmaps by the DESeq2 manual