This workshop will be an overview of an RNASeq workflow using widely accepted analysis techniques. We will perform a case study examining gene knockout of Atg5 in macrophages of B6 mice.
This Workflow will be posted to my github repository with the outputs from the commands
WorkFlow Overview
Data originate from a study on norovirus infection (https://www.ncbi.nlm.nih.gov/sra/?term=PRJEB10074)
Black 6 Mice +/- Atg5 (5 knockouts and 9 controls) - Macrophage cells
Sequencing – Ilumina HiSeq
- Single-end
- 50 bp reads
STAR
Tophat (Tuxedo suite)
HISAT
For this case study, we will use STAR
start AWS instance in terminal (or putty)
Project Directory - /home/genomics/workshop_materials/Transcriptomics
You will notice many other subdirectories in the Project directory
- Our raw sample fastq files are located in the Data directory
- Mapping results will output to STAR_Mapping
- Our Mouse genome (fasta and gtf) is located at MouseGenome_GRCm38
- Genome index is located at GenomeIndex2
- Counting results will go to HTSeq
- Differential Expression results will go to the DESeq2 directory
- You should make a directory, Figures, to place our pretty plots in later
Now lets Go to the Data directory
Type the following commands to download the correct raw data files.
$ wget https://www.dropbox.com/s/xe5zszwq849ym6h/Data.tar.gz
$ tar -zxvf Data.tar.gz
This will overwrite your Data folder. If you look in this Data directory, you will see the correct raw read files. You will also notice a file called mm_ref_GRCm38.p2_chr1.fa . This is the fasta file required for running IGV after mapping. Move this file to MouseGenome_GRCm38
Let’s recall some useful unix commands from earlier in the week. List the files in this directory. Now, Take a look at a few lines of “AACGCATT.fq”. How many reads are there in this file?
Now let’s move up one directory back to our main project directory, Transcriptomics.
OPTIONS (more options available in the STAR manual)
--runThreadN NumberOfThreads
--genomeDir /path/to/genomeDir
--readFilesIn /path/to/read1 [/path/to/read2]
--readFilesCommand (uncompression command)*
--outFileNamePrefix
--outFilterMismatchNmax N (recommended 0.06*readLength)
--outSamtype (sorted or unsorted)
$ STAR --runThreadN 2 --outBAMsortingThreadN 2 \
--genomeDir <GENOME_INDEX_DIRECTORY> \
--readFilesIn <READS_DIR><file> \
--outFileNamePrefix <OUTPUT_DIR><prefix>. \
--outFilterMismatchNmax 3 \
--outReadsUnmapped Fastx \
--outSAMtype BAM SortedByCoordinate
Now run the STAR command with “AACGCATT.fq” as the input file and output it to the STAR_Mapping directory. ####Note: For each exercise, “prefix” will be replaced with “AACGCATT”
Once the command has completed Take a look at the files inside STAR_Mapping generated by STAR to make sure all files have been generated. You should have files with the same extensions as in the image below.
We can use the following command to print the contents of the Log file to your screen to get some general mapping information
$ cat AACGCATT.Log.final.out
You will notice that the number of input reads is listed in this file. Was your answer about the number of reads correct?
return to Transcriptomics.
For convenience, lets copy the bam file generated by STAR into the BAMS directory.
Before we execute the counting command, we need to convert the bam files into sam format.
Move into the BAMS directory if you aren’t there already.
Run the below command to convert the bam file into sam format
$ samtools view -h <prefix>.bam > <prefix>.sam
Now that we have the alignment file in sam output, use one of your unix commands to take a looks at some of the lines in this file.
HTSeq
Cufflinks (Tuxedo Suite)
Salmon (alignment free read quantification)
OPTIONS
-s <yes/no/reverse> (strand specific assay)
-t <feature type>
Below is the HTSeq command for single sam file. Let’s execute it on our same file and save it to the HTSeq Directory
$ htseq-count -s yes -t exon <prefix>.sam /home/genomics/workshop_materials/Transcriptomics/MouseGenome_GRCm38/gencode.vM6.annotation.gtf > <OutputDirectory/prefix>.txt
When performing an RNASeq study, we have multiple samples and multiple output files. We don’t want to perform these commands one by one, so we can use some unix tools to automate these repetitive tasks. This is beyond the scope of this workshop, but I recommend you take a look at the links below at some point, so you can learn how to write some basic scripts execute a command like STAR iteratively over multiple files
for-do-loop - iterate over multiple files
using variables
grep
“|” - redirect output of a command to another command
Below is an example of how we use for-do-loops to execute the HTSeq command across multiple files. Try not to let yourself be overwhelmed with the syntax. Slowly, over time, you will become equipped with all the tools to write scripts similary to automate tasks.
In addition, there are many variations of the above script that could accomplish the same output. As you become more familiar with unix and coding in the terminal, you will acquire skills to make your analyses much more efficient.
Up until now, we have been working on a single file and a subset of it’s reads. I have placed the HTSeq output for each file in a github repository. You can go to the Counts directory and type the following command to retrieve the files.
$ git pull
#load libraries
library("phyloseq"); packageVersion("phyloseq")
library("DESeq2"); packageVersion("DESeq2")
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:phyloseq':
##
## distance
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:phyloseq':
##
## sampleNames
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## Loading required package: BiocParallel
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply
library("ggplot2"); packageVersion("ggplot2")
library("ggrepel"); packageVersion("ggrepel")
library("data.table"); packageVersion("data.table")
##
## Attaching package: 'data.table'
## The following object is masked from 'package:SummarizedExperiment':
##
## shift
## The following object is masked from 'package:GenomicRanges':
##
## shift
## The following object is masked from 'package:IRanges':
##
## shift
## The following objects are masked from 'package:S4Vectors':
##
## first, second
Read HTSeq output into DESeq2 ###The following commands are adopted from the DESeq2 manual
directory <- "~/Documents/Transcriptomics/Transcriptomics2018"
sampleFiles <- list.files(path=directory,pattern=".txt")
sampleFiles
## [1] "AACGCATT.txt" "AACTTGAC.txt" "AAGTAGAG.txt" "AGTTGCTT.txt"
## [5] "CACATCCT.txt" "CATGCTTA.txt" "CCAGTTAG.txt" "CGCAATTC.txt"
## [9] "CGTTACCA.txt" "GTATAACA.txt" "TCCAGCAA.txt" "TGCTCGAC.txt"
## [13] "TTCGCTGA.txt" "TTGAATAG.txt"
sampleCondition<- read.table("~/Documents/Transcriptomics/Conditions.txt",head=TRUE) #file with sample data to be compared
sampleCondition
## SampleID condition
## 1 AGTTGCTT Cre_d28
## 2 AACTTGAC Cre_d28
## 3 AACGCATT Cre_d28
## 4 CATGCTTA Cre_d28
## 5 GTATAACA Cre_d28
## 6 AAGTAGAG Cre_d28
## 7 TGCTCGAC Cre_d28
## 8 CACATCCT Cre_d28
## 9 CGCAATTC Cre_d28
## 10 CCAGTTAG FF_d28
## 11 TTCGCTGA FF_d28
## 12 CGTTACCA FF_d28
## 13 TCCAGCAA FF_d28
## 14 TTGAATAG FF_d28
sampleCondition <- sampleCondition[order(sampleCondition$SampleID),] #We need to order this table by SampleID before merging with the sampleFiles
sampleCondition
## SampleID condition
## 3 AACGCATT Cre_d28
## 2 AACTTGAC Cre_d28
## 6 AAGTAGAG Cre_d28
## 1 AGTTGCTT Cre_d28
## 8 CACATCCT Cre_d28
## 4 CATGCTTA Cre_d28
## 10 CCAGTTAG FF_d28
## 9 CGCAATTC Cre_d28
## 12 CGTTACCA FF_d28
## 5 GTATAACA Cre_d28
## 13 TCCAGCAA FF_d28
## 7 TGCTCGAC Cre_d28
## 11 TTCGCTGA FF_d28
## 14 TTGAATAG FF_d28
sampleTable <- data.frame(sampleName = sampleFiles, fileName = sampleFiles, condition = sampleCondition)
sampleTable #verify that the "fileName" column matches the condition.SampleID column
## sampleName fileName condition.SampleID condition.condition
## 3 AACGCATT.txt AACGCATT.txt AACGCATT Cre_d28
## 2 AACTTGAC.txt AACTTGAC.txt AACTTGAC Cre_d28
## 6 AAGTAGAG.txt AAGTAGAG.txt AAGTAGAG Cre_d28
## 1 AGTTGCTT.txt AGTTGCTT.txt AGTTGCTT Cre_d28
## 8 CACATCCT.txt CACATCCT.txt CACATCCT Cre_d28
## 4 CATGCTTA.txt CATGCTTA.txt CATGCTTA Cre_d28
## 10 CCAGTTAG.txt CCAGTTAG.txt CCAGTTAG FF_d28
## 9 CGCAATTC.txt CGCAATTC.txt CGCAATTC Cre_d28
## 12 CGTTACCA.txt CGTTACCA.txt CGTTACCA FF_d28
## 5 GTATAACA.txt GTATAACA.txt GTATAACA Cre_d28
## 13 TCCAGCAA.txt TCCAGCAA.txt TCCAGCAA FF_d28
## 7 TGCTCGAC.txt TGCTCGAC.txt TGCTCGAC Cre_d28
## 11 TTCGCTGA.txt TTCGCTGA.txt TTCGCTGA FF_d28
## 14 TTGAATAG.txt TTGAATAG.txt TTGAATAG FF_d28
ddsCounts <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory, design = ~ condition.condition) #creates DESEqDataSet for DESeq function
#The data first need to be transformed using the variance stabilizing transformation
vsd <- vst(ddsCounts, blind=FALSE) #variance stabilizing transformation
pcaData <- plotPCA(vsd, intgroup=c("condition.condition"),returnData=TRUE)
percentVar <- round(100 * attr(pcaData,"percentVar"))
head(pcaData)
## PC1 PC2 group condition.condition name
## AACGCATT.txt -15.560314 -5.106014 Cre_d28 Cre_d28 AACGCATT.txt
## AACTTGAC.txt -15.560314 -5.106014 Cre_d28 Cre_d28 AACTTGAC.txt
## AAGTAGAG.txt -8.813421 -5.173291 Cre_d28 Cre_d28 AAGTAGAG.txt
## AGTTGCTT.txt -17.843509 -5.159003 Cre_d28 Cre_d28 AGTTGCTT.txt
## CACATCCT.txt -4.543260 -3.578570 Cre_d28 Cre_d28 CACATCCT.txt
## CATGCTTA.txt 24.039225 5.152656 Cre_d28 Cre_d28 CATGCTTA.txt
ggplot(pcaData, aes(PC1, PC2, color=condition.condition))+
geom_text(mapping = aes(label = name),size=2.5)+
geom_point(size=2.5) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
xlim(-25,25)+
coord_fixed()
vsd2 <- vsd[,vsd$condition.SampleID != "CGCAATTC"] #there is a typo here in the Rmarkdown you pulled from github. Remove the '.txt' extention here
pcaData2 <- plotPCA(vsd2, intgroup=c("condition.condition"),returnData=TRUE)
percentVar <- round(100 * attr(pcaData2,"percentVar"))
head(pcaData2)
## PC1 PC2 group condition.condition
## AACGCATT.txt -18.438509 -0.6177934 Cre_d28 Cre_d28
## AACTTGAC.txt -18.438509 -0.6177934 Cre_d28 Cre_d28
## AAGTAGAG.txt -11.179033 -5.6603309 Cre_d28 Cre_d28
## AGTTGCTT.txt -20.724749 -5.3125679 Cre_d28 Cre_d28
## CACATCCT.txt -6.858376 -4.2964690 Cre_d28 Cre_d28
## CATGCTTA.txt 23.763822 -14.9062063 Cre_d28 Cre_d28
## name
## AACGCATT.txt AACGCATT.txt
## AACTTGAC.txt AACTTGAC.txt
## AAGTAGAG.txt AAGTAGAG.txt
## AGTTGCTT.txt AGTTGCTT.txt
## CACATCCT.txt CACATCCT.txt
## CATGCTTA.txt CATGCTTA.txt
ggplot(pcaData2, aes(PC1, PC2, color=condition.condition))+
#geom_text(mapping = aes(label = name),size=1.5)+
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
####You can see, there is now separation between the two conditions, so we won’t do anymore sample removal based on the PCA results ####Now we have to remove the outlier(s) from the dds counts matrix since this is what the DESeq function will be run on
ddsCounts2 <- ddsCounts[,ddsCounts$condition.SampleID != "CGCAATTC"] #there is a typo here in the Rmarkdown you pulled from github. Remove the '.txt' extention here
#Now we can run the differential expression analysis
dds <- DESeq(ddsCounts2) #DESeq2 command for differential expression test
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 21 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
Below is a description of the column headers that will appear in the DESeq2 results table: * Basemean - mean of normalized counts for all samples * log2foldchange * lfcse - standard error * stat - wald statistic (log2foldchange / lfcse) * pvalue * padj - pvalue adjusted for false discovery ####Warning: You MUST resist the temptation to use the p-value when evaluating DESeq2 results. The adjusted p-value (padj) accounts for multiple hypothesis testing
#text file with ensembl ID's and corresponding gene names
mmGenes = read.table("~/Documents/Transcriptomics/MouseGeneTable.txt", header=TRUE, row.names = 1)
mmGenes = as.matrix(mmGenes) #I convert the data frame to a matrix to get a downstream function, cbind, to work properly
head(mmGenes)
## External_Gene_Name
## ENSMUSG00000000001.4 "GNAI3"
## ENSMUSG00000000003.13 "PBSN"
## ENSMUSG00000000028.12 "CDC45"
## ENSMUSG00000000031.13 "H19"
## ENSMUSG00000000037.14 "SCML2"
## ENSMUSG00000000049.9 "APOH"
res <- results(dds)
res.dds <- res[order(res$padj),] #order table by adj p value
summary(res.dds) #lets take a look at how many genes are up-regulated/down-regulated
##
## out of 19026 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1444, 7.6%
## LFC < 0 (down) : 1524, 8%
## outliers [1] : 18, 0.095%
## low counts [2] : 6832, 36%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# Quick check of factor levels
mcols(res.dds, use.names = TRUE) #this is important for results interpretation
## DataFrame with 6 rows and 2 columns
## type
## <character>
## baseMean intermediate
## log2FoldChange results
## lfcSE results
## stat results
## pvalue results
## padj results
## description
## <character>
## baseMean mean of normalized counts for all samples
## log2FoldChange log2 fold change (MLE): condition.condition FF d28 vs Cre d28
## lfcSE standard error: condition.condition FF d28 vs Cre d28
## stat Wald statistic: condition.condition FF d28 vs Cre d28
## pvalue Wald test p-value: condition.condition FF d28 vs Cre d28
## padj BH adjusted p-values
We can add the gene names to the results by merging the mmGenes table with the res.dds table
res.dds = cbind(as(res.dds, "data.frame"), Gene_Name = as(mmGenes[rownames(res.dds), ],"matrix"))
head(res.dds)
## baseMean log2FoldChange lfcSE stat
## ENSMUSG00000069515.5 1755.61412 -2.490739 0.2074743 -12.005048
## ENSMUSG00000023913.15 4207.92575 -2.075657 0.2226230 -9.323642
## ENSMUSG00000028124.13 1049.89492 -1.589033 0.1713642 -9.272843
## ENSMUSG00000028691.10 9768.76201 -1.903420 0.2101529 -9.057312
## ENSMUSG00000093765.5 32.94265 -2.201636 0.2467593 -8.922201
## ENSMUSG00000022947.7 31.30477 -2.374491 0.2771568 -8.567317
## pvalue padj Gene_Name
## ENSMUSG00000069515.5 3.342679e-33 4.070046e-29 LYZ1
## ENSMUSG00000023913.15 1.124135e-20 6.843736e-17 PLA2G7
## ENSMUSG00000028124.13 1.812483e-20 7.356266e-17 GCLM
## ENSMUSG00000028691.10 1.337044e-19 4.069961e-16 PRDX1
## ENSMUSG00000093765.5 4.571121e-19 1.113159e-15 GM20658
## ENSMUSG00000022947.7 1.059241e-17 2.149553e-14 CBR3
resdt.dds = data.table(as(res.dds, "data.frame"),
keep.rownames = TRUE)
setnames(resdt.dds, "rn", "Ensembl_ID")
resdt.dds
## Ensembl_ID baseMean log2FoldChange lfcSE
## 1: ENSMUSG00000069515.5 1755.6141168 -2.49073851 0.2074743
## 2: ENSMUSG00000023913.15 4207.9257497 -2.07565720 0.2226230
## 3: ENSMUSG00000028124.13 1049.8949185 -1.58903345 0.1713642
## 4: ENSMUSG00000028691.10 9768.7620105 -1.90342025 0.2101529
## 5: ENSMUSG00000093765.5 32.9426531 -2.20163571 0.2467593
## ---
## 45702: ENSMUSG00000107387.1 0.0000000 NA NA
## 45703: ENSMUSG00000107388.1 0.1148895 0.05009061 3.1853473
## 45704: ENSMUSG00000107389.1 0.3429328 0.25155769 1.8079178
## 45705: ENSMUSG00000107390.1 0.1759695 -0.31057669 2.4419312
## 45706: ENSMUSG00000107392.1 0.0000000 NA NA
## stat pvalue padj Gene_Name
## 1: -12.00504850 3.342679e-33 4.070046e-29 LYZ1
## 2: -9.32364232 1.124135e-20 6.843736e-17 PLA2G7
## 3: -9.27284345 1.812483e-20 7.356266e-17 GCLM
## 4: -9.05731230 1.337044e-19 4.069961e-16 PRDX1
## 5: -8.92220091 4.571121e-19 1.113159e-15 GM20658
## ---
## 45702: NA NA NA RP24-343A12.4
## 45703: 0.01572532 9.874535e-01 NA RP23-408A1.4
## 45704: 0.13914222 8.893378e-01 NA RP23-8L20.8
## 45705: -0.12718487 8.987941e-01 NA RP24-351O18.4
## 45706: NA NA NA RP23-57N5.3
resdt.dds[, Significant := padj < .1]
resdt.dds[!is.na(Significant)]
## Ensembl_ID baseMean log2FoldChange lfcSE
## 1: ENSMUSG00000069515.5 1755.614117 -2.490739e+00 0.2074743
## 2: ENSMUSG00000023913.15 4207.925750 -2.075657e+00 0.2226230
## 3: ENSMUSG00000028124.13 1049.894918 -1.589033e+00 0.1713642
## 4: ENSMUSG00000028691.10 9768.762010 -1.903420e+00 0.2101529
## 5: ENSMUSG00000093765.5 32.942653 -2.201636e+00 0.2467593
## ---
## 12172: ENSMUSG00000026927.15 87.101498 1.068005e-04 0.1576998
## 12173: ENSMUSG00000024033.9 1.506442 -1.001425e-04 0.8582128
## 12174: ENSMUSG00000025290.14 812.246831 9.475938e-05 0.2630923
## 12175: ENSMUSG00000027002.11 4.563147 8.091800e-05 0.5605311
## 12176: ENSMUSG00000060261.13 44.095044 1.385327e-05 0.2523291
## stat pvalue padj Gene_Name Significant
## 1: -1.200505e+01 3.342679e-33 4.070046e-29 LYZ1 TRUE
## 2: -9.323642e+00 1.124135e-20 6.843736e-17 PLA2G7 TRUE
## 3: -9.272843e+00 1.812483e-20 7.356266e-17 GCLM TRUE
## 4: -9.057312e+00 1.337044e-19 4.069961e-16 PRDX1 TRUE
## 5: -8.922201e+00 4.571121e-19 1.113159e-15 GM20658 TRUE
## ---
## 12172: 6.772395e-04 9.994596e-01 9.997881e-01 SDCCAG3 FALSE
## 12173: -1.166872e-04 9.999069e-01 9.999562e-01 RSPH1 FALSE
## 12174: 3.601754e-04 9.997126e-01 9.999562e-01 RPS24 FALSE
## 12175: 1.443595e-04 9.998848e-01 9.999562e-01 NCKAP1 FALSE
## 12176: 5.490160e-05 9.999562e-01 9.999562e-01 GTF2I FALSE
resdt.dds
## Ensembl_ID baseMean log2FoldChange lfcSE
## 1: ENSMUSG00000069515.5 1755.6141168 -2.49073851 0.2074743
## 2: ENSMUSG00000023913.15 4207.9257497 -2.07565720 0.2226230
## 3: ENSMUSG00000028124.13 1049.8949185 -1.58903345 0.1713642
## 4: ENSMUSG00000028691.10 9768.7620105 -1.90342025 0.2101529
## 5: ENSMUSG00000093765.5 32.9426531 -2.20163571 0.2467593
## ---
## 45702: ENSMUSG00000107387.1 0.0000000 NA NA
## 45703: ENSMUSG00000107388.1 0.1148895 0.05009061 3.1853473
## 45704: ENSMUSG00000107389.1 0.3429328 0.25155769 1.8079178
## 45705: ENSMUSG00000107390.1 0.1759695 -0.31057669 2.4419312
## 45706: ENSMUSG00000107392.1 0.0000000 NA NA
## stat pvalue padj Gene_Name Significant
## 1: -12.00504850 3.342679e-33 4.070046e-29 LYZ1 TRUE
## 2: -9.32364232 1.124135e-20 6.843736e-17 PLA2G7 TRUE
## 3: -9.27284345 1.812483e-20 7.356266e-17 GCLM TRUE
## 4: -9.05731230 1.337044e-19 4.069961e-16 PRDX1 TRUE
## 5: -8.92220091 4.571121e-19 1.113159e-15 GM20658 TRUE
## ---
## 45702: NA NA NA RP24-343A12.4 NA
## 45703: 0.01572532 9.874535e-01 NA RP23-408A1.4 NA
## 45704: 0.13914222 8.893378e-01 NA RP23-8L20.8 NA
## 45705: -0.12718487 8.987941e-01 NA RP24-351O18.4 NA
## 45706: NA NA NA RP23-57N5.3 NA
write.table(res.dds,file="~/Documents/Transcriptomics/FF_vs_Cre.DESeq.out.txt",quote=F,sep="\t")
RNK = data.table(Gene_Name = res.dds$Gene_Name, stat = res.dds$stat) #These are 2 columns from our deseq2 output that we need for GSEA
RNK = subset(RNK, stat != "NA")
head(RNK)
## Gene_Name stat
## 1: LYZ1 -12.005048
## 2: PLA2G7 -9.323642
## 3: GCLM -9.272843
## 4: PRDX1 -9.057312
## 5: GM20658 -8.922201
## 6: CBR3 -8.567317
write.table(RNK, "~/Documents/Transcriptomics/FF_vs_Cre.DESeq.rnk",quote=F,sep="\t", row.names = F)
Volcano plots are a good way of visualizing the effects of the experimental conditions on the cells
volcano = ggplot(
data = resdt.dds[!is.na(Significant)],
mapping = aes(x = log2FoldChange,
y = -log10(padj),
color = Significant,
label = Ensembl_ID, label1 = Gene_Name)) +
theme_bw() +
geom_point() +
geom_point(data = resdt.dds[(Significant)], size = 7, alpha = 0.7) + #Larger circles for the significant values
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5)) +
geom_hline(yintercept = -log10(.1)) +
ggtitle("DESeq2 Negative Binomial Test Volcano Plot \nProximal Colon Abx vs no Abx") +
theme(axis.title = element_text(size=12)) +
theme(axis.text = element_text(size=12)) +
theme(legend.text = element_text(size=12)) +
geom_vline(xintercept = 0, lty = 2)
#volcano
volcano + xlim(-7,7) #xlim isn't necessary, but I add it here to make the plot symmetrical about the y axix
summary(resdt.dds)
## Ensembl_ID baseMean log2FoldChange lfcSE
## Length:45706 Min. : 0.00 Min. :-5.806 Min. :0.076
## Class :character 1st Qu.: 0.00 1st Qu.:-0.527 1st Qu.:0.254
## Mode :character Median : 0.00 Median :-0.140 Median :0.462
## Mean : 38.99 Mean :-0.150 Mean :1.094
## 3rd Qu.: 2.23 3rd Qu.: 0.376 3rd Qu.:1.813
## Max. :32611.22 Max. : 4.526 Max. :3.185
## NA's :26680 NA's :26680
## stat pvalue padj Gene_Name
## Min. :-12.005 Min. :0.000 Min. :0.00 MMU-MIR-669A-4: 9
## 1st Qu.: -0.758 1st Qu.:0.092 1st Qu.:0.11 FLG : 8
## Median : -0.165 Median :0.437 Median :0.39 ACEA_U3 : 5
## Mean : -0.049 Mean :0.441 Mean :0.42 SCARNA15 : 5
## 3rd Qu.: 0.791 3rd Qu.:0.762 3rd Qu.:0.70 SCARNA4 : 5
## Max. : 6.913 Max. :1.000 Max. :1.00 SNORA17 : 5
## NA's :26680 NA's :26698 NA's :33530 (Other) :45669
## Significant
## Mode :logical
## FALSE:9208
## TRUE :2968
## NA's :33530
##
##
##
save.image("./Transcriptomics.RData") #it's to to save the image so all the object that have been initiated can be accessed at any time
After running GSEA, we may want to make some heat maps. To do that we need to merge HTSeq counts files with Gene list file to make a counts table
counts = mmGenes #initialize counts table as gene list with gene names and ensemble IDs
#subset sampleFiles to remove the outlier we removed earlier
sampleFiles
#iterate over each HTSeq file
for(i in 1:length(sampleFiles)){
file <- read.table(paste(directory,sampleFiles[i],sep=""),header=FALSE,sep="\t",row.names = 1)
colnames(file) = c(sub('\\.txt$', '', sampleFiles[i])) #add the file name minus the ".txt" suffix to the header so we can identify each column
dat = subset(file, !(row.names(file)%in%c("__no_feature","__ambiguous","__too_low_aQual","__not_aligned","__alignment_not_unique"))) #remove the extra fields from the counts table
counts = cbind(as(counts,"matrix"),as(dat,"data.frame")) #each time the loop iterates, it will add a new column for the file's counts
}
head(counts)
write.table(counts,"~/Transcriptomics/CountsTable.txt",quote=F,sep="\t")
Because R…We have to open the file “CountsTable.txt” and add the column header “Ensembl_ID” followed by a tab to the file, so the columns match with their headers. You can open the file in RStudio by double clicking the file
Open x2goclient Because R…We have to open the file “CountsTable.txt” and add the column header “Ensembl_ID” followed by a tab to the file, so the columns match with their headers. Click the GSEA tab (a java browser should open)
i) gseaftp.broadinstitute.org://pub/gsea/gene_sets/h.all.v6.2.symbols.gmt (Hallmark Gene Symbols)
ii) gseaftp.broadinstitute.org://pub/gsea/gene_sets/c2.cp.v6.2.symbols.gmt (Canonical Pathways)
- Log2 Transform to make the data more interpretable for the heatmap
- Quantile normalize to account for variance in the dataset