From Gene Counts to Differential Expression - DESeq2 Tutorial
18 minute read
note: this tutorial is loosely based off the DESeq2 and bioconductor documentation found here. Some explanation text is taken directly from the documentation and referenced.
In this tutorial, we will walk through going from gene counts to differential expression results.
Introduction
The true end goal in RNA Sequencing analysis is determining the genes that are differentially expressed between a control and treatment group. In this case, if you followed the previous tutorial, you know we are investigating the transcriptomic response of TCPMOH on the developing zebrafish. In this dataset subset we have two control biological replicates and two TCPMOH biological replicates. To perform differential expression analysis, we will be using using DESeq2 (Love, Huber, and Anders 2014). DESeq2 is a great tool for dealing with RNA-seq data and running Differential Gene Expression (DGE) analysis. It takes read count files from different samples, combines them into a big table (with genes in the rows and samples in the columns) and applies normalization for sequencing depth and library composition. We do not need to account for gene length normalization does because we are comparing the counts between sample groups for the same gene.
Load necessary packages
The first step is to make sure we have the proper packages loaded. We will be using DESeq2
, tidyverse
, and ggplot2
. If you haven’t already installed these before, you can run the following commands:
# install DESeq2
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
# install tidyverse (ggplot2 is in tidyverse)
install.packages("tidyverse")
Now we can load the packages into our workspace by running the following commands:
# package setup
library( "DESeq2" )
library(tidyverse)
library(ggplot2)
Getting Started - Data
This tutorial assumes you have a file, per sample, of gene counts. This would be a typical output from a tool such as featureCounts which we used in the previous tutorial. To give an idea of what this looks like, lets peak into one of the files:
# define the path to the file
# note this path is relative to the current working directory and the files are stored in a folder called data
dmso1_data_file_path <- 'data/DMSO_1_counts.tsv'
# read in the file
dmso_1_count_data <- read.csv(dmso1_data_file_path, header = TRUE, sep = "\t")
# show the first few lines
head(dmso_1_count_data)
## Geneid DMSO_1.RNA.STAR..mapped.bam
## 1 CYTB 136445
## 2 ND6 1443
## 3 ND5 77429
## 4 ND4 71269
## 5 ND4L 12122
## 6 ND3 10141
As you can see, the file has two columns. One identifying the gene name (Geneid) and the second identifying the number of counts per gene. It is useful to go ahead and rename the second column as it is currently cumbersome.
# rename the second column
colnames(dmso_1_count_data)[2] ="DMSO_1"
Lets go ahead and do this for the rest of the samples:
# dmso2
dmso_2_count_data <- read.csv('data/DMSO_2_counts.tsv', header = TRUE, sep = "\t")
colnames(dmso_2_count_data)[2] ="DMSO_2"
# tcpmoh1
tcpmoh_1_count_data <- read.csv('data/TCPMOH_1_counts.tsv', header = TRUE, sep = "\t")
colnames(tcpmoh_1_count_data)[2] ="TCPMOH_1"
# tcpmoh2
tcpmoh_2_count_data <- read.csv('data/TCPMOH_2_counts.tsv', header = TRUE, sep = "\t")
colnames(tcpmoh_2_count_data)[2] ="TCPMOH_2"
Organize Data
To actually run DESeq2, we want all the gene count data in one data frame. Therefore, we need to merge the data sets based on the Geneid.
# put all data frames into list
df_list <- list(dmso_1_count_data, dmso_2_count_data, tcpmoh_1_count_data, tcpmoh_2_count_data)
# merge all data frames in list
count_data <- df_list %>% reduce(full_join, by='Geneid')
# make the Geneid the row id
count_data <- count_data %>% remove_rownames %>% column_to_rownames(var="Geneid")
head(count_data)
## DMSO_1 DMSO_2 TCPMOH_1 TCPMOH_2
## CYTB 136445 176209 137586 193845
## ND6 1443 2208 1317 1695
## ND5 77429 93110 74762 93454
## ND4 71269 94885 79189 100194
## ND4L 12122 14628 11453 14168
## ND3 10141 14687 10184 13766
We also need a meta data table. This can be done in R or by filling out a csv/tsv file and reading it in. I will just generate it here.
# get the sample names from the count_data matrix
SampleName <- c(colnames(count_data))
# specify the conditions for each sample
# In my sample names, DMSO_1 and DMSO_2 are control replicated and TCPMOH_1 and TCPMOH_2 are treated replicates
condition <- c("control", "control", "treated", "treated")
# generate the metadata data frame
meta_data <- data.frame(SampleName, condition)
# make the sample name the row id
meta_data <- meta_data %>% remove_rownames %>% column_to_rownames(var="SampleName")
meta_data
## condition
## DMSO_1 control
## DMSO_2 control
## TCPMOH_1 treated
## TCPMOH_2 treated
We want to double check that the name of the columns in the counts matrix is the same as the names in the meta data file. We can do this using the following two lines of code. Ideally, you should see TRUE
and TRUE
returned. If not, the rows in meta_data
do not match the columns in counts_data
and you should trouble shoot before moving forward. This is usually just a misorganization somewhere or a typo.
all(colnames(count_data) %in% rownames(meta_data))
## [1] TRUE
all(colnames(count_data) == rownames(meta_data))
## [1] TRUE
Differential Expression Analysis
Set Up DESeq Objects
Now that we have the data organized, we can run the differential expression analysis. The first step is to create a DESeq
data object to complete the downstream analysis.
# create deseq data set object
dds <- DESeqDataSetFromMatrix(countData = count_data,
colData = meta_data,
design = ~ condition)
We can investigate the DESeq
object:
# run the below to see the object elements
dds
## class: DESeqDataSet
## dim: 30421 4
## metadata(1): version
## assays(1): counts
## rownames(30421): CYTB ND6 ... cep97 rpl24
## rowData names(0):
## colnames(4): DMSO_1 DMSO_2 TCPMOH_1 TCPMOH_2
## colData names(1): condition
We see here that the dimension is 36351x4
meaning we have 36351 genes and 4 samples. The row names are the gene ids. Everything looks as it should! An optional step is to pre fliter some of the genes with low counts.
Pre-Filtering
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.
# filter any counts less than 10
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds
## class: DESeqDataSet
## dim: 24356 4
## metadata(1): version
## assays(1): counts
## rownames(24356): CYTB ND6 ... cep97 rpl24
## rowData names(0):
## colnames(4): DMSO_1 DMSO_2 TCPMOH_1 TCPMOH_2
## colData names(1): condition
You can see that when we pre-filter, we end up with 26859 genes left. The only downside to this step is these genes will no longer be included in the final output file. This isn’t an issue for most cases but if you are like me and often aggregate results from multiple studies, this can make your life difficult as the number of genes will now be different. Anyway, pre-filtering is great if you are just running a ‘typical’ workflow.
Set Reference Factor
By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels. Here, we just want to be sure the level we are comparing against is truly the control using the relevel
function.
# set the reference to be the control
dds$condition <- relevel(dds$condition, ref = 'control')
Getting Normalized Counts
Sometimes, we want the normalized counts as an output for downstream analysis. To obtain them, you can run the following commands.
# get normalized counts
dds <- estimateSizeFactors(dds)
normalized_counts <- counts(dds, normalized = TRUE)
# If you would like to write the normalized counts to a file, you can run the following command. Note that you need to specify the file path.
# write.table(normalized_counts, file=normalized_counts_data_file_path, sep = '\t', quote=F, col.names = NA)
Run Differential Expression
The standard differential expression analysis steps are wrapped into a single function, DESeq
. Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.
dds <- DESeq(dds)
res <- results(dds)
res
## log2 fold change (MLE): condition treated vs control
## Wald test p-value: condition treated vs control
## DataFrame with 24356 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## CYTB 158703.52 0.0480368 0.105504 0.455308 0.6488875 0.962519
## ND6 1643.65 -0.2958785 0.155837 -1.898644 0.0576113 0.487636
## ND5 83990.25 -0.0468041 0.102839 -0.455123 0.6490211 0.962519
## ND4 85437.84 0.0876639 0.110877 0.790639 0.4291546 0.917258
## ND4L 12990.62 -0.0880488 0.107552 -0.818662 0.4129792 0.908968
## ... ... ... ... ... ... ...
## hikeshi 1064.229 0.0618543 0.146767 0.421445 0.673430 0.966905
## eed 875.412 -0.0960908 0.166265 -0.577937 0.563307 0.948176
## nfkbiz 948.290 0.0612779 0.172558 0.355116 0.722503 0.977389
## cep97 661.991 -0.0895011 0.212066 -0.422043 0.672993 0.966905
## rpl24 35114.027 0.1329105 0.123686 1.074584 0.282561 0.844292
We can also get a summary:
summary(res)
##
## out of 24356 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 201, 0.83%
## LFC < 0 (down) : 270, 1.1%
## outliers [1] : 0, 0%
## low counts [2] : 9916, 41%
## (mean count < 204)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
If you would like to know the exact number of adjusted p-values that are <0.01, you can run the following command:
sum(res$padj < 0.1, na.rm=TRUE)
## [1] 471
As you can see, the results function defaults to a p-value <0.1. For stronger statistical significance, we often want to look at those <0.05. This can be specified in the results function:
res05 <- results(dds, alpha=0.05)
summary(res05)
##
## out of 24356 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 148, 0.61%
## LFC < 0 (down) : 197, 0.81%
## outliers [1] : 0, 0%
## low counts [2] : 6611, 27%
## (mean count < 75)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
And then the number of genes with an adjusted p-value <0.05 are:
sum(res$padj < 0.05, na.rm=TRUE)
## [1] 348
As you can see, this number is less as an adjusted p-value <0.05 is a stronger statistical significance. At this point, it is useful to download the results into one location. We can do that by writing the results to a csv file.
# write results to file, notice you need to specify the file path.
# write.csv(as.data.frame(res), file=deseq_results_file_path)
Visualizing Results
At this point, we can visualize and make conclusions for our data. The first question is what is considered a differentially expressed gene? There are two things to consider: (1) the lof2FC value and (2) the adjusted p-value.
DESeq2 uses the so-called Benjamini-Hochberg (BH) adjustment; in brief, this method calculates for each gene an adjusted p value which answers the following question: if one called significant all genes with a p value less than or equal to this gene’s p value threshold, what would be the fraction of false positives (the false discovery rate, FDR) among them (in the sense of the calculation outlined above)? These values, called the BH-adjusted p values, are given in the column padj of the results object. Hence, if we consider a fraction of 10% false positives acceptable, we can consider all genes with an adjusted p value below 10%=0.1 as significant.
Mathematically speaking, any log2FC value different than zero with low variability. Typically speaking, you will see people identify a differentially expressed gene for log2FC cutoff value between 1-1.5 (positive or negative). Additionally, the adjusted p-value is typically chosen to be <0.05 but is sometimes chosen to be <0.1 (DESeq default) or even <0.01 for stronger statistical confidence. For this tutorial, I will choose genes with an adjusted p-value <0.05 and a |Log2FC|<1
.
# convert results data to basic dataframe
data <- data.frame(res)
head(data)
## baseMean log2FoldChange lfcSE stat pvalue padj
## CYTB 158703.518 0.04803679 0.1055039 0.4553082 0.64888750 0.9625189
## ND6 1643.651 -0.29587855 0.1558368 -1.8986437 0.05761133 0.4876364
## ND5 83990.255 -0.04680413 0.1028385 -0.4551225 0.64902109 0.9625189
## ND4 85437.841 0.08766391 0.1108773 0.7906391 0.42915461 0.9172576
## ND4L 12990.616 -0.08804881 0.1075521 -0.8186621 0.41297922 0.9089675
## ND3 12017.929 -0.07758008 0.1247725 -0.6217724 0.53409151 0.9438768
Notice the column names! The columns we are typically interested in are log2FoldChange
and padj
.
PCA Plot
Portions of this section were taken from this tutorial. DESeq2 has a built-in function for plotting PCA plots, that uses ggplot2 under the hood. This is great because it saves us having to type out lines of code and having to fiddle with the different ggplot2 layers. In addition, it takes the rlog object as an input directly, hence saving us the trouble of extracting the relevant information from it.
The function plotPCA() requires two arguments as input: an rlog object and the intgroup (the column in our metadata that we are interested in).
rld <- rlog(dds)
plotPCA(rld)
By default the function uses the top 500 most variable genes. Ideally, we would want to see that our samples cluster together based on treatment group.
Volcano Plots
A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. We can get as simple or as fancy as we would like when it comes to volcano plots!
Identify Up and Down Regulated Genes
# add an additional column that identifies a gene as unregulated, downregulated, or unchanged
# note the choice of pvalue and log2FoldChange cutoff.
data <- data %>%
mutate(
Expression = case_when(log2FoldChange >= log(1) & padj <= 0.05 ~ "Up-regulated",
log2FoldChange <= -log(1) & padj <= 0.05 ~ "Down-regulated",
TRUE ~ "Unchanged")
)
head(data)
## baseMean log2FoldChange lfcSE stat pvalue padj
## CYTB 158703.518 0.04803679 0.1055039 0.4553082 0.64888750 0.9625189
## ND6 1643.651 -0.29587855 0.1558368 -1.8986437 0.05761133 0.4876364
## ND5 83990.255 -0.04680413 0.1028385 -0.4551225 0.64902109 0.9625189
## ND4 85437.841 0.08766391 0.1108773 0.7906391 0.42915461 0.9172576
## ND4L 12990.616 -0.08804881 0.1075521 -0.8186621 0.41297922 0.9089675
## ND3 12017.929 -0.07758008 0.1247725 -0.6217724 0.53409151 0.9438768
## Expression
## CYTB Unchanged
## ND6 Unchanged
## ND5 Unchanged
## ND4 Unchanged
## ND4L Unchanged
## ND3 Unchanged
Get the Top 10 Most Variable Genes
top <- 10
# we are getting the top 10 up and down regulated genes by filtering the column Up-regulated and Down-regulated and sorting by the adjusted p-value.
top_genes <- bind_rows(
data %>%
filter(Expression == 'Up-regulated') %>%
arrange(padj, desc(abs(log2FoldChange))) %>%
head(top),
data %>%
filter(Expression == 'Down-regulated') %>%
arrange(padj, desc(abs(log2FoldChange))) %>%
head(top)
)
# create a datframe just holding the top 10 genes
Top_Hits = head(arrange(data,pvalue),10)
Top_Hits
## baseMean log2FoldChange lfcSE stat pvalue
## pdk2b 9748.7672 1.329178 0.1346152 9.873903 5.402079e-23
## si:ch211-132b12.7 4790.7498 -1.239026 0.1256370 -9.861952 6.085478e-23
## trim63a 622.7489 1.887092 0.1942775 9.713383 2.644124e-22
## eevs 2600.6421 -1.623744 0.1680947 -9.659699 4.471781e-22
## zgc:113054 3347.4297 -1.115578 0.1315768 -8.478527 2.280618e-17
## si:ch211-161h7.5 2632.1364 -1.424045 0.1735748 -8.204213 2.321063e-16
## hebp2 2600.6605 -1.098667 0.1359765 -8.079829 6.485740e-16
## ndrg1b 716.7809 1.302238 0.1659977 7.844918 4.332367e-15
## psme4a 1946.1987 1.279067 0.1718202 7.444217 9.752090e-14
## hsp70.3 817.8511 -1.287905 0.1735650 -7.420303 1.168528e-13
## padj Expression
## pdk2b 4.393715e-19 Up-regulated
## si:ch211-132b12.7 4.393715e-19 Down-regulated
## trim63a 1.272705e-18 Up-regulated
## eevs 1.614313e-18 Down-regulated
## zgc:113054 6.586426e-14 Down-regulated
## si:ch211-161h7.5 5.586024e-13 Down-regulated
## hebp2 1.337916e-12 Down-regulated
## ndrg1b 7.819922e-12 Up-regulated
## psme4a 1.564669e-10 Up-regulated
## hsp70.3 1.687355e-10 Down-regulated
Basic Volcano Plot
data$label = if_else(rownames(data) %in% rownames(Top_Hits), rownames(data), "")
# basic plot
p1 <- ggplot(data, aes(log2FoldChange, -log(pvalue,10))) + # -log10 conversion
geom_point( size = 2/5) +
xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"P-Value")) +
xlim(-4.5, 4.5)
p1
Basic Volcano Plot - Red Significant Genes
# basic plot with line + red for p < 0.05
p2 <- ggplot(data, aes(log2FoldChange, -log(pvalue,10))) + # -log10 conversion
geom_point(aes(color = Expression), size = 2/5) +
#geom_hline(yintercept= -log10(0.05), linetype="dashed", linewidth = .4) +
xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"P-Value")) +
scale_color_manual(values = c("firebrick3", "black", "firebrick3")) +
xlim(-4.5, 4.5) +
theme(legend.position = "none")
p2
# with labels for top 10 sig overall
library(ggrepel)
p3 <- ggplot(data, aes(log2FoldChange, -log(pvalue,10))) + # -log10 conversion
geom_point(aes(color = Expression), size = 2/5) +
# geom_hline(yintercept=-log10(0.05), linetype="dashed", linewidth = .4) +
xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"P-Value")) +
scale_color_manual(values = c("firebrick3", "black", "firebrick3")) +
xlim(-4.5, 4.5) +
theme(legend.position = "none") +
geom_text_repel(aes(label = label), size = 2.5)
p3
Up/Down Regulated Genes Identified
# plot with up/down
p4 <- ggplot(data, aes(log2FoldChange, -log(pvalue,10))) + # -log10 conversion
geom_point(aes(color = Expression), size = 2/5) +
#geom_hline(yintercept=-log10(0.05), linetype="dashed", linewidth = .4) +
xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"P-Value")) +
scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) +
xlim(-4.5, 4.5)
p4
Up/Down Regulated Genes Identified and Labels
p5 <- ggplot(data, aes(log2FoldChange, -log(pvalue,10))) + # -log10 conversion
geom_point(aes(color = Expression), size = 2/5) +
# geom_hline(yintercept=-log10(0.05), linetype="dashed", linewidth = .4) +
xlab(expression("log"[2]*"FC")) +
ylab(expression("-log"[10]*"P-Value")) +
scale_color_manual(values = c("dodgerblue3", "black", "firebrick3")) +
xlim(-4.5, 4.5) +
geom_text_repel(aes(label = label), size = 2.5)
p5
Conclusion
Hope this tutorial was helpful! There are many other things that you can accomplish while analyzing differentially expressed genes. In fact, the next logical step would be to perform a gene enrichment analysis. Tutorial coming soon…