User Guide

Introduction


This Shiny app is a wrapper around DESeq2, an R package for “Differential gene expression analysis based on the negative binomial distribution”.

It is meant to provide an intuitive interface for researchers to easily upload, analyze, visualize, and explore RNAseq count data interactively with no prior programming knowledge in R.

This tool supports simple or multi-factorial experimental design. It also allows for exploratory analysis when no replicates are available.

The app also provides svaseq Surrogate Variable Analysis for hidden batch effect detection. The user can then include Surrogate Variables (SVs) as adjustment factors for downstream analysis (eg. differential expression). For more information on svaseq, go to this link

For details on how this package is used for RNASeq count data analysis and visualization, see documentation


Features


Various visualizations and output data are included:

* Click image to enlarge

  • Clustering

    • R-Log, Variance Stabilizing Transformation (VST) output matrices
    • PCA plots, Heatmaps

  • Differential Expression

    • Comparison Data (logFC, p-value, etc, sample vs sample, etc …)
    • MA plots


  • Surrogate Variable Analysis

    • SV plots
    • PCA plots

  • Gene Expression

    • Gene Boxplots



Input File & Conditions


1. Example data (simple/multi-factorial experiment)

  • It is recommended for the user to try oneor both of the pre-loaded example data sets to carry out the analysis and get familiar with the app. Then the user should be able to replicate the analysis on their own datasets.

  • The simple factorial example loads the Tissue Comparison Data (Human) (RNASeq counts) that belongs to this published study

  • The multi-factorial example loads the Mouse Data (RNASeq counts and experiment design metadata) that belongs to this published study

2. Upload your own data (gene counts)

  • A .csv/.txt file that contains a table of the gene counts

  • The first column should have gene names/ids followed by columns for sample counts. The file can be either comma or tab delimited

  • If your counts are not merged, you can use this Gene Count Merger to consolidate all your sample count files

  • For convenience, if this is a simple factor experiment and samples contain replicates and sample names are denoted by underscore and the replicate number (see figure 2), then the conditions will be automatically set by parsing the samples/replicate numbers

  • If no replicates, then select No replicates option to help set the default experiment conditions for the next step. This is necessary because newer versions of DESeq2 (> 1.22.0) do not work on experiments with no replicates. See here for more details

  • Avoid using special characters or spaces in sample names (figure 1), other than underscores to denote replicates (figure 2)

  • First column can either contain gene.ids or gene.names

  • Prefilter: You can also set a minimum number of counts per gene to include

  • For a simple-factor experiment sample counts file, download and view this file

  • For a multi-factorial experiment example file, download and view this counts file and the following metadata file.

  • Experimental design meta data can either be uploaded as a csv file, or constructed in-page with in the “Edit Conditions” step

figure 1 (No Replicates)

Sample file

figure 2 (Replicates)

Sample file


3. Experiment Conditions

Setup experiment condition table

  • By default, if there are replicates, the sample name will be set as the condition for those samples

  • For example, if we take samples with replicates (figure 2), the default condition table will be:

figure 3 (Condition Table)

Sample file



Upload Gene Counts

(select .CSV)

.csv/.txt counts file (tab or comma delimited)

For details on this data, see this publication

For details on this data, see this publication

Config & Prefilter

* For convenience, if this is a single-factor experiment and column names are denoted by underscore replicate number (eg. sampleX_1,sampleX_2, etc ...)

the sample names will be parsed automatically and the conditions table will be set for the next step.

Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis.

For more details, click here

This step is not necessary, but can speed up the processing time


Gene Counts Table


Loading...

Design Formula

Conditions/Factors

Option 1) Edit Table:


  • Tag samples with corresponding conditions
  • Download CSV
Download CSV


Option 2) Upload Experiment design table (meta table)

.csv/.txt counts file (tab or comma delimited)

1. Initialize DESeq2 Dataset

Initialize DESeq2 dataset with current counts and experimental design conditions

Surrogate variable analysis (svaseq): hidden batch effects


We can sometimes identify the source of batch effects, and by using statistical models, we can remove any sample-specific variation we can predict based on features like sequence content or gene length. Here we use Surrogate Variable Analysis (SVA), which doesn’t require the use of knowing exactly how the counts will vary across batches. It uses only the biological condition, and looks for large scale variation which is orthogonal to the biological condition. This approach requires that the technical variation be orthogonal to the biological conditions.

For more information, see following link

Estimate Surrogate Variables (SVA)


SVA Plot

Loading...

Remove Batch Effect (PCA Visualization)

Here we use limma::removeBatchEffect in order to regress the effect of selected Surrogate/Batch variable(s)

This is strictly for visualization purposes only, and the corrected counts are NOT used for downstream analysis

Inspect the plots and decide whether to include Surrogate Variables as adjustment factors in design for downstream analysis


Next step:

OR


PCA (VST) after regression of batch/surrogate variable(s)

Loading...

Run DESeq


DESeq run settings:

The DESeq function performs Differential Expression analysis based on the Negative Binomial Distribution using the following steps:

  • 1. estimation of size factors
  • 2. estimation of dispersion
  • 3. Negative Binomial GLM fitting and Wald statistics

Design Formula:


Showing only the first 5 rows of colData table:

Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis. If an experimental design is supplied which does not contain the necessary degrees of freedom for differential analysis, DESeq will provide a message to the user and follow the strategy outlined in Anders and Huber (2010) under the section "Working without replicates", wherein all the samples are considered as replicates of a single group for the estimation of dispersion. As noted in the reference above: "Some overestimation of the variance may be expected, which will make that approach conservative." Furthermore, "while one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation."

(Optional) Surrogate Variable Analysis (SVA)

Run Surrogate Variable Analysis (for hidden batch detection)

You may choose to include computed Surrogate Variables (SVs) in your design formula for downstream differential expression analysis

Regularized Log Transformation

This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. The rlog transformation produces a similar variance stabilizing effect as varianceStabilizingTransformation, though rlog is more robust in the case when the size factors vary widely. The transformation is useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.


Distance Heatmap

Loading...

PCA Plot

Loading...

Download rlog.csv

Loading...

Variance Stabilizing Transformation


This function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.


Distance Heatmap

Loading...

PCA Plot

Loading...

Download vsd.csv

Loading...

Differential Expression Analysis

Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis. If an experimental design is supplied which does not contain the necessary degrees of freedom for differential analysis, DESeq will provide a message to the user and follow the strategy outlined in Anders and Huber (2010) under the section "Working without replicates", wherein all the samples are considered as replicates of a single group for the estimation of dispersion. As noted in the reference above: "Some overestimation of the variance may be expected, which will make that approach conservative." Furthermore, "while one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation."

VS

Conditions:

MA Plot Settings

MA Plot

Loading...

Download .csv

Loading...

Gene Expression Boxplot

Plot Settings:


Boxplot

Loading...

Download .csv

Loading...

Heatmap

* This heatmap uses normalized counts which can be viewed/downloaded below the figure


Heatmap

Loading...

Download Normalized Counts .csv

Loading...