Introduction

This vignette provides a walk through tutorial on how to use MuSiC2 to estimate cell type proportions for bulk RNA-seq data using scRNA-seq data as reference when the bulk and scRNA-seq data are generated from samples with multiple clinical conditions.

Installation

# install devtools if necessary
if (!"devtools" %in% rownames(installed.packages())) {
  install.packages('devtools')
}
# install the MuSiC2 package
if (!"MuSiC2" %in% rownames(installed.packages())) {
  devtools::install_github('Jiaxin-Fan/MuSiC2')
}
# load
library(MuSiC2)

Data

Similar as MuSiC (Wang et al., 2019), MuSiC2 uses two types of input data:

Bulk RNA sequencing expression data collected from samples with 2 different clincial conditions, e.g., healthy and diseased. These are the data we want to deconvolve.
Single-cell RNA sequencing (scRNA-seq) expression data collected from samples with single condition, e.g., healthy. The cell types of scRNA-seq are pre-determined. These serve as the reference for estimating cell type proportions of the bulk data.

Both datasets should be in the form of ExpressionSet. The details of constructing ExpressionSet can be found on this page.

MuSiC2 Deconvolution

MuSiC2 is an iterative algorithm aiming to improve cell type deconvolution for bulk RNA-seq data when the bulk data and scRNA-seq reference are generated from samples with different clinical conditions. The key idea of MuSiC2 is that, when the bulk samples and single-cell samples are from different clinical conditions, the majority of genes shall still have similar cell-type-specific gene expression pattern between conditions. By removing genes with cell-type-specific differential expression (DE) between conditions from the single-cell reference, MuSiC2 can refine the reference gene list and yield more accurate cell type proportion estimates.

Assuming we want to deconvolve bulk RNA-seq samples generated from both Healthy and Diseased conditions, using scRNA-seq data generated only from the Healthy condition as the reference. MuSiC2 iterates over 2 steps. In Step 1, we use MuSiC (Wang et al. 2019) to infer the cell type proportions of the bulk samples under both conditions by borrowing information from the scRNA-seq data. In Step 2, for samples within each condition, we deconvolve the bulk-level expression over the cell type proportion estimates obtained in Step 1 to infer the cell-type-specific mean expression for each gene and identify cell-type-specific DE genes between conditions. Then, by removing genes with cell-type-specific DE from the scRNA-seq data, we can update the cell type proportion estimates in Step 1 for bulk samples generated under Diseased condition. By alternating between cell type deconvolution (Step 1) and cell-type-specific DE gene detection and removal (Step 2), MuSiC2 gradually refines the list of “stable” genes retained in the scRNA-seq reference and improves the cell type proportion estimation for the diseased samples. An overview of MuSiC2 is shown in Figure 1.

To test for the cell-type-specific DE genes, a resampling procedure is employed in order to achieve a reliable estimate. Specifically, at each resampling iteration, we generate a subset of samples by random sampling without replacement under each clinical condition, and compute the log fold change of cell-type-specific expression between conditions, \(logFC_g^k=\frac{\mu_{g, diseased}^k}{\mu_{g, healthy}^k}\). We define a statistic \(T_g^k\) as the absolute value of the ratio of the mean and standard deviation (SD) of the \(logFC_g^k\) over all resamples as a measure of the cell-type-specific DE. Genes with \(T_g^k\) in the top 5% for common cell types, i.e., cell types with average proportion ≥ 10%, or in the top 1% for rare cell types, i.e., cell types with average proportion < 10%, are considered as cell-type-specific DE genes. Since fold change is sensitive to genes with low expression, we suggest that genes with bulk-level average sequencing depth < 20 are retained as “stable” genes and excluded from the cell-type-specific DE detection. We further filter the genes by their expression levels in the random samples. Specifically, we compute the mean of \(\mu_{g,healthy}^k\) and \(\mu_{g,diseased}^k\) over the resamples, and retain genes with cell-type-specific expression in the bottom 5% for samples in both conditions as “stable” genes and exclude them from the cell-type-specific DE detection. See the Methods session of the MuSiC2 manuscript for additional details.

Sample Analysis

For illustration purpose, in this tutorial, we deconvolved the benchmark bulk RNA-seq data, which contain raw RNA-seq read counts and sample annotation data for 100 healthy and 100 diseased (i.e., Type 2 diabetes (T2D)) samples simulated based on pancreatic islets scRNA-seq RNA-seq data from Segerstolpe et al. (2016). The procedure for generating the benchmark dataset can be found in the Methods session of the MuSiC2 manuscript. We deconvolved the benchmark bulk RNA-seq data using scRNA-seq data generated from 6 healthy subjects by Segerstolpe et al. (2016). Both datasets can be found on this page.

## Bulk RNA-seq data
benchmark.eset = readRDS("./data/bulk-eset.rds")
benchmark.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 10000 features, 200 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 1 10 ... 200 (200 total)
##   varLabels: sampleID group
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

# clinical conditions
table(benchmark.eset$group)

## 
## healthy     t2d 
##     100     100

## scRNA-seq data
seger.eset = readRDS("./data/single-eset.rds")
seger.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 25453 features, 1097 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: AZ_A10 AZ_A11 ... HP1509101_P9 (1097 total)
##   varLabels: sampleID SubjectName cellTypeID cellType
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

Cell Type Deconvolution

The cell type proportions are estimated by the function music2_prop. The essential inputs are:

bulk.eset: ExpressionSet of bulk data;
sc.eset: ExpressionSet of single cell data;
condition: character, the phenoData of bulk dataset used for indicating clinical conditions;
control: character, the clinical condition of bulk samples that is the same as the clinical condition of the single cell samples;
case: character, the clinical condition of bulk samples that is different from the clinical condition of the single cell samples;
clusters: character, the phenoData from single cell dataset used as clusters;
samples: character, the phenoData from single cell dataset used as samples;
select.ct: vector of cell types. Default is NULL, which uses all cell types provided in the single-cell data;
n_resample: numeric, number of resamples used for detecting cell-type-specific DE genes. Default is 20;
sample_prop: numeric, proportion of samples to be randomly sampled without replacement under each clinical condition for each resample. Default is 0.5;
cutoff_c: numeric, cutoff on the upper quantile of \(T_g^k\) statistics for detecting cell-type-specific DE genes for common cell types (i.e., cell type proportion \(\ge\) 0.1). Default is 0.05;
cutoff_r: numeric, cutoff on the upper quantile of \(T_g^k\) statistics for detecting cell-type-specific DE genes for rare cell types (i.e., cell type proportion < 0.1). Default is 0.01;

The output of music2_prop is a list with elements:

Est.prop: matrix, cell type proportion estimates;
convergence: logical, whether MuSiC2 converged or not;
n.iter: numeric, number of iterations;
DE.genes: vector, cell-type-specific DE genes being removed;

For illustration purpose, we constrained our analysis on 6 well-studied cell types: acinar, alpha, beta, delta, ductal and gamma. Figure 2 below showed the estimated cell type proportion of MuSiC2 separated by disease status (e.g., healthy and T2D).

# music2 deconvolution
set.seed(1234)
est.prop = music2_prop(bulk.eset = benchmark.eset, sc.eset = seger.eset, condition='group', control='healthy',case='t2d', clusters = 'cellType', samples = 'sampleID', select.ct = c('acinar','alpha','beta','delta','ductal','gamma'), n_resample=20, sample_prop=0.5,cutoff_c=0.05,cutoff_r=0.01)$Est.prop

Figure 2: Cell Type Composition. Jitter plots showing estimated cell type proportions of benchmark bulk RNA-seq samples by disease status (healthy and T2D), estimated using MuSiC2 with healthy scRNA-seq data as reference. The medians of cell type proportions across samples is showed by the black horizontal lines.

We also deconvolved the benchmark bulk RNA-seq data using MuSiC (Wang et al., 2019), and evaluated the accuracy of both deconvolution methods by comparing the estimated cell type proportions obtained by MuSiC2 and by MuSiC to the true proportions. Below we present the individual-level root mean square error (RMSE) across cell types for the two deconvolution methods separated by disease status (e.g., healthy and T2D) (Figure 3: left). As expected, because MuSiC2 only refines the gene list in the single cell reference when deconvolving bulk samples generated from clinical condition that differs from the single cell data, MuSiC and MuSiC2 had exactly the same performance for healthy samples with estimation bias close to 0. For diseased samples, MuSiC2 improved the estimation accuracy, highlighting the significance of gene selection for deconvolution. Especially for beta cells, MuSiC2 produced much more accurate cell type proportion estimates for diseased bulk samples than MuSiC, which suffered from severe underestimation (Figure 3: right).

Figure 3: Estimation Accuracy. We evaluated the performance of MuSiC2 and compared to MuSiC using the benchmark bulk RNA-seq samples with healthy scRNA-seq data as reference. (Left) Boxplots of individual-level root mean square error (RMSE) across cell types separated by disease status (healthy and T2D). (Right) Boxplots of beta cell proportions comparing true proportions with estimated proportions by MuSiC2 and by MuSiC, separated by disease status (healthy and T2D).

Reference

Wang, X., Park, J., Susztak, K., Zhang, N.R., and Li, M. 2019. “Bulk Tissue Cell Type Deconvolution with Multi-Subject Single-Cell Expression Reference.” Nature Communications 10: 380.

Segerstolpe, Å., Palasantza, A., Eliasson, P., Andersson, E.M., Andréasson, A.C., et al. 2016. “Single-cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell metabolism. 24: 593-607.

MuSiC2: cell type deconvolution for multi-condition bulk RNA-seq data

Jiaxin Fan (jiaxinf@pennmedicine.upenn.edu)