Detecting cell-type-specific AEI in bulk RNA-seq

Step 1: Estimating cell type proportions by deconvolution
- Data preparation
- Cell type deconvolution
Step 2: Detecting cell-type-specific AEI
- Data preparation
- Cell-type-specific AEI detection
Step 2 extension: Associating cell-type-specific AEI with covariates

This vignette provides a walk through tutorial on how to use BSCET to characterize cell-type-specific allelic expression imbalance (AEI), and the association of cell-type-specific AEI with clinical factors by integrative analysis of bulk and single cell RNA sequencing (RNA-seq) data. BSCET involves two steps. First, we infer cell type proportions in the bulk RNA-seq samples by incorporating cell-type-specific gene expression information provided by a scRNA-seq reference. Second, for each heterozugous transcribed SNP, BSCET aggregates allele-specific read counts from the bulk RNA-seq data across individuals to model cell-type-specific expression difference between two alleles by linear regression, and tests for the evidence of cell-type-specific AEI.

Below, using the human pancreas datasets, we demonstrate BSCET step by step.

Step 1: Estimating cell type proportions by deconvolution

In this step, we aim to infer cell type proportions in the bulk RNA-seq samples by incorporating cell-type-specific gene expression information provided by a scRNA-seq reference. This can be achieved by cell type deconvolution algorithms such as MuSiC (Wang et al. 2019). Following MuSiC, we deconvolve bulk RNA-seq data of 89 subjects from Fadista et al. (2014) using single cell reference obtained from Segerstolpe et al. (2016). The detailed tutorial of MuSiC (Wang et al. 2019) can be found on this page.

Data preparation

MuSiC (Wang et al. 2019) takes a bulk RNA-seq data that contain a mixture expression of various cell types, and a scRNA-seq data that include the single cell expression from multiple samples as input. Both datasets should be in the form of ExpressionSet, a convenient data structure to hold expression data along with sample/feature annotation. The details of constructing ExpressionSet can be found on this page.

library(MuSiC)
## Bulk RNA-seq data
fadista.eset = readRDS("./GSE50244bulkeset.rds")
fadista.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 32581 features, 89 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: Sub1 Sub2 ... Sub89 (89 total)
##   varLabels: sampleID SubjectName ... tissue (7 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

## scRNA-seq data
seger.eset = readRDS("./EMTABesethealthy.rds")
seger.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 25453 features, 1097 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: AZ_A10 AZ_A11 ... HP1509101_P9 (1097 total)
##   varLabels: sampleID SubjectName cellTypeID cellType
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

Cell type deconvolution

We constrained our analysis on 6 well-studied cell types: acinar, alpha, beta, delta, ductal and gamma.

music_result = music_prop(bulk.eset = fadista.eset, sc.eset = seger.eset, clusters = 'cellType', samples = 'sampleID', select.ct = c('acinar','alpha', 'beta', 'delta','ductal','gamma'), verbose = F)
est.prop = music_result$Est.prop.weighted

Below shows the estimated cell type proportions for the 89 bulk samples obtained from MuSiC (Wang et al. 2019).

Although we illustrated Step 1 using estimated cell type proportions obtained from MuSiC (Wang et al. 2019), however, BSCET is not limited to MuSiC and one can infer cell type compositions using other methods, such as CIBERSORT (Newman et al. 2019), in this step.

Step 2: Detecting cell-type-specific AEI

Given the estimated cell type proportions obtained in Step 1, for each heterozygous transcribed SNP, BSCET aggregates allele-specific read counts of the bulk RNA-seq data across heterozygous individuals and detects cell-type-specific AEI of the SNP.

For individual \(j\), let \(X_j^T\) and \(X_j^t\) be the read counts for the reference and alternative alleles of the transcribed SNP, respectively. The difference of two allele-specific reads can be expressed as:

\[X_j^T-X_j^t = \alpha+m_j\sum_{k=1}^Kp_j^kS_j^k\theta^k+\epsilon_j \;\;\;\;\;\;\;\;\;\;\;\;\;(1)\]

where the intercept \(\alpha\) captures the allelic expression difference not explained by the selected \(K\) cell types; \(m_j\) is the total number of cells in the bulk tissue for individual \(j\); \(S_j^k\) is the subject-specific average cell size of cell type \(k\) and can be approximated by the sample mean, \(S^k\); \(p_j^k\) is the individual-specific proportion of cells from cell type \(k\); \(θ^k\) represents allelic expression difference between two alleles of the transcribed SNP for cell type \(k\); \(\epsilon_j\) is random error and assumed to follow \(N(0,\sigma^2)\).

To detect cell-type-specific AEI in the population, for cell type \(k\), we test the following hypothesis:

\[H_0: \theta^k=0 \;\; vs \;\; H_1: \theta^k \neq 0\]

using a t statistic. Details of BSCET can be found in the Method section of our manuscript.

Data preparation

BSCET analyzes one transcribed SNP at a time. In this section, we will walk you through on how to generate input data for BSCET.

First, we obtained the allele-specific read count of the SNP, i.e, \(X_j^T\) and \(X_j^t\). The input allelic bulk RNA-seq data are in form of text file and should contain columns:

SNP: character, the name or chromosome location of each heterozygous transcriobed SNP;
id: character, individual identifier;
ref: numeric, the SNP-level allele-specific read counts of the reference allele;
alt: numeric, the SNP-level allele-specific read counts of the alternative allele;

The tables below show an example allelic RNA-seq data of SNP chr1:22336305.

Second, we calculate the subject-specific total number of cells in the bulk tissue, i.e., \(m_j\). As \(m_j\) is not observed for the bulk data, we esimate it using the individual-specific library size factor, which can be obtained by the function estimateSizeFactor of DESeq2 (Love, Huber, and Anders 2014). A detailed tutorial of DESeq2 (Love, Huber, and Anders 2014) can be found on this page.

library(DESeq2)
expmat <- exprs(fadista.eset)
coldata = matrix(colnames(expmat),ncol=1)
colnames(coldata) = 'id'
rownames(coldata) = colnames(expmat)
dds <- DESeqDataSetFromMatrix(countData = expmat,
                              colData = coldata,
                              design = ~ id)
dds <- estimateSizeFactors(dds)
mj = sizeFactors(dds)
head(mj)

##      Sub1      Sub2      Sub3      Sub4      Sub5      Sub6 
## 0.9595915 0.9580428 1.0016846 0.8483268 1.0137396 0.8437796

Third, we obtained the subject-specific cell type compositions of the bulk tissue, i.e., \(p_j^k\), which can be estimated using cell type deconvolution showed in Step 1. Given both \(m_j\) and \(p_j^k\), for each individual \(j\), we can calculate their \(m_jp_j^k\), showed in the table below.

Finally, we combine \(m_jp_j^k\) with the allele-specific reads, \(X_j^T\) and \(X_j^t\), and obtain the input data of BSCET as the following:

Cell-type-specific AEI detection

To detect cell-type-specific AEI, BSCET regresses the allelic count difference, i.e., \(X_j^T-X_j^t\), over \(m_jp_j^k\) through linear regression across samples, and the p-values of coefficients are used to indicate the significance of cell-type-specific effect. As we can see below, the SNP was detected as having cell-type-specific AEI for acinar cells.

mod=lm(ref-alt ~ `acinar`+`alpha`+`beta`+`delta`+`ductal`+`gamma`,data=dat)
p_value = summary(mod)$coefficient[,4][-1]
print(round(p_value,5))

##  acinar   alpha    beta   delta  ductal   gamma 
## 0.00032 0.44113 0.09326 0.22204 0.93712 0.75730

Step 2 extension: Associating cell-type-specific AEI with covariates

We can readily extend the model in Step 2 to assess covariate effect on cell-type-specific AEI. Let \(V_j\) be the covariate of interest for individual \(j\). We can modify the model by adding an interaction term between the covariate and the estimated cell type proportions:

\[X_j^T-X_j^t=\alpha+m_j\sum_{k=1}^Kp_j^k(\theta^k +V_j \theta_{\Delta}^k)+ \epsilon_j \;\;\;\;\;\;\;\;\;\;\;\;\;(2)\]

where \(\theta_{\Delta}^k\) is the covariate effect on the cell-type-specific AEI. In practice we will likely test for the covariate effect on cell-type-specific AEI only if a cell-type-specific AEI has been detected based on model (1). Therefore, for model (2), we are interested in testing whether the cell-type-specific AEI changes with the covariate, i.e., for each cell type \(k\), we test the following hypothesis:

\[H_0: \theta_{\Delta}^k=0 \;\; vs \;\; H_1: \theta_{\Delta}^k\neq 0\]

As the bulk RNA-seq data come from pancreatic islets, here, we test for the effect of HbA1c, a well-known biomarker for T2D diagnosis, on cell-type-specific AEI. By including HbA1c level for each individual, we can have the data as the following:

To detect covariate effect on cell-type-specific AEI, BSCET models the allelic count difference over the cell type proportions and interactions between proportion estimates and HbA1c level through linear regression, and p-values for interaction terms are used to indicate the significance of covariate effect on cell-type-specific AEI. As we can see below, cell-type-specific AEI was not associated with HbA1c level for all cell types.

mod=lm(ref-alt ~ `acinar`+`alpha`+`beta`+`delta`+`ductal`+`gamma`+
          HbA1c:(`acinar`+`alpha`+`beta`+`delta`+`ductal`+`gamma`),data=dat)
p_value = summary(mod)$coefficient[,4][8:13]
print(round(p_value,5))

## acinar:HbA1c  alpha:HbA1c   beta:HbA1c  delta:HbA1c ductal:HbA1c  gamma:HbA1c 
##      0.49077      0.05900      0.16795      0.05154      0.39757      0.13501

Fadista, J., P. Vikman, E. O. Laakso, I. G. Mollet, J. L. Esguerra, J. Taneera, P. Storm, et al. 2014. “Global Genomic and Transcriptomic Analysis of Human Pancreatic Islets Reveals Novel Genes Influencing Glucose Metabolism.” Proceedings of the National Academy of Sciences of the United States of America 111: 13924.

Love, M. I., W. Huber, and S. Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 550.

Newman, A. M., C. B. Steen, C. L. Liu, A. J. Gentles, A. A. Chaudhuri, F. Scherer, and others. 2019. “Determining Cell Type Abundance and Expression from Bulk Tissues with Digital Cytometry.” Nat Biotechnology 37: 773–82.

Segerstolpe, å., A. Palasantza, P. Eliasson, E. Andersson, A. Andréasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metabolism 24: 593–607.

Wang, X., J. Park, K. Susztak, N. R. Zhang, and M. Li. 2019. “Bulk Tissue Cell Type Deconvolution with Multi-Subject Single-Cell Expression Reference.” Nature Communications 10: 380.

Developed by Jiaxin Fan (jiaxinf@pennmedicine.upenn.edu).